Dear Cepher,
I am planning a cephfs cluster with ca. 100 OSD nodes, each of which has 12 disks, and 2 NVMe (for db wal and cephfs metadata pool). Fpr performance and scalability reasons, i would like to try multi MDS working ative-active. From what i learned in the past, i am not sure about the following questions.
1 Which Ceph version should i run? I had a good experience with Luminous 12.2.13, and not familiar yet with Mimic and Nautilus. Is Lumious 12.2.13 stable enouth to run multiple active-active MDS servers for CephFS?
2 If i had to go Mimic or Nautilus for CephFS, which one is perferable?
3 I did has some experience with Ceph RBD, but not CephFS, So my question is, what should i pay attention to whening running CephFS? I am somehow nervous......
best regards,
Samuel
huxiaoyu(a)horebdata.cn
Hello everybody,
I'm trying to figure out how often the ceph client is contacting the monitors for updating its own information about the cluster map.
Can anyone point me to a document describing this client <-> monitor communication?
Thank you,
Laszlo
I need to change the network my monitors are on. It seems this is not a trivial thing to do. Are there any up-to-date instructions for doing so on a cephadm-deployed cluster?
I’ve found some steps in older versions of the docs but not sure if these are still correct - they mention using the ceph-mon command which I don’t have.
Will
Hi,
As we all know, the default replica setting of 'size' is 3 which means
there
are 3 copies of an object. What is the disadvantages if I set it to 2,
except
I get fewer copies?
Thanks
Hi all,
rocksdb failed to open when the ceph-osd process was restarted after unplugging the OSD data disk with Ceph 14.2.5 on Centos 7.6.
1) After unplugging the OSD data disk, the ceph-osd process exist.
-3> 2020-07-13 15:25:35.912 7f1ad7254700 -1 bdev(0x559d1134f880 /var/lib/ceph/osd/ceph-10/block) _sync_write sync_file_range error: (5) Input/output error
-2> 2020-07-13 15:25:35.912 7f1ad9c5f700 -1 bdev(0x559d1134f880 /var/lib/ceph/osd/ceph-10/block) _aio_thread got r=-5 ((5) Input/output error)
-1> 2020-07-13 15:25:35.917 7f1ad9c5f700 -1 /root/rpmbuild/BUILD/ceph-14.2.5-1.0.9/src/os/bluestore/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f1ad9c5f700 time 2020-07-13 15:25:35.913821
/root/rpmbuild/BUILD/ceph-14.2.5-1.0.9/src/os/bluestore/KernelDevice.cc: 534: ceph_abort_msg("Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!")
ceph version 14.2.5-93-g9a4f93e (9a4f93e7143bcdd5fadc88eb58bb730ae97b89c5) nautilus (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x559d05b6069a]
2: (KernelDevice::_aio_thread()+0xebe) [0x559d061a54ee]
3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x559d061a7add]
4: (()+0x7dd5) [0x7f1ae66aedd5]
5: (clone()+0x6d) [0x7f1ae5572ead]
2) Plug the disk back in and restart the ceph-osd process, rocksdb found that incomplete records existed and stop to work.
2020-07-13 15:51:38.305 7f9801ef5a80 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #9 mode 0
2020-07-13 15:51:38.748 7f9801ef5a80 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000009.log: dropping 2922 bytes; Corruption: missing start of fragmented record(2)
2020-07-13 15:51:38.748 7f9801ef5a80 4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work
2020-07-13 15:51:38.748 7f9801ef5a80 4 rocksdb: [db/db_impl.cc:563] Shutdown complete
2020-07-13 15:51:38.748 7f9801ef5a80 -1 rocksdb: Corruption: missing start of fragmented record(2)
2020-07-13 15:51:38.748 7f9801ef5a80 -1 bluestore(/var/lib/ceph/osd/ceph-10) _open_db erroring opening db:
2020-07-13 15:51:38.748 7f9801ef5a80 1 bluefs umount
2020-07-13 15:51:38.776 7f9801ef5a80 1 fbmap_alloc 0x55c897e0a900 shutdown
2020-07-13 15:51:38.776 7f9801ef5a80 1 bdev(0x55c898a6ce00 /var/lib/ceph/osd/ceph-10/block) close
Why does rocksdb not automatically delete these incomplete records and continue work?
In addition, after the occurrence of this situation, what method should be used to recover.
I tested osd bench with different block size: 1MB, 512KB, 256KB, 128KB, 64KB, and 32KB. osd.2 is one from the cluster where osds have better 4KB osd bench, and osd.30 is from the cluster where osds have lower 4KB osd bench. Before 32KB, osd.30 was better than osd.2, however, there was a big drop on osd.30 with 32KB block size.
root@cmn01:~# ceph tell osd.2 bench 1073741824 1048576
{
"bytes_written": 1073741824,
"blocksize": 1048576,
"bytes_per_sec": 188747963
}
root@cmn01:~# ceph tell osd.2 bench 1073741824 524288
{
"bytes_written": 1073741824,
"blocksize": 524288,
"bytes_per_sec": 181071543
}
root@cmn01:~# ceph tell osd.2 bench 786432000 262144
{
"bytes_written": 786432000,
"blocksize": 262144,
"bytes_per_sec": 159007035
}
root@cmn01:~# ceph tell osd.2 bench 393216000 131072
{
"bytes_written": 393216000,
"blocksize": 131072,
"bytes_per_sec": 127179122
}
root@cmn01:~# ceph tell osd.2 bench 196608000 65536
{
"bytes_written": 196608000,
"blocksize": 65536,
"bytes_per_sec": 83365482
}
root@cmn01:~# ceph tell osd.2 bench 98304000 32768
{
"bytes_written": 98304000,
"blocksize": 32768,
"bytes_per_sec": 48351258
}
root@cmn01:~# ceph tell osd.2 bench 49152000 16384
{
"bytes_written": 49152000,
"blocksize": 16384,
"bytes_per_sec": 31725841
}
------------------------------------------------------------------------------------------------------------------
root@stor-mgt01:~# ceph tell osd.30 bench 1073741824 1048576
{
"bytes_written": 1073741824,
"blocksize": 1048576,
"elapsed_sec": 5.344805,
"bytes_per_sec": 200894474.890259,
"iops": 191.587901
}
root@stor-mgt01:~# ceph tell osd.30 bench 1073741824 524288
{
"bytes_written": 1073741824,
"blocksize": 524288,
"elapsed_sec": 5.303052,
"bytes_per_sec": 202476205.680661,
"iops": 386.192714
}
root@stor-mgt01:~# ceph tell osd.30 bench 786432000 262144
{
"bytes_written": 786432000,
"blocksize": 262144,
"elapsed_sec": 3.878248,
"bytes_per_sec": 202780204.655892,
"iops": 773.545092
}
root@stor-mgt01:~# ceph tell osd.30 bench 393216000 131072
{
"bytes_written": 393216000,
"blocksize": 131072,
"elapsed_sec": 1.939532,
"bytes_per_sec": 202737591.242988,
"iops": 1546.765070
}
root@stor-mgt01:~# ceph tell osd.30 bench 196608000 65536
{
"bytes_written": 196608000,
"blocksize": 65536,
"elapsed_sec": 1.081617,
"bytes_per_sec": 181772360.338257,
"iops": 2773.626104
}
root@stor-mgt01:~# ceph tell osd.30 bench 98304000 32768
{
"bytes_written": 98304000,
"blocksize": 32768,
"elapsed_sec": 2.908703,
"bytes_per_sec": 33796507.598640,
"iops": 1031.387561
}
root@stor-mgt01:~# ceph tell osd.30 bench 49152000 16384
{
"bytes_written": 49152000,
"blocksize": 16384,
"elapsed_sec": 3.907744,
"bytes_per_sec": 12578102.861185,
"iops": 767.706473
}
------------------ 原始邮件 ------------------
发件人: "rainning" <tweetypie(a)qq.com>gt;;
发送时间: 2020年7月16日(星期四) 上午9:42
收件人: "Zhenshi Zhou"<deaderzzs(a)gmail.com>gt;;
抄送: "ceph-users"<ceph-users(a)ceph.io>gt;;
主题: 回复:[ceph-users] Re: osd bench with or without a separate WAL device deployed
Hi Zhenshi,
I did try with bigger block size. Interestingly, the one whose 4KB osd bench was lower performed slightly better in 4MB osd bench.
Let me try some other bigger block sizes, e.g. 16K, 64K, 128K, 1M etc, to see if there is any pattern.
Moreover, I did compare two SSDs, they respectively are INTEL SSDSC2KB480G8 and INTEL SSDSC2KB960G8. Performance wise, there is no much difference.
Thanks,
Ning
------------------ 原始邮件 ------------------
发件人: "Zhenshi Zhou" <deaderzzs(a)gmail.com>gt;;
发送时间: 2020年7月16日(星期四) 上午9:24
收件人: "rainning"<tweetypie(a)qq.com>gt;;
抄送: "ceph-users"<ceph-users(a)ceph.io>gt;;
主题: [ceph-users] Re: osd bench with or without a separate WAL device deployed
Maybe you can try writing with bigger block size and compare the results.
For bluestore, the write operations contain two modes. One is COW, the
other is RMW. AFAIK only RMW uses wal in order to prevent data from
being interrupted.
rainning <tweetypie(a)qq.com> 于2020年7月15日周三 下午11:04写道:
> Hi Zhenshi, thanks very much for the reply.
>
> Yes I know it is ood that the bluestore is deployed only with a separate
> db device but no a WAL device. The cluster was deployed in k8s using rook.
> I was told it was because the rook we used didn't support that.
>
> Moreover, the comparison was made on osd bench, so the network should not
> be the case. As far as the storage node hardware, although two clusters are
> indeed different, their CPUs and HDDs do have almost same performance
> numbers. I haven't compared SSDs that are used as db/WAL devices, it might
> cause difference, but I am not sure if it can make two times difference.
>
> ---Original---
> *From:* "Zhenshi Zhou"<deaderzzs(a)gmail.com>
> *Date:* Wed, Jul 15, 2020 18:39 PM
> *To:* "rainning"<tweetypie(a)qq.com>gt;;
> *Cc:* "ceph-users"<ceph-users(a)ceph.io>gt;;
> *Subject:* [ceph-users] Re: osd bench with or without a separate WAL
> device deployed
>
> I deployed the cluster either with separate db/wal or put db/wal/data
> together. Never tried to have only a seperate db.
> AFAIK wal does have an effect on writing but I'm not sure if it could be
> two times of the bench value. Hardware and
> network environment are also important factors.
>
> rainning <tweetypie(a)qq.com> 于2020年7月15日周三 下午4:35写道:
>
> > Hi all,
> >
> >
> > I am wondering if there is any performance comparison done on osd bench
> > with and without a separate WAL device deployed given that there is
> always
> > a separate db device deployed on SSD in both cases.
> >
> >
> > The reason I am asking this question is that we have two clusters and
> osds
> > in one have separate db and WAL device deployed on SSD but osds in
> another
> > only have a separate db device deployed. And we found 4KB osd bench (i.e.
> > ceph tell osd.X bench 12288000 4096) for the ones having a separate WAL
> > device was two times of the ones without a separate WAL device. Is the
> > performance difference caused by the separate WAL device?
> >
> >
> > Thanks,
> > Ning
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi all,
I am wondering if there is any performance comparison done on osd bench with and without a separate WAL device deployed given that there is always a separate db device deployed on SSD in both cases.
The reason I am asking this question is that we have two clusters and osds in one have separate db and WAL device deployed on SSD but osds in another only have a separate db device deployed. And we found 4KB osd bench (i.e. ceph tell osd.X bench 12288000 4096) for the ones having a separate WAL device was two times of the ones without a separate WAL device. Is the performance difference caused by the separate WAL device?
Thanks,
Ning
Hi Liam, All,
We have also run into this bug:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/PCYY2MKRPCP…
Like you, we are also running Octopus 15.2.3
Downgrading the RGWs at this point is not ideal, but if a fix isn't found
soon we might have to.
Has a bug report been filed for this yet?
- Dave