- ceph-users - lists.ceph.io

Cephfs multiple active-active MDS stability and optimization

by huxiaoyu＠horebdata.cn

Dear Cepher， I am planning a cephfs cluster with ca. 100 OSD nodes, each of which has 12 disks, and 2 NVMe (for db wal and cephfs metadata pool). Fpr performance and scalability reasons, i would like to try multi MDS working ative-active. From what i learned in the past, i am not sure about the following questions. 1 Which Ceph version should i run? I had a good experience with Luminous 12.2.13, and not familiar yet with Mimic and Nautilus. Is Lumious 12.2.13 stable enouth to run multiple active-active MDS servers for CephFS? 2 If i had to go Mimic or Nautilus for CephFS, which one is perferable? 3 I did has some experience with Ceph RBD, but not CephFS, So my question is, what should i pay attention to whening running CephFS? I am somehow nervous...... best regards, Samuel huxiaoyu(a)horebdata.cn

3 years, 9 months

1
0
0 0

AdminSocket occurs segment fault with samba vfs ceph plugin

by 380562518＠qq.com

Ceph Version: 14.2.5 Samba Version:4.10.4 OS: Centos 7.6.1810 Procedure: 1，Setup a ceph cluster and create a file system. 2，Setup a samba share with samba vfs ceph plugin. 3，Use 'ceph daemon /var/run/ceph/ceph-client.admin.*****.asok help' to get debug command. 4，It will occurs a segment fault when exec the 3rd command. The call trace list belows: #0 0x00007f35425cb387 in raise () from /lib64/libc.so.6 #1 0x00007f35425cca78 in abort () from /lib64/libc.so.6 #2 0x00007f3545a8fff3 in dump_core () at ../source3/lib/dumpcore.c:338 #3 0x00007f3545a8109b in smb_panic_s3 (why=<optimized out>) at ../source3/lib/util.c:839 #4 0x00007f354627297f in smb_panic (why=why@entry=0x7f35462bb55b "internal error") at ../lib/util/fault.c:174 #5 0x00007f3546272bb6 in fault_report (sig=<optimized out>) at ../lib/util/fault.c:88 #6 sig_fault (sig=<optimized out>) at ../lib/util/fault.c:99 #7 <signal handler called> #8 0x00007f35226d026f in acquire (this=0x564a0a014050) at /usr/src/debug/ceph-14.2.5-1.0.8/build/boost/include/boost/smart_ptr/detail/shared_count.hpp:426 #9 acquire_object_id (this=0x7f350fffe220) at /usr/src/debug/ceph-14.2.5-1.0.8/build/boost/include/boost/spirit/home/classic/core/non_terminal/impl/object_with_id.ipp:157 #10 object_with_id (this=0x7f350fffe220) at /usr/src/debug/ceph-14.2.5-1.0.8/build/boost/include/boost/spirit/home/classic/core/non_terminal/impl/object_with_id.ipp:79 #11 grammar (this=0x7f350fffe220) at /usr/src/debug/ceph-14.2.5-1.0.8/build/boost/include/boost/spirit/home/classic/core/non_terminal/grammar.hpp:51 #12 Json_grammer (semantic_actions=..., this=0x7f350fffe220) at /usr/src/debug/ceph-14.2.5-1.0.8/src/json_spirit/json_spirit_reader_template.h:401 #13 json_spirit::read_range_or_throw<__gnu_cxx::__normal_iterator<char const*, std::string>, json_spirit::Value_impl<json_spirit::Config_map<std::string> > > (begin=123 '{', end=0 '\000', value=...) at /usr/src/debug/ceph-14.2.5-1.0.8/src/json_spirit/json_spirit_reader_template.h:585 #14 0x00007f35226d0c3c in json_spirit::read_range<__gnu_cxx::__normal_iterator<char const*, std::string>, json_spirit::Value_impl<json_spirit::Config_map<std::string> > > ( begin=123 '{', end=..., value=...) at /usr/src/debug/ceph-14.2.5-1.0.8/src/json_spirit/json_spirit_reader_template.h:607 #15 0x00007f35226bb07d in read_string<std::basic_string<char>, json_spirit::Value_impl<json_spirit::Config_map<std::basic_string<char> > > > (value=..., s=<error reading variable: Cannot access memory at address 0x7869666572702263>) at /usr/src/debug/ceph-14.2.5-1.0.8/src/json_spirit/json_spirit_reader.cpp:78 #16 json_spirit::read (s="{\"prefix\": \"get_command_descriptions\"}", value=...) at /usr/src/debug/ceph-14.2.5-1.0.8/src/json_spirit/json_spirit_reader.cpp:78 #17 0x00007f35222953a7 in cmdmap_from_json(std::vector<std::string, std::allocator<std::string> >, std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > >*, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >&) () at /usr/src/debug/ceph-14.2.5-1.0.8/src/common/cmdparse.cc:294 #18 0x00007f3522255492 in AdminSocket::execute_command(std::string const&, ceph::buffer::v14_2_0::list&) () at /opt/rh/devtoolset-8/root/usr/include/c++/8/new:169 #19 0x00007f35222567ef in AdminSocket::do_accept() () at /usr/src/debug/ceph-14.2.5-1.0.8/src/common/admin_socket.cc:341 #20 0x00007f3522258cc8 in AdminSocket::entry (this=0x564a09f3a950) at /usr/src/debug/ceph-14.2.5-1.0.8/src/common/admin_socket.cc:241 #21 0x00007f352270f68f in execute_native_thread_routine () from /usr/lib64/ceph/libceph-common.so.0 #22 0x00007f35466d1ea5 in start_thread () from /lib64/libpthread.so.0 #23 0x00007f35426938dd in clone () from /lib64/libc.so.6

3 years, 9 months

2
2
0 0

client - monitor communication.

by Budai Laszlo

Hello everybody, I'm trying to figure out how often the ceph client is contacting the monitors for updating its own information about the cluster map. Can anyone point me to a document describing this client <-> monitor communication? Thank you, Laszlo

3 years, 9 months

4
6
0 0

Monitor IPs

by Will Payne

I need to change the network my monitors are on. It seems this is not a trivial thing to do. Are there any up-to-date instructions for doing so on a cephadm-deployed cluster? I’ve found some steps in older versions of the docs but not sure if these are still correct - they mention using the ceph-mon command which I don’t have. Will

3 years, 9 months

5
4
0 0

about replica size

by Zhenshi Zhou

Hi, As we all know, the default replica setting of 'size' is 3 which means there are 3 copies of an object. What is the disadvantages if I set it to 2, except I get fewer copies? Thanks

3 years, 9 months

6
6
0 0

high commit_latency and apply_latency

by rainning

we have a cluster with very low load, however, "ceph osd perf" shows high commit_latency and apply_latency. root@stor-mgt01:~# ceph -s   cluster:     id:     3d1ec789-829d-4e0f-b707-9363356a68f1     health: HEALTH_WARN             application not enabled on 3 pool(s)     services:     mon: 3 daemons, quorum a,b,c     mgr: a(active)     mds: rook-ceph-filesystem-1/1/1 up  {0=rook-ceph-filesystem-a=up:active}, 1 up:standby-replay     osd: 55 osds: 55 up, 55 in     rgw: 3 daemons active     data:     pools:   18 pools, 3328 pgs     objects: 1.06M objects, 1.37TiB     usage:   4.18TiB used, 141TiB / 145TiB avail     pgs:     3328 active+clean     io:     client:   46.2KiB/s rd, 2.58MiB/s wr, 197op/s rd, 273op/s wr ---------------------------------------------------------------------------------------- osd commit_latency(ms) apply_latency(ms)  54                  0                 0  53                  0                 0  52                  0                 0  51                  0                 0  50                  0                 0  49                 11                11  48                  9                 9  47                  3                 3  46                 39                39  21                 28                28  20                 26                26  19                 28                28  18                  1                 1  17                 17                17  16                  6                 6  15                 11                11  14                  9                 9  13                  8                 8  12                 12                12  11                 14                14  10                 21                21   0                 10                10   1                 12                12   2                 30                30   3                  3                 3   4                  3                 3   5                 23                23   6                 56                56   7                  2                 2   8                 36                36   9                  5                 5  22                 26                26  23                  3                 3  24                  2                 2  25                  3                 3  26                 19                19  27                  3                 3  28                  4                 4  29                  8                 8  30                 16                16  31                 21                21  32                 13                13  33                  7                 7  34                 10                10  35                 12                12  36                  5                 5  37                  5                 5  38                  4                 4  39                 29                29  40                 26                26  41                 35                35  42                  5                 5  43                  4                 4  44                  8                 8  45                  6                 6

3 years, 9 months

1
0
0 0

How to deal with the incomplete records in rocksdb

by zhouli_2000＠163.com

Hi all, rocksdb failed to open when the ceph-osd process was restarted after unplugging the OSD data disk with Ceph 14.2.5 on Centos 7.6. 1) After unplugging the OSD data disk, the ceph-osd process exist. -3> 2020-07-13 15:25:35.912 7f1ad7254700 -1 bdev(0x559d1134f880 /var/lib/ceph/osd/ceph-10/block) _sync_write sync_file_range error: (5) Input/output error -2> 2020-07-13 15:25:35.912 7f1ad9c5f700 -1 bdev(0x559d1134f880 /var/lib/ceph/osd/ceph-10/block) _aio_thread got r=-5 ((5) Input/output error) -1> 2020-07-13 15:25:35.917 7f1ad9c5f700 -1 /root/rpmbuild/BUILD/ceph-14.2.5-1.0.9/src/os/bluestore/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f1ad9c5f700 time 2020-07-13 15:25:35.913821 /root/rpmbuild/BUILD/ceph-14.2.5-1.0.9/src/os/bluestore/KernelDevice.cc: 534: ceph_abort_msg("Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!") ceph version 14.2.5-93-g9a4f93e (9a4f93e7143bcdd5fadc88eb58bb730ae97b89c5) nautilus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x559d05b6069a] 2: (KernelDevice::_aio_thread()+0xebe) [0x559d061a54ee] 3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x559d061a7add] 4: (()+0x7dd5) [0x7f1ae66aedd5] 5: (clone()+0x6d) [0x7f1ae5572ead] 2) Plug the disk back in and restart the ceph-osd process, rocksdb found that incomplete records existed and stop to work. 2020-07-13 15:51:38.305 7f9801ef5a80 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #9 mode 0 2020-07-13 15:51:38.748 7f9801ef5a80 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000009.log: dropping 2922 bytes; Corruption: missing start of fragmented record(2) 2020-07-13 15:51:38.748 7f9801ef5a80 4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work 2020-07-13 15:51:38.748 7f9801ef5a80 4 rocksdb: [db/db_impl.cc:563] Shutdown complete 2020-07-13 15:51:38.748 7f9801ef5a80 -1 rocksdb: Corruption: missing start of fragmented record(2) 2020-07-13 15:51:38.748 7f9801ef5a80 -1 bluestore(/var/lib/ceph/osd/ceph-10) _open_db erroring opening db: 2020-07-13 15:51:38.748 7f9801ef5a80 1 bluefs umount 2020-07-13 15:51:38.776 7f9801ef5a80 1 fbmap_alloc 0x55c897e0a900 shutdown 2020-07-13 15:51:38.776 7f9801ef5a80 1 bdev(0x55c898a6ce00 /var/lib/ceph/osd/ceph-10/block) close Why does rocksdb not automatically delete these incomplete records and continue work? In addition, after the occurrence of this situation, what method should be used to recover.

3 years, 9 months

1
0
1 0

回复：Re: osd bench with or without a separate WAL device deployed

by rainning

I tested osd bench with different block size: 1MB, 512KB, 256KB, 128KB, 64KB, and 32KB. osd.2 is one from the cluster where osds have better 4KB osd bench, and osd.30 is from the cluster where osds have lower 4KB osd bench.  Before 32KB, osd.30 was better than osd.2, however, there was a big drop on osd.30 with 32KB block size. root@cmn01:~# ceph tell osd.2 bench 1073741824 1048576 {     "bytes_written": 1073741824,     "blocksize": 1048576,     "bytes_per_sec": 188747963 } root@cmn01:~# ceph tell osd.2 bench 1073741824 524288 {     "bytes_written": 1073741824,     "blocksize": 524288,     "bytes_per_sec": 181071543 } root@cmn01:~# ceph tell osd.2 bench 786432000 262144 {     "bytes_written": 786432000,     "blocksize": 262144,     "bytes_per_sec": 159007035 } root@cmn01:~# ceph tell osd.2 bench 393216000 131072 {     "bytes_written": 393216000,     "blocksize": 131072,     "bytes_per_sec": 127179122 } root@cmn01:~# ceph tell osd.2 bench 196608000 65536 {     "bytes_written": 196608000,     "blocksize": 65536,     "bytes_per_sec": 83365482 } root@cmn01:~# ceph tell osd.2 bench 98304000 32768 {     "bytes_written": 98304000,     "blocksize": 32768,     "bytes_per_sec": 48351258 } root@cmn01:~# ceph tell osd.2 bench 49152000 16384 {     "bytes_written": 49152000,     "blocksize": 16384,     "bytes_per_sec": 31725841 } ------------------------------------------------------------------------------------------------------------------ root@stor-mgt01:~# ceph tell osd.30 bench 1073741824 1048576 {     "bytes_written": 1073741824,     "blocksize": 1048576,     "elapsed_sec": 5.344805,     "bytes_per_sec": 200894474.890259,     "iops": 191.587901 } root@stor-mgt01:~# ceph tell osd.30 bench 1073741824 524288 {     "bytes_written": 1073741824,     "blocksize": 524288,     "elapsed_sec": 5.303052,     "bytes_per_sec": 202476205.680661,     "iops": 386.192714 } root@stor-mgt01:~# ceph tell osd.30 bench 786432000 262144 {     "bytes_written": 786432000,     "blocksize": 262144,     "elapsed_sec": 3.878248,     "bytes_per_sec": 202780204.655892,     "iops": 773.545092 } root@stor-mgt01:~# ceph tell osd.30 bench 393216000 131072 {     "bytes_written": 393216000,     "blocksize": 131072,     "elapsed_sec": 1.939532,     "bytes_per_sec": 202737591.242988,     "iops": 1546.765070 } root@stor-mgt01:~# ceph tell osd.30 bench 196608000 65536 {     "bytes_written": 196608000,     "blocksize": 65536,     "elapsed_sec": 1.081617,     "bytes_per_sec": 181772360.338257,     "iops": 2773.626104 } root@stor-mgt01:~# ceph tell osd.30 bench 98304000 32768 {     "bytes_written": 98304000,     "blocksize": 32768,     "elapsed_sec": 2.908703,     "bytes_per_sec": 33796507.598640,     "iops": 1031.387561 } root@stor-mgt01:~# ceph tell osd.30 bench 49152000 16384 {     "bytes_written": 49152000,     "blocksize": 16384,     "elapsed_sec": 3.907744,     "bytes_per_sec": 12578102.861185,     "iops": 767.706473 } ------------------ 原始邮件 ------------------ 发件人: "rainning" <tweetypie(a)qq.com>gt;; 发送时间: 2020年7月16日(星期四) 上午9:42 收件人: "Zhenshi Zhou"<deaderzzs(a)gmail.com>gt;; 抄送: "ceph-users"<ceph-users(a)ceph.io>gt;; 主题: 回复：[ceph-users] Re: osd bench with or without a separate WAL device deployed Hi Zhenshi, I did try with bigger block size. Interestingly, the one whose 4KB osd bench was lower performed slightly better in 4MB osd bench. Let me try some other bigger block sizes, e.g. 16K, 64K, 128K, 1M etc, to see if there is any pattern. Moreover, I did compare two SSDs, they respectively are INTEL SSDSC2KB480G8 and INTEL SSDSC2KB960G8. Performance wise, there is no much difference. Thanks, Ning ------------------ 原始邮件 ------------------ 发件人: "Zhenshi Zhou" <deaderzzs(a)gmail.com>gt;; 发送时间: 2020年7月16日(星期四) 上午9:24 收件人: "rainning"<tweetypie(a)qq.com>gt;; 抄送: "ceph-users"<ceph-users(a)ceph.io>gt;; 主题: [ceph-users] Re: osd bench with or without a separate WAL device deployed Maybe you can try writing with bigger block size and compare the results. For bluestore, the write operations contain two modes. One is COW, the other is RMW. AFAIK only RMW uses wal in order to prevent data from being interrupted. rainning <tweetypie(a)qq.com> 于2020年7月15日周三下午11:04写道： > Hi Zhenshi, thanks very much for the reply. > > Yes I know it is ood that the bluestore is deployed only with a separate > db device  but no a WAL device. The cluster was deployed in k8s using rook. > I was told it was because the rook we used didn't support that. > > Moreover, the comparison was made on osd bench, so the network should not > be the case. As far as the storage node hardware, although two clusters are > indeed different, their CPUs and HDDs do have almost same performance > numbers. I haven't compared SSDs that are used as db/WAL devices, it might > cause difference, but I am not sure if it can make two times difference. > > ---Original--- > *From:* "Zhenshi Zhou"<deaderzzs(a)gmail.com> > *Date:* Wed, Jul 15, 2020 18:39 PM > *To:* "rainning"<tweetypie(a)qq.com>gt;; > *Cc:* "ceph-users"<ceph-users(a)ceph.io>gt;; > *Subject:* [ceph-users] Re: osd bench with or without a separate WAL > device deployed > > I deployed the cluster either with separate db/wal or put db/wal/data > together. Never tried to have only a seperate db. > AFAIK wal does have an effect on writing but I'm not sure if it could be > two times of the bench value. Hardware and > network environment are also important factors. > > rainning <tweetypie(a)qq.com> 于2020年7月15日周三下午4:35写道： > > > Hi all, > > > > > > I am wondering if there is any performance comparison done on osd bench > > with and without a separate WAL device deployed given that there is > always > > a separate db device deployed on SSD in both cases. > > > > > > The reason I am asking this question is that we have two clusters and > osds > > in one have separate db and WAL device deployed on SSD but osds in > another > > only have a separate db device deployed. And we found 4KB osd bench (i.e. > > ceph tell osd.X bench 12288000 4096) for the ones having a separate WAL > > device was two times of the ones without a separate WAL device. Is the > > performance difference caused by the separate WAL device? > > > > > > Thanks, > > Ning > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 9 months

2
1
0 0

osd bench with or without a separate WAL device deployed

by rainning

Hi all, I am wondering if there is any performance comparison done on osd bench with and without a separate WAL device deployed given that there is always a separate db device deployed on SSD in both cases. The reason I am asking this question is that we have two clusters and osds in one have separate db and WAL device deployed on SSD but osds in another only have a separate db device deployed. And we found 4KB osd bench (i.e. ceph tell osd.X bench 12288000 4096) for the ones having a separate WAL device was two times of the ones without a separate WAL device. Is the performance difference caused by the separate WAL device? Thanks, Ning

3 years, 9 months

2
5
0 0

Re: [RGW] Space usage vastly overestimated since Octopus upgrade

by David Monschein

Hi Liam, All, We have also run into this bug: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/PCYY2MKRPCP… Like you, we are also running Octopus 15.2.3 Downgrading the RGWs at this point is not ideal, but if a fix isn't found soon we might have to. Has a bug report been filed for this yet? - Dave

3 years, 9 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users