October 2020 - ceph-users

Issues with the ceph-bluestore-tool during cluster upgrade from Mimic to Nautilus

by Jean-Philippe Méthot

Hi, We’re upgrading our cluster OSD node per OSD node to Nautilus from Mimic. From some release notes, it was recommended to run the following command to fix stats after an upgrade : ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0 However, running that command gives us the following error message: > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: In > function 'virtual Allocator::SocketHook::~SocketHook()' thread 7f1a6467eec0 time 2020-09-10 14:40:25.872353 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: 53 > : FAILED ceph_assert(r == 0) > ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f1a5a823025] > 2: (()+0x25c1ed) [0x7f1a5a8231ed] > 3: (()+0x3c7a4f) [0x55b33537ca4f] > 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] > 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] > 6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] > 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] > 8: (main()+0x10b3) [0x55b335187493] > 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] > 10: (()+0x1f9b5f) [0x55b3351aeb5f] > 2020-09-10 14:40:25.873 7f1a6467eec0 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: In function 'virtual Allocator::SocketHook::~SocketHook()' thread 7f1a6467eec0 time 2020-09-10 14:40:25.872353 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: 53: FAILED ceph_assert(r == 0) > > ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f1a5a823025] > 2: (()+0x25c1ed) [0x7f1a5a8231ed] > 3: (()+0x3c7a4f) [0x55b33537ca4f] > 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] > 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] > 6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] > 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] > 8: (main()+0x10b3) [0x55b335187493] > 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] > 10: (()+0x1f9b5f) [0x55b3351aeb5f] > *** Caught signal (Aborted) ** > in thread 7f1a6467eec0 thread_name:ceph-bluestore- > ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) > 1: (()+0xf630) [0x7f1a58cf0630] > 2: (gsignal()+0x37) [0x7f1a574be387] > 3: (abort()+0x148) [0x7f1a574bfa78] > 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x7f1a5a823074] > 5: (()+0x25c1ed) [0x7f1a5a8231ed] > 6: (()+0x3c7a4f) [0x55b33537ca4f] > 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] > 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] > 9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] > 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] > 11: (main()+0x10b3) [0x55b335187493] > 12: (__libc_start_main()+0xf5) [0x7f1a574aa555] > 13: (()+0x1f9b5f) [0x55b3351aeb5f] > 2020-09-10 14:40:25.874 7f1a6467eec0 -1 *** Caught signal (Aborted) ** > in thread 7f1a6467eec0 thread_name:ceph-bluestore- > > ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) > 1: (()+0xf630) [0x7f1a58cf0630] > 2: (gsignal()+0x37) [0x7f1a574be387] > 3: (abort()+0x148) [0x7f1a574bfa78] > 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x7f1a5a823074] > 5: (()+0x25c1ed) [0x7f1a5a8231ed] > 6: (()+0x3c7a4f) [0x55b33537ca4f] > 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] > 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] > 9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] > 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] > 11: (main()+0x10b3) [0x55b335187493] > 12: (__libc_start_main()+0xf5) [0x7f1a574aa555] > 13: (()+0x1f9b5f) [0x55b3351aeb5f] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. What could be the source of this error? I haven’t found much of anything about it online. Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. 4414-4416 Louis B Mayer Laval, QC, H7P 0G1, Canada TEL : +1.514.802.1644 - Poste : 2644 FAX : +1.514.612.0678 CA/US : 1.855.774.4678 FR : 01 76 60 41 43 UK : 0808 189 0423

3 years, 6 months

3
7
0 0

Re: Issues with the ceph-bluestore-tool during cluster upgrade from Mimic to Nautilus

by Igor Fedotov

On 9/25/2020 6:07 PM, Saber(a)PlanetHoster.info wrote: > Hi Igor, > > The only thing abnormal about this osdstore is that it was created by > Mimic 13.2.8 and I can see that the OSDs size of this osdstore are not > the same as the others in the cluster (while they should be exactly > the same size). > > Can it be https://tracker.ceph.com/issues/39151 ? hmm, may be... Did you change H/W at some point for this OSD's node as it happened in the ticket? And it's still unclear to me if the issue is reproducible for you. Could you please also run fsck (at first) and then repair for this OSD and collect log(s). Thanks, Igor > > Thanks! > Saber > CTO @PlanetHoster > >> On Sep 25, 2020, at 5:46 AM, Igor Fedotov <ifedotov(a)suse.de >> <mailto:ifedotov@suse.de>> wrote: >> >> Hi Saber, >> >> I don't think this is related. New assertion happens along the write >> path while the original one occurred on allocator shutdown. >> >> >> Unfortunately there are not much information to troubleshoot this... >> Are you able to reproduce the case? >> >> >> Thanks, >> >> Igor >> >> On 9/25/2020 4:21 AM, Saber(a)PlanetHoster.info wrote: >>> Hi Igor, >>> >>> We had an osd crash a week after running Nautilus. I have attached >>> the logs, is it related to the same bug? >>> >>> >>> >>> >>> Thanks, >>> Saber >>> CTO @PlanetHoster >>> >>>> On Sep 14, 2020, at 10:22 AM, Igor Fedotov <ifedotov(a)suse.de >>>> <mailto:ifedotov@suse.de>> wrote: >>>> >>>> Thanks! >>>> >>>> Now got the root cause. The fix is on its way... >>>> >>>> Meanwhile you might want to try to workaround the issue via setting >>>> "bluestore_hybrid_alloc_mem_cap" to 0 or using different allocator, >>>> e.g. avl for bluestore_allocator (and optionally for >>>> bluefs_allocator too). >>>> >>>> >>>> Hope this helps, >>>> >>>> Igor. >>>> >>>> >>>> >>>> On 9/14/2020 5:02 PM, Jean-Philippe Méthot wrote: >>>>> Alright, here’s the full log file. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Jean-Philippe Méthot >>>>> Senior Openstack system administrator >>>>> Administrateur système Openstack sénior >>>>> PlanetHoster inc. >>>>> 4414-4416 Louis B Mayer >>>>> Laval, QC, H7P 0G1, Canada >>>>> TEL : +1.514.802.1644 - Poste : 2644 >>>>> FAX : +1.514.612.0678 >>>>> CA/US : 1.855.774.4678 >>>>> FR : 01 76 60 41 43 >>>>> UK : 0808 189 0423 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Le 14 sept. 2020 à 06:49, Igor Fedotov <ifedotov(a)suse.de >>>>>> <mailto:ifedotov@suse.de>> a écrit : >>>>>> >>>>>> Well, I can see duplicate admin socket command >>>>>> registration/de-registration (and the second de-registration >>>>>> asserts) but don't understand how this could happen. >>>>>> >>>>>> Would you share the full log, please? >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Igor >>>>>> >>>>>> On 9/11/2020 7:26 PM, Jean-Philippe Méthot wrote: >>>>>>> Here’s the out file, as requested. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Jean-Philippe Méthot >>>>>>> Senior Openstack system administrator >>>>>>> Administrateur système Openstack sénior >>>>>>> PlanetHoster inc. >>>>>>> 4414-4416 Louis B Mayer >>>>>>> Laval, QC, H7P 0G1, Canada >>>>>>> TEL : +1.514.802.1644 - Poste : 2644 >>>>>>> FAX : +1.514.612.0678 >>>>>>> CA/US : 1.855.774.4678 >>>>>>> FR : 01 76 60 41 43 >>>>>>> UK : 0808 189 0423 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Le 11 sept. 2020 à 10:38, Igor Fedotov <ifedotov(a)suse.de >>>>>>>> <mailto:ifedotov@suse.de>> a écrit : >>>>>>>> >>>>>>>> Could you please run: >>>>>>>> >>>>>>>> CEPH_ARGS="--log-file log --debug-asok 5" ceph-bluestore-tool >>>>>>>> repair --path <...> ; cat log | grep asok > out >>>>>>>> >>>>>>>> and share 'out' file. >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Igor >>>>>>>> >>>>>>>> On 9/11/2020 5:15 PM, Jean-Philippe Méthot wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> We’re upgrading our cluster OSD node per OSD node to Nautilus >>>>>>>>> from Mimic. From some release notes, it was recommended to run >>>>>>>>> the following command to fix stats after an upgrade : >>>>>>>>> >>>>>>>>> ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0 >>>>>>>>> >>>>>>>>> However, running that command gives us the following error >>>>>>>>> message: >>>>>>>>> >>>>>>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc >>>>>>>>>> <http://allocator.cc/>: In >>>>>>>>>> function 'virtual Allocator::SocketHook::~SocketHook()' >>>>>>>>>> thread 7f1a6467eec0 time 2020-09-10 14:40:25.872353 >>>>>>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc >>>>>>>>>> <http://allocator.cc/>: 53 >>>>>>>>>> : FAILED ceph_assert(r == 0) >>>>>>>>>> ceph version 14.2.11 >>>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) >>>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, >>>>>>>>>> char const*)+0x14a) [0x7f1a5a823025] >>>>>>>>>> 2: (()+0x25c1ed) [0x7f1a5a8231ed] >>>>>>>>>> 3: (()+0x3c7a4f) [0x55b33537ca4f] >>>>>>>>>> 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] >>>>>>>>>> 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] >>>>>>>>>> 6: (BlueStore::_close_db_and_around(bool)+0x2f8) >>>>>>>>>> [0x55b335274528] >>>>>>>>>> 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) >>>>>>>>>> [0x55b3352749a1] >>>>>>>>>> 8: (main()+0x10b3) [0x55b335187493] >>>>>>>>>> 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] >>>>>>>>>> 10: (()+0x1f9b5f) [0x55b3351aeb5f] >>>>>>>>>> 2020-09-10 14:40:25.873 7f1a6467eec0 -1 >>>>>>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc >>>>>>>>>> <http://allocator.cc/>: In function 'virtual >>>>>>>>>> Allocator::SocketHook::~SocketHook()' thread 7f1a6467eec0 >>>>>>>>>> time 2020-09-10 14:40:25.872353 >>>>>>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc >>>>>>>>>> <http://allocator.cc/>: 53: FAILED ceph_assert(r == 0) >>>>>>>>>> >>>>>>>>>> ceph version 14.2.11 >>>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) >>>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, >>>>>>>>>> char const*)+0x14a) [0x7f1a5a823025] >>>>>>>>>> 2: (()+0x25c1ed) [0x7f1a5a8231ed] >>>>>>>>>> 3: (()+0x3c7a4f) [0x55b33537ca4f] >>>>>>>>>> 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] >>>>>>>>>> 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] >>>>>>>>>> 6: (BlueStore::_close_db_and_around(bool)+0x2f8) >>>>>>>>>> [0x55b335274528] >>>>>>>>>> 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) >>>>>>>>>> [0x55b3352749a1] >>>>>>>>>> 8: (main()+0x10b3) [0x55b335187493] >>>>>>>>>> 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] >>>>>>>>>> 10: (()+0x1f9b5f) [0x55b3351aeb5f] >>>>>>>>>> *** Caught signal (Aborted) ** >>>>>>>>>> in thread 7f1a6467eec0 thread_name:ceph-bluestore- >>>>>>>>>> ceph version 14.2.11 >>>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) >>>>>>>>>> 1: (()+0xf630) [0x7f1a58cf0630] >>>>>>>>>> 2: (gsignal()+0x37) [0x7f1a574be387] >>>>>>>>>> 3: (abort()+0x148) [0x7f1a574bfa78] >>>>>>>>>> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, >>>>>>>>>> char const*)+0x199) [0x7f1a5a823074] >>>>>>>>>> 5: (()+0x25c1ed) [0x7f1a5a8231ed] >>>>>>>>>> 6: (()+0x3c7a4f) [0x55b33537ca4f] >>>>>>>>>> 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] >>>>>>>>>> 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] >>>>>>>>>> 9: (BlueStore::_close_db_and_around(bool)+0x2f8) >>>>>>>>>> [0x55b335274528] >>>>>>>>>> 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) >>>>>>>>>> [0x55b3352749a1] >>>>>>>>>> 11: (main()+0x10b3) [0x55b335187493] >>>>>>>>>> 12: (__libc_start_main()+0xf5) [0x7f1a574aa555] >>>>>>>>>> 13: (()+0x1f9b5f) [0x55b3351aeb5f] >>>>>>>>>> 2020-09-10 14:40:25.874 7f1a6467eec0 -1 *** Caught signal >>>>>>>>>> (Aborted) ** >>>>>>>>>> in thread 7f1a6467eec0 thread_name:ceph-bluestore- >>>>>>>>>> >>>>>>>>>> ceph version 14.2.11 >>>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) >>>>>>>>>> 1: (()+0xf630) [0x7f1a58cf0630] >>>>>>>>>> 2: (gsignal()+0x37) [0x7f1a574be387] >>>>>>>>>> 3: (abort()+0x148) [0x7f1a574bfa78] >>>>>>>>>> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, >>>>>>>>>> char const*)+0x199) [0x7f1a5a823074] >>>>>>>>>> 5: (()+0x25c1ed) [0x7f1a5a8231ed] >>>>>>>>>> 6: (()+0x3c7a4f) [0x55b33537ca4f] >>>>>>>>>> 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] >>>>>>>>>> 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] >>>>>>>>>> 9: (BlueStore::_close_db_and_around(bool)+0x2f8) >>>>>>>>>> [0x55b335274528] >>>>>>>>>> 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) >>>>>>>>>> [0x55b3352749a1] >>>>>>>>>> 11: (main()+0x10b3) [0x55b335187493] >>>>>>>>>> 12: (__libc_start_main()+0xf5) [0x7f1a574aa555] >>>>>>>>>> 13: (()+0x1f9b5f) [0x55b3351aeb5f] >>>>>>>>>> NOTE: a copy of the executable, or `objdump -rdS >>>>>>>>>> <executable>` is needed to interpret this. >>>>>>>>> >>>>>>>>> What could be the source of this error? I haven’t found much >>>>>>>>> of anything about it online. >>>>>>>>> >>>>>>>>> >>>>>>>>> Jean-Philippe Méthot >>>>>>>>> Senior Openstack system administrator >>>>>>>>> Administrateur système Openstack sénior >>>>>>>>> PlanetHoster inc. >>>>>>>>> 4414-4416 Louis B Mayer >>>>>>>>> Laval, QC, H7P 0G1, Canada >>>>>>>>> TEL : +1.514.802.1644 - Poste : 2644 >>>>>>>>> FAX : +1.514.612.0678 >>>>>>>>> CA/US : 1.855.774.4678 >>>>>>>>> FR : 01 76 60 41 43 >>>>>>>>> UK : 0808 189 0423 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>> <mailto:ceph-users@ceph.io> >>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>>> <mailto:ceph-users-leave@ceph.io> >>>>>>> >>>>> >>> >

3 years, 6 months

2
1
0 0

Ceph docker containers all stopped

by Darrin Hodges

HI all, Had an issue where the docker containers on all the ceph nodes just seem to stop at some point, effectively shutting down the cluster. Restarting cephs on all of the nodes restored the cluster to normal working order. I would like to find out why this occurred, any ideas on where to look? many thanks Darrin -- CONFIDENTIALITY NOTICE: This email is intended for the named recipients only. It may contain privileged, confidential or copyright information. If you are not the named recipients, any use, reliance upon, disclosure or copying of this email or any attachments is unauthorised. If you have received this email in error, please reply via email or telephone +61 2 8004 5928.

3 years, 6 months

2
1
0 0

Re: The feasibility of mixed SSD and HDD replicated pool

by 胡玮文

> 在 2020年10月26日，00:07，Anthony D'Atri <anthony.datri(a)gmail.com> 写道： > >> I'm not entirely sure if primary on SSD will actually make the read happen on SSD. > > My understanding is that by default reads always happen from the lead OSD in the acting set. Octopus seems to (finally) have an option to spread the reads around, which IIRC defaults to false. I also remember that “by default reads always happen from the lead OSD in the acting set”. I dig through git blame and it seems ceph-fuse has a —localize-reads options since 10 years ago [1], through not documented anywhere. I don’t find such setting in kernel ceph module. [1]: https://github.com/ceph/ceph/commit/7912f5c7034bd26d22615d1be1d398849e124749 > I’ve never seen anything that implies that lead OSDs within an acting set are a function of CRUSH rule ordering. I’m not asserting that they aren’t though, but I’m … skeptical. That conclusion is from experiments. I create an empty pool with above mentioned CRUSH rule, and all 32 pgs have SSD as primary. > Setting primary affinity would do the job, and you’d want to have cron continually update it across the cluster to react to topology changes. I was told of this strategy back in 2014, but haven’t personally seen it implemented. I’m also considering this. But if I set the primary affinity of HDDs to 0, then what will happen if I create another all-HDD pool? Or I should just set primary affinity to a very small value, say 0.00001. > That said, HDDs are more of a bottleneck for writes than reads and just might be fine for your application. Tiny reads are going to limit you to some degree regardless of drive type, and you do mention throughput, not IOPS. > > I must echo Frank’s notes about capacity too. Ceph can do a lot of things, but that doesn’t mean something exotic is necessarily the best choice. You’re concerned about 3R only yielding 1/3 of raw capacity if using an all-SSD cluster, but the architecture you propose limits you anyway because drive size. Consider also chassis, CPU, RAM, RU, switch port costs as well, and the cost of you fussing over an exotic solution instead of the hundreds of other things in your backlog. > > And your cluster as described is *tiny*. Honestly I’d suggest considering one of these alternatives: > > * Ditch the HDDs, use QLC flash. The emerging EDSFF drives are really promising for replacing HDDs for density in this kind of application. You might even consider ARM if IOPs aren’t a concern. > * An NVMeoF solution Thanks for the advices, we will discuss on these. But this deployment is on existing server hardwares, so we don’t have many choices. And our budget is very limited. We want to make best use of our existing SSDs. And we have plenty of cold data to fill our HDDs. We will not worry about the wasting of HDD capacity. Sorry Anthony, I sent this mail twice. I forgot to CC this mail list at first. > Cache tiers are “deprecated”, but then so are custom cluster names. Neither appears > >> For EC pools there is an option "fast_read" (https://docs.ceph.com/en/latest/rados/operations/pools/?highlight=fast_read…), which states that a read will return as soon as the first k shards have arrived. The default is to wait for all k+m shards (all replicas). This option is not available for replicated pools. >> >> Now, not sure if this option is not available for replicated pools because the read will always be served by the acting primary, or if it currently waits for all replicas. In the latter case, reads will wait for the slowest device. >> >> I'm not sure if I interpret this correctly. I think you should test the setup with HDD only and SSD+HDD to see if read speed improves. Note that write speed will always depend on the slowest device. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Frank Schilder <frans(a)dtu.dk> >> Sent: 25 October 2020 15:03:16 >> To: 胡玮文; Alexander E. Patrakov >> Cc: ceph-users(a)ceph.io >> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool >> >> A cache pool might be an alternative, heavily depending on how much data is hot. However, then you will have much less SSD capacity available, because it also requires replication. >> >> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = 120T HDD you will probably run short of SSD capacity. Or, looking at it the other way around, with copies on 1 SSD+3HDD, you will only be able to use about 30T out of 120T HDD capacity. >> >> With this replication, the usable storage will be 10T and raw used will be 10T SSD and 30T HDD. If you can't do anything else on the HDD space, you will need more SSDs. If your servers have more free disk slots, you can add SSDs over time until you have at least 40T SSD capacity to balance SSD and HDD capacity. >> >> Personally, I think the 1SSD + 3HDD is a good option compared with a cache pool. You have the data security of 3-times replication and, if everything is up, need only 1 copy in the SSD cache, which means that you have 3 times the cache capacity. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: 胡玮文 <huww98(a)outlook.com> >> Sent: 25 October 2020 13:40:55 >> To: Alexander E. Patrakov >> Cc: ceph-users(a)ceph.io >> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool >> >> Yes. This is the limitation of CRUSH algorithm, in my mind. In order to guard against 2 host failures, I’m going to use 4 replications, 1 on SSD and 3 on HDD. This will work as intended, right? Because at least I can ensure 3 HDDs are from different hosts. >> >>>> 在 2020年10月25日，20:04，Alexander E. Patrakov <patrakov(a)gmail.com> 写道： >>> >>> On Sun, Oct 25, 2020 at 12:11 PM huww98(a)outlook.com <huww98(a)outlook.com> wrote: >>>> >>>> Hi all, >>>> >>>> We are planning for a new pool to store our dataset using CephFS. These data are almost read-only (but not guaranteed) and consist of a lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will deploy about 10 such nodes. We aim at getting the highest read throughput. >>>> >>>> If we just use a replicated pool of size 3 on SSD, we should get the best performance, however, that only leave us 1/3 of usable SSD space. And EC pools are not friendly to such small object read workload, I think. >>>> >>>> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 3 data replications, each on a different host (fail domain). 1 of them on SSD, the other 2 on HDD. And normally every read request is directed to SSD. So, if every SSD OSD is up, I’d expect the same read throughout as the all SSD deployment. >>>> >>>> I’ve read the documents and did some tests. Here is the crush rule I’m testing with: >>>> >>>> rule mixed_replicated_rule { >>>> id 3 >>>> type replicated >>>> min_size 1 >>>> max_size 10 >>>> step take default class ssd >>>> step chooseleaf firstn 1 type host >>>> step emit >>>> step take default class hdd >>>> step chooseleaf firstn -1 type host >>>> step emit >>>> } >>>> >>>> Now I have the following conclusions, but I’m not very sure: >>>> * The first OSD produced by crush will be the primary OSD (at least if I don’t change the “primary affinity”). So, the above rule is guaranteed to map SSD OSD as primary in pg. And every read request will read from SSD if it is up. >>>> * It is currently not possible to enforce SSD and HDD OSD to be chosen from different hosts. So, if I want to ensure data availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the replication size to 4, instead of the ideal value 3, on the pool using the above crush rule. >>>> >>>> Am I correct about the above statements? How would this work from your experience? Thanks. >>> >>> This works (i.e. guards against host failures) only if you have >>> strictly separate sets of hosts that have SSDs and that have HDDs. >>> I.e., there should be no host that has both, otherwise there is a >>> chance that one hdd and one ssd from that host will be picked. >>> >>> -- >>> Alexander E. Patrakov >>> CV: https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.cd%2FPL… >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 6 months

1
0
0 0

Ceph cluster recovering status

by Ing. Luis Felipe Domínguez Vega

Hi, my cluster was crashed by going down one of my DC and 'ceph -s' status dont show me the current working status and nothing change in large time, how can i see what is ceph doing really: cluster: health: HEALTH_ERR mons fond-beagle,guided-tuna are using a lot of disk space 1/3 mons down, quorum fond-beagle,guided-tuna 18/404368 objects unfound (0.004%) Reduced data availability: 235 pgs inactive, 72 pgs down, 9 pgs incomplete Possible data damage: 3 pgs recovery_unfound Degraded data redundancy: 306574/2607020 objects degraded (11.760%), 10 pgs degraded, 10 pgs undersized 2 pgs not deep-scrubbed in time 32408 slow ops, oldest one blocked for 62348 sec, daemons [osd.0,osd.10,osd.11,osd.13,osd.14,osd.15,osd.16,osd.17,osd.18,osd.19]... have slow ops. services: mon: 3 daemons, quorum fond-beagle,guided-tuna (age 31m), out of quorum: alive-lynx mgr: fond-beagle(active, since 31m) osd: 52 osds: 28 up (since 30m), 28 in (since 11h); 3 remapped pgs data: pools: 7 pools, 2305 pgs objects: 404.37k objects, 1.7 TiB usage: 2.7 TiB used, 22 TiB / 24 TiB avail pgs: 6.681% pgs unknown 3.514% pgs not active 306574/2607020 objects degraded (11.760%) 18/404368 objects unfound (0.004%) 2060 active+clean 154 unknown 72 down 9 incomplete 7 active+undersized+degraded 3 active+recovery_unfound+undersized+degraded+remapped

3 years, 6 months

1
0
0 0

Urgent help needed please - MDS offline

by David C

Hi All My main CephFS data pool on a Luminous 12.2.10 cluster hit capacity overnight, metadata is on a separate pool which didn't hit capacity but the filesystem stopped working which I'd expect. I increased the osd full-ratio to give me some breathing room to get some data deleted once the filesystem is back online. When I attempt to restart the MDS service, I see the usual stuff I'd expect in the log but then: heartbeat_map is_healthy 'MDSRank' had timed out after 15 Followed by: mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last > acked 4.00013s ago); MDS internal heartbeat is not healthy! Eventually I get: > > mds.beacon.hostnamecephssd01 is_laggy 29.372 > 15 since last acked beacon > mds.0.90884 skipping upkeep work because connection to Monitors appears > laggy > mds.hostnamecephssd01 Updating MDS map to version 90885 from mon.0 > mds.beacon.hostnamecephssd01 MDS is no longer laggy The "MDS is no longer laggy" appears to be where the service fails Meanwhile a ceph -s is showing: > > cluster: > id: 5c5998fd-dc9b-47ec-825e-beaba66aad11 > health: HEALTH_ERR > 1 filesystem is degraded > insufficient standby MDS daemons available > 67 backfillfull osd(s) > 11 nearfull osd(s) > full ratio(s) out of order > 2 pool(s) backfillfull > 2 pool(s) nearfull > 6 scrub errors > Possible data damage: 5 pgs inconsistent > services: > mon: 3 daemons, quorum hostnameceph01,hostnameceph02,hostnameceph03 > mgr: hostnameceph03(active), standbys: hostnameceph02, hostnameceph01 > mds: cephfs-1/1/1 up {0=hostnamecephssd01=up:replay} > osd: 172 osds: 161 up, 161 in > data: > pools: 5 pools, 8384 pgs > objects: 76.25M objects, 124TiB > usage: 373TiB used, 125TiB / 498TiB avail > pgs: 8379 active+clean > 5 active+clean+inconsistent > io: > client: 676KiB/s rd, 0op/s rd, 0op/s w The 5 pgs inconsistent is not a new issue, that is from past scrubs, just haven't gotten around to manually clearing them although I suppose they could be related to my issue The cluster has no clients connected I did notice in the ceph.log, some OSDs that are in the same host as the MDS service briefly went down when trying to restart the MDS but examining the logs of those particular OSDs isn't showing any glaring issues. Full MDS log at debug 5 (can go higher if needed): 2020-10-22 11:27:10.987652 7f6f696f5240 0 set uid:gid to 167:167 (ceph:ceph) 2020-10-22 11:27:10.987669 7f6f696f5240 0 ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable), process ceph-mds, pid 2022582 2020-10-22 11:27:10.990567 7f6f696f5240 0 pidfile_write: ignore empty --pid-file 2020-10-22 11:27:11.027981 7f6f62616700 1 mds.hostnamecephssd01 Updating MDS map to version 90882 from mon.0 2020-10-22 11:27:15.097957 7f6f62616700 1 mds.hostnamecephssd01 Updating MDS map to version 90883 from mon.0 2020-10-22 11:27:15.097989 7f6f62616700 1 mds.hostnamecephssd01 Map has assigned me to become a standby 2020-10-22 11:27:15.101071 7f6f62616700 1 mds.hostnamecephssd01 Updating MDS map to version 90884 from mon.0 2020-10-22 11:27:15.105310 7f6f62616700 1 mds.0.90884 handle_mds_map i am now mds.0.90884 2020-10-22 11:27:15.105316 7f6f62616700 1 mds.0.90884 handle_mds_map state change up:boot --> up:replay 2020-10-22 11:27:15.105325 7f6f62616700 1 mds.0.90884 replay_start 2020-10-22 11:27:15.105333 7f6f62616700 1 mds.0.90884 recovery set is 2020-10-22 11:27:15.105344 7f6f62616700 1 mds.0.90884 waiting for osdmap 73745 (which blacklists prior instance) 2020-10-22 11:27:15.149092 7f6f5be09700 0 mds.0.cache creating system inode with ino:0x100 2020-10-22 11:27:15.149693 7f6f5be09700 0 mds.0.cache creating system inode with ino:0x1 2020-10-22 11:27:41.021708 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:27:43.029290 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:27:43.029297 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 4.00013s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:27:45.866711 7f6f5fe11700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-10-22 11:28:01.021965 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:03.029862 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:03.029885 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 4.00113s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:06.022033 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:07.029955 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:07.029961 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 8.00126s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:11.022099 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:11.030024 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:11.030028 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 12.0014s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:15.030092 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:15.030099 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 16.0015s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:16.022165 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:19.030163 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:19.030169 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 20.0016s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:21.022231 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:23.030233 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:23.030241 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 24.0008s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:26.022295 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:27.030305 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:27.030311 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 28.0009s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:28.401161 7f6f5fe11700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-10-22 11:28:28.401168 7f6f5fe11700 1 mds.beacon.hostnamecephssd01 is_laggy 29.372 > 15 since last acked beacon 2020-10-22 11:28:28.401177 7f6f5fe11700 1 mds.0.90884 skipping upkeep work because connection to Monitors appears laggy 2020-10-22 11:28:28.401187 7f6f62616700 1 mds.hostnamecephssd01 Updating MDS map to version 90885 from mon.0 2020-10-22 11:28:31.659817 7f6f64595700 0 mds.beacon.hostnamecephssd01 MDS is no longer laggy 2020-10-22 11:36:15.880009 7f88ee4ac240 0 set uid:gid to 167:167 (ceph:ceph) 2020-10-22 11:36:15.880026 7f88ee4ac240 0 ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable), process ceph-mds, pid 2022663 2020-10-22 11:36:15.883118 7f88ee4ac240 0 pidfile_write: ignore empty --pid-file 2020-10-22 11:36:15.921200 7f88e73cd700 1 mds.hostnamecephssd01 Updating MDS map to version 90887 from mon.2 2020-10-22 11:36:20.270298 7f88e73cd700 1 mds.hostnamecephssd01 Updating MDS map to version 90888 from mon.2 2020-10-22 11:36:20.270329 7f88e73cd700 1 mds.hostnamecephssd01 Map has assigned me to become a standby 2020-10-22 11:36:20.272917 7f88e73cd700 1 mds.hostnamecephssd01 Updating MDS map to version 90889 from mon.2 2020-10-22 11:36:20.277063 7f88e73cd700 1 mds.0.90889 handle_mds_map i am now mds.0.90889 2020-10-22 11:36:20.277069 7f88e73cd700 1 mds.0.90889 handle_mds_map state change up:boot --> up:replay 2020-10-22 11:36:20.277079 7f88e73cd700 1 mds.0.90889 replay_start 2020-10-22 11:36:20.277086 7f88e73cd700 1 mds.0.90889 recovery set is 2020-10-22 11:36:20.277096 7f88e73cd700 1 mds.0.90889 waiting for osdmap 73746 (which blacklists prior instance) 2020-10-22 11:36:20.322318 7f88e0bc0700 0 mds.0.cache creating system inode with ino:0x100 2020-10-22 11:36:20.322918 7f88e0bc0700 0 mds.0.cache creating system inode with ino:0x1 2020-10-22 11:36:47.922531 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:36:47.922549 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 4.00013s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:36:50.914516 7f88e83cf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:36:51.351457 7f88e4bc8700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-10-22 11:37:07.923089 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:07.923126 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 3.99913s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:10.914767 7f88e83cf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:11.923216 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:11.923223 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 7.99926s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:15.914831 7f88e83cf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:15.923286 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:15.923294 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 11.9994s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:19.923359 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:19.923366 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 15.9995s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:20.914917 7f88e83cf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:23.923430 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:23.923437 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 19.9996s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:25.914981 7f88e83cf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:27.923501 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:27.923508 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 23.9998s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:30.915046 7f88e83cf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:31.923572 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:31.923579 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 27.9999s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:32.412628 7f88e4bc8700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-10-22 11:37:32.412635 7f88e4bc8700 1 mds.beacon.hostnamecephssd01 is_laggy 28.4889 > 15 since last acked beacon 2020-10-22 11:37:32.412643 7f88e4bc8700 1 mds.0.90889 skipping upkeep work because connection to Monitors appears laggy 2020-10-22 11:37:32.412657 7f88e73cd700 1 mds.hostnamecephssd01 Updating MDS map to version 90890 from mon.2 2020-10-22 11:37:35.978858 7f88e934c700 0 mds.beacon.hostnamecephssd01 MDS is no longer laggy Thanks in advance for any assistance you can provide! David

3 years, 6 months

4
15
0 0

Ceph and ram limits

by Ing. Luis Felipe Domínguez Vega

Since some days ago i am recoveryng my ceph cluster, all start with OSD been killed by OOM, well i created a script to delete from the OSD the PGs corrupted (i write corrupted because that pg is the cause of the 100% of RAM usage by OSD). Great, almost done with all OSDs of my cluster, then the monitors now are consuming all the servers RAM, and the Managers too, why??? why they use 60GB of RAM, there are something to block that?? i tried configurind all kind of RAM limit to the minimal.

3 years, 6 months

1
0
0 0

Large map object found

by Peter Eisch

Hi, My rgw.buckets.index has the cluster in WARN. I'm either not understanding the real issue or I'm making it worse, or both. OMAP_BYTES: 70461524 OMAP_KEYS: 250874 I thought I'd head this off by deleting rgw objects which would normally get deleted in the near future but this only seemed to make the values grow. Before I deleted lots of objects the values were: OMAP_BYTES: 65450132 OMAP_KEYS: 209843 I read the default is 200k but I haven't read the proper way to manage this situation. What reading should I dive into? I could probably craft up a command to increase the value to clear the warning but I'm guessing this might not be great long-term. Other errata which might matter: Size: 3 Pool: nvme CLASS SIZE AVAIL USED RAW USED %RAW USED nvme 256 TiB 165 TiB 91 TiB 91 TiB 35.53 Errata: the complete statements: PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 43.d 2 0 0 0 0 70461524 250874 3070 active+clean 36m 185904'456870 185904:1357091 [99,90,48]p99 [99,90,48]p99 2020-10-21 13:53:42.102363 2020-10-21 13:53:42.102363 Thanks! peter Peter Eisch Senior Site Reliability Engineer T1.612.445.5135 virginpulse.com Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United Kingdom | USA Confidentiality Notice: The information contained in this e-mail, including any attachment(s), is intended solely for use by the designated recipient(s). Unauthorized use, dissemination, distribution, or reproduction of this message by anyone other than the intended recipient(s), or a person designated as responsible for delivering such messages to the intended recipient, is strictly prohibited and may be unlawful. This e-mail may contain proprietary, confidential or privileged information. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Virgin Pulse, Inc. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message. v2.66

3 years, 6 months

2
6
0 0

desaster recovery Ceph Storage , urgent help needed

by Gerhard W. Recher

Hi I have a worst case, osd's in a 3 node cluster each 4 nvme's won't start we had a ip config change in public network, and mon's died so we managed mon's to come back with new ip's. corosync on 2 rings is fine, all 3 mon's are up osd's won't start how to get back to the pool, already 3vm's are configured and valuable data would be lost... this is like a scenario when all systemdisks on each 3 nodes failed, but osd disks are healthy ... any help to reconstruct this storage is highly appreciated! Gerhard |root@pve01:/var/log# systemctl status ceph-osd(a)0.service.service ● ceph-osd(a)0.service.service - Ceph object storage daemon osd.0.service Loaded: loaded (/lib/systemd/system/ceph-osd@.service; disabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d └─ceph-after-pve-cluster.conf Active: failed (Result: exit-code) since Thu 2020-10-22 00:30:09 CEST; 37min ago Process: 31402 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0.service (code=exited, status=1/FAILURE) Oct 22 00:30:09 pve01 systemd[1]: ceph-osd(a)0.service.service: Service RestartSec=100ms expired, scheduling restart. Oct 22 00:30:09 pve01 systemd[1]: ceph-osd(a)0.service.service: Scheduled restart job, restart counter is at 3. Oct 22 00:30:09 pve01 systemd[1]: Stopped Ceph object storage daemon osd.0.service. Oct 22 00:30:09 pve01 systemd[1]: ceph-osd(a)0.service.service: Start request repeated too quickly. Oct 22 00:30:09 pve01 systemd[1]: ceph-osd(a)0.service.service: Failed with result 'exit-code'. Oct 22 00:30:09 pve01 systemd[1]: Failed to start Ceph object storage daemon osd.0.service. | ||ceph mon dump dumped monmap epoch 3 epoch 3 fsid 92d063d7-647c-44b8-95d7-86057ee0ab22 last_changed 2020-10-21 23:31:50.584796 created 2020-10-21 21:00:54.077449 min_mon_release 14 (nautilus) 0: [v2:10.100.200.141:3300/0,v1:10.100.200.141:6789/0] mon.pve01 1: [v2:10.100.200.142:3300/0,v1:10.100.200.142:6789/0] mon.pve02 2: [v2:10.100.200.143:3300/0,v1:10.100.200.143:6789/0] mon.pve03 || |||Networks: auto lo iface lo inet loopback auto eno1np0 iface eno1np0 inet static address 10.110.200.131/24 mtu 9000 #corosync1 10GB auto eno2np1 iface eno2np1 inet static address 10.111.200.131/24 mtu 9000 #Corosync2 10GB iface enp69s0f0 inet manual mtu 9000 auto enp69s0f1 iface enp69s0f1 inet static address 10.112.200.131/24 mtu 9000 #Cluster private 100GB auto vmbr0 iface vmbr0 inet static address 10.100.200.141/24 gateway 10.100.200.1 bridge-ports enp69s0f0 bridge-stp off bridge-fd 0 mtu 9000 #Cluster public 100GB =================================================================================================== ||| ||||ceph.conf [global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.112.200.0/24 fsid = 92d063d7-647c-44b8-95d7-86057ee0ab22 mon_allow_pool_delete = true mon_host = 10.100.200.141 10.100.200.142 10.100.200.143 osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 10.100.200.0/24 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [mon.pve01] public_addr = 10.100.200.141 [mon.pve02] public_addr = 10.100.200.142 [mon.pve03] public_addr = 10.100.200.143 |||| |||||ceph -s cluster: id: 92d063d7-647c-44b8-95d7-86057ee0ab22 health: HEALTH_WARN 1 daemons have recently crashed OSD count 0 < osd_pool_default_size 3 services: mon: 3 daemons, quorum pve01,pve02,pve03 (age 63m) mgr: pve01(active, since 64m) osd: 0 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: df -h Filesystem Size Used Avail Use% Mounted on udev 252G 0 252G 0% /dev tmpfs 51G 11M 51G 1% /run rpool/ROOT/pve-1 229G 16G 214G 7% / tmpfs 252G 63M 252G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 252G 0 252G 0% /sys/fs/cgroup rpool 214G 128K 214G 1% /rpool rpool/data 214G 128K 214G 1% /rpool/data rpool/ROOT 214G 128K 214G 1% /rpool/ROOT tmpfs 252G 24K 252G 1% /var/lib/ceph/osd/ceph-3 tmpfs 252G 24K 252G 1% /var/lib/ceph/osd/ceph-2 tmpfs 252G 24K 252G 1% /var/lib/ceph/osd/ceph-0 tmpfs 252G 24K 252G 1% /var/lib/ceph/osd/ceph-1 /dev/fuse 30M 32K 30M 1% /etc/pve tmpfs 51G 0 51G 0% /run/user/0 lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme4n1 259:0 0 238.5G 0 disk ├─nvme4n1p1 259:5 0 1007K 0 part ├─nvme4n1p2 259:6 0 512M 0 part └─nvme4n1p3 259:7 0 238G 0 part nvme5n1 259:1 0 238.5G 0 disk ├─nvme5n1p1 259:2 0 1007K 0 part ├─nvme5n1p2 259:3 0 512M 0 part └─nvme5n1p3 259:4 0 238G 0 part nvme0n1 259:12 0 2.9T 0 disk └─ceph--cc77fe1b--c8d4--48be--a7c4--36109439c85c-osd--block--80e0127e--836e--44b8--882d--ac49bfc85866 253:3 0 2.9T 0 lvm nvme1n1 259:13 0 2.9T 0 disk └─ceph--eb8b2fc7--775e--4b94--8070--784e7bbf861e-osd--block--4d433222--e1e8--43ac--8dc7--2e6e998ff122 253:2 0 2.9T 0 lvm nvme3n1 259:14 0 2.9T 0 disk └─ceph--5724bdf7--5124--4244--91d6--e254210c2174-osd--block--2d6fe149--f330--415a--a762--44d037c900b1 253:1 0 2.9T 0 lvm nvme2n1 259:15 0 2.9T 0 disk └─ceph--cb5762e9--40fa--4148--98f4--5b5ddef4c1de-osd--block--793d52e6--4cc7--42aa--8326--df25b21c1237 253:0 0 2.9T 0 lvm |||||

3 years, 6 months

3
5
0 0

OSD Failures after pg_num increase on one of the pools

by Артём Григорьев

Hello everyone, I created a new ceph 14.2.7 Nautilus cluster recently. Cluster consists of 3 racks and 2 osd nodes on each rack, 12 new hdd in each node. HDD model is TOSHIBA MG07ACA14TE 14Tb. All data pools are ec pools. Yesterday I decided to increase pg number on one of the pools with command "ceph osd pool set photo.buckets.data pg_num 512", after that many osds started to crash with "out" and "down" status. I tried to increase recovery_sleep to 1s but osds still crashes. Osds started working properly only when i set "norecover" flag, but osd scrub errors appeared after that. In logs from osd during crashes i found this: --- Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 7f8af535d700 time 2020-10-21 15:12:11.460092 Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc: 648: FAILED ceph_assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( aft er_progress.data_recovered_to - op.recovery_progress.data_recovered_to)) Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable) Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55fc694d6c0f] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 2: (()+0x4dddd7) [0x55fc694d6dd7] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 3: (ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)+0x1740) [0x55fc698cafa0] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 4: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::v14_2_0::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::v14_2_0::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional<std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > > >, RecoveryMessages*)+0x734) [0x55fc698cb804] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 5: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x94) [0x55fc698ebbe4] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 6: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8c) [0x55fc698bfdcc] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 7: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0x109c) [0x55fc698d6b8c] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 8: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x17f) [0x55fc698d718f] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 9: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55fc697c18ea] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5b3) [0x55fc697676b3] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 11: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x362) [0x55fc695b3d72] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 12: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55fc698415c2] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55fc695cebbf] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55fc69b6f976] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55fc69b72490] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 16: (()+0x7e65) [0x7f8b1ddede65] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 17: (clone()+0x6d) [0x7f8b1ccb188d] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: *** Caught signal (Aborted) ** Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: in thread 7f8af535d700 thread_name:tp_osd_tp --- Current ec profile and pool info bellow: # ceph osd erasure-code-profile get EC42 crush-device-class=hdd crush-failure-domain=host crush-root=main jerasure-per-chunk-alignment=false k=4 m=2 plugin=jerasure technique=reed_sol_van w=8 pool 25 'photo.buckets.data' erasure size 6 min_size 4 crush_rule 6 object_hash rjenkins pg_num 512 pgp_num 280 pgp_num_target 512 autoscale_mode warn last_change 43418 lfor 0/0/42223 flags hashpspool stripe_width 1048576 application rgw Current ceph status: ceph -s cluster: id: 9ec8d309-a620-4ad8-93fa-c2d111e5256e health: HEALTH_ERR norecover flag(s) set 1 pools have many more objects per pg than average 4542629 scrub errors Possible data damage: 6 pgs inconsistent Degraded data redundancy: 1207268/578535561 objects degraded (0.209%), 51 pgs degraded, 35 pgs undersized 85 pgs not deep-scrubbed in time services: mon: 3 daemons, quorum ceph-osd-101,ceph-osd-201,ceph-osd-301 (age 2w) mgr: ceph-osd-101(active, since 3w), standbys: ceph-osd-301, ceph-osd-201 osd: 72 osds: 72 up (since 11h), 72 in (since 21h); 48 remapped pgs flags norecover rgw: 6 daemons active (ceph-osd-101.rgw0, ceph-osd-102.rgw0, ceph-osd-201.rgw0, ceph-osd-202.rgw0, ceph-osd-301.rgw0, ceph-osd-302.rgw0) data: pools: 26 pools, 15680 pgs objects: 96.46M objects, 124 TiB usage: 303 TiB used, 613 TiB / 917 TiB avail pgs: 1207268/578535561 objects degraded (0.209%) 14068769/578535561 objects misplaced (2.432%) 15290 active+clean 312 active+recovering 30 active+undersized+degraded+remapped+backfilling 21 active+recovering+degraded 13 active+remapped+backfilling 6 active+clean+inconsistent 5 active+recovering+undersized+remapped 3 active+clean+scrubbing+deep So now my cluster is stuck and can't recover properly, can someone give info about this problem? Is it a bug?

3 years, 6 months

3
2
0 0

2024

2023

2022

2021

2020

2019

ceph-users October 2020