ceph-users January 2021

ceph-users@ceph.io

104 participants
123 discussions

by Szabo, Istvan (Agoda)

What is the easiest and best way to migrate bucket from an old cluster to a new one? Luminous to octopus not sure does it matter from the data perspective. ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 4 months

Re: OSD reboot loop after running out of memory

by Stefan Wild

Of course, I forgot to mention that. Thank you for bringing it up! We made sure balancer and PG autoscaler were turned off for the (only) pool that uses those PGs shortly after we noticed the cycle of remapping/backfilling: # ceph balancer status { "active": false, "last_optimize_duration": "", "last_optimize_started": "", "mode": "none", "optimize_result": "", "plans": [] } # ceph osd pool get avl1.rgw.buckets.data pg_autoscale_mode pg_autoscale_mode: off Thanks, Stefan On 1/1/21, 11:23 PM, "Anthony D'Atri" <anthony.datri(a)gmail.com> wrote: I have to ask if this might be the balancer or the PG autoscaler at work > On Jan 1, 2021, at 7:15 PM, Stefan Wild <swild(a)tiltworks.com> wrote: > > Our setup is not using SSDs as the Bluestore DB devices. We only have 2 SSDs vs 12 HDDs, which is normally fine for the low workload of the cluster. The SSDs are serving a pool that is just used by RGW for index and meta. > > Since the compaction two weeks ago the OSDs have all been stable. However, besides some other minor issues the cluster now keeps remapping erasure coded PGs with identical OSDs just in different order. Ceph will remap 11 (out of 128) PGs, then slowly backfill them and the second it's done, it'll pick another 11 PGs and remap those. I had to set osd_max_backfills to 0 in order to get any scrubbing/repair in. Not sure how to stop the constant cycle of remapping/backfilling. > > Thanks, > Stefan > > > > > On 12/16/20, 4:24 AM, "Frédéric Nass" <frederic.nass(a)univ-lorraine.fr> wrote: > > Regarding RocksDB compaction, if you were in a situation were RocksDB > had overspilled to HDDs (if your cluster is using an hybrid setup), the > compaction should have move the bits back to fast devices. So it might > have helped in this situation too. > > Regards, > > Frédéric. > > Le 16/12/2020 à 09:57, Frédéric Nass a écrit : >> Hi Sefan, >> >> This has me thinking that the issue your cluster may be facing is >> probably with bluefs_buffered_io set to true, as this has been >> reported to induce excessive swap usage (and OSDs flapping or OOMing >> as consequences) in some versions starting from Nautilus I believe. >> >> Can you check the value of bluefs_buffered_io that OSDs are currently >> using ? : ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config >> show | grep bluefs_buffered_io >> >> Can you check the kernel value of vm.swappiness ? : sysctl >> vm.swappiness (default value is 30) >> >> And describe your OSD nodes ? # of HDDS and SSDs/NVMes and HDD/SSD >> ratio, and how much memory they have ? >> >> You should be able to avoid swap usage by setting bluefs_buffered_io >> to false but your cluster / workload might not allow that performance >> and stability wise. >> Or you may be able to workaround the excessive swap usage (when >> bluefs_buffered_io is set to true) by lowering vm.swappiness or >> disabling the swap. >> >> Regards, >> >> Frédéric. >> >> Le 14/12/2020 à 22:12, Stefan Wild a écrit : >>> Hi Frédéric, >>> >>> Thanks for the additional input. We are currently only running RGW on >>> the cluster, so no snapshot removal, but there have been plenty of >>> remappings with the OSDs failing (all of them at first during and >>> after the OOM incident, then one-by-one). I haven't had a chance to >>> look into or test the bluefs_buffered_io setting, but will do that >>> next. Initial results from compacting all OSDs' RocksDBs look >>> promising (thank you, Igor!). Things have been stable for the past >>> two hours, including the two OSDs with issues (one in reboot loop, >>> the other with some heartbeats missed), while 15 degraded PGs are >>> backfilling. >>> >>> The ballooning of each OSD to over 15GB memory right after the >>> initial crash was even with osd_memory_target set to 2GB. The only >>> thing that helped at that point was to temporarily add enough swap >>> space to fit 12 x 15GB and let them do their thing. Once they had all >>> booted, memory usage went back down to normal levels. >>> >>> I will report back here with more details when the cluster is >>> hopefully back to a healthy state. >>> >>> Thanks, >>> Stefan >>> >>> >>> >>> On 12/14/20, 3:35 PM, "Frédéric Nass" >>> <frederic.nass(a)univ-lorraine.fr> wrote: >>> >>> Hi Stefan, >>> >>> Initial data removal could also have resulted from a snapshot >>> removal >>> leading to OSDs OOMing and then pg remappings leading to more >>> removals >>> after OOMed OSDs rejoined the cluster and so on. >>> >>> As mentioned by Igor : "Additionally there are users' reports that >>> recent default value's modification for bluefs_buffered_io >>> setting has >>> negative impact (or just worsen existing issue with massive >>> removal) as >>> well. So you might want to switch it back to true." >>> >>> We're some of them. Our cluster suffered from a severe >>> performance drop >>> during snapshot removal right after upgrading to Nautilus, due to >>> bluefs_buffered_io being set to false by default, with slow >>> requests >>> observed around the cluster. >>> Once back to true (can be done with ceph tell osd.* injectargs >>> '--bluefs_buffered_io=true') snap trimming would be fast again >>> so as >>> before the upgrade, with no more slow requests. >>> >>> But of course we've seen the excessive memory swap usage >>> described here >>> : https://github.com/ceph/ceph/pull/34224 >>> So we lower osd_memory_target from 8MB to 4MB and haven't >>> observed any >>> swap usage since then. You can also have a look here : >>> https://github.com/ceph/ceph/pull/38044 >>> >>> What you need to look at to understand if your cluster would >>> benefit >>> from changing bluefs_buffered_io back to true is the %util of your >>> RocksDBD devices on an iostat. Run an iostat -dmx 1 /dev/sdX (if >>> you're >>> using SSD RocksDB devices) and look at the %util of the device with >>> bluefs_buffered_io=false and with bluefs_buffered_io=true. If with >>> bluefs_buffered_io=false, the %util is over 75% most of the >>> time, then >>> you'd better change it to true. :-) >>> >>> Regards, >>> >>> Frédéric. >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 4 months

OSD reboot loop after running out of memory

by Stefan Wild

Hi, We recently upgraded a cluster from 15.2.1 to 15.2.5. About two days later, one of the server ran out of memory for unknown reasons (normally the machine uses about 60 out of 128 GB). Since then, some OSDs on that machine get caught in an endless restart loop. Logs will just mention system seeing the daemon fail and then restarting it. Since the out of memory incident, we’ve have 3 OSDs fail this way at separate times. We resorted to wiping the affected OSD and re-adding it to the cluster, but it seems as soon as all PGs have moved to the OSD, the next one fails. This is also keeping us from re-deploying RGW, which was affected by the same out of memory incident, since cephadm runs a check and won’t deploy the service unless the cluster is in HEALTH_OK status. Any help would be greatly appreciated. Thanks, Stefan

3 years, 4 months

2024

2023

2022

2021

2020

2019

ceph-users January 2021