The issue is found and fixed in 15.2.3.
Thanks for your response Igor!
Kind regards,
Wout
42on
________________________________________
From: Wout van Heeswijk <wout(a)42on.com>
Sent: Friday, 26 February 2021 16:10
To: ceph-users(a)ceph.io
Subject: [ceph-users] Re: Nautilus Cluster Struggling to Come Back Online
For those interested in this issue. We've been seeing OSDs with corrupted wals after
they had a suicide time out. I've updated the ticket created by William with some of
our logs.
https://tracker.ceph.com/issues/48827#note-16
We're using ceph 15.2.2 in this cluster. Currently we are contemplating a way forward,
but it looks like the wals are being corrupted under load.
Kind regards,
Wout
42on
________________________________________
From: William Law <wlaw(a)stanford.edu>
Sent: Tuesday, 19 January 2021 18:48
To: ceph-users(a)ceph.io
Subject: [ceph-users] Nautilus Cluster Struggling to Come Back Online
I guess as a sort of follow up from my previous post. Our Nautilus (14.2.16 on ubuntu
18.04) cluster had some sort of event that caused many of the machines to have memory
errors. The aftermath is that initially some OSDs had (and continue to have) this error
https://tracker.ceph.com/issues/48827 others won't start for various reasons.
The OSDs that *will* start are badly behind the current epoch for the most part.
It sounds very similar to this:
https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-ha…
We are having trouble getting things back online.
I think the path forward is to:
-set noup/nodown/noout/nobackfill/and wait for the OSDs that run to come up; we were
making good progress yesterday until some of the OSDs crashed with OOM errors. We are
again moving forward but understandably nervous.
-export the PGs from questionable OSDs and and then rebuild the OSDs; import the PGs if
necessary (very likely). Repeat until we are up.
Any suggestions for increasing speed? We are using noup/nobackfill/norebalance/pause but
the epoch catchup is taking a very long time. Any tips for keeping the epoch from moving
forward or speeding up the OSDs catching up? How can we estimate how long it should take?
Thank you for any ideas or assistance anyone can provide.
Will
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io