Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

List overview All Threads
Download

newer

older

Migrating/Realocating ceph cluster

bluestore compression questions

Dan van der Ster

21 Feb 2020 21 Feb '20

5:40 a.m.

Hi Troy, Looks like we hit the same today -- Sage posted some observations here: https://tracker.ceph.com/issues/39525#note-6 Did it happen again in your cluster? Cheers, Dan On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan <tablan(a)gmail.com> wrote: > > While I'm still unsure how this happened, this is what was done to solve > this. > > Started OSD in foreground with debug 10, watched for the most recent > osdmap epoch mentioned before abort(). For example, if it mentioned > that it just tried to load 80896 and then crashed > > # ceph osd getmap -o osdmap.80896 80896 > # ceph-objectstore-tool --op set-osdmap --data-path > /var/lib/ceph/osd/ceph-77/ --file osdmap.80896 > > Then I restarted the osd in foreground/debug, and repeated for the next > osdmap epoch until it got past the first few seconds. This process > worked for all but two OSDs. For the ones that succeeded I'd ^C and > then start the osd via systemd > > For the remaining two, it would try loading the incremental map and then > crash. I had presence of mind to make dd images of every OSD before > starting this process, so I reverted these two to the state before > injecting the osdmaps. > > I then injected the last 15 or so epochs of the osdmap in sequential > order before starting the osd, with success. > > This leads me to believe that the step-wise injection didn't work > because the osd had more subtle corruption that it got past, but it was > confused when it requested the next incremental delta. > > Thanks again to Brad/badone for the guidance! > > Tracker issue updated. > > Here's the closing IRC dialogue re this issue (UTC-0700) > > 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out > yesterday, you've helped a ton, twice now :) I'm still concerned > because I don't know how this happened. I'll feel better once > everything's active+clean, but it's all at least active. > > 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with > Josh earlier and he shares my opinion this is likely somehow related to > these drives or perhaps controllers, or at least specific to these machines > > 2019-08-19 16:31:04 < badone> however, there is a possibility you are > seeing some extremely rare race that no one up to this point has seen before > > 2019-08-19 16:31:20 < badone> that is less likely though > > 2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire > successfully but wrote it out to disk in a format that it could not then > read back in (unlikely) or... > > 2019-08-19 16:33:21 < badone> the map "changed" after it had been > written to disk > > 2019-08-19 16:33:46 < badone> the second is considered most likely by us > but I recognise you may not share that opinion > _______________________________________________ > ceph-users mailing list > ceph-users(a)lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Show replies by date