[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

21 Feb 2020

Dan,

Yes, I have had this happen several times since, but fortunately the 
last couple of times has only happened to one or two OSDs at a time so 
it didn't take down entire pools.  Remedy has been the same.

I had been holding off on too much further investigation because I 
thought the source of the issue may have been some old hardware 
gremlins, and we're waiting on some new hardware.

-Troy

On 2/20/20 1:40 PM, Dan van der Ster wrote:
> Hi Troy,
> 
> Looks like we hit the same today -- Sage posted some observations
> here: https://tracker.ceph.com/issues/39525#note-6
> 
> Did it happen again in your cluster?
> 
> Cheers, Dan
> 
> 
> 
> On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan &lt;tablan(a)gmail.com&gt; wrote:
>>
>> While I'm still unsure how this happened, this is what was done to solve
>> this.
>>
>> Started OSD in foreground with debug 10, watched for the most recent
>> osdmap epoch mentioned before abort().  For example, if it mentioned
>> that it just tried to load 80896 and then crashed
>>
>> # ceph osd getmap -o osdmap.80896 80896
>> # ceph-objectstore-tool --op set-osdmap --data-path
>> /var/lib/ceph/osd/ceph-77/ --file osdmap.80896
>>
>> Then I restarted the osd in foreground/debug, and repeated for the next
>> osdmap epoch until it got past the first few seconds.  This process
>> worked for all but two OSDs.  For the ones that succeeded I'd ^C and
>> then start the osd via systemd
>>
>> For the remaining two, it would try loading the incremental map and then
>> crash.  I had presence of mind to make dd images of every OSD before
>> starting this process, so I reverted these two to the state before
>> injecting the osdmaps.
>>
>> I then injected the last 15 or so epochs of the osdmap in sequential
>> order before starting the osd, with success.
>>
>> This leads me to believe that the step-wise injection didn't work
>> because the osd had more subtle corruption that it got past, but it was
>> confused when it requested the next incremental delta.
>>
>> Thanks again to Brad/badone for the guidance!
>>
>> Tracker issue updated.
>>
>> Here's the closing IRC dialogue re this issue (UTC-0700)
>>
>> 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out
>> yesterday, you've helped a ton, twice now :)  I'm still concerned
>> because I don't know how this happened.  I'll feel better once
>> everything's active+clean, but it's all at least active.
>>
>> 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with
>> Josh earlier and he shares my opinion this is likely somehow related to
>> these drives or perhaps controllers, or at least specific to these machines
>>
>> 2019-08-19 16:31:04 < badone> however, there is a possibility you are
>> seeing some extremely rare race that no one up to this point has seen before
>>
>> 2019-08-19 16:31:20 < badone> that is less likely though
>>
>> 2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire
>> successfully but wrote it out to disk in a format that it could not then
>> read back in (unlikely) or...
>>
>> 2019-08-19 16:33:21 < badone> the map "changed" after it had
been
>> written to disk
>>
>> 2019-08-19 16:33:46 < badone> the second is considered most likely by us
>> but I recognise you may not share that opinion
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users(a)lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

2024

2023

2022

2021

2020

2019

[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]