[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

21 Feb 2020

Dan,

This has happened to HDDs also, and nvme most recently.  CentOS 7.7, 
usually the kernel is within 6 months of current updates.  We try to 
stay relatively up to date.

-Troy

On 2/20/20 5:28 PM, Dan van der Ster wrote:
> Another thing... in your thread that you said that only the *SSDs* in
> your cluster had crashed, but not the HDDs.
> Both SSDs and HDDs were bluestore? Did the hdds ever crash subsequently?
> Which OS/kernel do you run? We're CentOS 7 with quite some uptime.
> 
> 
> On Thu, Feb 20, 2020 at 10:29 PM Troy Ablan &lt;tablan(a)gmail.com&gt; wrote:
>>
>> I hope I don't sound too happy to hear that you've run into this same
>> problem, but still I'm glad to see that it's not just a one-off problem
>> with us. :)
>>
>> We're still running Mimic.  I haven't yet deployed Nautilus anywhere.
>>
>> Thanks
>> -Troy
>>
>> On 2/20/20 2:14 PM, Dan van der Ster wrote:
>>> Thanks Troy for the quick response.
>>> Are you still running mimic on that cluster? Seeing the crashes in nautilus
too?
>>>
>>> Our cluster is also quite old -- so it could very well be memory or
>>> network gremlins.
>>>
>>> Cheers, Dan
>>>
>>> On Thu, Feb 20, 2020 at 10:11 PM Troy Ablan &lt;tablan(a)gmail.com&gt; wrote:
>>>>
>>>> Dan,
>>>>
>>>> Yes, I have had this happen several times since, but fortunately the
>>>> last couple of times has only happened to one or two OSDs at a time so
>>>> it didn't take down entire pools.  Remedy has been the same.
>>>>
>>>> I had been holding off on too much further investigation because I
>>>> thought the source of the issue may have been some old hardware
>>>> gremlins, and we're waiting on some new hardware.
>>>>
>>>> -Troy
>>>>
>>>>
>>>> On 2/20/20 1:40 PM, Dan van der Ster wrote:
>>>>> Hi Troy,
>>>>>
>>>>> Looks like we hit the same today -- Sage posted some observations
>>>>> here: https://tracker.ceph.com/issues/39525#note-6
>>>>>
>>>>> Did it happen again in your cluster?
>>>>>
>>>>> Cheers, Dan
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan &lt;tablan(a)gmail.com&gt;
wrote:
>>>>>>
>>>>>> While I'm still unsure how this happened, this is what was
done to solve
>>>>>> this.
>>>>>>
>>>>>> Started OSD in foreground with debug 10, watched for the most
recent
>>>>>> osdmap epoch mentioned before abort().  For example, if it
mentioned
>>>>>> that it just tried to load 80896 and then crashed
>>>>>>
>>>>>> # ceph osd getmap -o osdmap.80896 80896
>>>>>> # ceph-objectstore-tool --op set-osdmap --data-path
>>>>>> /var/lib/ceph/osd/ceph-77/ --file osdmap.80896
>>>>>>
>>>>>> Then I restarted the osd in foreground/debug, and repeated for
the next
>>>>>> osdmap epoch until it got past the first few seconds.  This
process
>>>>>> worked for all but two OSDs.  For the ones that succeeded I'd
^C and
>>>>>> then start the osd via systemd
>>>>>>
>>>>>> For the remaining two, it would try loading the incremental map
and then
>>>>>> crash.  I had presence of mind to make dd images of every OSD
before
>>>>>> starting this process, so I reverted these two to the state
before
>>>>>> injecting the osdmaps.
>>>>>>
>>>>>> I then injected the last 15 or so epochs of the osdmap in
sequential
>>>>>> order before starting the osd, with success.
>>>>>>
>>>>>> This leads me to believe that the step-wise injection didn't
work
>>>>>> because the osd had more subtle corruption that it got past, but
it was
>>>>>> confused when it requested the next incremental delta.
>>>>>>
>>>>>> Thanks again to Brad/badone for the guidance!
>>>>>>
>>>>>> Tracker issue updated.
>>>>>>
>>>>>> Here's the closing IRC dialogue re this issue (UTC-0700)
>>>>>>
>>>>>> 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you
reaching out
>>>>>> yesterday, you've helped a ton, twice now :)  I'm still
concerned
>>>>>> because I don't know how this happened.  I'll feel better
once
>>>>>> everything's active+clean, but it's all at least active.
>>>>>>
>>>>>> 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick
discussion with
>>>>>> Josh earlier and he shares my opinion this is likely somehow
related to
>>>>>> these drives or perhaps controllers, or at least specific to
these machines
>>>>>>
>>>>>> 2019-08-19 16:31:04 < badone> however, there is a
possibility you are
>>>>>> seeing some extremely rare race that no one up to this point has
seen before
>>>>>>
>>>>>> 2019-08-19 16:31:20 < badone> that is less likely though
>>>>>>
>>>>>> 2019-08-19 16:32:50 < badone> the osd read the osdmap over
the wire
>>>>>> successfully but wrote it out to disk in a format that it could
not then
>>>>>> read back in (unlikely) or...
>>>>>>
>>>>>> 2019-08-19 16:33:21 < badone> the map "changed"
after it had been
>>>>>> written to disk
>>>>>>
>>>>>> 2019-08-19 16:33:46 < badone> the second is considered most
likely by us
>>>>>> but I recognise you may not share that opinion
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users(a)lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

2024

2023

2022

2021

2020

2019

[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]