I hope I don't sound too happy to hear that you've run into this same
problem, but still I'm glad to see that it's not just a one-off problem
with us. :)
We're still running Mimic. I haven't yet deployed Nautilus anywhere.
Thanks
-Troy
On 2/20/20 2:14 PM, Dan van der Ster wrote:
> Thanks Troy for the quick response.
> Are you still running mimic on that cluster? Seeing the crashes in nautilus too?
>
> Our cluster is also quite old -- so it could very well be memory or
> network gremlins.
>
> Cheers, Dan
>
> On Thu, Feb 20, 2020 at 10:11 PM Troy Ablan <tablan(a)gmail.com> wrote:
>>
>> Dan,
>>
>> Yes, I have had this happen several times since, but fortunately the
>> last couple of times has only happened to one or two OSDs at a time so
>> it didn't take down entire pools. Remedy has been the same.
>>
>> I had been holding off on too much further investigation because I
>> thought the source of the issue may have been some old hardware
>> gremlins, and we're waiting on some new hardware.
>>
>> -Troy
>>
>>
>> On 2/20/20 1:40 PM, Dan van der Ster wrote:
>>> Hi Troy,
>>>
>>> Looks like we hit the same today -- Sage posted some observations
>>> here:
https://tracker.ceph.com/issues/39525#note-6
>>>
>>> Did it happen again in your cluster?
>>>
>>> Cheers, Dan
>>>
>>>
>>>
>>> On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan <tablan(a)gmail.com> wrote:
>>>>
>>>> While I'm still unsure how this happened, this is what was done to
solve
>>>> this.
>>>>
>>>> Started OSD in foreground with debug 10, watched for the most recent
>>>> osdmap epoch mentioned before abort(). For example, if it mentioned
>>>> that it just tried to load 80896 and then crashed
>>>>
>>>> # ceph osd getmap -o osdmap.80896 80896
>>>> # ceph-objectstore-tool --op set-osdmap --data-path
>>>> /var/lib/ceph/osd/ceph-77/ --file osdmap.80896
>>>>
>>>> Then I restarted the osd in foreground/debug, and repeated for the next
>>>> osdmap epoch until it got past the first few seconds. This process
>>>> worked for all but two OSDs. For the ones that succeeded I'd ^C and
>>>> then start the osd via systemd
>>>>
>>>> For the remaining two, it would try loading the incremental map and then
>>>> crash. I had presence of mind to make dd images of every OSD before
>>>> starting this process, so I reverted these two to the state before
>>>> injecting the osdmaps.
>>>>
>>>> I then injected the last 15 or so epochs of the osdmap in sequential
>>>> order before starting the osd, with success.
>>>>
>>>> This leads me to believe that the step-wise injection didn't work
>>>> because the osd had more subtle corruption that it got past, but it was
>>>> confused when it requested the next incremental delta.
>>>>
>>>> Thanks again to Brad/badone for the guidance!
>>>>
>>>> Tracker issue updated.
>>>>
>>>> Here's the closing IRC dialogue re this issue (UTC-0700)
>>>>
>>>> 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you
reaching out
>>>> yesterday, you've helped a ton, twice now :) I'm still
concerned
>>>> because I don't know how this happened. I'll feel better once
>>>> everything's active+clean, but it's all at least active.
>>>>
>>>> 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion
with
>>>> Josh earlier and he shares my opinion this is likely somehow related to
>>>> these drives or perhaps controllers, or at least specific to these
machines
>>>>
>>>> 2019-08-19 16:31:04 < badone> however, there is a possibility you
are
>>>> seeing some extremely rare race that no one up to this point has seen
before
>>>>
>>>> 2019-08-19 16:31:20 < badone> that is less likely though
>>>>
>>>> 2019-08-19 16:32:50 < badone> the osd read the osdmap over the
wire
>>>> successfully but wrote it out to disk in a format that it could not then
>>>> read back in (unlikely) or...
>>>>
>>>> 2019-08-19 16:33:21 < badone> the map "changed" after it
had been
>>>> written to disk
>>>>
>>>> 2019-08-19 16:33:46 < badone> the second is considered most likely
by us
>>>> but I recognise you may not share that opinion
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users(a)lists.ceph.com
>>>>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com