Op 20 feb. 2020 om 19:54 heeft Dan van der Ster
<dan(a)vanderster.com> het volgende geschreven:
For those following along, the issue is here:
https://tracker.ceph.com/issues/39525#note-6
Somehow single bits are getting flipped in the osdmaps -- maybe
network, maybe memory, maybe a bug.
Weird!
But I did see things like this happen before. This was under Hammer and Jewel where I
needed to these kind of things. Crashes looked very similar.
To get an osd starting, we have to extract the full
osdmap from the
mon, and set it into the crashing osd. So for the osd.666:
# ceph osd getmap 2982809 -o 2982809
# ceph-objectstore-tool --op set-osdmap --data-path
/var/lib/ceph/osd/ceph-666/ --file 2982809
Some osds had multiple corrupted osdmaps -- so we scriptified the above.
Were those corrupted onces in sequence?
As of now our PGs are all active, but we're not
confident that this
Awesome!
Wido
won't happen again (without knowing why the maps
were corrupting).
Thanks to all who helped!
dan
> On Thu, Feb 20, 2020 at 1:01 PM Dan van der Ster <dan(a)vanderster.com> wrote:
>
> 680 is epoch 2983572
> 666 crashes at 2982809 or 2982808
>
> -407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl
> 2982809 612069 bytes
> -407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) **
> in thread 7f4d931b5b80 thread_name:ceph-osd
>
> So I grabbed 2982809 and 2982808 and am setting them.
>
> Checking if the osds will start with that.
>
> -- dan
>
>
>
>> On Thu, Feb 20, 2020 at 12:47 PM Wido den Hollander <wido(a)42on.com> wrote:
>> On 2/20/20 12:40 PM, Dan van der Ster wrote:
>>> Hi,
>>>
>>> My turn.
>>> We suddenly have a big outage which is similar/identical to
>>>
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html
>>>
>>> Some of the osds are runnable, but most crash when they start -- crc
>>> error in osdmap::decode.
>>> I'm able to extract an osd map from a good osd and it decodes well
>>> with osdmaptool:
>>>
>>> # ceph-objectstore-tool --op get-osdmap --data-path
>>> /var/lib/ceph/osd/ceph-680/ --file osd.680.map
>>>
>>> But when I try on one of the bad osds I get:
>>>
>>> # ceph-objectstore-tool --op get-osdmap --data-path
>>> /var/lib/ceph/osd/ceph-666/ --file osd.666.map
>>> terminate called after throwing an instance of
'ceph::buffer::malformed_input'
>>> what(): buffer::malformed_input: bad crc, actual 822724616 !=
>>> expected 2334082500
>>> *** Caught signal (Aborted) **
>>> in thread 7f600aa42d00 thread_name:ceph-objectstor
>>> ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic
(stable)
>>> 1: (()+0xf5f0) [0x7f5ffefc45f0]
>>> 2: (gsignal()+0x37) [0x7f5ffdbae337]
>>> 3: (abort()+0x148) [0x7f5ffdbafa28]
>>> 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5ffe4be7d5]
>>> 5: (()+0x5e746) [0x7f5ffe4bc746]
>>> 6: (()+0x5e773) [0x7f5ffe4bc773]
>>> 7: (()+0x5e993) [0x7f5ffe4bc993]
>>> 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e)
[0x7f6000f4168e]
>>> 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f6000f42e31]
>>> 10: (get_osdmap(ObjectStore*, unsigned int, OSDMap&,
>>> ceph::buffer::list&)+0x1d0) [0x55d30a489190]
>>> 11: (main()+0x5340) [0x55d30a3aae70]
>>> 12: (__libc_start_main()+0xf5) [0x7f5ffdb9a505]
>>> 13: (()+0x3a0f40) [0x55d30a483f40]
>>> Aborted (core dumped)
>>>
>>>
>>>
>>> I think I want to inject the osdmap, but can't:
>>>
>>> # ceph-objectstore-tool --op set-osdmap --data-path
>>> /var/lib/ceph/osd/ceph-666/ --file osd.680.map
>>> osdmap (#-1:b65b78ab:::osdmap.2983572:0#) does not exist.
>>>
>>
>> Have you tried to list which epoch osd.680 is at and which one osd.666
>> is at? And which one the MONs are at?
>>
>> Maybe there is a difference there?
>>
>> Wido
>>
>>>
>>> How do I do this?
>>>
>>> Thanks for any help!
>>>
>>> dan
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>