[ceph-users] Re: osd crashing and rocksdb corruption

29 Apr 2020

Hi,
It seems much better with theses options. The osd is now up since 10mn 
without crashing (before it was rebooting after ~1mn).
F.

Le 29/04/2020 à 15:16, Igor Fedotov a écrit :
...
  Hi Francois,

 I'll write a more thorough response a bit later.

 Meanwhile could you please try OSD startup with the following settings 
 now:

 debug-bluefs abd debug-bdev = 20

 bluefs sync write = false

 bluefs buffered io = false

 Thanks,

 Igor

 On 4/29/2020 3:35 PM, Francois Legrand wrote:
> Hi Igor,
> Here is what we did :
> First, as other osd were falling down, we stopped all operations with
> ceph osd set norecover
> ceph osd set norebalance
> ceph osd set nobackfill
> ceph osd set pause
>
> to avoid other crashs !
>
> Then we moved to your recommandations (still testing on osd 5):
> in /etc/ceph/ceph.conf we added:
> [osd.5]
>     debug bluefs = 20
>     debug bdev = 20
>
> We ran :
> ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5 -l 
> /var/log/ceph/bluestore-tool-fsck-osd-5.log --log-level 20  > 
> /var/log/ceph/bluestore-tool-fsck-osd-5.out  2>&1
>
> it ended with fsck success
>
> It seems that the default value for bluefs sync write is false 
> (https://github.com/ceph/ceph/blob/v14.2.8/src/common/options.cc), thus
> we changed /etc/ceph/ceph.conf to :
> [osd.5]
>     debug bluefs = 20
>     debug bdev = 20
>     bluefs sync write = true
>
> and restarted the osd. It crashed !
> We tryed to change explicitely bluefs sync write = false and 
> restarted... same result !
>
> The logs are here : https://we.tl/t-HMiFDu22XH
>
> Moreover, we have a rados gateway pool with hundreds of 4GB files. 
> Can this be the origin of the problem ?
> Do you thing that removing this pool can help ?
>
> Thanks again for your expertise.
> F.
>
>
> Le 28/04/2020 à 18:52, Igor Fedotov a écrit :
>> Short update - please treat bluefs_sync_write parameter instead of 
>> bdev-aio.  Changing the latter isn't supported in fact.
>>
>> On 4/28/2020 7:35 PM, Igor Fedotov wrote:
>>> Francious,
>>>
>>> here are some observations got from your log.
>>>
>>> 1) Rocksdb reports error on the following .sst file:
>>>
>>>    -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1 rocksdb: 
>>> Corruption: Bad table magic number: expected 986351839
>>> 0377041911, found 12950032858166034944 in db/068269.sst
>>>
>>> 2) which relates to BlueFS ino 53361:
>>>
>>>   -50> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs open_for_read 
>>> db/068269.sst (random)
>>>    -49> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs 
>>> open_for_read h 0x557914fb80b0 on file(ino 53361 size 0x
>>> c496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated c4970000 
>>> extents [1:0x383db280000~c4970000])
>>>
>>> 3) and failed read happens to the end (0xc496f8e4~35, last 0x35 
>>> bytes) of this huge(3+GB) file:
>>>
>>>  -44> 2020-04-28 15:23:47.514 7f4856e82a80 10 bluefs _read_random h 
>>> 0x557914fb80b0 0xc496f8e4~35 from file(in
>>> o 53361 size 0xc496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 
>>> allocated c4970000 extents [1:0x383db280000~c49
>>> 70000])
>>>    -43> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random 
>>> left 0x71c 0xc496f8e4~35
>>>    -42> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random 
>>> got 53
>>>
>>> 4) This .sst file was created from scratch shortly before with a 
>>> single-shot 3+GB write:
>>>
>>>  -88> 2020-04-28 15:23:35.661 7f4856e82a80 10 bluefs open_for_write 
>>> db/068269.sst
>>>    -87> 2020-04-28 15:23:35.661 7f4856e82a80 20 bluefs 
>>> open_for_write mapping db/068269.sst to bdev 1
>>>    -86> 2020-04-28 15:23:35.662 7f4856e82a80 10 bluefs 
>>> open_for_write h 0x5579145e7a40 on file(ino 53361 size 0
>>> x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents [])
>>>    -85> 2020-04-28 15:23:39.826 7f4856e82a80 10 bluefs _flush 
>>> 0x5579145e7a40 0x0~c496f919 to file(ino 53361 siz
>>> e 0x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents [])
>>>
>>> 5) Presumably RocksDB creates this file in an attempt to 
>>> recover/compact/process another existing  file (ino 52405) which is 
>>> pretty large as well
>>>
>>> Please find multiple earlier reads, the last one:
>>>
>>>    -92> 2020-04-28 15:23:29.857 7f4856e82a80 10 bluefs _read h 
>>> 0x5579147286e0 0xc6788000~8000 from file(ino 524
>>> 05 size 0xc67888a0 mtime 2020-04-25 13:34:55.325699 bdev 0 
>>> allocated c6790000 extents [1:0x381c8220000~10000,1:
>>>
>>> The rationales for binding these two files are pretty uncommon file 
>>> sizes.
>>>
>>>
>>> So you have 3+GB single-shot BlueFS write and immediate read from 
>>> the end of the written extent which returns unexpected magic.
>>>
>>> It's well known in the software world that large (2+GB) data 
>>> processing implementations tend to be error-prone. And Ceph is not 
>>> an exception. Here is a couple of recent examples which are pretty 
>>> close to your case:
>>>
>>> https://github.com/ceph/ceph/commit/4d33114a40d5ae0d541c36175977ca22789a3b88

>>>
>>>
>>> https://github.com/ceph/ceph/commit/207806abaa91259d9bb726c2555e7e21ac541527

>>>
>>>
>>> Although they are already fixed in Nautilus 14.2.8 there might be 
>>> others present along the write path (including H/W firmware).
>>>
>>> The good news are that failure happens on a newly written file 
>>> (remember invalid magic is read at the end(!) of this new large 
>>> file shortly(!) after the write). So looks like data hasn't yet 
>>> landed to a disk. And hopefully original data are untouched.
>>>
>>> You can try to run ceph-bluestore-tool's fsck  command on the 
>>> broken OSD. This will open DB in read-only and hopefully proves the 
>>> idea that the issue is on the write path rather than already 
>>> corrupted data.
>>>
>>> Also you might want to set bdev-aio to false (or true if it's 
>>> already at false) and try OSD startup once more - this will go via 
>>> a bit different write path and may be provide a workaround.
>>>
>>> Also please collect debug logs for OSD startup (with both current 
>>> and updated bdev-aio parameter) and --debug-bdev/debug-bluefs set 
>>> to 20. You can omit --debug-bluestore increase for now to reduce 
>>> log size.
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 4/28/2020 5:16 PM, Francois Legrand wrote:
>>>> Here is the output of ceph-bluestore-tool bluefs-bdev-sizes
>>>> inferring bluefs devices from bluestore path
>>>>  slot 1 /var/lib/ceph/osd/ceph-5/block -> /dev/dm-17
>>>> 1 : device size 0x746c0000000 : own 0x[37e1eb00000~4a82900000] = 
>>>> 0x4a82900000 : using 0x5bc780000(23 GiB)
>>>>
>>>>
>>>> the result of the debug-bluestore (and debug-bluefs) set to 20 for 
>>>> osd.5
>>>> is at the following address (28MB).
>>>>
>>>>
https://wetransfer.com/downloads/a193ab15ab5e2395fe2462c963507a7f2020042814…

>>>>
>>>>
>>>> Thanks for your help.
>>>> F.
>>>>
>>>> Le 28/04/2020 à 13:33, Igor Fedotov a écrit :
>>>>> Hi Francois,
>>>>>
>>>>>
>>>>> Could you please share OSD startup log with debug-bluestore (and 
>>>>> debug-bluefs) set to 20.
>>>>>
>>>>> Also please run ceph-bluestore-tool's bluefs-bdev-sizes command 
>>>>> and share the output.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Igor
>>>>>
>>>>>
>>>>> On 4/28/2020 12:55 AM, Francois Legrand wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> *** Short version ***
>>>>>> Is there a way to repair a rocksdb from errors "Encountered

>>>>>> error while reading data from compression dictionary block 
>>>>>> Corruption: block checksum mismatch" and "_open_db
erroring 
>>>>>> opening db" ?
>>>>>>
>>>>>>
>>>>>> *** Long version ***
>>>>>> We operate a nautilus ceph cluster (with 100 disks of 8TB in 6 
>>>>>> servers + 4 mons/mgr + 3 mds).
>>>>>> We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This 
>>>>>> triggered a rebalancing of some data.
>>>>>> Two days later (Wednesday 22) we had a very short power outage. 
>>>>>> Only one of the osd servers went down (and unfortunately died).
>>>>>> This triggered a reconstruction of the losts osds. Operations 
>>>>>> went fine until Saturday 25 where some osds in the 5 remaining 
>>>>>> servers started to crash apparently with no reasons.
>>>>>> We tryed to restart them, but they crashed again. We ended with 
>>>>>> 18 osd down (+ 16 in the dead server so 34 osd downs out of
100).
>>>>>> Looking at the logs we found for all the crashed osd :
>>>>>>
>>>>>> -237> 2020-04-25 16:32:51.835 7f1f45527a80  3 rocksdb: 
>>>>>> [table/block_based_table_reader.cc:1117] Encountered error while

>>>>>> reading data from compression dictionary block Corruption: block

>>>>>> checksum mismatch: expected 0, got 2729370997  in db/181355.sst 
>>>>>> offset 18446744073709551615 size 18446744073709551615
>>>>>>
>>>>>> and
>>>>>>
>>>>>> 2020-04-25 16:05:47.251 7fcbd1e46a80 -1 
>>>>>> bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening
db:
>>>>>>
>>>>>> We also noticed that the "Encountered error while reading
data 
>>>>>> from compression dictionary block Corruption: block checksum 
>>>>>> mismatch" was present few days before the crash.
>>>>>> We also have some osd with this error but still up.
>>>>>>
>>>>>> We tryed to repair with :
>>>>>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair
>>>>>> But no success (it ends with _open_db erroring opening db).
>>>>>>
>>>>>> Thus does somebody have an idea to fix this or at least know if 
>>>>>> it's possible to repair and correct the "Encountered
error while 
>>>>>> reading data from compression dictionary block Corruption: block

>>>>>> checksum mismatch" and "_open_db erroring opening
db" !
>>>>>> Thanks for your help (we are desperate because we will loose 
>>>>>> datas and are fighting to save something) !!!
>>>>>> F.
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: osd crashing and rocksdb corruption