[ceph-users] Re: osd crashing and rocksdb corruption

28 Apr 2020

Short update - please treat bluefs_sync_write parameter instead of 
bdev-aio.  Changing the latter isn't supported in fact.

On 4/28/2020 7:35 PM, Igor Fedotov wrote:
> Francious,
>
> here are some observations got from your log.
>
> 1) Rocksdb reports error on the following .sst file:
>
>    -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1 rocksdb: Corruption: 
> Bad table magic number: expected 986351839
> 0377041911, found 12950032858166034944 in db/068269.sst
>
> 2) which relates to BlueFS ino 53361:
>
>   -50> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs open_for_read 
> db/068269.sst (random)
>    -49> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs open_for_read h 
> 0x557914fb80b0 on file(ino 53361 size 0x
> c496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated c4970000 
> extents [1:0x383db280000~c4970000])
>
> 3) and failed read happens to the end (0xc496f8e4~35, last 0x35 bytes) 
> of this huge(3+GB) file:
>
>  -44> 2020-04-28 15:23:47.514 7f4856e82a80 10 bluefs _read_random h 
> 0x557914fb80b0 0xc496f8e4~35 from file(in
> o 53361 size 0xc496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 
> allocated c4970000 extents [1:0x383db280000~c49
> 70000])
>    -43> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random 
> left 0x71c 0xc496f8e4~35
>    -42> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random 
> got 53
>
> 4) This .sst file was created from scratch shortly before with a 
> single-shot 3+GB write:
>
>  -88> 2020-04-28 15:23:35.661 7f4856e82a80 10 bluefs open_for_write 
> db/068269.sst
>    -87> 2020-04-28 15:23:35.661 7f4856e82a80 20 bluefs open_for_write 
> mapping db/068269.sst to bdev 1
>    -86> 2020-04-28 15:23:35.662 7f4856e82a80 10 bluefs open_for_write 
> h 0x5579145e7a40 on file(ino 53361 size 0
> x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents [])
>    -85> 2020-04-28 15:23:39.826 7f4856e82a80 10 bluefs _flush 
> 0x5579145e7a40 0x0~c496f919 to file(ino 53361 siz
> e 0x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents [])
>
> 5) Presumably RocksDB creates this file in an attempt to 
> recover/compact/process another existing  file (ino 52405) which is 
> pretty large as well
>
> Please find multiple earlier reads, the last one:
>
>    -92> 2020-04-28 15:23:29.857 7f4856e82a80 10 bluefs _read h 
> 0x5579147286e0 0xc6788000~8000 from file(ino 524
> 05 size 0xc67888a0 mtime 2020-04-25 13:34:55.325699 bdev 0 allocated 
> c6790000 extents [1:0x381c8220000~10000,1:
>
> The rationales for binding these two files are pretty uncommon file 
> sizes.
>
>
> So you have 3+GB single-shot BlueFS write and immediate read from the 
> end of the written extent which returns unexpected magic.
>
> It's well known in the software world that large (2+GB) data 
> processing implementations tend to be error-prone. And Ceph is not an 
> exception. Here is a couple of recent examples which are pretty close 
> to your case:
>
> https://github.com/ceph/ceph/commit/4d33114a40d5ae0d541c36175977ca22789a3b88 
>
>
> https://github.com/ceph/ceph/commit/207806abaa91259d9bb726c2555e7e21ac541527 
>
>
> Although they are already fixed in Nautilus 14.2.8 there might be 
> others present along the write path (including H/W firmware).
>
> The good news are that failure happens on a newly written file 
> (remember invalid magic is read at the end(!) of this new large file 
> shortly(!) after the write). So looks like data hasn't yet landed to a 
> disk. And hopefully original data are untouched.
>
> You can try to run ceph-bluestore-tool's fsck  command on the broken 
> OSD. This will open DB in read-only and hopefully proves the idea that 
> the issue is on the write path rather than already corrupted data.
>
> Also you might want to set bdev-aio to false (or true if it's already 
> at false) and try OSD startup once more - this will go via a bit 
> different write path and may be provide a workaround.
>
> Also please collect debug logs for OSD startup (with both current and 
> updated bdev-aio parameter) and --debug-bdev/debug-bluefs set to 20. 
> You can omit --debug-bluestore increase for now to reduce log size.
>
>
> Thanks,
>
> Igor
>
>
> On 4/28/2020 5:16 PM, Francois Legrand wrote:
>> Here is the output of ceph-bluestore-tool bluefs-bdev-sizes
>> inferring bluefs devices from bluestore path
>>  slot 1 /var/lib/ceph/osd/ceph-5/block -> /dev/dm-17
>> 1 : device size 0x746c0000000 : own 0x[37e1eb00000~4a82900000] = 
>> 0x4a82900000 : using 0x5bc780000(23 GiB)
>>
>>
>> the result of the debug-bluestore (and debug-bluefs) set to 20 for osd.5
>> is at the following address (28MB).
>>
>>
https://wetransfer.com/downloads/a193ab15ab5e2395fe2462c963507a7f2020042814…

>>
>>
>> Thanks for your help.
>> F.
>>
>> Le 28/04/2020 à 13:33, Igor Fedotov a écrit :
>>> Hi Francois,
>>>
>>>
>>> Could you please share OSD startup log with debug-bluestore (and 
>>> debug-bluefs) set to 20.
>>>
>>> Also please run ceph-bluestore-tool's bluefs-bdev-sizes command and 
>>> share the output.
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 4/28/2020 12:55 AM, Francois Legrand wrote:
>>>> Hi all,
>>>>
>>>> *** Short version ***
>>>> Is there a way to repair a rocksdb from errors "Encountered error 
>>>> while reading data from compression dictionary block Corruption: 
>>>> block checksum mismatch" and "_open_db erroring opening
db" ?
>>>>
>>>>
>>>> *** Long version ***
>>>> We operate a nautilus ceph cluster (with 100 disks of 8TB in 6 
>>>> servers + 4 mons/mgr + 3 mds).
>>>> We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This 
>>>> triggered a rebalancing of some data.
>>>> Two days later (Wednesday 22) we had a very short power outage. 
>>>> Only one of the osd servers went down (and unfortunately died).
>>>> This triggered a reconstruction of the losts osds. Operations went 
>>>> fine until Saturday 25 where some osds in the 5 remaining servers 
>>>> started to crash apparently with no reasons.
>>>> We tryed to restart them, but they crashed again. We ended with 18 
>>>> osd down (+ 16 in the dead server so 34 osd downs out of 100).
>>>> Looking at the logs we found for all the crashed osd :
>>>>
>>>> -237> 2020-04-25 16:32:51.835 7f1f45527a80  3 rocksdb: 
>>>> [table/block_based_table_reader.cc:1117] Encountered error while 
>>>> reading data from compression dictionary block Corruption: block 
>>>> checksum mismatch: expected 0, got 2729370997  in db/181355.sst 
>>>> offset 18446744073709551615 size 18446744073709551615
>>>>
>>>> and
>>>>
>>>> 2020-04-25 16:05:47.251 7fcbd1e46a80 -1 
>>>> bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db:
>>>>
>>>> We also noticed that the "Encountered error while reading data from

>>>> compression dictionary block Corruption: block checksum mismatch" 
>>>> was present few days before the crash.
>>>> We also have some osd with this error but still up.
>>>>
>>>> We tryed to repair with :
>>>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair
>>>> But no success (it ends with _open_db erroring opening db).
>>>>
>>>> Thus does somebody have an idea to fix this or at least know if 
>>>> it's possible to repair and correct the "Encountered error while

>>>> reading data from compression dictionary block Corruption: block 
>>>> checksum mismatch" and "_open_db erroring opening db" !
>>>> Thanks for your help (we are desperate because we will loose datas 
>>>> and are fighting to save something) !!!
>>>> F.
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: osd crashing and rocksdb corruption