[ceph-users] Re: osd crashing and rocksdb corruption

30 Apr 2020

Hello,
We set bluefs_buffered_io to false for the whole cluster except 2 osd 
(21 and 49) for which we decided to keep the value to true for future 
experiments/troubleshooting as you asked.
We then restarted all the 25 downs osd and they started... except one 
(number 8) which still continue to crash with the same kind of errors.
I tryed a fsck on this osd which ended by a success. I set the debug to 
20 and recorded the logs.
You will find the logs there if you want to have a look : 
https://we.tl/t-GDvvvi2Gmm
Now we plan to reactivate the recovery, backfill and rebalancing if you 
think it's safe.
F.

Le 29/04/2020 à 16:45, Igor Fedotov a écrit :
...
  So the crash seems to be caused by the same issue -
big (and 
 presumably incomplete) write and subsequent read failure.

 I've managed to repro this locally. So bluefs_buffered_io seems to be 
 a remedy for now.

 But additionally I can observe multiple slow ops indications in this 
 new log and I think they cause those big writes.

 And I presume some RGW-triggered ops are in flight - bucket resharding 
 or removal or something..

 I've seen multiple reports about this stuff causing OSD slowdown, high 
 memory utilization and  finally huge reads and/or writes from RocksDB.

 Don't know how to deal with this at the moment...

 Thanks,

 Igor

 On 4/29/2020 5:33 PM, Francois Legrand wrote:
> Here are the logs of the newly crashed osd.
> F.
>
> Le 29/04/2020 à 16:21, Igor Fedotov a écrit :
>> Sounds interesting - could you please share the crash log for these 
>> new OSDs? They presumably suffer from another issue. At least that 
>> first crash is caused by something else.
>>
>>
>> "bluefs buffered io" can be injected on the fly but I expect it to 
>> help when OSD isn't starting up only.
>>
>>
>> On 4/29/2020 5:17 PM, Francois Legrand wrote:
>>> Ok we will try that.
>>> Indeed, restarting osd.5 triggered the falling down of two other 
>>> osds in the cluster.
>>> Thus we will set bluefs buffered io = false for all osds and force 
>>> bluefs buffered io = true for one of the downs osds.
>>> Is that modification needs to use injectargs or changing it in the 
>>> configuration is enougth to have it applied on the fly ?
>>> F.
>>>
>>> Le 29/04/2020 à 15:56, Igor Fedotov a écrit :
>>>> That's bluefs buffered io = false which did the trick.
>>>>
>>>> It modified write path and this presumably has fixed large write(s).
>>>>
>>>> Trying to reproduce locally but please preserve at least one 
>>>> failing OSD (i.e. do not start it with the disabled buffered io) 
>>>> for future experiments/troubleshooting for a while if possible.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>>
>>>>
>>>> On 4/29/2020 4:50 PM, Francois Legrand wrote:
>>>>> Hi,
>>>>> It seems much better with theses options. The osd is now up since 
>>>>> 10mn without crashing (before it was rebooting after ~1mn).
>>>>> F.
>>>>>
>>>>> Le 29/04/2020 à 15:16, Igor Fedotov a écrit :
>>>>>> Hi Francois,
>>>>>>
>>>>>> I'll write a more thorough response a bit later.
>>>>>>
>>>>>> Meanwhile could you please try OSD startup with the following 
>>>>>> settings now:
>>>>>>
>>>>>> debug-bluefs abd debug-bdev = 20
>>>>>>
>>>>>> bluefs sync write = false
>>>>>>
>>>>>> bluefs buffered io = false
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Igor
>>>>>>
>>>>>>
>>>>>> On 4/29/2020 3:35 PM, Francois Legrand wrote:
>>>>>>> Hi Igor,
>>>>>>> Here is what we did :
>>>>>>> First, as other osd were falling down, we stopped all 
>>>>>>> operations with
>>>>>>> ceph osd set norecover
>>>>>>> ceph osd set norebalance
>>>>>>> ceph osd set nobackfill
>>>>>>> ceph osd set pause
>>>>>>>
>>>>>>> to avoid other crashs !
>>>>>>>
>>>>>>> Then we moved to your recommandations (still testing on osd
5):
>>>>>>> in /etc/ceph/ceph.conf we added:
>>>>>>> [osd.5]
>>>>>>>     debug bluefs = 20
>>>>>>>     debug bdev = 20
>>>>>>>
>>>>>>> We ran :
>>>>>>> ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5 -l 
>>>>>>> /var/log/ceph/bluestore-tool-fsck-osd-5.log --log-level 20
> 
>>>>>>> /var/log/ceph/bluestore-tool-fsck-osd-5.out 2>&1
>>>>>>>
>>>>>>> it ended with fsck success
>>>>>>>
>>>>>>> It seems that the default value for bluefs sync write is
false 
>>>>>>>
(https://github.com/ceph/ceph/blob/v14.2.8/src/common/options.cc), 
>>>>>>> thus
>>>>>>> we changed /etc/ceph/ceph.conf to :
>>>>>>> [osd.5]
>>>>>>>     debug bluefs = 20
>>>>>>>     debug bdev = 20
>>>>>>>     bluefs sync write = true
>>>>>>>
>>>>>>> and restarted the osd. It crashed !
>>>>>>> We tryed to change explicitely bluefs sync write = false and

>>>>>>> restarted... same result !
>>>>>>>
>>>>>>> The logs are here : https://we.tl/t-HMiFDu22XH
>>>>>>>
>>>>>>> Moreover, we have a rados gateway pool with hundreds of 4GB 
>>>>>>> files. Can this be the origin of the problem ?
>>>>>>> Do you thing that removing this pool can help ?
>>>>>>>
>>>>>>> Thanks again for your expertise.
>>>>>>> F.
>>>>>>>
>>>>>>>
>>>>>>> Le 28/04/2020 à 18:52, Igor Fedotov a écrit :
>>>>>>>> Short update - please treat bluefs_sync_write parameter 
>>>>>>>> instead of bdev-aio. Changing the latter isn't
supported in fact.
>>>>>>>>
>>>>>>>> On 4/28/2020 7:35 PM, Igor Fedotov wrote:
>>>>>>>>> Francious,
>>>>>>>>>
>>>>>>>>> here are some observations got from your log.
>>>>>>>>>
>>>>>>>>> 1) Rocksdb reports error on the following .sst file:
>>>>>>>>>
>>>>>>>>>    -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1
rocksdb: 
>>>>>>>>> Corruption: Bad table magic number: expected
986351839
>>>>>>>>> 0377041911, found 12950032858166034944 in
db/068269.sst
>>>>>>>>>
>>>>>>>>> 2) which relates to BlueFS ino 53361:
>>>>>>>>>
>>>>>>>>>   -50> 2020-04-28 15:23:45.103 7f4856e82a80 10
bluefs 
>>>>>>>>> open_for_read db/068269.sst (random)
>>>>>>>>>    -49> 2020-04-28 15:23:45.103 7f4856e82a80 10
bluefs 
>>>>>>>>> open_for_read h 0x557914fb80b0 on file(ino 53361 size
0x
>>>>>>>>> c496f919 mtime 2020-04-28 15:23:39.827515 bdev 1
allocated 
>>>>>>>>> c4970000 extents [1:0x383db280000~c4970000])
>>>>>>>>>
>>>>>>>>> 3) and failed read happens to the end (0xc496f8e4~35,
last 
>>>>>>>>> 0x35 bytes) of this huge(3+GB) file:
>>>>>>>>>
>>>>>>>>>  -44> 2020-04-28 15:23:47.514 7f4856e82a80 10
bluefs 
>>>>>>>>> _read_random h 0x557914fb80b0 0xc496f8e4~35 from
file(in
>>>>>>>>> o 53361 size 0xc496f919 mtime 2020-04-28
15:23:39.827515 bdev 
>>>>>>>>> 1 allocated c4970000 extents [1:0x383db280000~c49
>>>>>>>>> 70000])
>>>>>>>>>    -43> 2020-04-28 15:23:47.514 7f4856e82a80 20
bluefs 
>>>>>>>>> _read_random left 0x71c 0xc496f8e4~35
>>>>>>>>>    -42> 2020-04-28 15:23:47.514 7f4856e82a80 20
bluefs 
>>>>>>>>> _read_random got 53
>>>>>>>>>
>>>>>>>>> 4) This .sst file was created from scratch shortly
before 
>>>>>>>>> with a single-shot 3+GB write:
>>>>>>>>>
>>>>>>>>>  -88> 2020-04-28 15:23:35.661 7f4856e82a80 10
bluefs 
>>>>>>>>> open_for_write db/068269.sst
>>>>>>>>>    -87> 2020-04-28 15:23:35.661 7f4856e82a80 20
bluefs 
>>>>>>>>> open_for_write mapping db/068269.sst to bdev 1
>>>>>>>>>    -86> 2020-04-28 15:23:35.662 7f4856e82a80 10
bluefs 
>>>>>>>>> open_for_write h 0x5579145e7a40 on file(ino 53361
size 0
>>>>>>>>> x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated
0 
>>>>>>>>> extents [])
>>>>>>>>>    -85> 2020-04-28 15:23:39.826 7f4856e82a80 10
bluefs _flush 
>>>>>>>>> 0x5579145e7a40 0x0~c496f919 to file(ino 53361 siz
>>>>>>>>> e 0x0 mtime 2020-04-28 15:23:35.663142 bdev 1
allocated 0 
>>>>>>>>> extents [])
>>>>>>>>>
>>>>>>>>> 5) Presumably RocksDB creates this file in an attempt
to 
>>>>>>>>> recover/compact/process another existing file (ino
52405) 
>>>>>>>>> which is pretty large as well
>>>>>>>>>
>>>>>>>>> Please find multiple earlier reads, the last one:
>>>>>>>>>
>>>>>>>>>    -92> 2020-04-28 15:23:29.857 7f4856e82a80 10
bluefs _read 
>>>>>>>>> h 0x5579147286e0 0xc6788000~8000 from file(ino 524
>>>>>>>>> 05 size 0xc67888a0 mtime 2020-04-25 13:34:55.325699
bdev 0 
>>>>>>>>> allocated c6790000 extents [1:0x381c8220000~10000,1:
>>>>>>>>>
>>>>>>>>> The rationales for binding these two files are pretty

>>>>>>>>> uncommon file sizes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So you have 3+GB single-shot BlueFS write and
immediate read 
>>>>>>>>> from the end of the written extent which returns
unexpected 
>>>>>>>>> magic.
>>>>>>>>>
>>>>>>>>> It's well known in the software world that large
(2+GB) data 
>>>>>>>>> processing implementations tend to be error-prone.
And Ceph 
>>>>>>>>> is not an exception. Here is a couple of recent
examples 
>>>>>>>>> which are pretty close to your case:
>>>>>>>>>
>>>>>>>>>
https://github.com/ceph/ceph/commit/4d33114a40d5ae0d541c36175977ca22789a3b88 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
https://github.com/ceph/ceph/commit/207806abaa91259d9bb726c2555e7e21ac541527 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Although they are already fixed in Nautilus 14.2.8
there 
>>>>>>>>> might be others present along the write path
(including H/W 
>>>>>>>>> firmware).
>>>>>>>>>
>>>>>>>>> The good news are that failure happens on a newly
written 
>>>>>>>>> file (remember invalid magic is read at the end(!) of
this 
>>>>>>>>> new large file shortly(!) after the write). So looks
like 
>>>>>>>>> data hasn't yet landed to a disk. And hopefully
original data 
>>>>>>>>> are untouched.
>>>>>>>>>
>>>>>>>>> You can try to run ceph-bluestore-tool's fsck
command on the 
>>>>>>>>> broken OSD. This will open DB in read-only and
hopefully 
>>>>>>>>> proves the idea that the issue is on the write path
rather 
>>>>>>>>> than already corrupted data.
>>>>>>>>>
>>>>>>>>> Also you might want to set bdev-aio to false (or true
if it's 
>>>>>>>>> already at false) and try OSD startup once more -
this will 
>>>>>>>>> go via a bit different write path and may be provide
a 
>>>>>>>>> workaround.
>>>>>>>>>
>>>>>>>>> Also please collect debug logs for OSD startup (with
both 
>>>>>>>>> current and updated bdev-aio parameter) and 
>>>>>>>>> --debug-bdev/debug-bluefs set to 20. You can omit 
>>>>>>>>> --debug-bluestore increase for now to reduce log
size.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Igor
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 4/28/2020 5:16 PM, Francois Legrand wrote:
>>>>>>>>>> Here is the output of ceph-bluestore-tool
bluefs-bdev-sizes
>>>>>>>>>> inferring bluefs devices from bluestore path
>>>>>>>>>>  slot 1 /var/lib/ceph/osd/ceph-5/block ->
/dev/dm-17
>>>>>>>>>> 1 : device size 0x746c0000000 : own 
>>>>>>>>>> 0x[37e1eb00000~4a82900000] = 0x4a82900000 : using

>>>>>>>>>> 0x5bc780000(23 GiB)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> the result of the debug-bluestore (and
debug-bluefs) set to 
>>>>>>>>>> 20 for osd.5
>>>>>>>>>> is at the following address (28MB).
>>>>>>>>>>
>>>>>>>>>>
https://wetransfer.com/downloads/a193ab15ab5e2395fe2462c963507a7f2020042814…

>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks for your help.
>>>>>>>>>> F.
>>>>>>>>>>
>>>>>>>>>> Le 28/04/2020 à 13:33, Igor Fedotov a écrit :
>>>>>>>>>>> Hi Francois,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Could you please share OSD startup log with
debug-bluestore 
>>>>>>>>>>> (and debug-bluefs) set to 20.
>>>>>>>>>>>
>>>>>>>>>>> Also please run ceph-bluestore-tool's
bluefs-bdev-sizes 
>>>>>>>>>>> command and share the output.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Igor
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 4/28/2020 12:55 AM, Francois Legrand
wrote:
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> *** Short version ***
>>>>>>>>>>>> Is there a way to repair a rocksdb from
errors 
>>>>>>>>>>>> "Encountered error while reading
data from compression 
>>>>>>>>>>>> dictionary block Corruption: block
checksum mismatch" and 
>>>>>>>>>>>> "_open_db erroring opening db"
?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *** Long version ***
>>>>>>>>>>>> We operate a nautilus ceph cluster (with
100 disks of 8TB 
>>>>>>>>>>>> in 6 servers + 4 mons/mgr + 3 mds).
>>>>>>>>>>>> We recently (Monday 20) upgraded from
14.2.7 to 14.2.8. 
>>>>>>>>>>>> This triggered a rebalancing of some
data.
>>>>>>>>>>>> Two days later (Wednesday 22) we had a
very short power 
>>>>>>>>>>>> outage. Only one of the osd servers went
down (and 
>>>>>>>>>>>> unfortunately died).
>>>>>>>>>>>> This triggered a reconstruction of the
losts osds. 
>>>>>>>>>>>> Operations went fine until Saturday 25
where some osds in 
>>>>>>>>>>>> the 5 remaining servers started to crash
apparently with 
>>>>>>>>>>>> no reasons.
>>>>>>>>>>>> We tryed to restart them, but they
crashed again. We ended 
>>>>>>>>>>>> with 18 osd down (+ 16 in the dead server
so 34 osd downs 
>>>>>>>>>>>> out of 100).
>>>>>>>>>>>> Looking at the logs we found for all the
crashed osd :
>>>>>>>>>>>>
>>>>>>>>>>>> -237> 2020-04-25 16:32:51.835
7f1f45527a80 3 rocksdb: 
>>>>>>>>>>>> [table/block_based_table_reader.cc:1117]
Encountered error 
>>>>>>>>>>>> while reading data from compression
dictionary block 
>>>>>>>>>>>> Corruption: block checksum mismatch:
expected 0, got 
>>>>>>>>>>>> 2729370997 in db/181355.sst offset
18446744073709551615 
>>>>>>>>>>>> size 18446744073709551615
>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>>
>>>>>>>>>>>> 2020-04-25 16:05:47.251 7fcbd1e46a80 -1 
>>>>>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-3)
_open_db erroring 
>>>>>>>>>>>> opening db:
>>>>>>>>>>>>
>>>>>>>>>>>> We also noticed that the
"Encountered error while reading 
>>>>>>>>>>>> data from compression dictionary block
Corruption: block 
>>>>>>>>>>>> checksum mismatch" was present few
days before the crash.
>>>>>>>>>>>> We also have some osd with this error but
still up.
>>>>>>>>>>>>
>>>>>>>>>>>> We tryed to repair with :
>>>>>>>>>>>> ceph-kvstore-tool bluestore-kv
/var/lib/ceph/osd/ceph-3 
>>>>>>>>>>>> repair
>>>>>>>>>>>> But no success (it ends with _open_db
erroring opening db).
>>>>>>>>>>>>
>>>>>>>>>>>> Thus does somebody have an idea to fix
this or at least 
>>>>>>>>>>>> know if it's possible to repair and
correct the 
>>>>>>>>>>>> "Encountered error while reading
data from compression 
>>>>>>>>>>>> dictionary block Corruption: block
checksum mismatch" and 
>>>>>>>>>>>> "_open_db erroring opening db"
!
>>>>>>>>>>>> Thanks for your help (we are desperate
because we will 
>>>>>>>>>>>> loose datas and are fighting to save
something) !!!
>>>>>>>>>>>> F.
>>>>>>>>>>>>
>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>> ceph-users mailing list --
ceph-users(a)ceph.io
>>>>>>>>>>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>>>>
>>>>>
>>>
> 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: osd crashing and rocksdb corruption