[ceph-users] Re: osd crashing and rocksdb corruption

30 Apr 2020

I created the following ticket and PR to track/fix the issue with 
incomplete large writes when bluefs_buffereed_io=1.

https://tracker.ceph.com/issues/45337

https://github.com/ceph/ceph/pull/34836

But In fact setting bluefs_buffered_io to false is the mainstream for 
now, see https://github.com/ceph/ceph/pull/34224

Francois, you can proceed with OSD.21 &.49 I can reproduce the issue 
locally hence no much need in them now.

Still investigating what's happening with OSD.8...

As for reactivating recovery/backfill/rebalancing - I can say for sure 
whether it's safe or not.

Thanks,

Igor

On 4/30/2020 1:39 AM, Francois Legrand wrote:
> Hello,
> We set bluefs_buffered_io to false for the whole cluster except 2 osd 
> (21 and 49) for which we decided to keep the value to true for future 
> experiments/troubleshooting as you asked.
> We then restarted all the 25 downs osd and they started... except one 
> (number 8) which still continue to crash with the same kind of errors.
> I tryed a fsck on this osd which ended by a success. I set the debug 
> to 20 and recorded the logs.
> You will find the logs there if you want to have a look : 
> https://we.tl/t-GDvvvi2Gmm
> Now we plan to reactivate the recovery, backfill and rebalancing if 
> you think it's safe.
> F.
>
> Le 29/04/2020 à 16:45, Igor Fedotov a écrit :
>> So the crash seems to be caused by the same issue - big (and 
>> presumably incomplete) write and subsequent read failure.
>>
>> I've managed to repro this locally. So bluefs_buffered_io seems to be 
>> a remedy for now.
>>
>> But additionally I can observe multiple slow ops indications in this 
>> new log and I think they cause those big writes.
>>
>> And I presume some RGW-triggered ops are in flight - bucket 
>> resharding or removal or something..
>>
>> I've seen multiple reports about this stuff causing OSD slowdown, 
>> high memory utilization and  finally huge reads and/or writes from 
>> RocksDB.
>>
>> Don't know how to deal with this at the moment...
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 4/29/2020 5:33 PM, Francois Legrand wrote:
>>> Here are the logs of the newly crashed osd.
>>> F.
>>>
>>> Le 29/04/2020 à 16:21, Igor Fedotov a écrit :
>>>> Sounds interesting - could you please share the crash log for these 
>>>> new OSDs? They presumably suffer from another issue. At least that 
>>>> first crash is caused by something else.
>>>>
>>>>
>>>> "bluefs buffered io" can be injected on the fly but I expect it
to 
>>>> help when OSD isn't starting up only.
>>>>
>>>>
>>>> On 4/29/2020 5:17 PM, Francois Legrand wrote:
>>>>> Ok we will try that.
>>>>> Indeed, restarting osd.5 triggered the falling down of two other 
>>>>> osds in the cluster.
>>>>> Thus we will set bluefs buffered io = false for all osds and force 
>>>>> bluefs buffered io = true for one of the downs osds.
>>>>> Is that modification needs to use injectargs or changing it in the 
>>>>> configuration is enougth to have it applied on the fly ?
>>>>> F.
>>>>>
>>>>> Le 29/04/2020 à 15:56, Igor Fedotov a écrit :
>>>>>> That's bluefs buffered io = false which did the trick.
>>>>>>
>>>>>> It modified write path and this presumably has fixed large
write(s).
>>>>>>
>>>>>> Trying to reproduce locally but please preserve at least one 
>>>>>> failing OSD (i.e. do not start it with the disabled buffered io)

>>>>>> for future experiments/troubleshooting for a while if possible.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Igor
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 4/29/2020 4:50 PM, Francois Legrand wrote:
>>>>>>> Hi,
>>>>>>> It seems much better with theses options. The osd is now up 
>>>>>>> since 10mn without crashing (before it was rebooting after
~1mn).
>>>>>>> F.
>>>>>>>
>>>>>>> Le 29/04/2020 à 15:16, Igor Fedotov a écrit :
>>>>>>>> Hi Francois,
>>>>>>>>
>>>>>>>> I'll write a more thorough response a bit later.
>>>>>>>>
>>>>>>>> Meanwhile could you please try OSD startup with the
following 
>>>>>>>> settings now:
>>>>>>>>
>>>>>>>> debug-bluefs abd debug-bdev = 20
>>>>>>>>
>>>>>>>> bluefs sync write = false
>>>>>>>>
>>>>>>>> bluefs buffered io = false
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Igor
>>>>>>>>
>>>>>>>>
>>>>>>>> On 4/29/2020 3:35 PM, Francois Legrand wrote:
>>>>>>>>> Hi Igor,
>>>>>>>>> Here is what we did :
>>>>>>>>> First, as other osd were falling down, we stopped all

>>>>>>>>> operations with
>>>>>>>>> ceph osd set norecover
>>>>>>>>> ceph osd set norebalance
>>>>>>>>> ceph osd set nobackfill
>>>>>>>>> ceph osd set pause
>>>>>>>>>
>>>>>>>>> to avoid other crashs !
>>>>>>>>>
>>>>>>>>> Then we moved to your recommandations (still testing
on osd 5):
>>>>>>>>> in /etc/ceph/ceph.conf we added:
>>>>>>>>> [osd.5]
>>>>>>>>>     debug bluefs = 20
>>>>>>>>>     debug bdev = 20
>>>>>>>>>
>>>>>>>>> We ran :
>>>>>>>>> ceph-bluestore-tool fsck --path
/var/lib/ceph/osd/ceph-5 -l 
>>>>>>>>> /var/log/ceph/bluestore-tool-fsck-osd-5.log
--log-level 20 > 
>>>>>>>>> /var/log/ceph/bluestore-tool-fsck-osd-5.out
2>&1
>>>>>>>>>
>>>>>>>>> it ended with fsck success
>>>>>>>>>
>>>>>>>>> It seems that the default value for bluefs sync write
is false 
>>>>>>>>>
(https://github.com/ceph/ceph/blob/v14.2.8/src/common/options.cc), 
>>>>>>>>> thus
>>>>>>>>> we changed /etc/ceph/ceph.conf to :
>>>>>>>>> [osd.5]
>>>>>>>>>     debug bluefs = 20
>>>>>>>>>     debug bdev = 20
>>>>>>>>>     bluefs sync write = true
>>>>>>>>>
>>>>>>>>> and restarted the osd. It crashed !
>>>>>>>>> We tryed to change explicitely bluefs sync write =
false and 
>>>>>>>>> restarted... same result !
>>>>>>>>>
>>>>>>>>> The logs are here : https://we.tl/t-HMiFDu22XH
>>>>>>>>>
>>>>>>>>> Moreover, we have a rados gateway pool with hundreds
of 4GB 
>>>>>>>>> files. Can this be the origin of the problem ?
>>>>>>>>> Do you thing that removing this pool can help ?
>>>>>>>>>
>>>>>>>>> Thanks again for your expertise.
>>>>>>>>> F.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Le 28/04/2020 à 18:52, Igor Fedotov a écrit :
>>>>>>>>>> Short update - please treat bluefs_sync_write
parameter 
>>>>>>>>>> instead of bdev-aio. Changing the latter
isn't supported in 
>>>>>>>>>> fact.
>>>>>>>>>>
>>>>>>>>>> On 4/28/2020 7:35 PM, Igor Fedotov wrote:
>>>>>>>>>>> Francious,
>>>>>>>>>>>
>>>>>>>>>>> here are some observations got from your
log.
>>>>>>>>>>>
>>>>>>>>>>> 1) Rocksdb reports error on the following
.sst file:
>>>>>>>>>>>
>>>>>>>>>>>    -35> 2020-04-28 15:23:47.612
7f4856e82a80 -1 rocksdb: 
>>>>>>>>>>> Corruption: Bad table magic number: expected
986351839
>>>>>>>>>>> 0377041911, found 12950032858166034944 in
db/068269.sst
>>>>>>>>>>>
>>>>>>>>>>> 2) which relates to BlueFS ino 53361:
>>>>>>>>>>>
>>>>>>>>>>>   -50> 2020-04-28 15:23:45.103
7f4856e82a80 10 bluefs 
>>>>>>>>>>> open_for_read db/068269.sst (random)
>>>>>>>>>>>    -49> 2020-04-28 15:23:45.103
7f4856e82a80 10 bluefs 
>>>>>>>>>>> open_for_read h 0x557914fb80b0 on file(ino
53361 size 0x
>>>>>>>>>>> c496f919 mtime 2020-04-28 15:23:39.827515
bdev 1 allocated 
>>>>>>>>>>> c4970000 extents [1:0x383db280000~c4970000])
>>>>>>>>>>>
>>>>>>>>>>> 3) and failed read happens to the end
(0xc496f8e4~35, last 
>>>>>>>>>>> 0x35 bytes) of this huge(3+GB) file:
>>>>>>>>>>>
>>>>>>>>>>>  -44> 2020-04-28 15:23:47.514 7f4856e82a80
10 bluefs 
>>>>>>>>>>> _read_random h 0x557914fb80b0 0xc496f8e4~35
from file(in
>>>>>>>>>>> o 53361 size 0xc496f919 mtime 2020-04-28
15:23:39.827515 
>>>>>>>>>>> bdev 1 allocated c4970000 extents
[1:0x383db280000~c49
>>>>>>>>>>> 70000])
>>>>>>>>>>>    -43> 2020-04-28 15:23:47.514
7f4856e82a80 20 bluefs 
>>>>>>>>>>> _read_random left 0x71c 0xc496f8e4~35
>>>>>>>>>>>    -42> 2020-04-28 15:23:47.514
7f4856e82a80 20 bluefs 
>>>>>>>>>>> _read_random got 53
>>>>>>>>>>>
>>>>>>>>>>> 4) This .sst file was created from scratch
shortly before 
>>>>>>>>>>> with a single-shot 3+GB write:
>>>>>>>>>>>
>>>>>>>>>>>  -88> 2020-04-28 15:23:35.661 7f4856e82a80
10 bluefs 
>>>>>>>>>>> open_for_write db/068269.sst
>>>>>>>>>>>    -87> 2020-04-28 15:23:35.661
7f4856e82a80 20 bluefs 
>>>>>>>>>>> open_for_write mapping db/068269.sst to bdev
1
>>>>>>>>>>>    -86> 2020-04-28 15:23:35.662
7f4856e82a80 10 bluefs 
>>>>>>>>>>> open_for_write h 0x5579145e7a40 on file(ino
53361 size 0
>>>>>>>>>>> x0 mtime 2020-04-28 15:23:35.663142 bdev 1
allocated 0 
>>>>>>>>>>> extents [])
>>>>>>>>>>>    -85> 2020-04-28 15:23:39.826
7f4856e82a80 10 bluefs 
>>>>>>>>>>> _flush 0x5579145e7a40 0x0~c496f919 to
file(ino 53361 siz
>>>>>>>>>>> e 0x0 mtime 2020-04-28 15:23:35.663142 bdev 1
allocated 0 
>>>>>>>>>>> extents [])
>>>>>>>>>>>
>>>>>>>>>>> 5) Presumably RocksDB creates this file in an
attempt to 
>>>>>>>>>>> recover/compact/process another existing file
(ino 52405) 
>>>>>>>>>>> which is pretty large as well
>>>>>>>>>>>
>>>>>>>>>>> Please find multiple earlier reads, the last
one:
>>>>>>>>>>>
>>>>>>>>>>>    -92> 2020-04-28 15:23:29.857
7f4856e82a80 10 bluefs _read 
>>>>>>>>>>> h 0x5579147286e0 0xc6788000~8000 from
file(ino 524
>>>>>>>>>>> 05 size 0xc67888a0 mtime 2020-04-25
13:34:55.325699 bdev 0 
>>>>>>>>>>> allocated c6790000 extents
[1:0x381c8220000~10000,1:
>>>>>>>>>>>
>>>>>>>>>>> The rationales for binding these two files
are pretty 
>>>>>>>>>>> uncommon file sizes.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> So you have 3+GB single-shot BlueFS write and
immediate read 
>>>>>>>>>>> from the end of the written extent which
returns unexpected 
>>>>>>>>>>> magic.
>>>>>>>>>>>
>>>>>>>>>>> It's well known in the software world
that large (2+GB) data 
>>>>>>>>>>> processing implementations tend to be
error-prone. And Ceph 
>>>>>>>>>>> is not an exception. Here is a couple of
recent examples 
>>>>>>>>>>> which are pretty close to your case:
>>>>>>>>>>>
>>>>>>>>>>>
https://github.com/ceph/ceph/commit/4d33114a40d5ae0d541c36175977ca22789a3b88 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
https://github.com/ceph/ceph/commit/207806abaa91259d9bb726c2555e7e21ac541527 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Although they are already fixed in Nautilus
14.2.8 there 
>>>>>>>>>>> might be others present along the write path
(including H/W 
>>>>>>>>>>> firmware).
>>>>>>>>>>>
>>>>>>>>>>> The good news are that failure happens on a
newly written 
>>>>>>>>>>> file (remember invalid magic is read at the
end(!) of this 
>>>>>>>>>>> new large file shortly(!) after the write).
So looks like 
>>>>>>>>>>> data hasn't yet landed to a disk. And
hopefully original 
>>>>>>>>>>> data are untouched.
>>>>>>>>>>>
>>>>>>>>>>> You can try to run ceph-bluestore-tool's
fsck command on the 
>>>>>>>>>>> broken OSD. This will open DB in read-only
and hopefully 
>>>>>>>>>>> proves the idea that the issue is on the
write path rather 
>>>>>>>>>>> than already corrupted data.
>>>>>>>>>>>
>>>>>>>>>>> Also you might want to set bdev-aio to false
(or true if 
>>>>>>>>>>> it's already at false) and try OSD
startup once more - this 
>>>>>>>>>>> will go via a bit different write path and
may be provide a 
>>>>>>>>>>> workaround.
>>>>>>>>>>>
>>>>>>>>>>> Also please collect debug logs for OSD
startup (with both 
>>>>>>>>>>> current and updated bdev-aio parameter) and 
>>>>>>>>>>> --debug-bdev/debug-bluefs set to 20. You can
omit 
>>>>>>>>>>> --debug-bluestore increase for now to reduce
log size.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Igor
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 4/28/2020 5:16 PM, Francois Legrand
wrote:
>>>>>>>>>>>> Here is the output of ceph-bluestore-tool
bluefs-bdev-sizes
>>>>>>>>>>>> inferring bluefs devices from bluestore
path
>>>>>>>>>>>>  slot 1 /var/lib/ceph/osd/ceph-5/block
-> /dev/dm-17
>>>>>>>>>>>> 1 : device size 0x746c0000000 : own 
>>>>>>>>>>>> 0x[37e1eb00000~4a82900000] = 0x4a82900000
: using 
>>>>>>>>>>>> 0x5bc780000(23 GiB)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> the result of the debug-bluestore (and
debug-bluefs) set to 
>>>>>>>>>>>> 20 for osd.5
>>>>>>>>>>>> is at the following address (28MB).
>>>>>>>>>>>>
>>>>>>>>>>>>
https://wetransfer.com/downloads/a193ab15ab5e2395fe2462c963507a7f2020042814…

>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your help.
>>>>>>>>>>>> F.
>>>>>>>>>>>>
>>>>>>>>>>>> Le 28/04/2020 à 13:33, Igor Fedotov a
écrit :
>>>>>>>>>>>>> Hi Francois,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Could you please share OSD startup
log with 
>>>>>>>>>>>>> debug-bluestore (and debug-bluefs)
set to 20.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also please run
ceph-bluestore-tool's bluefs-bdev-sizes 
>>>>>>>>>>>>> command and share the output.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Igor
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 4/28/2020 12:55 AM, Francois
Legrand wrote:
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *** Short version ***
>>>>>>>>>>>>>> Is there a way to repair a
rocksdb from errors 
>>>>>>>>>>>>>> "Encountered error while
reading data from compression 
>>>>>>>>>>>>>> dictionary block Corruption:
block checksum mismatch" and 
>>>>>>>>>>>>>> "_open_db erroring opening
db" ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *** Long version ***
>>>>>>>>>>>>>> We operate a nautilus ceph
cluster (with 100 disks of 8TB 
>>>>>>>>>>>>>> in 6 servers + 4 mons/mgr + 3
mds).
>>>>>>>>>>>>>> We recently (Monday 20) upgraded
from 14.2.7 to 14.2.8. 
>>>>>>>>>>>>>> This triggered a rebalancing of
some data.
>>>>>>>>>>>>>> Two days later (Wednesday 22) we
had a very short power 
>>>>>>>>>>>>>> outage. Only one of the osd
servers went down (and 
>>>>>>>>>>>>>> unfortunately died).
>>>>>>>>>>>>>> This triggered a reconstruction
of the losts osds. 
>>>>>>>>>>>>>> Operations went fine until
Saturday 25 where some osds in 
>>>>>>>>>>>>>> the 5 remaining servers started
to crash apparently with 
>>>>>>>>>>>>>> no reasons.
>>>>>>>>>>>>>> We tryed to restart them, but
they crashed again. We 
>>>>>>>>>>>>>> ended with 18 osd down (+ 16 in
the dead server so 34 osd 
>>>>>>>>>>>>>> downs out of 100).
>>>>>>>>>>>>>> Looking at the logs we found for
all the crashed osd :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -237> 2020-04-25 16:32:51.835
7f1f45527a80 3 rocksdb: 
>>>>>>>>>>>>>>
[table/block_based_table_reader.cc:1117] Encountered 
>>>>>>>>>>>>>> error while reading data from
compression dictionary 
>>>>>>>>>>>>>> block Corruption: block checksum
mismatch: expected 0, 
>>>>>>>>>>>>>> got 2729370997 in db/181355.sst
offset 
>>>>>>>>>>>>>> 18446744073709551615 size
18446744073709551615
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2020-04-25 16:05:47.251
7fcbd1e46a80 -1 
>>>>>>>>>>>>>>
bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring 
>>>>>>>>>>>>>> opening db:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We also noticed that the
"Encountered error while reading 
>>>>>>>>>>>>>> data from compression dictionary
block Corruption: block 
>>>>>>>>>>>>>> checksum mismatch" was
present few days before the crash.
>>>>>>>>>>>>>> We also have some osd with this
error but still up.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We tryed to repair with :
>>>>>>>>>>>>>> ceph-kvstore-tool bluestore-kv
/var/lib/ceph/osd/ceph-3 
>>>>>>>>>>>>>> repair
>>>>>>>>>>>>>> But no success (it ends with
_open_db erroring opening db).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thus does somebody have an idea
to fix this or at least 
>>>>>>>>>>>>>> know if it's possible to
repair and correct the 
>>>>>>>>>>>>>> "Encountered error while
reading data from compression 
>>>>>>>>>>>>>> dictionary block Corruption:
block checksum mismatch" and 
>>>>>>>>>>>>>> "_open_db erroring opening
db" !
>>>>>>>>>>>>>> Thanks for your help (we are
desperate because we will 
>>>>>>>>>>>>>> loose datas and are fighting to
save something) !!!
>>>>>>>>>>>>>> F.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>>> ceph-users mailing list --
ceph-users(a)ceph.io
>>>>>>>>>>>>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>>>>>>>>>
>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>> ceph-users mailing list --
ceph-users(a)ceph.io
>>>>>>>>>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>

2024

2023

2022

2021

2020

2019

[ceph-users] Re: osd crashing and rocksdb corruption