Sounds interesting - could you please share the crash log for these new
OSDs? They presumably suffer from another issue. At least that first
crash is caused by something else.
"bluefs buffered io" can be injected on the fly but I expect it to help
when OSD isn't starting up only.
On 4/29/2020 5:17 PM, Francois Legrand wrote:
> Ok we will try that.
> Indeed, restarting osd.5 triggered the falling down of two other osds
> in the cluster.
> Thus we will set bluefs buffered io = false for all osds and force
> bluefs buffered io = true for one of the downs osds.
> Is that modification needs to use injectargs or changing it in the
> configuration is enougth to have it applied on the fly ?
> F.
>
> Le 29/04/2020 à 15:56, Igor Fedotov a écrit :
>> That's bluefs buffered io = false which did the trick.
>>
>> It modified write path and this presumably has fixed large write(s).
>>
>> Trying to reproduce locally but please preserve at least one failing
>> OSD (i.e. do not start it with the disabled buffered io) for future
>> experiments/troubleshooting for a while if possible.
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>>
>> On 4/29/2020 4:50 PM, Francois Legrand wrote:
>>> Hi,
>>> It seems much better with theses options. The osd is now up since
>>> 10mn without crashing (before it was rebooting after ~1mn).
>>> F.
>>>
>>> Le 29/04/2020 à 15:16, Igor Fedotov a écrit :
>>>> Hi Francois,
>>>>
>>>> I'll write a more thorough response a bit later.
>>>>
>>>> Meanwhile could you please try OSD startup with the following
>>>> settings now:
>>>>
>>>> debug-bluefs abd debug-bdev = 20
>>>>
>>>> bluefs sync write = false
>>>>
>>>> bluefs buffered io = false
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>>
>>>> On 4/29/2020 3:35 PM, Francois Legrand wrote:
>>>>> Hi Igor,
>>>>> Here is what we did :
>>>>> First, as other osd were falling down, we stopped all operations
with
>>>>> ceph osd set norecover
>>>>> ceph osd set norebalance
>>>>> ceph osd set nobackfill
>>>>> ceph osd set pause
>>>>>
>>>>> to avoid other crashs !
>>>>>
>>>>> Then we moved to your recommandations (still testing on osd 5):
>>>>> in /etc/ceph/ceph.conf we added:
>>>>> [osd.5]
>>>>> debug bluefs = 20
>>>>> debug bdev = 20
>>>>>
>>>>> We ran :
>>>>> ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5 -l
>>>>> /var/log/ceph/bluestore-tool-fsck-osd-5.log --log-level 20 >
>>>>> /var/log/ceph/bluestore-tool-fsck-osd-5.out 2>&1
>>>>>
>>>>> it ended with fsck success
>>>>>
>>>>> It seems that the default value for bluefs sync write is false
>>>>> (
https://github.com/ceph/ceph/blob/v14.2.8/src/common/options.cc),
>>>>> thus
>>>>> we changed /etc/ceph/ceph.conf to :
>>>>> [osd.5]
>>>>> debug bluefs = 20
>>>>> debug bdev = 20
>>>>> bluefs sync write = true
>>>>>
>>>>> and restarted the osd. It crashed !
>>>>> We tryed to change explicitely bluefs sync write = false and
>>>>> restarted... same result !
>>>>>
>>>>> The logs are here :
https://we.tl/t-HMiFDu22XH
>>>>>
>>>>> Moreover, we have a rados gateway pool with hundreds of 4GB files.
>>>>> Can this be the origin of the problem ?
>>>>> Do you thing that removing this pool can help ?
>>>>>
>>>>> Thanks again for your expertise.
>>>>> F.
>>>>>
>>>>>
>>>>> Le 28/04/2020 à 18:52, Igor Fedotov a écrit :
>>>>>> Short update - please treat bluefs_sync_write parameter instead
>>>>>> of bdev-aio. Changing the latter isn't supported in fact.
>>>>>>
>>>>>> On 4/28/2020 7:35 PM, Igor Fedotov wrote:
>>>>>>> Francious,
>>>>>>>
>>>>>>> here are some observations got from your log.
>>>>>>>
>>>>>>> 1) Rocksdb reports error on the following .sst file:
>>>>>>>
>>>>>>> -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1 rocksdb:
>>>>>>> Corruption: Bad table magic number: expected 986351839
>>>>>>> 0377041911, found 12950032858166034944 in db/068269.sst
>>>>>>>
>>>>>>> 2) which relates to BlueFS ino 53361:
>>>>>>>
>>>>>>> -50> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs
>>>>>>> open_for_read db/068269.sst (random)
>>>>>>> -49> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs
>>>>>>> open_for_read h 0x557914fb80b0 on file(ino 53361 size 0x
>>>>>>> c496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated
>>>>>>> c4970000 extents [1:0x383db280000~c4970000])
>>>>>>>
>>>>>>> 3) and failed read happens to the end (0xc496f8e4~35, last
0x35
>>>>>>> bytes) of this huge(3+GB) file:
>>>>>>>
>>>>>>> -44> 2020-04-28 15:23:47.514 7f4856e82a80 10 bluefs
>>>>>>> _read_random h 0x557914fb80b0 0xc496f8e4~35 from file(in
>>>>>>> o 53361 size 0xc496f919 mtime 2020-04-28 15:23:39.827515 bdev
1
>>>>>>> allocated c4970000 extents [1:0x383db280000~c49
>>>>>>> 70000])
>>>>>>> -43> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs
>>>>>>> _read_random left 0x71c 0xc496f8e4~35
>>>>>>> -42> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs
>>>>>>> _read_random got 53
>>>>>>>
>>>>>>> 4) This .sst file was created from scratch shortly before
with a
>>>>>>> single-shot 3+GB write:
>>>>>>>
>>>>>>> -88> 2020-04-28 15:23:35.661 7f4856e82a80 10 bluefs
>>>>>>> open_for_write db/068269.sst
>>>>>>> -87> 2020-04-28 15:23:35.661 7f4856e82a80 20 bluefs
>>>>>>> open_for_write mapping db/068269.sst to bdev 1
>>>>>>> -86> 2020-04-28 15:23:35.662 7f4856e82a80 10 bluefs
>>>>>>> open_for_write h 0x5579145e7a40 on file(ino 53361 size 0
>>>>>>> x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0
extents [])
>>>>>>> -85> 2020-04-28 15:23:39.826 7f4856e82a80 10 bluefs
_flush
>>>>>>> 0x5579145e7a40 0x0~c496f919 to file(ino 53361 siz
>>>>>>> e 0x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0
>>>>>>> extents [])
>>>>>>>
>>>>>>> 5) Presumably RocksDB creates this file in an attempt to
>>>>>>> recover/compact/process another existing file (ino 52405)
which
>>>>>>> is pretty large as well
>>>>>>>
>>>>>>> Please find multiple earlier reads, the last one:
>>>>>>>
>>>>>>> -92> 2020-04-28 15:23:29.857 7f4856e82a80 10 bluefs
_read h
>>>>>>> 0x5579147286e0 0xc6788000~8000 from file(ino 524
>>>>>>> 05 size 0xc67888a0 mtime 2020-04-25 13:34:55.325699 bdev 0
>>>>>>> allocated c6790000 extents [1:0x381c8220000~10000,1:
>>>>>>>
>>>>>>> The rationales for binding these two files are pretty
uncommon
>>>>>>> file sizes.
>>>>>>>
>>>>>>>
>>>>>>> So you have 3+GB single-shot BlueFS write and immediate read
>>>>>>> from the end of the written extent which returns unexpected
magic.
>>>>>>>
>>>>>>> It's well known in the software world that large (2+GB)
data
>>>>>>> processing implementations tend to be error-prone. And Ceph
is
>>>>>>> not an exception. Here is a couple of recent examples which
are
>>>>>>> pretty close to your case:
>>>>>>>
>>>>>>>
https://github.com/ceph/ceph/commit/4d33114a40d5ae0d541c36175977ca22789a3b88
>>>>>>>
>>>>>>>
>>>>>>>
https://github.com/ceph/ceph/commit/207806abaa91259d9bb726c2555e7e21ac541527
>>>>>>>
>>>>>>>
>>>>>>> Although they are already fixed in Nautilus 14.2.8 there
might
>>>>>>> be others present along the write path (including H/W
firmware).
>>>>>>>
>>>>>>> The good news are that failure happens on a newly written
file
>>>>>>> (remember invalid magic is read at the end(!) of this new
large
>>>>>>> file shortly(!) after the write). So looks like data
hasn't yet
>>>>>>> landed to a disk. And hopefully original data are untouched.
>>>>>>>
>>>>>>> You can try to run ceph-bluestore-tool's fsck command on
the
>>>>>>> broken OSD. This will open DB in read-only and hopefully
proves
>>>>>>> the idea that the issue is on the write path rather than
already
>>>>>>> corrupted data.
>>>>>>>
>>>>>>> Also you might want to set bdev-aio to false (or true if
it's
>>>>>>> already at false) and try OSD startup once more - this will
go
>>>>>>> via a bit different write path and may be provide a
workaround.
>>>>>>>
>>>>>>> Also please collect debug logs for OSD startup (with both
>>>>>>> current and updated bdev-aio parameter) and
>>>>>>> --debug-bdev/debug-bluefs set to 20. You can omit
>>>>>>> --debug-bluestore increase for now to reduce log size.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Igor
>>>>>>>
>>>>>>>
>>>>>>> On 4/28/2020 5:16 PM, Francois Legrand wrote:
>>>>>>>> Here is the output of ceph-bluestore-tool
bluefs-bdev-sizes
>>>>>>>> inferring bluefs devices from bluestore path
>>>>>>>> slot 1 /var/lib/ceph/osd/ceph-5/block -> /dev/dm-17
>>>>>>>> 1 : device size 0x746c0000000 : own
0x[37e1eb00000~4a82900000]
>>>>>>>> = 0x4a82900000 : using 0x5bc780000(23 GiB)
>>>>>>>>
>>>>>>>>
>>>>>>>> the result of the debug-bluestore (and debug-bluefs) set
to 20
>>>>>>>> for osd.5
>>>>>>>> is at the following address (28MB).
>>>>>>>>
>>>>>>>>
https://wetransfer.com/downloads/a193ab15ab5e2395fe2462c963507a7f2020042814…
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for your help.
>>>>>>>> F.
>>>>>>>>
>>>>>>>> Le 28/04/2020 à 13:33, Igor Fedotov a écrit :
>>>>>>>>> Hi Francois,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Could you please share OSD startup log with
debug-bluestore
>>>>>>>>> (and debug-bluefs) set to 20.
>>>>>>>>>
>>>>>>>>> Also please run ceph-bluestore-tool's
bluefs-bdev-sizes
>>>>>>>>> command and share the output.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Igor
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 4/28/2020 12:55 AM, Francois Legrand wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> *** Short version ***
>>>>>>>>>> Is there a way to repair a rocksdb from errors
"Encountered
>>>>>>>>>> error while reading data from compression
dictionary block
>>>>>>>>>> Corruption: block checksum mismatch" and
"_open_db erroring
>>>>>>>>>> opening db" ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *** Long version ***
>>>>>>>>>> We operate a nautilus ceph cluster (with 100
disks of 8TB in
>>>>>>>>>> 6 servers + 4 mons/mgr + 3 mds).
>>>>>>>>>> We recently (Monday 20) upgraded from 14.2.7 to
14.2.8. This
>>>>>>>>>> triggered a rebalancing of some data.
>>>>>>>>>> Two days later (Wednesday 22) we had a very short
power
>>>>>>>>>> outage. Only one of the osd servers went down
(and
>>>>>>>>>> unfortunately died).
>>>>>>>>>> This triggered a reconstruction of the losts
osds. Operations
>>>>>>>>>> went fine until Saturday 25 where some osds in
the 5
>>>>>>>>>> remaining servers started to crash apparently
with no reasons.
>>>>>>>>>> We tryed to restart them, but they crashed again.
We ended
>>>>>>>>>> with 18 osd down (+ 16 in the dead server so 34
osd downs out
>>>>>>>>>> of 100).
>>>>>>>>>> Looking at the logs we found for all the crashed
osd :
>>>>>>>>>>
>>>>>>>>>> -237> 2020-04-25 16:32:51.835 7f1f45527a80 3
rocksdb:
>>>>>>>>>> [table/block_based_table_reader.cc:1117]
Encountered error
>>>>>>>>>> while reading data from compression dictionary
block
>>>>>>>>>> Corruption: block checksum mismatch: expected 0,
got
>>>>>>>>>> 2729370997 in db/181355.sst offset
18446744073709551615 size
>>>>>>>>>> 18446744073709551615
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>> 2020-04-25 16:05:47.251 7fcbd1e46a80 -1
>>>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-3) _open_db
erroring opening
>>>>>>>>>> db:
>>>>>>>>>>
>>>>>>>>>> We also noticed that the "Encountered error
while reading
>>>>>>>>>> data from compression dictionary block
Corruption: block
>>>>>>>>>> checksum mismatch" was present few days
before the crash.
>>>>>>>>>> We also have some osd with this error but still
up.
>>>>>>>>>>
>>>>>>>>>> We tryed to repair with :
>>>>>>>>>> ceph-kvstore-tool bluestore-kv
/var/lib/ceph/osd/ceph-3 repair
>>>>>>>>>> But no success (it ends with _open_db erroring
opening db).
>>>>>>>>>>
>>>>>>>>>> Thus does somebody have an idea to fix this or at
least know
>>>>>>>>>> if it's possible to repair and correct the
"Encountered error
>>>>>>>>>> while reading data from compression dictionary
block
>>>>>>>>>> Corruption: block checksum mismatch" and
"_open_db erroring
>>>>>>>>>> opening db" !
>>>>>>>>>> Thanks for your help (we are desperate because we
will loose
>>>>>>>>>> datas and are fighting to save something) !!!
>>>>>>>>>> F.
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>
>>>
>