Hi all,
in various Rook operated Ceph clusters I have seen OSDs going into a CrashLoop due to
debug 2020-12-16T13:19:25.500+0000 7fc4c3f13f40 4 rocksdb: EVENT_LOG_v1
{"time_micros": 1608124765507105, "job": 1, "event":
"recovery_started", "log_files": [1400, 1402]}
debug 2020-12-16T13:19:25.500+0000 7fc4c3f13f40 4 rocksdb: [db/db_impl_open.cc:583]
Recovering log #1400 mode 0
debug 2020-12-16T13:19:27.724+0000 7fc4c3f13f40 1 bluefs _allocate failed to allocate
0x43ce43d on bdev 1, free 0x2e50000; fallback to bdev 2
debug 2020-12-16T13:19:27.724+0000 7fc4c3f13f40 1 bluefs _allocate unable to allocate
0x43ce43d on bdev 2, free 0xffffffffffffffff; fallback to slow device expander
debug 2020-12-16T13:19:27.724+0000 7fc4c3f13f40 -1 bluestore(/var/lib/ceph/osd/ceph-1)
allocate_bluefs_freespace failed to allocate on 0x3d1b0000 min_size 0x43d0000 >
allocated total 0x300000 bluefs_shared_alloc_size 0x10000 allocated 0x300000 available 0x
b019c8000
debug 2020-12-16T13:19:27.724+0000 7fc4c3f13f40 -1 bluefs _allocate failed to expand slow
device to fit +0x43ce43d
debug 2020-12-16T13:19:27.724+0000 7fc4c3f13f40 -1 bluefs _flush_range allocated: 0x0
offset: 0x0 length: 0x43ce43d
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.6/rpm/el8/BUILD/ceph-15.2.6/src/os/bluestore/BlueFS.cc:
In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)'
thread 7fc4c3f13f40 time 2020-12-16T13:19:27.731533+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.6/rpm/el8/BUILD/ceph-15.2.6/src/os/bluestore/BlueFS.cc:
2721: ceph_abort_msg("bluefs enospc")
The OSD is not really full:
# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
1 : device size 0x18ffc00000 : own 0x[bffe10000~fffe0000] = 0xfffe0000 : using
0xfd190000(4.0 GiB) : bluestore has 0x82b360000(33 GiB) available
Expanding DB/WAL...
Expanding the underlying block device by just 1 gig followed by "ceph-bluestore-tool
bluefs-bdev-expand" and "ceph-bluestore-tool repair" resolves the
situation. In general, larger OSDs seem to reduce the likeliness for this issue.
Ceph version is v15.2.6.
Is this a known bug?
Ceph report and logs are attached.
Thanks for your help
Stephan