We have a ceph cluster of only nvme drives.
Very recently our overall OSD write latency increase pretty dramatically and our overall
thoughput has really decreased.
One thing that seems to correlate with the start of this problem are the below ERROR line
from the logs. All our OSD nodes are creating these log lines now.
Can anyone tell me what this might be telling us? All and any help is greatly
appreciated.
Mar 31 23:21:56 ceph1d03 ceph-8797e570-96be-11ed-b022-506b4b7d76e1-osd-46[12898]: debug
2024-04-01T03:21:56.953+0000 7effbba51700 0 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/cls/fifo/cls_fifo.cc:112:
ERROR: int rados::cls::fifo::{anonymous}::read_part_header(cls_method_context_t,
rados::cls::fifo::part_header*): failed decoding part header
--
Mark Selby
Sr Linux Administrator, The Voleon Group
mselby(a)voleon.com
This email is subject to important conditions and disclosures that are listed on this web
page:
https://voleon.com/disclaimer/.