Incomplete PG due to primary OSD crashing during EC backfill - get_hash_info: Mismatch of total_chunk_size 0 - ceph-users

10 Dec 2020

Hi all,

Got an odd issue that I'm not sure how to solve on our Nautilus 14.2.9 EC cluster.

The primary OSD of an EC 8+3 PG died this morning with a very sad disk (thousands of
pending sectors). After the down out interval a new 'up' primary was assigned and
the backfill started. Twenty minutes later the acting primary (not the new 'up'
primary) started crashing with a "get_hash_info: Mismatch of total_chunk_size 0"
error (see log below)

This crash always happens at the same object, with different acting primaries, and with a
different new 'up' primary. I can't see anything in the logs that points to a
particular OSD being the issue, so I suspect there is a corrupted object in the PG that is
causing issues, but I'm not sure how to dig into this further. The PG is currently
active (but degraded), but only whilst nobackfill or noout are set (+turning the new OSD
off), and if the flags are unset the backfill will eventually crash enough OSDs to render
the PG incomplete, which is not ideal. I would appreciate being able to resolve this so I
can go back to letting Ceph deal with down OSDs itself :)

Does anyone have some pointers on how to dig into or resolve this? Happy to create a
tracker ticket and post more logs if this looks like a bug.

Thanks,
Tom

OSD log with debug_osd=20 (preamble cut from subsequent lines in an attempt to improve
readability...):

2020-12-10 15:14:16.130 7fc0a1575700 10 osd.1708 pg_epoch: 1162259 pg[11.214s1( v
1162255'714638 (1162110'711564,1162255'714638] local-lis/les=1162253/1162254
n=133385 ec=1069520/992 lis/c 1162253/1125301 les/c/f 1162254/1125302/257760
1162252/1162253/1162253)
[2449,1708,2099,1346,4309,777,5098,4501,4134,217,4643]/[2147483647,1708,2099,1346,4309,777,5098,4501,4134,217,4643]p1708(1)
backfill=[2449(0)] r=1 lpr=1162253 pi=[1125301,1162253)/3 rops=1 crt=1162255'714638
lcod 1162254'714637 mlcod 1162254'714637
active+undersized+degraded+remapped+backfilling mbc={}] run_recovery_op: starting
RecoveryOp(hoid=11:28447b4a:::962de230-ed6c-44f2-ab02-788c52ea6a82.3210530112.122__multipart_201%2fin5%2fexp_4-05-737%2fprocessed%2fspe%2fsqw_187570.nxspe.2~bgZPo_rC64ZXJWKyTfdn4dIApqLNDPp.22:head
v=1162125'713150 missing_on=2449(0) missing_on_shards=0
recovery_info=ObjectRecoveryInfo(11:28447b4a:::962de230-ed6c-44f2-ab02-788c52ea6a82.3210530112.122__multipart_201%2fin5%2fexp_4-05-737%2fprocessed%2fspe%2fsqw_187570.nxspe.2~bgZPo_rC64ZXJWKyTfdn4dIApqLNDPp.22:head@1162125'713150,
size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:{})
recovery_progress=ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false,
omap_recovered_to:, omap_complete:true, error:false) obc refcount=3 state=IDLE
waiting_on_pushes= extent_requested=0,0)
continue_recovery_op: continuing
RecoveryOp(hoid=11:28447b4a:::962de230-ed6c-44f2-ab02-788c52ea6a82.3210530112.122__multipart_201%2fin5%2fexp_4-05-737%2fprocessed%2fspe%2fsqw_187570.nxspe.2~bgZPo_rC64ZXJWKyTfdn4dIApqLNDPp.22:head
v=1162125'713150 missing_on=2449(0) missing_on_shards=0
recovery_info=ObjectRecoveryInfo(11:28447b4a:::962de230-ed6c-44f2-ab02-788c52ea6a82.3210530112.122__multipart_201%2fin5%2fexp_4-05-737%2fprocessed%2fspe%2fsqw_187570.nxspe.2~bgZPo_rC64ZXJWKyTfdn4dIApqLNDPp.22:head@1162125'713150,
size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:{})
recovery_progress=ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false,
omap_recovered_to:, omap_complete:true, error:false) obc refcount=4 state=IDLE
waiting_on_pushes= extent_requested=0,0)
get_hash_info: Getting attr on
11:28447b4a:::962de230-ed6c-44f2-ab02-788c52ea6a82.3210530112.122__multipart_201%2fin5%2fexp_4-05-737%2fprocessed%2fspe%2fsqw_187570.nxspe.2~bgZPo_rC64ZXJWKyTfdn4dIApqLNDPp.22:head
get_hash_info: not in cache
11:28447b4a:::962de230-ed6c-44f2-ab02-788c52ea6a82.3210530112.122__multipart_201%2fin5%2fexp_4-05-737%2fprocessed%2fspe%2fsqw_187570.nxspe.2~bgZPo_rC64ZXJWKyTfdn4dIApqLNDPp.22:head
get_hash_info: found on disk, size 524288
get_hash_info: Mismatch of total_chunk_size 0
2020-12-10 15:14:16.136 7fc0a1575700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/osd/ECBackend.cc:
In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&,
RecoveryMessages*)' thread 7fc0a1575700 time 2020-12-10 15:14:16.132060
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/osd/ECBackend.cc:
585: FAILED ceph_assert(op.hinfo)

ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a)
[0x55e1569acf7d]
2: (()+0x4cb145) [0x55e1569ad145]
3: (ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)+0x1764)
[0x55e156da5834]
4: (ECBackend::run_recovery_op(PGBackend::RecoveryHandle*, int)+0x65b) [0x55e156da6c6b]
5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&,
bool*)+0x1491) [0x55e156c26681]
6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned
long*)+0x114c) [0x55e156c29f7c]
7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x2ff)
[0x55e156a8b32f]
8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x19) [0x55e156d1aa19]
9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f)
[0x55e156aa6b4f]
10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55e15704b216]
11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e15704dd30]
12: (()+0x7ea5) [0x7fc0c2d5dea5]
13: (clone()+0x6d) [0x7fc0c1c208dd]

2020-12-10 15:14:16.143 7fc0a1575700 -1 *** Caught signal (Aborted) **

This email and any attachments are intended solely for the use of the named recipients. If
you are not the intended recipient you must not use, disclose, copy or distribute this
email or any of its attachments and should notify the sender immediately and delete this
email from your system. UK Research and Innovation (UKRI) has taken every reasonable
precaution to minimise risk of this email or any attachments containing viruses or malware
but the recipient should carry out its own virus and malware checks before opening the
attachments. UKRI does not accept any liability for any losses or damages which the
recipient may sustain due to presence of any viruses. Opinions, conclusions or other
information in this message and attachments that are not related directly to UKRI business
are solely those of the author and do not represent the views of UKRI.