Martin Conway wrote:
I find that backfilling and possibly scrubbing often
comes to a halt for no apparent
reason. If I put a server into maintenance mode or kill and restart OSDs it bursts back
into life again.
Not sure how to diagnose why the recovery processes have stalled.
My cluster is in this stalled state now, I have saved some details below.
Seems to point quite heavily to OSD.32 and OSD.33 but there is nothing of note in their
logs. They were experiencing slow ops last night, and this morning have logged nothing. I
am certain recovery and scrubbing will resume if I restarted those OSDs, but it would be
nice to know what keep causing this.
ceph -s
cluster:
id: 16bb4f7a-cf04-4667-aeee-94ce7f6ab672
health: HEALTH_WARN
441 pgs not deep-scrubbed in time
43 pgs not scrubbed in time
services:
mon: 5 daemons, quorum scustor3,scustor2,scustor1,scustor4,scustor5 (age 23h)
mgr: scustor3.wplaov(active, since 2d), standbys: scustor4.giyegr, scustor1.luywbi,
scustor2.ncfaec
mds: 2/2 daemons up, 1 standby
osd: 31 osds: 31 up (since 23h), 31 in (since 4d); 54 remapped pgs
data:
volumes: 1/1 healthy
pools: 8 pools, 897 pgs
objects: 48.52M objects, 47 TiB
usage: 130 TiB used, 130 TiB / 260 TiB avail
pgs: 3497437/145569867 objects misplaced (2.403%)
757 active+clean
70 active+clean+scrubbing
48 active+remapped+backfilling
14 active+clean+scrubbing+deep
6 active+recovering+remapped
2 active+recovering
io:
client: 24 KiB/s rd, 416 KiB/s wr, 1 op/s rd, 44 op/s wr
ceph pg dump
https://pastebin.com/raw/KPBie7SD
ceph pg dump_stuck
PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY
5.1f4 active+remapped+backfilling [32,5,18] 32 [32,5,21] 32
5.1f0 active+remapped+backfilling [32,2,13] 32 [32,2,5] 32
5.1e4 active+remapped+backfilling [33,13,10] 33 [33,6,5] 33
5.1c5 active+remapped+backfilling [33,2,10] 33 [33,10,16] 33
5.1bc active+remapped+backfilling [18,33,11] 18 [33,11,6] 33
5.193 active+remapped+backfilling [33,14,3] 33 [33,13,5] 33
5.180 active+remapped+backfilling [32,13,1] 32 [32,1,10] 32
5.171 active+remapped+backfilling [33,1,18] 33 [33,1,20] 33
5.16b active+recovering+remapped [32,9,13] 32 [32,9,2] 32
5.16a active+remapped+backfilling [13,4,6] 13 [33,4,16] 33
5.169 active+remapped+backfilling [14,33,10] 14 [33,10,13] 33
5.162 active+remapped+backfilling [32,6,13] 32 [32,6,22] 32
5.130 active+remapped+backfilling [13,32,3] 13 [32,3,1] 32
5.1cd active+remapped+backfilling [33,5,18] 33 [33,5,13] 33
6.3b active+recovering [32,1,3] 32 [32,1,3] 32
6.42 active+remapped+backfilling [18,33,9] 18 [33,9,13] 33
5.4e active+remapped+backfilling [32,11,10] 32 [32,10,22] 32
5.167 active+remapped+backfilling [33,18,6] 33 [33,6,2] 33
6.20 active+remapped+backfilling [14,32,10] 14 [32,13,1] 32
5.52 active+remapped+backfilling [14,1,32] 14 [32,1,13] 32
5.57 active+recovering+remapped [32,9,13] 32 [32,9,22] 32
5.49 active+remapped+backfilling [18,32,9] 18 [32,9,13] 32
5.1f7 active+recovering+remapped [9,32,13] 9 [32,20,1] 32
5.100 active+remapped+backfilling [33,9,13] 33 [33,9,20] 33
5.58 active+remapped+backfilling [32,6,4] 32 [32,6,20] 32
5.16 active+recovering+remapped [32,18,3] 32 [32,3,9] 32
5.60 active+recovering+remapped [31,13,4] 31 [32,4,9] 32
5.fc active+recovering [32,9,10] 32 [32,9,10] 32
5.c8 active+remapped+backfilling [32,14,3] 32 [32,3,13] 32
5.ad active+remapped+backfilling [32,3,16] 32 [32,3,22] 32
5.6e active+remapped+backfilling [32,5,18] 32 [32,5,11] 32
6.6d active+remapped+backfilling [32,6,5] 32 [32,5,21] 32
5.cf active+remapped+backfilling [33,9,18] 33 [33,11,16] 33
5.7e active+remapped+backfilling [32,9,18] 32 [32,4,16] 32
6.37 active+remapped+backfilling [32,4,3] 32 [32,4,21] 32
5.1aa active+remapped+backfilling [13,33,10] 13 [33,10,1] 33
5.165 active+remapped+backfilling [33,5,16] 33 [33,5,21] 33
5.76 active+remapped+backfilling [13,1,32] 13 [32,1,22] 32
5.102 active+remapped+backfilling [33,5,6] 33 [33,5,21] 33
5.2d active+remapped+backfilling [32,18,4] 32 [32,4,5] 32
6.24 active+remapped+backfilling [33,18,2] 33 [33,9,3] 33
5.f6 active+remapped+backfilling [32,1,14] 32 [32,1,22] 32
5.1c active+remapped+backfilling [33,18,3] 33 [33,3,22] 33
5.d9 active+remapped+backfilling [33,18,11] 33 [33,11,9] 33
5.184 active+remapped+backfilling [32,14,2] 32 [32,20,5] 32
5.e6 active+remapped+backfilling [18,33,16] 18 [33,16,13] 33
5.18f active+recovering+remapped [18,32,9] 18 [32,9,13] 32
5.e9 active+remapped+backfilling [32,13,9] 32 [32,13,2] 32
5.55 active+remapped+backfilling [32,6,14] 32 [32,6,3] 32
5.eb active+remapped+backfilling [18,33,11] 18 [32,20,6] 32
6.13 active+remapped+backfilling [14,10,1] 14 [32,20,13] 32
5.107 active+remapped+backfilling [14,3,31] 14 [32,3,1] 32
5.109 active+remapped+backfilling [32,4,14] 32 [32,4,13] 32
5.117 active+remapped+backfilling [33,16,3] 33 [33,16,20] 33
6.30 active+remapped+backfilling [32,4,1] 32 [32,1,21] 32
5.126 active+remapped+backfilling [33,9,4] 33 [33,9,21] 33
ok