Hey Andreas,
thanks for the insights. Maybe a bit more background:
We are running a variety of pools, the majority of data is stored on the
"hdd" and "ssd" pools, which make use of the "ssd" and
"hdd-big" (as in
3.5") classes.
Andreas John <aj(a)net-lab.net> writes:
On 22.09.20 22:09, Nico Schottelius wrote:
[...]
All nodes are connected with 2x 10 Gbit/s
bonded/LACP, so I'd expect at
The disks in question are 3.5"/10TB/6 Gbit/s SATA disks connected to an
H800 controller - so generally speaking I do not see a reasonable
bottleneck here.
Yes, I should! I saw in your mail:
1.) 1532 slow requests are blocked > 32 sec
789 slow ops, oldest one blocked for 1949 sec, daemons
[osd.12,osd.14,osd.2,osd.20,osd.23,osd.25,osd.3,osd.33,osd.35,osd.50]...
have slow ops.
An request that is blocked for > 32 sec is odd! Same goes for 1949 sec.
I my experience, they will never finish. Sometimes they go away with osd
restarts. Are those OSD the ones you relocated?
We tried restarting some of the osds, however the slow ops are coming
back soon after restart. And this is the most puzzling part:
The move of the osds only affected PGs that are related to the "ssd"
pool. While data was rebalancing, one hdd osd crashed and was restarted,
but what we at the moment is that there are slow ops on a lot of osds:
REQUEST_SLOW 4560 slow requests are blocked > 32 sec
1262 ops are blocked > 2097.15 sec
1121 ops are blocked > 1048.58 sec
602 ops are blocked > 524.288 sec
849 ops are blocked > 262.144 sec
407 ops are blocked > 131.072 sec
175 ops are blocked > 65.536 sec
144 ops are blocked > 32.768 sec
osd.82 has blocked requests > 131.072 sec
osds 1,9,11,19,28,44,45,48,58,72,73,84 have blocked requests > 262.144 sec
osds 2,4,21,22,27,29,31,34,61 have blocked requests > 524.288 sec
osds 15,20,32,52,55,62,71,74,79,83 have blocked requests > 1048.58 sec
osds 5,6,7,12,14,16,18,25,33,35,47,50,51,69 have blocked requests > 2097.15 sec
REQUEST_STUCK 1228 stuck requests are blocked > 4096 sec
330 ops are blocked > 8388.61 sec
898 ops are blocked > 4194.3 sec
osds 3,23,56,59,60 have stuck requests > 4194.3 sec
osds 30,46,49,63,64,65,66,68,70,75,85 have stuck requests > 8388.61 sec
SLOW_OPS 2360 slow ops, oldest one blocked for 6517 sec, daemons
[osd.0,osd.1,osd.11,osd.12,osd.14,osd.15,osd.16,osd.18,osd.19,osd.2]... have slow ops.
We have checked DNS, MTU, network congestion via prometheus and on the
network side nothing seems to be wrong.
2.) client: 91 MiB/s rd, 28 MiB/s wr, 1.76k op/s rd,
686 op/s wr
recovery: 67 MiB/s, 17 objects/s
67 MB/sec is slower than a single rotational disk can deliver. Even 67
+ 91 MB/s is not much, especially not for an 85 OSD @ 10G cluster. The
~2500 IOPS client I/O will translate to 7500 "net" IOPS with pook size
3, maybe that is the limit.
But I guess you already know that. But before tuning, you should
probably listen to Frank's advice about the placements (See other post).
ASAP the unknown OSDs come back, the speed will probably go up due to
parallelism.
I am not sure whether after the long rebalance progress over some hours
this is a good idea at the moment.
What really looks wrong is the extreme long peering and activation
times:
data:
pools: 12 pools, 3000 pgs
objects: 35.03M objects, 133 TiB
usage: 394 TiB used, 163 TiB / 557 TiB avail
pgs: 5.667% pgs unknown
24.967% pgs not active
1365063/105076392 objects degraded (1.299%)
252605/105076392 objects misplaced (0.240%)
1955 active+clean
608 peering
170 unknown
59 activating
57 active+remapped+backfill_wait
35 activating+undersized
32 active+undersized+degraded
20 stale+peering
17 activating+undersized+degraded
9 active+remapped+backfilling
6 stale+active+clean
5 active+recovery_wait
4 active+undersized
4 activating+degraded
4 active+clean+scrubbing+deep
4 stale+activating
3 active+recovery_wait+degraded
3 active+undersized+degraded+remapped+backfill_wait
2 remapped+peering
1 active+recovery_wait+undersized+degraded
1 active+undersized+degraded+remapped+backfilling
1 active+remapped+backfill_toofull
io:
client: 34 MiB/s rd, 3.6 MiB/s wr, 1.08k op/s rd, 324 op/s wr
recovery: 82 MiB/s, 20 objects/s
Still debugging. It's impressive how the very simple task of moving 4
SSDs caused/causes such problems. I wonder (and suspect) that something
else must be wrong here.
We recently (some months ago) upgraded from luminous via mimic to
nautilus, I will triple check if there are any changes that can cause
these effects.
--
Modern, affordable, Swiss Virtual Machines. Visit
www.datacenterlight.ch