Ceph health status reports: Reduced data availability and this is resulting in slow requests are blocked - ceph-users

15 Oct 2019

Hi,

I want to use balancer mode "upmap" for all pools.
This mode is currently enable for pool "hdb_backup" with ~600TB used space.
root@ld3955:~# rados df
POOL_NAME          USED  OBJECTS CLONES    COPIES MISSING_ON_PRIMARY
UNFOUND DEGRADED   RD_OPS      RD    WR_OPS      WR USED COMPR UNDER COMPR
backup              0 B        0      0         0                 
0       0        0        0     0 B         0     0 B        0 B         0 B
cephfs_data     1.1 TiB    89592      0    268776                 
0       0        0        0     0 B     43443 144 GiB        0 B         0 B
cephfs_metadata 311 MiB       48      0       144                 
0       0        0        6   6 KiB      7465 106 MiB        0 B         0 B
hdb_backup      585 TiB 51077985      0 153233955                 
0       0        0 12577024 4.3 TiB 281002173 523 TiB        0 B         0 B
hdd             6.3 TiB   585051      0   1755153                 
0       0        0  4420255  69 GiB   8219453 1.2 TiB        0 B         0 B

root@ld3955:~# ceph osd lspools
11 hdb_backup
51 hdd
52 backup
57 cephfs_data
58 cephfs_metadata

I started with pool "cephfs_metadata" that allocated comparable low data.
At the same moment when I executed
ceph config set mgr mgr/balancer/pool_ids 11,52,58
Ceph status shows increasing number of Reduced data availability: <x> pg
inactive, <y> pg peering

And the number of
<z> slow requests are blocked > 32 sec
was increasing heavily up to 180.

Observing Ceph log (ceph -w) it becomes clear that there's a correlation
between pg inactive and slow requests are block in my cluster.

How can I start analysis why the cluster reports slow requests?

THX