Hello,
Our ceph cluster performance has become horrifically slow over the past few
months.
Nobody here is terribly familiar with ceph and we're inheriting this
cluster without much direction.
Architecture: 40Gbps QDR IB fabric between all ceph nodes and our ovirt VM
hosts. 11 OSD nodes with a total of 163 OSDs. 14 pools, 3616 PGs, 1.19PB
total capacity.
Ceph versions:
{
"mon": {
"ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 3
},
"mgr": {
"ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 3
},
"osd": {
"ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 118,
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 22,
"ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e)
luminous (stable)": 19
},
"mds": {},
"overall": {
"ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 124,
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 22,
"ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e)
luminous (stable)": 19
}
}
The majority of disks are spindles but there are also NVMe SSDs. There is a
lot of variability in drive sizes - two different sets of admins added
disks sized between 6TB and 16TB and I suspect this and imbalanced
weighting is to blame.
Performance on the ovirt VMs can dip as low as several *kilobytes*
per-second (!) on reads and a few MB/sec on writes. There are also several
scrub errors. In short, it's a complete wreck.
STATUS:
[root@ceph-admin davei]# ceph -s
cluster:
id: 1b8d958c-e50b-40ef-a681-16cfeb9390b8
health: HEALTH_ERR
3 scrub errors
Possible data damage: 3 pgs inconsistent
services:
mon: 3 daemons, quorum ceph1,ceph2,ceph3
mgr: ceph3(active), standbys: ceph2, ceph1
osd: 163 osds: 159 up, 158 in
data:
pools: 14 pools, 3616 pgs
objects: 46.28M objects, 174TiB
usage: 527TiB used, 694TiB / 1.19PiB avail
pgs: 3609 active+clean
4 active+clean+scrubbing+deep
3 active+clean+inconsistent
io:
client: 74.3MiB/s rd, 96.0MiB/s wr, 3.85kop/s rd, 3.68kop/s wr
---
HEALTH:
[root@ceph-admin davei]# ceph health detail
HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent
OSD_SCRUB_ERRORS 3 scrub errors
PG_DAMAGED Possible data damage: 3 pgs inconsistent
pg 2.8a is active+clean+inconsistent, acting [13,152,127]
pg 2.ce is active+clean+inconsistent, acting [145,13,152]
pg 2.e8 is active+clean+inconsistent, acting [150,162,42]
---
CEPH OSD DF:
(not going to paste that all in here):
https://pastebin.com/CNW5RKWx
What else am I missing in terms of what to share with you all?
Any advice on how we should 'reweight' these to get the performance to
improve?
Thanks all,
-Dave