We have separate public and cluster networks, but on the same physical
switches with different VLANs.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Fri, Aug 23, 2019 at 1:29 PM <DHilsbos(a)performair.com> wrote:
Frank;
Just out of curiosity; do you use a dedicated cluster network, or do you
run a single network?
Thank you,
Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
DHilsbos(a)PerformAir.com
www.PerformAir.com
From: Robert LeBlanc [mailto:robert@leblancnet.us]
Sent: Friday, August 23, 2019 8:29 AM
To: Frank Schilder
Cc: ceph-users
Subject: [ceph-users] Re: ceph fs crashes on simple fio test
The WPQ scheduler may help your clients back off when things get busy.
Put this in your ceph.conf and restart your OSDs.
osd op queue = wpq
osd op queue cut off = high
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Fri, Aug 23, 2019 at 5:03 AM Frank Schilder <frans(a)dtu.dk> wrote:
I need to harden our ceph cluster to satisfy the following properties:
Assuming all hardware is functioning properly,
1) Cluster health has highest priority. Heartbeats have priority over
client requests.
2) The cluster does not accept more IO than the OSDs can handle. The only
exception might be a configurable burst option.
3) Client IO is accepted as long as it does not compromise 1.
4) Ideally, there is fair sharing of the cluster's IO budget between
clients (like deadline or completely fair scheduling). Rogue clients should
not get priority just because they push a lot.
Unfortunately, with default settings our (any?) cluster prefers client IO
over cluster health, which opens up for a simple but serious non-privileged
client attack on cluster health:
I observed a serious issue on our cluster when running a simple fio test
that does 4K random writes on 100GB files (see details below). What I
observe is that within a few seconds the cluster goes to health_warn with
the MDS reporting slow meta data IO and behind on trimming. What is not
shown in ceph health detail is, that all OSDs report thousands of slow ops
and the counter increases really fast (I include some snippets below). This
goes rapidly to the point where OSDs start missing heartbeats and start
flapping, some PGs become inactive+degraded and start peering regularly.
CPU load, memory and even network load were rather low (one-figure % CPU
load). Nothing out of the ordinary. The mons had no issues.
After the fio test completed, the OSDs slowly crunched through the backlog
and managed to complete all OPS. The complete processing of the ops of a 30
second fio test took ca. 10 minutes. However, even then the cluster did not
come back healthy as essential messages between daemons seem to have been
lost. It is not a long stretch to assume that one can destroy a ceph fs
beyond repair when running this test or an application performing the same
IO pattern from multiple clients for several hours. We have a 500 node
cluster as clients and I'm afraid that even ordinary IO might trigger this
scenario in unlucky circumstances.
Since our cluster is only half-trusted (we have root access to clients,
but no control over user IO patterns), we are in need to harden our cluster
against such destructive IO patterns as much as possible.
To me it looks like the cluster is accepting way more IO than the OSDs can
handle. Ideally, what I would like to do is configure effective rate
limiting on (rogue) clients depending on how much OPS they have in flight
already. I would expect that there are tunables for MDS/OSD daemons that
control how much IO requests a client can submit/OSD will accept before
throttling IO. In particular, I would like to prioritize heartbeats to
prevent load-induced OSD flapping. How can I tune the cluster to satisfy
the conditions outlined at the top of this e-mail?
There were recent threads with similar topics, in particular, "MDS failing
under load with large cache sizes" and others reporting unstable MDS
daemons under load. However, I believe they were mostly related to cache
trimming issues due to large amounts of files created. This is not the case
here, its just 4 files with lots of random IO from a single client.
A bit of information about our cluster and observations:
The cluster is bluestore-only with 8+2 EC fs data pool on spinning disks
and an 3(2) replicated fs meta data pool on SSD. We have 8 OSD hosts with 2
shards per host. Each host has 4SSDs and 12 10TB HDDs SAS 12GB with 4k
block size. Network is 2x10G bonded for client and 2x10G bonded for
replication. The replication network will be extended to 4x10G soon. We are
aware that currently the network bandwidth greatly exceeds what the
spinning disks can handle. It is dimensioned for adding more disks in the
future.
The fio job script is:
[global]
name=fio-rand-write
directory=/home/fio # /home is on ceph fs
filename_format=tmp/fio-$jobname-${HOSTNAME}-$jobnum-$filenum
rw=randwrite
bs=4K
numjobs=4
time_based=1
runtime=30
[file1]
size=100G
ioengine=sync
Its one of the examples in the fio source repo with small modifications
(file pattern, run time). Shortly after starting this fio job from a client
connected with single 10G line, all OSDs start reporting slow ops. Picking
one, the log messages look like this:
Aug 22 12:06:22 ceph-09 ceph-osd: 2019-08-22 10:06:22.151 7f399d1bd700 -1
osd.125 3778 get_health_metrics reporting 146 slow ops, oldest is
osd_op(client.1178165.0:1670940 5.3fs0
5:fcc65fe3:::10000f403bc.0000063e:head [write 307200~4096,write
315392~8192,write 380928~12288,write 405504~12288,write 421888~4096,write
458752~4096,write 466944~4096,write 475136~4096,write 487424~8192,write
512000~4096,write 524288~4096,write 565248~4096,write 589824~8192,write
622592~4096,write 651264~4096,write 724992~12288] snapc 12e=[]
ondisk+write+known_if_redirected e3778)
...
Aug 22 12:06:49 ceph-09 journal: 2019-08-22 10:06:49.415 7f399d1bd700 -1
osd.125 3779 get_health_metrics reporting 4595 slow ops, olde
st is osd_op(client.1178165.0:1686216 5.acs0
5:354e96d5:::10000f403bc.00000bc1:head [write 2359296~4096,write
2375680~4096,write 23838
72~4096,write 2404352~4096,write 2428928~8192,write 2469888~4096,write
2490368~8192,write 2514944~4096,write 2527232~4096,write 253542
4~4096,write 2588672~4096,write 2600960~4096,write 2621440~8192,write
2658304~4096,write 2715648~8192,write 2727936~4096] snapc 12e=[]
ondisk+write+known_if_redirected e3778)
...
Aug 22 12:12:57 ceph-09 journal: 2019-08-22 10:12:57.650 7f399d1bd700 -1
osd.125 3839 get_health_metrics reporting 8419 slow ops, oldest is
osd_op(client.1178165.0:2009417 5.3fs0
5:fcdcf2bd:::10000f47e2d.00001501:head [write 2236416~4096,write
2256896~4096,write 2265088~4096,write 2301952~4096,write 2322432~4096,write
2355200~4096,write 2371584~4096,write 2387968~4096,write 2449408~4096,write
2486272~4096,write 2547712~8192,write 2617344~4096,write 2809856~4096,write
3018752~4096,write 3194880~4096,write 3223552~4096] snapc 12e=[]
ondisk+write+known_if_redirected e3839)
Aug 22 12:12:58 ceph-09 ceph-osd: 2019-08-22 10:12:58.681 7f399d1bd700 -1
osd.125 3839 get_health_metrics reporting 8862 slow ops, oldest is
osd_op(mds.0.16909:69577511 5.ees0 5.992388ee (undecoded)
ondisk+write+known_if_redirected+full_force e3839)
...
Aug 22 12:13:27 ceph-09 journal: 2019-08-22 10:13:27.691 7f399d1bd700 -1
osd.125 3839 get_health_metrics reporting 13795 slow ops, oldest is
osd_op(mds.0.16909:69577573 5.e6s0 5.d8994de6 (undecoded)
ondisk+write+known_if_redirected+full_force e3839)
...
Aug 22 12:13:59 ceph-09 ceph-osd: 2019-08-22 10:13:59.762 7f399d1bd700 -1
osd.125 3900 get_health_metrics reporting 12 slow ops, oldest is
osd_op(mds.0.16909:69577511 5.ees0 5.992388ee (undecoded)
ondisk+retry+write+known_if_redirected+full_force e3875)
...
Aug 22 12:14:46 ceph-09 journal: 2019-08-22 10:14:46.569 7f399d1bd700 -1
osd.125 3916 get_health_metrics reporting 969 slow ops, oldes
t is osd_op(mds.0.16909:69577511 5.ees0 5.992388ee (undecoded)
ondisk+retry+write+known_if_redirected+full_force e3875)
Aug 22 12:14:47 ceph-09 ceph-osd: 2019-08-22 10:14:47.617 7f399d1bd700 -1
osd.125 3935 get_health_metrics reporting 1 slow ops, oldest
is osd_op(mds.0.16909:69577511 5.ees0
5:7711c499:::10000f4798b.00000000:head [create,setxattr parent
(289),setxattr layout (30)] snap
c 0=[] RETRY=2 ondisk+retry+write+known_if_redirected+full_force e3875)
...
Aug 22 12:14:53 ceph-09 journal: 2019-08-22 10:14:53.675 7f399d1bd700 -1
osd.125 3939 get_health_metrics reporting 1 slow ops, oldest is
osd_op(mds.0.16909:69577511 5.ees0 5:7711c499:::10000f4798b.00000000:head
[create,setxattr parent (289),setxattr layout (30)] snapc 0=[] RETRY=2
ondisk+retry+write+known_if_redirected+full_force e3875)
This is the last log message, the OSD seems to have executed all OPS at
this point or shortly after. The cluster state went to health_ok - at least
shortly.
At about 12:45 (yes, lunch break) we looked at the cluster again and it
was back in health_warn with the following status output:
[root@ceph-01 ~]# ceph status
cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_WARN
1 MDSs report slow metadata IOs
2 MDSs behind on trimming
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-02, ceph-03
mds: con-fs-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 192 osds: 192 up, 192 in
data:
pools: 7 pools, 790 pgs
objects: 9.01 M objects, 16 TiB
usage: 20 TiB used, 1.3 PiB / 1.3 PiB avail
pgs: 790 active+clean
io:
client: 1.9 MiB/s rd, 21 MiB/s wr, 60 op/s rd, 721 op/s wr
[root@ceph-01 ~]# ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; 2 MDSs behind on trimming
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
mdsceph-08(mds.0): 2 slow metadata IOs are blocked > 30 secs, oldest
blocked for 3275 secs
MDS_TRIM 2 MDSs behind on trimming
mdsceph-08(mds.0): Behind on trimming (1778/128) max_segments: 128,
num_segments: 1778
mdsceph-12(mds.0): Behind on trimming (1780/128) max_segments: 128,
num_segments: 1780
Num_segments was increasing constantly. Apparently, an operation got stuck
and never completed. We hunted a little bit and found
[root@ceph-mds:ceph-08 /]# ceph daemon mds.ceph-08 objecter_requests
{
"ops": [
{
"tid": 71206302,
"pg": "4.952611b7",
"osd": 19,
"object_id": "200.0001be74",
"object_locator": "@4",
"target_object_id": "200.0001be74",
"target_object_locator": "@4",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "268516s",
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2019-08-22 11:12:41.0.565912s",
"osd_ops": [
"write 890478~1923"
]
},
{
"tid": 71206303,
"pg": "4.952611b7",
"osd": 19,
"object_id": "200.0001be74",
"object_locator": "@4",
"target_object_id": "200.0001be74",
"target_object_locator": "@4",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "268516s",
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2019-08-22 11:12:41.0.566236s",
"osd_ops": [
"write 892401~1931"
]
},
{
"tid": 71206301,
"pg": "5.eeb918c6",
"osd": 67,
"object_id": "10000f26f67.00000000",
"object_locator": "@5",
"target_object_id": "10000f26f67.00000000",
"target_object_locator": "@5",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "268516s",
"attempts": 1,
"snapid": "head",
"snap_context": "12e=[]",
"mtime": "1970-01-01 00:00:00.000000s",
"osd_ops": [
"trimtrunc 81854@573"
]
},
{
"tid": 69577573,
"pg": "5.d8994de6",
"osd": 125,
"object_id": "10000f479c9.00000000",
"object_locator": "@5",
"target_object_id": "10000f479c9.00000000",
"target_object_locator": "@5",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "265042s",
"attempts": 5,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2019-08-22 10:12:05.0.256058s",
"osd_ops": [
"create",
"setxattr parent (319)",
"setxattr layout (30)"
]
},
{
"tid": 69577598,
"pg": "5.deb003e6",
"osd": 125,
"object_id": "10000f479da.00000000",
"object_locator": "@5",
"target_object_id": "10000f479da.00000000",
"target_object_locator": "@5",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "265042s",
"attempts": 5,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2019-08-22 10:12:05.0.258824s",
"osd_ops": [
"create",
"setxattr parent (288)",
"setxattr layout (30)"
]
},
{
"tid": 71206300,
"pg": "5.5cd5b20b",
"osd": 163,
"object_id": "10000f01396.00000000",
"object_locator": "@5",
"target_object_id": "10000f01396.00000000",
"target_object_locator": "@5",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "268516s",
"attempts": 1,
"snapid": "head",
"snap_context": "12e=[]",
"mtime": "1970-01-01 00:00:00.000000s",
"osd_ops": [
"trimtrunc 208782@573"
]
}
],
"linger_ops": [],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": [],
"command_ops": []
}
Notice the ops from the 1970's. Checking ops and dump_blocked_ops on
osd.19 showed that these lists were empty. So, we decided to restart osd.19
and it cleared out most of the stuck requests, but did not clear the health
warnings:
[root@ceph-mds:ceph-08 /]# ceph daemon mds.ceph-08 objecter_requests
{
"ops": [
{
"tid": 69577573,
"pg": "5.d8994de6",
"osd": 125,
"object_id": "10000f479c9.00000000",
"object_locator": "@5",
"target_object_id": "10000f479c9.00000000",
"target_object_locator": "@5",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "265042s",
"attempts": 5,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2019-08-22 10:12:05.0.256058s",
"osd_ops": [
"create",
"setxattr parent (319)",
"setxattr layout (30)"
]
},
{
"tid": 69577598,
"pg": "5.deb003e6",
"osd": 125,
"object_id": "10000f479da.00000000",
"object_locator": "@5",
"target_object_id": "10000f479da.00000000",
"target_object_locator": "@5",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "265042s",
"attempts": 5,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2019-08-22 10:12:05.0.258824s",
"osd_ops": [
"create",
"setxattr parent (288)",
"setxattr layout (30)"
]
}
],
"linger_ops": [],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": [],
"command_ops": []
}
Restarting osd.125 finally resolved the health issues. However, the client
I run fio on had lost connection to ceph due to this incident, which is
really annoying. This client is the head node of our HPC cluster and it was
not possible to restore ceph fs access without reboot. This is an
additional bad fallout as all users will loose access to our HPC cluster
when this happens (/home is on the ceph fs).
I dumped the dump_historic_slow_ops of osd.125 in case anyone can use this
information.
I might be able to repeat this experiment, but cannot promise anything.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io