On Tue, Sep 3, 2019 at 11:33 AM Frank Schilder <frans@dtu.dk> wrote:

Hi Robert and Paul,

sad news. I did a 5 seconds single thread test after setting osd_op_queue_cut_off=high on all OSDs and MDSs. Here the current settings:

[root@ceph-01 ~]# ceph config show osd.0
NAME VALUE SOURCE OVERRIDES IGNORES
bluestore_compression_min_blob_size_hdd 262144 file
bluestore_compression_mode aggressive file
cluster_addr 192.168.16.68:0/0 override
cluster_network 192.168.16.0/20 file
crush_location host=c-04-A file
daemonize false override
err_to_syslog true file
keyring $osd_data/keyring default
leveldb_log default
mgr_initial_modules balancer dashboard file
mon_allow_pool_delete false file
mon_pool_quota_crit_threshold 90 file
mon_pool_quota_warn_threshold 70 file
osd_journal_size 4096 file
osd_max_backfills 3 mon
osd_op_queue_cut_off high mon
osd_pool_default_flag_nodelete true file
osd_recovery_max_active 8 mon
osd_recovery_sleep 0.050000 mon
public_addr 192.168.32.68:0/0 override
public_network 192.168.32.0/19 file
rbd_default_features 61 default
setgroup disk cmdline
setuser ceph cmdline
[root@ceph-01 ~]# ceph config get osd.0 osd_op_queue
wpq

Unfortunately, the problem is not resolved. The fio job script is:

=====================
[global]
name=fio-rand-write
filename_format=fio-$jobname-${HOSTNAME}-$jobnum-$filenum
rw=randwrite
bs=4K
numjobs=1
time_based=1
runtime=5

[file1]
size=100G
ioengine=sync
=====================

That's a random write test on a 100G file with write size 4K. Note that fio uses "direct=0" by default. Using "direct=1" is absolutely fine.

Running this short burst of load, I already get the cluster unhealthy:

cluster log:

2019-09-03 20:00:00.000160 [INF] overall HEALTH_OK
2019-09-03 20:08:36.450527 [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
2019-09-03 20:08:59.867124 [INF] MDS health message cleared (mds.0): 2 slow metadata IOs are blocked > 30 secs, oldest blocked for 49 secs
2019-09-03 20:09:00.373050 [INF] Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs)
2019-09-03 20:09:00.373094 [INF] Cluster is now healthy

/var/log/messages: loads of these (all OSDs!)

Sep 3 20:08:39 ceph-09 journal: 2019-09-03 20:08:39.269 7f6a3d63c700 -1 osd.161 10411 get_health_metrics reporting 354 slow ops, oldest is osd_op(client.4497435.0:38244 5.f7s0 5:ef9f1be4:::100010ed9bd.0000390c:head [write 8192~4096,write 32768~4096,write 139264~4096,write 172032~4096,write 270336~4096,write 512000~4096,write 688128~4096,write 876544~4096,write 1048576~4096,write 1257472~4096,write 1425408~4096,write 1445888~4096,write 1503232~4096,write 1552384~4096,write 1716224~4096,write 1765376~4096] snapc 12e=[] ondisk+write+known_if_redirected e10411)

It looks like the MDS is pushing waaaayyy too many requests onto the HDDs instead of throttling the client.

An ordinary user should not have so much power in his hands. This makes it trivial to destroy a ceph cluster.

This very short fio test is probably sufficient to reproduce the issue on any test cluster. Should I open an issue?

Best regards,

Are your metadata pools on SSD, or HDD? Usually for us, as long as the blocked I/O fluctuates and goes up and down, the cluster run fine even with the warnings. Usually on an idle cluster a client will send a bunch of data which fills up the queues, then the HDDs have to work through them, at that point the client realizes that the storage is 'slow' and starts throttling the traffic and will then match the speed at which the HDDs can perform the work. If you run the job for a long time, are you still seeing the trimming errors, or just a steady rise in blocked IO on the OSDs without any drops in count? What about the client, does it have a fairly even distribution of latencies, or does it have a lot that are just really long?
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1