On Tue, Sep 3, 2019 at 11:33 AM Frank Schilder <frans@dtu.dk> wrote:
Hi Robert and Paul,

sad news. I did a 5 seconds single thread test after setting osd_op_queue_cut_off=high on all OSDs and MDSs. Here the current settings:

[root@ceph-01 ~]# ceph config show osd.0
NAME                                    VALUE              SOURCE   OVERRIDES IGNORES
bluestore_compression_min_blob_size_hdd 262144             file                       
bluestore_compression_mode              aggressive         file                       
cluster_addr                    override                   
cluster_network                   file                       
crush_location                          host=c-04-A        file                       
daemonize                               false              override                   
err_to_syslog                           true               file                       
keyring                                 $osd_data/keyring  default                   
leveldb_log                                                default                   
mgr_initial_modules                     balancer dashboard file                       
mon_allow_pool_delete                   false              file                       
mon_pool_quota_crit_threshold           90                 file                       
mon_pool_quota_warn_threshold           70                 file                       
osd_journal_size                        4096               file                       
osd_max_backfills                       3                  mon                       
osd_op_queue_cut_off                    high               mon                       
osd_pool_default_flag_nodelete          true               file                       
osd_recovery_max_active                 8                  mon                       
osd_recovery_sleep                      0.050000           mon                       
public_addr                     override                   
public_network                    file                       
rbd_default_features                    61                 default                   
setgroup                                disk               cmdline                   
setuser                                 ceph               cmdline                   
[root@ceph-01 ~]# ceph config get osd.0 osd_op_queue

Unfortunately, the problem is not resolved. The fio job script is:



That's a random write test on a 100G file with write size 4K. Note that fio uses "direct=0" by default. Using "direct=1" is absolutely fine.

Running this short burst of load, I already get the cluster unhealthy:

cluster log:

2019-09-03 20:00:00.000160 [INF]  overall HEALTH_OK
2019-09-03 20:08:36.450527 [WRN]  Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
2019-09-03 20:08:59.867124 [INF]  MDS health message cleared (mds.0): 2 slow metadata IOs are blocked > 30 secs, oldest blocked for 49 secs
2019-09-03 20:09:00.373050 [INF]  Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs)
2019-09-03 20:09:00.373094 [INF]  Cluster is now healthy

/var/log/messages: loads of these (all OSDs!)

Sep  3 20:08:39 ceph-09 journal: 2019-09-03 20:08:39.269 7f6a3d63c700 -1 osd.161 10411 get_health_metrics reporting 354 slow ops, oldest is osd_op(client.4497435.0:38244 5.f7s0 5:ef9f1be4:::100010ed9bd.0000390c:head [write 8192~4096,write 32768~4096,write 139264~4096,write 172032~4096,write 270336~4096,write 512000~4096,write 688128~4096,write 876544~4096,write 1048576~4096,write 1257472~4096,write 1425408~4096,write 1445888~4096,write 1503232~4096,write 1552384~4096,write 1716224~4096,write 1765376~4096] snapc 12e=[] ondisk+write+known_if_redirected e10411)

It looks like the MDS is pushing waaaayyy too many requests onto the HDDs instead of throttling the client.

An ordinary user should not have so much power in his hands. This makes it trivial to destroy a ceph cluster.

This very short fio test is probably sufficient to reproduce the issue on any test cluster. Should I open an issue?

Best regards,
Are your metadata pools on SSD, or HDD? Usually for us, as long as the blocked I/O fluctuates and goes up and down, the cluster run fine even with the warnings. Usually on an idle cluster a client will send a bunch of data which fills up the queues, then the HDDs have to work through them, at that point the client realizes that the storage is 'slow' and starts throttling the traffic and will then match the speed at which the HDDs can perform the work. If you run the job for a long time, are you still seeing the trimming errors, or just a steady rise in blocked IO on the OSDs without any drops in count? What about the client, does it have a fairly even distribution of latencies, or does it have a lot that are just really long?
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1