On Tue, Sep 10, 2019 at 1:11 PM Frank Schilder <frans(a)dtu.dk> wrote:
Hi Robert,
I have meta data on SSD (3xrep) and data on 8+2 EC on spinning disks, so
the speed difference is orders of magnitudes. Our usage is quite meta data
heavy, so this suits us well. In particular since EC pools are high
throughput with large IO sizes.
As long as one uses fio with direct=1 (probably also if using sync=1
and/or fsync=1), everything is fine and behaves as you describe. IOPs
fluctuate but adjust to media speed. No problems at all.
As mentioned in my last update (I cut it out below), the destructive fio
command runs with direct=0 and neither sync=1 nor fsync=1. This test just
writes as fast as it can (to buffers) without waiting for acks. I would
expect that a ceph client would translate that to synced or direct IO,
which would be fine.
But it doesn't. Instead, it pushes the IO also as fast as possible to the
cluster. I have seen 40kops write on the EC pool (on 100+ HDDs) that can
handle maybe 1kops write in total. The queues were constantly increasing at
an incredible rate (several hundred ops per second). I hope with the change
of cut_off=high that heartbeats will not get lost any more, but this will
still destabilize our ceph cluster quite dramatically.
Changing the cut_off to high will not allow heartbeats to not get lost
(heartbeats have a priority far above the high mark). What cut_off = high
does is put replication ops into the main queue instead of the strict
priority queue. That way an OSD doesn't get DDOSed from it's peers and is
never able to service it's own clients.
When I did my fio testing, was on FireFly/Hammer and on RBD, so I can't
talk specifically to newer versions and CephFS. We haven't had time to set
up our test cluster, so I can't run benches at the moment.
My problem is not so much that such an IO pattern
could occur in
reasonable software, but
- that someone might try just for fun, and that
- the number of 500+ clients might occasionally produce such a workload by
aggregation.
I find it somewhat alarming that a storage system that promises data
integrity and reliability can be taken down with a publicly available
benchmark tool in a matter of a few dozen seconds by ordinary users.
Potentially with damaging effects. I guess something similar could be
achieved with a modified rogue client.
I would expect that a storage cluster should have basic self-defence
mechanisms that prevent this kind of overload or DOS attack by throttling
clients with crazy IO requests. Are there any settings that can be enabled
to prevent this from happening?
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1