On Wed, Jul 3, 2019 at 12:30 PM Jeff Layton <jlayton(a)redhat.com> wrote:
On Tue, 2019-07-02 at 17:24 +0200, Dan van der Ster wrote:
Hi,
Are there any plans to implement a per-client throttle on mds client requests?
We just had an interesting case where a new cephfs user was hammering
an mds from several hosts. In the end we found that their code was
doing:
while d=getafewbytesofdata():
f=open(file.dat)
f.append(d)
f.close()
By changing their code to:
f=open(file.dat)
while d=getafewbytesofdata():
f.append(d)
f.close()
it completely removes their load on the mds (for obvious reasons).
In a multi-user environment it's hard to scrutinize every user's
application, so we'd prefer to just throttle down the client req rates
(and let them suffer from the poor performance).
Thoughts?
(cc'ing Xuehan)
It sounds like a reasonable thing to do at first glance. There was a
patchset recently by Xuehan Xu to add a new io controller policy for
cephfs, but that was more focused around OSD ops on behalf of cephfs
clients, fwiw, but that's not quite what you're asking about.
The challenge with all of these sorts of throttling schemes is how to
parcel things out to individual clients. MDS/OSD ops are not a discrete
resource, and it's difficult to gauge how much to allocate to each
client.
I think if we were going to do something along these lines, it'd be good
to work out how you'd throttle both MDS and OSD ops to keep a lid on
things. That said, this is not a trivial problem to tackle, IMO.
Thanks for the reply, Jeff.
At the moment I'm only considering adding a simple throttle on the MDS
client ops. From the practical standpoint, we have seen clients
overloading MDS's already, but haven't suffered from any OSD related
load issues.
Plus, the distributed QoS stuff should better solve the OSD problem, AFAIU.
Some questions to get you started should you choose to
pursue this:
- Will you throttle these ops at the MDS or on the clients? Ditto for
the OSDs...
- How will it work? Will there be a fixed cap of some sort for a given
amount of time, or are you more looking to just delay processing ops for
a single client when it's "too busy"?
My basic idea is to add this machinery early on in handle_client_request:
- record each Session's last req timestamp
- if now()-last_req_timestamp for a given req/session is less than a
configurable delay, inject a delay. (e.g.
mds_per_client_request_sleep, defaults to 0, we'd use 0.01 to throttle
clients to 100Hz each)
That said, I haven't understood exactly how to inject that delay just
yet. Is h_c_r async per req, or is it looping with one thread over the
queued requests? If it's async per Session or req, then we could just
sleep right there in h_c_r. If h_c_r is handled by one thread, we need
to be more clever. Is there a standard way to tell a client to retry
the req after some delay?
Also, this kind of v1 PoC would obviously have the same
mds_per_client_request_sleep for all clients. v2 could add a
configurable sleep for specific clients.
And in addition to the per-client approach, a second idea would be to
throttle per mount prefix (which would be useful in cases of multiple
clients accessing the same path, e.g. multi-tenant with Manila).
A simple way to achieve this would be to use the session's
client_metadata.root as a key in a hash of last req time (per mount
root), delaying requests as needed (like above).
The holy grail path throttle solution would be to allow throttling per
subpath, e.g. for a home directory use-case where you have 10000
subdirs in /home/, and we want to throttle any /home/{*}/ to 100Hz.
This could be exposed as an xattr on a directory, but for each request
we'd have to resolve the path upwards to find a req/s quota (like we
do for space quotas) and sleep accordingly.
- If you're thinking of something more like a
cgroup, how will you
determine how large a pool of operations you will have, and can parcel
out to each client? If you've parceled out 100% of your MDS ops budget,
how will you rebalance things when new clients are added or removed from
the cluster?
I'm not a fan of the cgroup approach because it's nicer if the
throttling can be enforced/configured dynamically on the server-side.
The simple sleep proposed above is inspired by the various osd sleeps
that we've added over the years -- they turn out to be super effective
for busy prod clusters.
If we realize in prod that the sleep is too aggressive, we can just
lower it on the mds's as needed :)
- if a client is holding a file lock, then throttling
it could delay it
releasing locks and that could slow down other (mostly idle) clients
that are contending for it. Do we care? How will we deal with that
situation if so?
That would indeed be a concern, but my understanding from checking
dispatch_client_request is that these are all client md ops like
lookup, create, rm, setxattr; and not the responses to a revoke cap
request from the mds to the client.
Is that write?
Thanks!
Dan
> - Would you need separate tunables for OSD and MDS ops, or is there some
> way to tune both under a single knob?
> --
> Jeff Layton <jlayton(a)redhat.com>
>