Frank,
I wrote the wpq and the cut off code because the only scheduler at the time
was not servicing other priorities under extreme load. The default op
scheduler prioritized replication ops in the strict queue which meant as
long as there were any ops from other OSDs for replication, no client or
backfill ops would be serviced. Once the strict queue was empty then it
would start dequeing client ops, but the way the token bucket code worked
it would drain the client queue quickly and then start running
backfill/recovery ops which didn't drain that bucket as fast. This did not
sit well with our VMs with heavy write loads.
I wrote WPQ to dequeue each op priority based on the weight of the op
rather than token bucket queue and showed that it proportionally dequeued
ops based on the priority. It meant that sometimes higher priority ops
would be blocked to run a lower priority op but no queue was ever starved
from dequeunig an op like before. An op that had twice the priority of
another op had twice the probability of being dequed. The op scheduler in
Ceph actually consists of two queues, a strict priority queue and a TB/WPQ
queue. The cut off refers to the op priority number that separates the
strict priority queue from the WPQ or default token bucket. By setting it
to high, you are telling Ceph to include the replication ops in the token
bucket or WPQ rather than the strict queue and only allows very small ops
that don't require disk access to be in the strict priority queue
(heartbeats, Mon messages, OSD messages, etc) so that all the slow work is
prioritized by the WPQ/TB queue.
With this, we found that we didn't need QoS as all client now got a fair
share of I/O instead of some clients being 'lucky' to land on a non-busy
OSD and send many rep ops to a busy OSD who could only service replication
ops and never any client ops. I also found that op priorities worked as
expected. We could raise the number of backfill operations on an OSD and it
would negligibly impact clients as it started using only idle capacity to
do the backfill and prioritize client traffic. I assume that if you change
the op priority of the different classes of ops, that it would work more
predictably with WPQ, but I don't think that you can change it on the fly
and would require an OSD reboot which I could not do at the time I tried.
The WPQ did not prevent all blocked I/O, but what it did was prevent any
single client from being blocked indefinitely. I saw latencies become very
tight across all clients, instead of some clients having very good latency
and other extremely poor latency, each client had statistically the same
latency. No longer was the cluster limited by the slowest drive in the
cluster, the OSD with the slow drive would now execute client ops sending
rep ops to other OSD and helping to generate more load on a less loaded OSD
which would then possibly reduce the load on the overloaded OSD (because
now the idle OSD had other work to do other than just servicing client
ops). This allowed the cluster to appropriately throttle clients by
increasing latency on all clients in a more uniform manner. It allows the
cluster to achieve 100% utilization at the same time.
WPQ was planned to be the default scheduler, but I left the company I was
working for shortly after getting it merged and my new company wasn't doing
object storage so I wasn't there to see it become the default. I'm at a new
company and again working with Ceph and have made it the default on our two
large production clusters with great success. The client latencies and
backfill pain that my co-workers experienced on a daily basis have been all
alleviated since moving to WPQ.
Honestly, WPQ may do what you need without having to try to configure QoS.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Sat, Aug 24, 2019 at 2:08 AM Frank Schilder <frans(a)dtu.dk> wrote:
Hi Robert,
thanks for your reply. These are actually settings I found in cases I
referred to with "other cases" in my mail. These settings could be a first
step. Looking at the documentation, solving the overload problem might
require some QoS settings I found below the description of "osd op queue"
https://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/#opera…
.
I see some possibilities, but I'm not sure how to use these settings to
enforce load dependent rate limiting on clients. As far as I can see, IOPs
QoS does not take backlog into account, which would be important for
distinguishing a burst from a sustained overload. In addition, this
requires mClock, which is labelled experimental.
If anyone could shed some light on what possibilities currently exist
beyond playing with "osd op queue" and "osd op queue cut off" that
would be
great. Also if there is some experience out there about this problem.
For example, would reducing "osd client op priority" have any effect? As
far as I can see, this is only for weighting between recovery and client
IO, not for priority of IO already in flight versus new client OPS.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Robert LeBlanc <robert(a)leblancnet.us>
Sent: 23 August 2019 17:28
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] ceph fs crashes on simple fio test
The WPQ scheduler may help your clients back off when things get busy.
Put this in your ceph.conf and restart your OSDs.
osd op queue = wpq
osd op queue cut off = high
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1