Frank,

I wrote the wpq and the cut off code because the only scheduler at the time was not servicing other priorities under extreme load. The default op scheduler prioritized replication ops in the strict queue which meant as long as there were any ops from other OSDs for replication, no client or backfill ops would be serviced. Once the strict queue was empty then it would start dequeing client ops, but the way the token bucket code worked it would drain the client queue quickly and then start running backfill/recovery ops which didn't drain that bucket as fast. This did not sit well with our VMs with heavy write loads.

I wrote WPQ to dequeue each op priority based on the weight of the op rather than token bucket queue and showed that it proportionally dequeued ops based on the priority. It meant that sometimes higher priority ops would be blocked to run a lower priority op but no queue was ever starved from dequeunig an op like before. An op that had twice the priority of another op had twice the probability of being dequed. The op scheduler in Ceph actually consists of two queues, a strict priority queue and a TB/WPQ queue. The cut off refers to the op priority number that separates the strict priority queue from the WPQ or default token bucket. By setting it to high, you are telling Ceph to include the replication ops in the token bucket or WPQ rather than the strict queue and only allows very small ops that don't require disk access to be in the strict priority queue (heartbeats, Mon messages, OSD messages, etc) so that all the slow work is prioritized by the WPQ/TB queue.

With this, we found that we didn't need QoS as all client now got a fair share of I/O instead of some clients being 'lucky' to land on a non-busy OSD and send many rep ops to a busy OSD who could only service replication ops and never any client ops. I also found that op priorities worked as expected. We could raise the number of backfill operations on an OSD and it would negligibly impact clients as it started using only idle capacity to do the backfill and prioritize client traffic. I assume that if you change the op priority of the different classes of ops, that it would work more predictably with WPQ, but I don't think that you can change it on the fly and would require an OSD reboot which I could not do at the time I tried.

The WPQ did not prevent all blocked I/O, but what it did was prevent any single client from being blocked indefinitely. I saw latencies become very tight across all clients, instead of some clients having very good latency and other extremely poor latency, each client had statistically the same latency. No longer was the cluster limited by the slowest drive in the cluster, the OSD with the slow drive would now execute client ops sending rep ops to other OSD and helping to generate more load on a less loaded OSD which would then possibly reduce the load on the overloaded OSD (because now the idle OSD had other work to do other than just servicing client ops). This allowed the cluster to appropriately throttle clients by increasing latency on all clients in a more uniform manner. It allows the cluster to achieve 100% utilization at the same time.

WPQ was planned to be the default scheduler, but I left the company I was working for shortly after getting it merged and my new company wasn't doing object storage so I wasn't there to see it become the default. I'm at a new company and again working with Ceph and have made it the default on our two large production clusters with great success. The client latencies and backfill pain that my co-workers experienced on a daily basis have been all alleviated since moving to WPQ.

Honestly, WPQ may do what you need without having to try to configure QoS.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1

On Sat, Aug 24, 2019 at 2:08 AM Frank Schilder <frans@dtu.dk> wrote:

Hi Robert,

thanks for your reply. These are actually settings I found in cases I referred to with "other cases" in my mail. These settings could be a first step. Looking at the documentation, solving the overload problem might require some QoS settings I found below the description of "osd op queue" https://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/#operations .

I see some possibilities, but I'm not sure how to use these settings to enforce load dependent rate limiting on clients. As far as I can see, IOPs QoS does not take backlog into account, which would be important for distinguishing a burst from a sustained overload. In addition, this requires mClock, which is labelled experimental.

If anyone could shed some light on what possibilities currently exist beyond playing with "osd op queue" and "osd op queue cut off" that would be great. Also if there is some experience out there about this problem.

For example, would reducing "osd client op priority" have any effect? As far as I can see, this is only for weighting between recovery and client IO, not for priority of IO already in flight versus new client OPS.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Robert LeBlanc <robert@leblancnet.us>
Sent: 23 August 2019 17:28
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] ceph fs crashes on simple fio test

The WPQ scheduler may help your clients back off when things get busy.

Put this in your ceph.conf and restart your OSDs.
osd op queue = wpq
osd op queue cut off = high
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1