[ceph-users] Re: ceph fs crashes on simple fio test

26 Aug 2019

Hi Robert and Paul,

I checked today and the default scheduler is WPQ with cut off low (I was using the
defaults). I changed cut off to high in the config data base, but still need to restart
all OSDs to apply the change.

I'm not sure how much it will help though. Maybe heartbeats will get through despite
the mess, which will be a plus in any case.

My observation was, that not some but *all* OSDs were heavily overloaded by just one
client doing aggressive IO. Following your explanations, I'm not yet convinced that
the changed cut-off will limit the rate of client OPs accepted by the cluster. Well, I
will try again after OSD restart and report back.

Thanks for your help!

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Robert LeBlanc &lt;robert(a)leblancnet.us&gt;
Sent: 26 August 2019 22:24
To: Paul Emmerich
Cc: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: ceph fs crashes on simple fio test

If it is the default, then the documentation should be updated. [0]

[0]
https://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/?highl…
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Mon, Aug 26, 2019 at 1:22 PM Robert LeBlanc
<robert@leblancnet.us<mailto:robert@leblancnet.us>> wrote:
High should be the default with WPQ.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Mon, Aug 26, 2019 at 10:44 AM Paul Emmerich
<paul.emmerich@croit.io<mailto:paul.emmerich@croit.io>> wrote:
WPQ has been the default queue for quite some time now (Luminous?).

However, the default cut off is low. I remember changing this in some
early jewel (or kraken?) version to high and it helped a lot with the
only cluster we had back then.
We've been running all of our clusters with cut off high since then,
any reason why this isn't the default?

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io<http://www.croit.io>
Tel: +49 89 1896585 90

On Mon, Aug 26, 2019 at 6:21 PM Robert LeBlanc
<robert@leblancnet.us<mailto:robert@leblancnet.us>> wrote:
>
> Frank,
>
> I wrote the wpq and the cut off code because the only scheduler at the time was not
servicing other priorities under extreme load. The default op scheduler prioritized
replication ops in the strict queue which meant as long as there were any ops from other
OSDs for replication, no client or backfill ops would be serviced. Once the strict queue
was empty then it would start dequeing client ops, but the way the token bucket code
worked it would drain the client queue quickly and then start running backfill/recovery
ops which didn't drain that bucket as fast. This did not sit well with our VMs with
heavy write loads.
>
> I wrote WPQ to dequeue each op priority based on the weight of the op rather than
token bucket queue and showed that it proportionally dequeued ops based on the priority.
It meant that sometimes higher priority ops would be blocked to run a lower priority op
but no queue was ever starved from dequeunig an op like before. An op that had twice the
priority of another op had twice the probability of being dequed. The op scheduler in Ceph
actually consists of two queues, a strict priority queue and a TB/WPQ queue. The cut off
refers to the op priority number that separates the strict priority queue from the WPQ or
default token bucket. By setting it to high, you are telling Ceph to include the
replication ops in the token bucket or WPQ rather than the strict queue and only allows
very small ops that don't require disk access to be in the strict priority queue
(heartbeats, Mon messages, OSD messages, etc) so that all the slow work is prioritized by
the WPQ/TB queue.
>
> With this, we found that we didn't need QoS as all client now got a fair share of
I/O instead of some clients being 'lucky' to land on a non-busy OSD and send many
rep ops to a busy OSD who could only service replication ops and never any client ops. I
also found that op priorities worked as expected. We could raise the number of backfill
operations on an OSD and it would negligibly impact clients as it started using only idle
capacity to do the backfill and prioritize client traffic. I assume that if you change the
op priority of the different classes of ops, that it would work more predictably with WPQ,
but I don't think that you can change it on the fly and would require an OSD reboot
which I could not do at the time I tried.
>
> The WPQ did not prevent all blocked I/O, but what it did was prevent any single
client from being blocked indefinitely. I saw latencies become very tight across all
clients, instead of some clients having very good latency and other extremely poor
latency, each client had statistically the same latency. No longer was the cluster limited
by the slowest drive in the cluster, the OSD with the slow drive would now execute client
ops sending rep ops to other OSD and helping to generate more load on a less loaded OSD
which would then possibly reduce the load on the overloaded OSD (because now the idle OSD
had other work to do other than just servicing client ops). This allowed the cluster to
appropriately throttle clients by increasing latency on all clients in a more uniform
manner. It allows the cluster to achieve 100% utilization at the same time.
>
> WPQ was planned to be the default scheduler, but I left the company I was working for
shortly after getting it merged and my new company wasn't doing object storage so I
wasn't there to see it become the default. I'm at a new company and again working
with Ceph and have made it the default on our two large production clusters with great
success. The client latencies and backfill pain that my co-workers experienced on a daily
basis have been all alleviated since moving to WPQ.
>
> Honestly, WPQ may do what you need without having to try to configure QoS.
>
>  ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Sat, Aug 24, 2019 at 2:08 AM Frank Schilder
<frans@dtu.dk<mailto:frans@dtu.dk>> wrote:
>>
>> Hi Robert,
>>
>> thanks for your reply. These are actually settings I found in cases I referred to
with "other cases"  in my mail. These settings could be a first step. Looking at
the documentation, solving the overload problem might require some QoS settings I found
below the description of "osd op queue"
https://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/#opera… .
>>
>> I see some possibilities, but I'm not sure how to use these settings to
enforce load dependent rate limiting on clients. As far as I can see, IOPs QoS does not
take backlog into account, which would be important for distinguishing a burst from a
sustained overload. In addition, this requires mClock, which is labelled experimental.
>>
>> If anyone could shed some light on what possibilities currently exist beyond
playing with "osd op queue" and "osd op queue cut off" that would be
great. Also if there is some experience out there about this problem.
>>
>> For example, would reducing "osd client op priority" have any effect?
As far as I can see, this is only for weighting between recovery and client IO, not for
priority of IO already in flight versus new client OPS.
>>
>> Best regards,
>>
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Robert LeBlanc
<robert@leblancnet.us<mailto:robert@leblancnet.us>>
>> Sent: 23 August 2019 17:28
>> To: Frank Schilder
>> Cc: ceph-users
>> Subject: Re: [ceph-users] ceph fs crashes on simple fio test
>>
>> The WPQ scheduler may help your clients back off when things get busy.
>>
>> Put this in your ceph.conf and restart your OSDs.
>> osd op queue = wpq
>> osd op queue cut off = high
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> To unsubscribe send an email to
ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io>

2024

2023

2022

2021

2020

2019

[ceph-users] Re: ceph fs crashes on simple fio test