[ceph-users] Re: slow ops at restarting OSDs (octopus)

10 Jun 2021

Hi Peter,

your suggestion pointed me to the right spot. 
I didn't know about the feature, that ceph will read from replica
PGs.

So on. I found two functions in the osd/PrimaryLogPG.cc:
"check_laggy" and "check_laggy_requeue". On both is first a check, if
the partners have the octopus features. if not, the function is
skipped. This explains the beginning of the problem after about the
half cluster was updated.

To verifiy this, I added "return true" in the first line of the
functions. The issue is gone with it. But
I don't know what problems this could trigger. I know, the root cause
is not fixed with it.
I think I will open a bug ticket with this knowlage.

osd_op_queue_cutoff is set to high
and a icmp rate limiting should not happen

Thanks
Manuel

On Thu, 10 Jun 2021 11:28:48 +0200
Peter Lieven &lt;pl(a)kamp.de&gt; wrote:

...
  Am 10.06.21 um 11:08 schrieb Manuel Lausch:
  Hi,

 has no one a idea what could cause this issue. Or how I could debug
 it?

 In some days I have to go live with this cluster. If I don't have a
 solution I have to go live with nautilus.     

 Hi Manuel,

 I had similar issues with Octopus and i am thus stuck with Nautilus.

 Can you debug the slow ops and see if the slow ops are caused by the
 status "waiting for readable".

 I suspected that it has something to do with the new feature in
 Octopus to read from all OSDs regardless if

 they are master for a PG or not.

 Can you also verify that osd_op_queue_cut_off is set to high and that
 icmp rate limiting is disabled on your hosts?

 Peter

2024

2023

2022

2021

2020

2019

[ceph-users] Re: slow ops at restarting OSDs (octopus)