Hi Peter,
your suggestion pointed me to the right spot.
I didn't know about the feature, that ceph will read from replica
PGs.
So on. I found two functions in the osd/PrimaryLogPG.cc:
"check_laggy" and "check_laggy_requeue". On both is first a check, if
the partners have the octopus features. if not, the function is
skipped. This explains the beginning of the problem after about the
half cluster was updated.
To verifiy this, I added "return true" in the first line of the
functions. The issue is gone with it. But
I don't know what problems this could trigger. I know, the root cause
is not fixed with it.
I think I will open a bug ticket with this knowlage.
osd_op_queue_cutoff is set to high
and a icmp rate limiting should not happen
Thanks
Manuel
On Thu, 10 Jun 2021 11:28:48 +0200
Peter Lieven <pl(a)kamp.de> wrote:
Am 10.06.21 um 11:08 schrieb Manuel Lausch:
Hi,
has no one a idea what could cause this issue. Or how I could debug
it?
In some days I have to go live with this cluster. If I don't have a
solution I have to go live with nautilus.
Hi Manuel,
I had similar issues with Octopus and i am thus stuck with Nautilus.
Can you debug the slow ops and see if the slow ops are caused by the
status "waiting for readable".
I suspected that it has something to do with the new feature in
Octopus to read from all OSDs regardless if
they are master for a PG or not.
Can you also verify that osd_op_queue_cut_off is set to high and that
icmp rate limiting is disabled on your hosts?
Peter