Am 10.06.21 um 17:45 schrieb Manuel Lausch:
Hi Peter,
your suggestion pointed me to the right spot.
I didn't know about the feature, that ceph will read from replica
PGs.
So on. I found two functions in the osd/PrimaryLogPG.cc:
"check_laggy" and "check_laggy_requeue". On both is first a check,
if
the partners have the octopus features. if not, the function is
skipped. This explains the beginning of the problem after about the
half cluster was updated.
To verifiy this, I added "return true" in the first line of the
functions. The issue is gone with it. But
I don't know what problems this could trigger. I know, the root cause
is not fixed with it.
I think I will open a bug ticket with this knowlage.
I wonder if I faced the same issue. The issue I had occured when OSDs came back up and
peering started.
My cluster was a fresh octopus install so I think the min osd release was set to octopus.
Is it in general safe to stay with this switch at nautilus and run octopus to run a
maintained release?
osd_op_queue_cutoff is set to high
and a icmp rate limiting should not happen
It could if you choose fast shutdown and connections to the OSD daemon are refused with
icmp port unreachable?!
Peter