I've also seen this problem on Nautilus with no obvious reason for the
slowness once.
In my case it was a rather old cluster that was upgraded all the way
from firefly
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at
https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
On Tue, Feb 18, 2020 at 5:52 PM Wido den Hollander <wido(a)42on.com> wrote:
>
>
>
> On 8/27/19 11:49 PM, Bryan Stillwell wrote:
> > We've run into a problem on our test cluster this afternoon which is running
Nautilus (14.2.2). It seems that any time PGs move on the cluster (from marking an OSD
down, setting the primary-affinity to 0, or by using the balancer), a large number of the
OSDs in the cluster peg the CPU cores they're running on for a while which causes slow
requests. From what I can tell it appears to be related to slow peering caused by
osd_pg_create() taking a long time.
> >
> > This was seen on quite a few OSDs while waiting for peering to complete:
> >
> > # ceph daemon osd.3 ops
> > {
> > "ops": [
> > {
> > "description": "osd_pg_create(e179061 287.7a:177739
287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739
287.216:177739 287.306:177739 287.3e6:177739)",
> > "initiated_at": "2019-08-27 14:34:46.556413",
> > "age": 318.25234538000001,
> > "duration": 318.25241895300002,
> > "type_data": {
> > "flag_point": "started",
> > "events": [
> > {
> > "time": "2019-08-27
14:34:46.556413",
> > "event": "initiated"
> > },
> > {
> > "time": "2019-08-27
14:34:46.556413",
> > "event": "header_read"
> > },
> > {
> > "time": "2019-08-27
14:34:46.556299",
> > "event": "throttled"
> > },
> > {
> > "time": "2019-08-27
14:34:46.556456",
> > "event": "all_read"
> > },
> > {
> > "time": "2019-08-27
14:35:12.456901",
> > "event": "dispatched"
> > },
> > {
> > "time": "2019-08-27
14:35:12.456903",
> > "event": "wait for new map"
> > },
> > {
> > "time": "2019-08-27
14:40:01.292346",
> > "event": "started"
> > }
> > ]
> > }
> > },
> > ...snip...
> > {
> > "description": "osd_pg_create(e179066 287.7a:177739
287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739
287.216:177739 287.306:177739 287.3e6:177739)",
> > "initiated_at": "2019-08-27 14:35:09.908567",
> > "age": 294.900191001,
> > "duration": 294.90068416899999,
> > "type_data": {
> > "flag_point": "delayed",
> > "events": [
> > {
> > "time": "2019-08-27
14:35:09.908567",
> > "event": "initiated"
> > },
> > {
> > "time": "2019-08-27
14:35:09.908567",
> > "event": "header_read"
> > },
> > {
> > "time": "2019-08-27
14:35:09.908520",
> > "event": "throttled"
> > },
> > {
> > "time": "2019-08-27
14:35:09.908617",
> > "event": "all_read"
> > },
> > {
> > "time": "2019-08-27
14:35:12.456921",
> > "event": "dispatched"
> > },
> > {
> > "time": "2019-08-27
14:35:12.456923",
> > "event": "wait for new map"
> > }
> > ]
> > }
> > }
> > ],
> > "num_ops": 6
> > }
> >
> >
> > That "wait for new map" message made us think something was getting
hung up on the monitors, so we restarted them all without any luck.
> >
> > I'll keep investigating, but so far my google searches aren't pulling
anything up so I wanted to see if anyone else is running into this?
> >
>
> I've seen this twice now on a ~1400 OSD cluster running Nautilus.
>
> I created a bug report for this:
https://tracker.ceph.com/issues/44184
>
> Did you make any progress on this or run into it a second time?
>
> Wido
>
> > Thanks,
> > Bryan
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io