We've run into a problem on our test cluster this afternoon which is running Nautilus
(14.2.2). It seems that any time PGs move on the cluster (from marking an OSD down,
setting the primary-affinity to 0, or by using the balancer), a large number of the OSDs
in the cluster peg the CPU cores they're running on for a while which causes slow
requests. From what I can tell it appears to be related to slow peering caused by
osd_pg_create() taking a long time.
This was seen on quite a few OSDs while waiting for peering to complete:
# ceph daemon osd.3 ops
{
"ops": [
{
"description": "osd_pg_create(e179061 287.7a:177739
287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739
287.216:177739 287.306:177739 287.3e6:177739)",
"initiated_at": "2019-08-27 14:34:46.556413",
"age": 318.25234538000001,
"duration": 318.25241895300002,
"type_data": {
"flag_point": "started",
"events": [
{
"time": "2019-08-27 14:34:46.556413",
"event": "initiated"
},
{
"time": "2019-08-27 14:34:46.556413",
"event": "header_read"
},
{
"time": "2019-08-27 14:34:46.556299",
"event": "throttled"
},
{
"time": "2019-08-27 14:34:46.556456",
"event": "all_read"
},
{
"time": "2019-08-27 14:35:12.456901",
"event": "dispatched"
},
{
"time": "2019-08-27 14:35:12.456903",
"event": "wait for new map"
},
{
"time": "2019-08-27 14:40:01.292346",
"event": "started"
}
]
}
},
...snip...
{
"description": "osd_pg_create(e179066 287.7a:177739
287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739
287.216:177739 287.306:177739 287.3e6:177739)",
"initiated_at": "2019-08-27 14:35:09.908567",
"age": 294.900191001,
"duration": 294.90068416899999,
"type_data": {
"flag_point": "delayed",
"events": [
{
"time": "2019-08-27 14:35:09.908567",
"event": "initiated"
},
{
"time": "2019-08-27 14:35:09.908567",
"event": "header_read"
},
{
"time": "2019-08-27 14:35:09.908520",
"event": "throttled"
},
{
"time": "2019-08-27 14:35:09.908617",
"event": "all_read"
},
{
"time": "2019-08-27 14:35:12.456921",
"event": "dispatched"
},
{
"time": "2019-08-27 14:35:12.456923",
"event": "wait for new map"
}
]
}
}
],
"num_ops": 6
}
That "wait for new map" message made us think something was getting hung up on
the monitors, so we restarted them all without any luck.
I'll keep investigating, but so far my google searches aren't pulling anything up
so I wanted to see if anyone else is running into this?
Thanks,
Bryan