Quoting Stefan Kooman (stefan(a)bit.nl):
Hi,
Like I said in an earlier mail to this list, we re-balanced ~ 60% of the
CephFS metadata pool to NVMe backed devices. Roughly 422 M objects (1.2
Billion replicated). We have 512 PGs allocated to them. While
rebalancing we suffered from quite a few SLOW_OPS. Memory, CPU and
device IOPS capacity were not a limiting factor as far as we can see (plenty of
them available ... nowhere near max capacity). We saw quite a few
slow ops with the following events:
"time": "2019-12-19 09:41:02.712010",
"event": "reached_pg"
},
{
"time": "2019-12-19 09:41:02.712014",
"event": "waiting for rw locks"
},
{
"time": "2019-12-19 09:41:02.881939",
"event": "reached_pg"
... and this repeated 100's of times taking ~ 30 seconds to complete
Does this indicate PG lock contention?
If so ... would we need to provide more PGs to the metadata pool to avoid this?
The metadata pool is only ~ 166 MiB big ... but with loads of OMAPs ...
Most advice on PG planning is concerned with the _amount_ of data ... but the
metadata pool (and this might also be true for RGW index pools) seem to be a
special case.
This does seem to be the case. We moved the data to a subset of the
cluster which turned out not to be a good idea. The OSDs suffered badly
from this. Spreading the workload accross all OSDs (reverting change)
fixed the issues. If you have *lots* of small files and / or directories
in your cluster ... scale your metadata PGs accordingly.
Gr. Stefan
--
| BIT BV
https://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl