Dear Igor,
thanks a lot for the analysis and recommendations.
Here is a brief analysis:
1) Your DB is pretty large - 27GB at DB device (making
it full) and
279GB at main spinning one. I.e. RocksDB is experiencing huge
spillover to slow main device - expect performance drop. And generally
DB is highly under-provisioned.
Yes, we have known about this issue for a long time. This cluster and
in particular its SSD devices were dimensioned in the pre-Bluestore
days. We haven't yet found a viable migration path towards something
more sensible (with ~1500 OSDs on two separate clusters and quite a bit
of user data on them).
2) Main device space is highly fragmented -
0.84012572151981013 where
1.0 is the maximum. Can't say for sure but I presume it's pretty full
as well.
Not too full:
$ ceph osd df | sort -n
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR
PGS STATUS
[...]
3 hdd 7.27699 1.00000 7.3 TiB 4.6 TiB 4.3 TiB 49 MiB 319 GiB 2.7 TiB 63.46 1.10
0 down
The above are indirect factors resulting in the
current failure
though, primarily I just want to make you aware since this might cause
other issues later on.
Thanks.
The major reason preventing OSD from proper starting
is BlueFS attempt
to claim additional space (~52GB), see in the log:
[...]
I can suggest the following workarounds to start the
OSD for now:
1) switch allocator to stupid by setting
'bluestore allocator'
parameter to 'stupid'. Presume you have default setting of 'bitmap'
now.. This will allow more continuous allocations for bluefs space
claim. and hence shorter log write. But given high main disk
fragmentation this might be not enough. 'stupid' allocator has some
issues (e.g. high RAM utilization over time in some cases) as well but
they're rather irrelevant for OSD startup.
Thanks, we'll try that & report.
2) Increase 'bluefs_max_log_runway' parameter
to 8-12 MB (with the
default value at 4MB).
Suggest to start with 1) and then additionally proceed
with 2) if the
first one doesn't help.
Once OSD is up and cluster is healthy please consider
adding more DB
space and/or OSDs to your cluster to fight dangerous factors I started
with.
BTW wondering what payload is the primarily one for
you cluster - RGW
or something else?
The payload has changed over the lifetime of the cluster (which has been
in operation for more than four years, growing and being upgraded).
Initially it was almost exclusively RBD (for OpenStack VMs), then we
added RadosGW (still all with 3-way replication). As RadosGW/S3 became
more popular, we added an EC 8+3 pool. (We also added an NVMe-only pool,
which is used for RadosGW indexes.) Lately this EC 8+3 pool has become
very popular, and users have been storing hundreds of Terabyte on it.
Unfortunately they tend to use a small object size (~1MB per object).
That's why we have close to a billion objects in the EC pool now, and
things start to fail.
As I said, it's a problem of finding a viable migration path to a better
configuration. Unfortunately we cannot just throw away the current
installation and start from scratch...
Cheers,
--
Simon.