OSD disk usage - ceph-users

27 Oct 2020

Hi,

My cluster is in a warning state as it is rebalancing after I've added a
bunch of disks. (no issue here!)
Though, there are a few things which I just cannot understand... I hope
someone can help me... I'm getting hopeless finding the answers... If you
can answer any question (even one), it will be greatly appreciated.
Environment: Ceph Nautilus 14.2.11, 282 OSDs, mainly erasure coding.

1. Ceph dashboard shows more PGs for an OSD then I can extract from the pg
dump information. I assume that the value in the dashboard is coming from
Prometheus, though I'm not sure. Querying Prometheus thru Grafana also
gives me the same figures as the one I see in the dashboard (which is
incorrect). I can't find out how this value is calculated... All the
information I can find regarding calculating the pg's stored on an osd are
derived from pg dump :-( Help help help
2. As the cluster is rebalancing and there is a huge gap between the acting
and the up osd's for a bunch of pg's, some disks sometimes have slow
responses due to the backfilling and are from time to time marked as down.
(I know I can go around this one by setting "nodown" temporarily). If the
disk which is going down is in an acting set of a specific pg (not in the
upmap for that pg), then the pg will be marked as degraded as the system
will try to rebuild the missing data towards an osd in the up set. This I
understand... What I don't understand is that when the disk is restarted
and thus marked back as "up" (or simply being patient...), it doesn't add
that osd back to the acting set... Restarting other osds (and thus causing
more peering operations again) results in the disk being added again to the
acting set... I don't understand how this is happening.
3. Addition to question 2: If an osd which was in an acting set went down
and is removed from the acting set (replaced with -1), which process will
remove the obsolete data from the osd that went down once it is back up?
Which process is cleaning up the obsolete copy in the end? Does scrubbing
take this process into account? I assume that this might be related to my
first question too.
4. I'm data mining the pg dump output... If I get an answer on all the
previous questions, this will probably be answered automatically: When I
look at all the acting pg's for a specific osd, and I look at the numbytes
of that pg and I calculate the size that should be stored on that osd
(taking into account the erasure coding process) I get a difference between
disk space which should be used and effectively used. E.g. for a disk the
system is saying 11.3 TiB is being used... Calculating it using pg dump
gives me +/- 10TiB which should be used. I know that a delta might occur
due to the block size etc, but it doesn't seem correct as the usage is too
high...

I've tried to search on processes which clean up the osds, garbage
collection, etc. but no good information is available. You can find tons of
information related to garbage collection in combination with RGW, but not
for the RADOS mechanism... I really can't find any clue how the pg's of a
disk are removed after it went down and is not being used again in the
acting/up set of those pg's...

Another question which I probably should post in the dev group, but which
IDE is recommended for developing in the Ceph project? I'm working on a
Mac... Don't know if there are any recommendations.

I really hope I can get some help on these questions.

Many thanks!

Regards,

Kristof