Dallas;
First, I should point out that you have an issue with your units. Your cluster is
reporting 81TiB (1024^4) of available space, not 81TB (1000^4). Similarly; it's
reporting 22.8 TiB free space in the pool, not 22.8TB. For comparison; your 5.5 TB drives
(this is the correct unit here) is only 5.02 TiB. Hard drive manufacturers market in one
set of units, while software systems report in another. Thus, while you added 66 TB to
your cluster, that is only 60 TiB. For background information, these pages are
interesting:
https://en.wikipedia.org/wiki/Tebibyte
https://en.wikipedia.org/wiki/Binary_prefix#Consumer_confusion
It looks like you're using a replicated rule for your cephfs_data pool. With 81.2 TiB
available in the cluster, the maximum free space you can expect is 27.06 TiB (81.2 / 3 =
27.06). As we've seen you can't actually fill a cluster to 100%. It might be
worth noting that the discrepancy is roughly 10% of your entire cluster (122.8 / (81.2 -
(22.8 * 3))).
From your previously provided OSD map, I'm seeing some reweights that aren't 1.
It's possible that has some impact.
It's also possible that your cluster is "reserving" space on your HDDs for
DB and WAL operations.
It would take someone that is more familiar with the CephFS and Dashboard code than I am,
to answer your question definitively.
Thank you,
Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International, Inc.
DHilsbos(a)PerformAir.com
www.PerformAir.com
From: Dallas Jones [mailto:djones@tech4learning.com]
Sent: Monday, August 31, 2020 2:59 PM
To: Dominic Hilsbos
Cc: ceph-users(a)ceph.io
Subject: [ceph-users] Re: Cluster degraded after adding OSDs to increase capacity
Thanks to everyone who replied. After setting osd_recovery_sleep_hdd to 0 and changing
osd-max-backfills to 16, my recovery throughput increased from < 1MBPS to 40-60MPBS
and finished up late last night.
The cluster is mopping up a bunch of queued deep scrubs, but is otherwise now healthy.
I do have one remaining question - the cluster now shows 81TB of free space, but the data
pool only shows 22.8TB of free space. I was expecting/hoping to see the free space value
for the pool
grow more after doubling the capacity of the cluster (it previously had 21 OSDs w/ 2.7TB
SAS drives; I just added 12 more OSDs w/ 5.5TB drives).
Are my expectations flawed, or is there something I can do to prod Ceph into growing the
data pool free space?
On Fri, Aug 28, 2020 at 9:37 AM <DHilsbos(a)performair.com> wrote:
Dallas;
I would expect so, yes.
I wouldn't be surprised to see the used percentage slowly drop as the recovery /
rebalance progresses. I believe that the pool free space number is based on the free
space of the most filled OSD under any of the PGs, so I expect the free space will go up
as your near-full OSDs drain.
I've added OSDs to one of our clusters, once, and the recovery / rebalance completed
fairly quickly. I don't remember how the pool sizes progressed. I'm going to
need to expand our other cluster in the next couple of months, so follow up on how this
proceeds would be appreciated.
Thank you,
Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International, Inc.
DHilsbos(a)PerformAir.com
www.PerformAir.com
From: Dallas Jones [mailto:djones@tech4learning.com]
Sent: Friday, August 28, 2020 7:58 AM
To: Florian Pritz
Cc: ceph-users(a)ceph.io; Dominic Hilsbos
Subject: Re: [ceph-users] Re: Cluster degraded after adding OSDs to increase capacity
Thanks for the reply. I dialed up the value for max backfills yesterday, which increased
my recovery throughput from about 1mbps to 5ish. After tweaking osd_recovery_sleep_hdd,
I'm seeing 50-60MBPS - which is fairly epic. No clients are currently using this
cluster, so I'm not worried about tanking client performance.
One remaining question: Will the pool sizes begin to adjust once the recovery process is
complete? Per the following screenshot, my data pool is ~94% full...
On Fri, Aug 28, 2020 at 4:31 AM Florian Pritz <florian.pritz(a)rise-world.com>
wrote:
On Thu, Aug 27, 2020 at 05:56:22PM +0000, DHilsbos(a)performair.com wrote:
2) Adjust performance settings to allow the data
movement to go faster. Again, I don't have those setting immediately to hand, but
Googling something like 'ceph recovery tuning,' or searching this list, should
point you in the right direction. Notice that you only have 6 PGs trying to move at a
time, with 2 blocked on your near-full OSDs (8 & 19). I believe; by default, each OSD
daemon is only involved in 1 data movement at a time. The tradeoff here is user activity
suffers if you adjust to favor recovery, however, with the cluster in ERROR status, I
suspect user activity is already suffering.
We've set osd_max_backfills to 16 in the config and when necessary we
manually change the runtime value of osd_recovery_sleep_hdd. It defaults
to 0.1 seconds of wait time between objects (I think?). If you really
want fast recovery try this additional change:
ceph tell osd.\* config set osd_recovery_sleep_hdd 0
Be warned though, this will seriously affect client performance. Then
again it can bump your recovery speed by multiple orders of magnitude.
If you want to go back to how things were, set it back to 0.1 instead of
0. It may take a couple of seconds (maybe a minute) until performance
for clients starts to improve. I guess the OSDs are too busy with
recovery to instantly accept the changed value.
Florian