After you have filled that up, if such a host crashes
or needs
maintenance, another 80-100TB will need recreating from the other huge
drives.
A judicious setting of mon_osd_down_out_subtree_limit can help mitigate the thundering
herd FWIW.
I don't think there are specific limitations on
the size itself, but
as the single drive becomes larger and larger, just adding a new host
or a drive will mean the cluster is rebalancing for days or weeks if
not more.
Especially if EC is used. A corollary here is that IOPS/TB decrease as HDDs grow larger.
We see some incremental tweaks, but in the end the interface speed hasn’t grown in some
time. Seek and rotational latency are helped somewhat by increasing areal density, though
capacity growth is also acheived by making the platters increasingly thinner and more
numerous: recent drives pack as many as 9 in there (perhaps fewer for SMR models). I’ve
seen scale deployments cap HDD size at, say, 8TB because the IOPS/TB beyond was
increasingly untenable, depending of course on the use-case.
At some point you would end up having the cluster
almost never in
HEALTH_OK state because of normal replacements, expansions and other
surprises
With recent releases backfill doesn’t trigger HEALTH_WARN, though, right?
which in turn could cause secondary problems with mon
DBs
and things like that.
Your point is well made, though — Dan @ CERN observed several years ago that with a
sufficently large cluster one has to come to terms with backfill going on all the time.
The idea here is that mon DB compaction tends to block if there is any degradation — with
at least some releases, that means even if you have HEALTH_OK, but some OSDs are down/out.