Phil;
I'm probably going to get crucified for this, but I put a year of testing into this
before determining it was sufficient to the needs of my organization...
If the primary concerns are capability and cost (not top of the line performance), then I
can tell you that we have had great success utilizing Intel Atom C3000 series CPUs. We
have built 2 clusters with capacities on the order of 130TiB, for less than $30,000 each.
The initial clusters cost $20,000 each, for half the capacity. Our testing cluster cost
$8,000 to build, and most of that hardware could have been wrapped into the first
production cluster build.
For those keeping track, no that is not the lowest cost / unit space.
Thank you,
Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
DHilsbos(a)PerformAir.com
www.PerformAir.com
-----Original Message-----
From: Phil Merricks [mailto:seffyroff@gmail.com]
Sent: Monday, November 16, 2020 5:52 PM
To: Janne Johansson
Cc: Hans van den Bogert; ceph-users
Subject: [ceph-users] Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded
Data Reduncancy, all PGs degraded, undersized, not scrubbed in time
Thanks for all the replies folks. I think it's testament to the
versatility of Ceph that there are some differences of opinion and
experience here.
With regards to the purpose of this cluster, it is providing distributed
storage for stateful workloads of containers. The data produced is
somewhat immutable, it can be regenerated over time, however that does
cause some slowdown for the teams that use the data as part of their
development pipeline. To the best of my understanding the goals here were
to provide a data loss safety net but still make efficient use of the block
devices assigned to the cluster, which is I imagine where the EC direction
came from. The cluster is 3 nodes with the OSDs themselves mainly housed
in two of those. Additionally there was an initiative to 'use what we
have' (or as I like to put it, 'cobble it together') with commodity
hardware that was immediately available to hand. The departure of my
predecessor has left some unanswered questions so I am not going to bother
second guessing beyond what I already know. As I understand it my steps
are:
1: Move off the data and scrap the cluster as it stands currently.
(already under way)
2: Group the block devices into pools of the same geometry and type (and
maybe do some tiering?)
3. Spread the OSDs across all 3 nodes so recovery scope isn't so easily
compromised by a loss at the bare metal level
4. Add more hosts/OSDs if EC is the right solution (this may be outside of
the scope of this implementation, but I'll keep a-cobblin'!)
The additional ceph outputs follow:
ceph osd tree <https://termbin.com/vq63>
ceph osd erasure-code-profile get cephfs-media-ec <https://termbin.com/h33h>
I am fully prepared to do away with EC to keep things simple and efficient
in terms of CPU occupancy.
On Mon, 16 Nov 2020 at 02:32, Janne Johansson <icepic.dz(a)gmail.com> wrote:
Den mån 16 nov. 2020 kl 10:54 skrev Hans van den
Bogert <
hansbogert(a)gmail.com>gt;:
With this
profile you can only loose one OSD at a time, which is really
not that redundant.
That's rather situation dependent. I don't have really large disks, so
the repair time isn't that large.
Further, my SLO isn't that high that I need 99.xxx% uptime, if 2 disks
break in the same repair window, that would be unfortunate, but I'd just
grab a backup from a mirroring cluster. Looking at it from another
perspective, I came from a single host RAID5 scenario, I'd argue this is
better since I can survive a host failure.
Also this is a sliding problem right? Someone with K+3 could argue K+2
is not enough as well.
There are a few situations like when you are moving data or when a scrub
found a bad PG where you are suddenly out of copies in case something bad
happens. I think Raid5 operators also found this out, when your cold spare
disk kicks in, you find that old undetected error on one of the other disks
and think repairs are bad or stress your raid too much.
As with raids, the cheapest resource is often the actual disks and not
operator time, restore-wait-times and so on, so that is why many on this
list advocates for K+2-or-more, or Repl=3 because we have seen the errors
one normally didn't expect. Yes, a double surprise of two disks failing in
the same night after running for years is uncommon, but it is not as
uncommon to resize pools, move PGs around or find a scrub error or two some
day.
So while one could always say "one more drive is better than your amount",
there are people losing data with repl=2 or K+1 because some more normal
operation was in flight and _then_ a single surprise happens. So you can
have a weird reboot, causing those PGs needing backfill later, and if one
of the uptodate hosts have any single surprise during the recovery, the
cluster will lack some of the current data even if two disks were never
down at the same time.
Drive manufacturers print Mean Time Between Failures, storage admins count
Mean Time Between Surprises..
--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io