You pay $300-$350 per HDD when you buy in large quantities. So such a
server costs $50k-$60k.
There are 20TB drives but they are SMR(1), they are ok for archival
use cases, ceph does not recommend SMR drives.
We use 90 disk 4u servers(2), I remember HPE showed us 110 drive
servers but can't find them now.
Seagate also has 4u 106 drive JBOD enclosure(3)
1-
https://www.westerndigital.com/products/data-center-platforms/ultrastar-dc-…
2-
https://www.delltechnologies.com/no-no/collaterals/unauth/data-sheets/produ…
3-
https://www.seagate.com/files/www-content/datasheets/pdfs/exos-e-4u106-DS19…
On Sat, Feb 20, 2021 at 3:31 PM Loïc Dachary <loic(a)dachary.org> wrote:
>
> I did not know it was possible to buy such a machine, very impressive. How much does
that cost? I thought 18TB was the current maximum size for HDD :-P
>
> On 18/02/2021 04:37, Serkan Çoban wrote:
> > I still prefer the simplest solution. There are 4U servers with 110 x
> > 20TB disks on the market.
> > After raid you get 1.5PiB per server. This is 30 months of data.
> > 2 such servers will hold 5 years of data with minimal problems.
> > If you need backup; then buy 2 more sets and just send zfs snapshot
> > diffs to this set.
> >
> >
> > On Wed, Feb 17, 2021 at 11:15 PM Loïc Dachary <loic(a)dachary.org> wrote:
> >>
> >>
> >> On 17/02/2021 18:27, Serkan Çoban wrote:
> >>> Why not put all the data to a zfs pool with 3-4 levels deep directory
> >>> structure each directory named with 2 byte in range 00-FF?
> >>> Four levels deep, you get 255^4=4B folders with 3-4 objects per folder
> >>> or three levels deep you get 255^3=16M folders with ~1000 objects
> >>> each.
> >> It is more or less the current setup :-) I should have mentioned that there
currently are ~750TB and 10 billions objects. But it's growing by 50TB every month and
it will keep growing indefinitely. Reason why a solution that scales out is desirable.
> >>> On Wed, Feb 17, 2021 at 8:14 PM Loïc Dachary <loic(a)dachary.org>
wrote:
> >>>> Hi Nathan,
> >>>>
> >>>> Good thinking :-) The names of the objects are indeed the SHA256 of
their content, which provides deduplication.
> >>>>
> >>>> Cheers
> >>>>
> >>>> On 17/02/2021 18:04, Nathan Fish wrote:
> >>>>> I'm not much of a programmer, but as soon as I hear
"immutable
> >>>>> objects" I think "content-addressed". I don't
know if you have many
> >>>>> duplicate objects in this set, but content-addressing gives you
> >>>>> object-level dedup for free. Do you have to preserve some
meaningful
> >>>>> object names from the original dataset, or just do you just need
some
> >>>>> kind of ID?
> >>>>>
> >>>>> On Wed, Feb 17, 2021 at 11:37 AM Loïc Dachary
<loic(a)dachary.org> wrote:
> >>>>>> Bonjour,
> >>>>>>
> >>>>>> TL;DR: Is it more advisable to work on Ceph internals to
make it friendly to this particular workload or write something similar to EOS[0] (i.e
Rocksdb + Xrootd + RBD)?
> >>>>>>
> >>>>>> This is a followup of two previous mails[1] sent while
researching this topic. In a nutshell, the Software Heritage project[1] currently has
~750TB and 10 billions objects, 75% of which have a size smaller than 16KB and 50% have a
size smaller than 4KB. But they only account for ~5% of the 750TB: 25% of the objects have
a size > 16KB and total ~700TB. The objects can be compressed by ~50% and 750TB only
needs 350TB of actual storage. (if you're interested in the details see [2]).
> >>>>>>
> >>>>>> Let say those 10 billions objects are stored in a single 4+2
erasure coded pool with bluestore compression set for objects that have a size > 32KB
and the smallest allocation size for bluestore set to 4KB[3]. The 750TB won't use the
expected 350TB but about 30% more, i.e. ~450TB (see [4] for the maths). This space
amplification is because storing a 1 byte object uses the same space as storing a 16KB
object (see [5] to repeat the experience at home). In a 4+2 erasure coded pool, each of
the 6 chunks will use no less than 4KB because that's the smallest allocation size for
bluestore. That's 4 * 4KB = 16KB even when all that is needed is 1 byte.
> >>>>>>
> >>>>>> It was suggested[6] to have two different pools: one with a
4+2 erasure pool and compression for all objects with a size > 32KB that are expected
to compress to 16KB. And another with 3 replicas for the smaller objects to reduce space
amplification to a minimum without compromising on durability. A client looking for the
object could make two simultaneous requests to the two pools. They would get 404 from one
of them and the object from the other.
> >>>>>>
> >>>>>> Another workaround, is best described in the "Finding a
needle in Haystack: Facebook’s photo storage"[9] paper and essentially boils down to
using a database to store a map between the object name and its location. That does not
scale out (writing the database index is the bottleneck) but it's simple enough and is
successfully implemented in EOS[0] with >200PB worth of data and in seaweedfs[10],
another promising object store software based on the same idea.
> >>>>>>
> >>>>>> Instead of working around the problem, maybe Ceph could be
modified to make better use of the immutability of these objects[7], a hint that is
apparently only used to figure out how to best compress it and for checksum
calculation[8]. I honestly have not clue how difficult it would be. All I know is that
it's not easy otherwise it would have been done already: there seem to be a general
need for efficiently (space wise and performance wise) storing large quantities of objects
smaller than 4KB.
> >>>>>>
> >>>>>> Is it more advisable to:
> >>>>>>
> >>>>>> * work on Ceph internals to make it friendly to this
particular workload or,
> >>>>>> * write another implementation of "Finding a needle
in Haystack: Facebook’s photo storage"[9] based on RBD[11]?
> >>>>>>
> >>>>>> I'm currently leaning toward working on Ceph internals
but there are pros and cons to both approaches[12]. And since all this is still very new
to me, there also is the possibility that I'm missing something. Maybe it's
*super* difficult to improve Ceph in this way. I should try to figure that out sooner
rather than later.
> >>>>>>
> >>>>>> I realize it's a lot to take in and unless you're
facing the exact same problem there is very little chance you read that far :-) But if you
did... I'm *really* interested to hear what yout think. In any case I'll report
back to this thread once a decision has been made.
> >>>>>>
> >>>>>> Cheers
> >>>>>>
> >>>>>> [0]
https://eos-web.web.cern.ch/eos-web/
> >>>>>> [1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJF…
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISX…
> >>>>>> [2]
https://forge.softwareheritage.org/T3054
> >>>>>> [3]
https://github.com/ceph/ceph/blob/3f5e778ad6f055296022e8edabf701b6958fb602/…
> >>>>>> [4]
https://forge.softwareheritage.org/T3052#58864
> >>>>>> [5]
https://forge.softwareheritage.org/T3052#58917
> >>>>>> [6]
https://forge.softwareheritage.org/T3052#58876
> >>>>>> [7]
https://docs.ceph.com/en/latest/rados/api/librados/#c.@3.LIBRADOS_ALLOC_HIN…
> >>>>>> [8]
https://forge.softwareheritage.org/T3055
> >>>>>> [9]
https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf
> >>>>>> [10]
https://github.com/chrislusf/seaweedfs/wiki/Components
> >>>>>> [11]
https://forge.softwareheritage.org/T3049
> >>>>>> [12]
https://forge.softwareheritage.org/T3054#58977
> >>>>>>
> >>>>>> --
> >>>>>> Loïc Dachary, Artisan Logiciel Libre
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
> >>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >>>> --
> >>>> Loïc Dachary, Artisan Logiciel Libre
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users(a)ceph.io
> >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >> --
> >> Loïc Dachary, Artisan Logiciel Libre
> >>
> >>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
>