Hi Andrei,
The answers to your questions depend on Ceph version you're using. And
major use case: RBD, RGW or Ceph-FS?
Burkhard's comments are perfectly valid for Ceph before Octopus - DB
volume sizes to be selected from the following sequence 3-6GB, 30-60 GB,
300+GB
Intermediate values (e.g. 100GB) would result in waste of space -
BlueFS/RocksDB wouldn't use it.
Since Octopus the situation has been improved a bit(see
https://github.com/ceph/ceph/pull/29687). Now one can make BlueFS to use
additional space for higher DB levels hence making DB space assignment
less restrictive.
From your warnings I presume that your OSDs use up to L4 in RocksDB.
And spilled over values are most probably pretty close to amount of
data at L4.
So in case of [planned] Octopus I'd suggest you to reserve around 64GB
per WAL/and DB L1-L3 plus additionally at least 40GB for L4. The more is
better if you plan to put additional data and can afford such drives.
Deferred DB volume space extension is also available these days - hence
you can do that gradually via adding more drives and/or extending LVM
volume. So IMO one has to primarily care about available disk slots for
new DB devices to be able to add more DB space if needed.
Thanks,
Igor
On 7/1/2020 6:05 PM, Andrei Mikhailovsky wrote:
> Thanks for the information, Burkhard.
>
> My current setup shows a bunch of these warnings (24 osds with spillover out of 36
which have wal/db on the ssd):
>
> osd.36 spilled over 1.9 GiB metadata from 'db' device (7.2 GiB used of
30 GiB) to slow device
> osd.37 spilled over 13 GiB metadata from 'db' device (4.2 GiB used of
30 GiB) to slow device
> osd.44 spilled over 26 GiB metadata from 'db' device (13 GiB used of 30
GiB) to slow device
> osd.45 spilled over 33 GiB metadata from 'db' device (10 GiB used of 30
GiB) to slow device
> osd.46 spilled over 37 GiB metadata from 'db' device (8.8 GiB used of
30 GiB) to slow device
>
>
> From the above for example, osd.36 is a 3TB disk and osd.45 is 10TB disk.
>
> I was hoping to address those spillovers with the upgrade too, if it means increasing
the ssd space. Currently we've got WAL of 1GB and DB is 30GB. Am I right in
understanding that in case of osd.46 the DB size should be at least 67GB to stop the
spillover (30 + 37)?
>
>
> Cheers
>
> Andrei
>
> ----- Original Message -----
>> From: "Burkhard Linke"
<Burkhard.Linke(a)computational.bio.uni-giessen.de>
>> To: "ceph-users" <ceph-users(a)ceph.io>
>> Sent: Wednesday, 1 July, 2020 13:09:34
>> Subject: [ceph-users] Re: Advice on SSD choices for WAL/DB?
>> Hi,
>>
>> On 7/1/20 1:57 PM, Andrei Mikhailovsky wrote:
>>> Hello,
>>>
>>> We are planning to perform a small upgrade to our cluster and slowly start
>>> adding 12TB SATA HDD drives. We need to accommodate for additional SSD
WAL/DB
>>> requirements as well. Currently we are considering the following:
>>>
>>> HDD Drives - Seagate EXOS 12TB
>>> SSD Drives for WAL/DB - Intel D3 S4510 960GB or Intel D3 S4610 960GB
>>>
>>> Our cluster isn't hosting any IO intensive DBs nor IO hungry VMs such as
>>> Exchange, MSSQL, etc.
>>>
>>> From the documentation that I've read the recommended size for DB is
between 1%
>>> and 4% of the size of the osd. Would 2% figure be sufficient enough (so
around
>>> 240GB DB size for each 12TB osd?)
>>
>> The documentation is wrong. Rocksdb uses different levels to store data,
>> and need to store each level either completely in the DB partition or on
>> the data partition. There have been a number of mail threads about the
>> correct sizing.
>>
>>
>> In your case the best size would be 30GB for the DB part + the WAL size
>> (usually 2 GB). For compaction and other actions the ideal DB size needs
>> to be doubled, so you end up with 62GB per OSD. Larger DB partitions are
>> a waste of capacity, unless it can hold the next level (300GB per OSD).
>>
>>
>> If you have spare capacity on the SSD (>100GB) you can either leave it
>> untouched or create a small SSD based OSD for small pools that require a
>> lower latency, e.g. a small extra fast pool for RBD or the RGW
>> configuration pools.
>>
>>> Also, from your experience, which is a better model for the SSD DB/WAL?
Would
>>> Intel S4510 be sufficient enough for our purpose or would the S4610 be a
much
>>> better choice? Are there any other cost effective performance to consider
>>> instead of the above models?
>> The SSD model should support fast sync writes, similar to the known
>> requirements for filestore journal SSDs. If your selected model is a
>> good fit according to the test methods, then it is probably also a good
>> choice for bluestore DBs.
>>
>>
>> Since not all data is written to the bluestore DB (no full data journal
>> in contrast to filestore), the amount of data written to the SSD is
>> probably lower. The DWPD requirements might be lower. To be on the safe
>> side, use the better model (higher DWPD / "write intensive") if
possible.
>>
>> Regards,
>>
>> Burkhard
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io