I was also confused by this topic and had intended to post a question
this week. The documentation I recall reading said something about 'if
you want to use erasure coding on a CephFS, you should use a small
replicated data pool as the first pool, and your erasure coded pool as
the second.' I did not see any obvious indication of how this would
'auto-magically' put the small files in the replicated pool and the
large files in the erasure pool. although this sounds like a desirable
behavior. Instead I found the notes 'file layouts' which doesn't seem
to be able to use size as a criterion.
Does anybody have anything further to add that would help clarify this?
Thanks.
-Dave
Dave Hall
Binghamton University
On 2/10/20 1:26 PM, Gregory Farnum wrote:
> On Mon, Feb 10, 2020 at 12:29 AM Håkan T Johansson <f96hajo(a)chalmers.se>
wrote:
>>
>> On Mon, 10 Feb 2020, Gregory Farnum wrote:
>>
>>> On Sun, Feb 9, 2020 at 3:24 PM Håkan T Johansson <f96hajo(a)chalmers.se>
wrote:
>>>
>>> Hi,
>>>
>>> running 14.2.6, debian buster (backports).
>>>
>>> Have set up a cephfs with 3 data pools and one metadata pool:
>>> myfs_data, myfs_data_hdd, myfs_data_ssd, and myfs_metadata.
>>>
>>> The data of all files are with the use of ceph.dir.layout.pool either
>>> stored in the pools myfs_data_hdd or myfs_data_ssd. This has also
been
>>> checked by dumping the ceph.file.layout.pool attributes of all files.
>>>
>>> The filesystem has 1617949 files and 36042 directories.
>>>
>>> There are however approximately as many objects in the first pool
created
>>> for the cephfs, myfs_data, as there are files. They also becomes more
or
>>> fewer as files are created or deleted (so cannot be some leftover
from
>>> earlier exercises). Note how the USED size is reported as 0 bytes,
>>> correctly reflecting that no file data is stored in them.
>>>
>>> POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY
UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
>>> myfs_data 0 B 1618229 0 4854687 0
0 0 2263590 129 GiB 23312479 124 GiB 0 B 0 B
>>> myfs_data_hdd 831 GiB 136309 0 408927 0
0 0 106046 200 GiB 269084 277 GiB 0 B 0 B
>>> myfs_data_ssd 43 GiB 1552412 0 4657236 0
0 0 181468 2.3 GiB 4661935 12 GiB 0 B 0 B
>>> myfs_metadata 1.2 GiB 36096 0 108288 0
0 0 4828623 82 GiB 1355102 143 GiB 0 B 0 B
>>>
>>> Is this expected?
>>>
>>> I was assuming that in this scenario, all objects, both their data and
any
>>> keys would be either in the metadata pool, or the two pools where the
>>> objects are stored.
>>>
>>> Is it some additional metadata keys that are stored in the first
>>> created data pool for cephfs? This would not be so nice in case the
osd
>>> selection rules for it are using worse disks than the data itself...
>>>
>>>
>>>
https://docs.ceph.com/docs/master/cephfs/file-layouts/#adding-a-data-pool-t… notes
there is “a small amount of metadata” kept in the primary pool.
>> Thanks! This I managed to miss, probably as it was at the bottom of the
>> page. In case one wants to use layouts to separate fast (likely many)
>> from slow (likely large) files, it then sounds as the primary pool should
>> the fast kind too, due to the large amount of objects. Thus this needs to
>> be highlighted early in that documentation.
>>
>>> That’s not terribly clear; what is actually stored is a per-file location
backtrace (its location in the directory tree) used for hardlink lookups and disaster
recovery
>>> scenarios.
>> This info would be nice to add to the manual page. It is nice to know
>> what kind of information is stored there.
> Yeah, PRs welcome. :p
> Just to be clear though, that shouldn't be performance-critical. It's
> lazily updated by the MDS when the directory location changes, but not
> otherwise.
>
>> Again thanks for the clarification!
>>
>>> Btw: is there any tool to see the amount of key value data size
associated
>>> with a pool? 'ceph osd df' gives omap and meta for osds, but
not broken
>>> down per pool.
>>>
>>>
>>> I think this is in the newest master code, but I’m not certain which release
it’s in...
>> Would it then (when available) also be in the 'rados df' command?
> I really don't remember how everything is shared out but I think so?
>
>> Best regards,
>> Håkan
>>
>>
>>> -Greg
>>>
>>>
>>>
>>> Best regards,
>>> Håkan
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>
>>>
>>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io