cephfs file layouts, empty objects in first data pool

List overview All Threads
Download

newer

older

snapshot-based mirroring...

Re: Reducing RAM usage on...

Håkan T Johansson

10 Feb 2020 10 Feb '20

4:53 a.m.

Hi, running 14.2.6, debian buster (backports). Have set up a cephfs with 3 data pools and one metadata pool: myfs_data, myfs_data_hdd, myfs_data_ssd, and myfs_metadata. The data of all files are with the use of ceph.dir.layout.pool either stored in the pools myfs_data_hdd or myfs_data_ssd. This has also been checked by dumping the ceph.file.layout.pool attributes of all files. The filesystem has 1617949 files and 36042 directories. There are however approximately as many objects in the first pool created for the cephfs, myfs_data, as there are files. They also becomes more or fewer as files are created or deleted (so cannot be some leftover from earlier exercises). Note how the USED size is reported as 0 bytes, correctly reflecting that no file data is stored in them. POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR myfs_data 0 B 1618229 0 4854687 0 0 0 2263590 129 GiB 23312479 124 GiB 0 B 0 B myfs_data_hdd 831 GiB 136309 0 408927 0 0 0 106046 200 GiB 269084 277 GiB 0 B 0 B myfs_data_ssd 43 GiB 1552412 0 4657236 0 0 0 181468 2.3 GiB 4661935 12 GiB 0 B 0 B myfs_metadata 1.2 GiB 36096 0 108288 0 0 0 4828623 82 GiB 1355102 143 GiB 0 B 0 B Is this expected? I was assuming that in this scenario, all objects, both their data and any keys would be either in the metadata pool, or the two pools where the objects are stored. Is it some additional metadata keys that are stored in the first created data pool for cephfs? This would not be so nice in case the osd selection rules for it are using worse disks than the data itself... Btw: is there any tool to see the amount of key value data size associated with a pool? 'ceph osd df' gives omap and meta for osds, but not broken down per pool. Best regards, Håkan

Show replies by date

Gregory Farnum

10 Feb 10 Feb

10:46 a.m.

On Sun, Feb 9, 2020 at 3:24 PM Håkan T Johansson <f96hajo(a)chalmers.se> wrote:

...

https://docs.ceph.com/docs/master/cephfs/file-layouts/#adding-a-data-pool-t… notes there is “a small amount of metadata” kept in the primary pool. That’s not terribly clear; what is actually stored is a per-file location backtrace (its location in the directory tree) used for hardlink lookups and disaster recovery scenarios.

...

Btw: is there any tool to see the amount of key value data size associated with a pool? 'ceph osd df' gives omap and meta for osds, but not broken down per pool.

I think this is in the newest master code, but I’m not certain which release it’s in... -Greg

...

Best regards, Håkan _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Håkan T Johansson

1:58 p.m.

On Mon, 10 Feb 2020, Gregory Farnum wrote:

...

On Sun, Feb 9, 2020 at 3:24 PM Håkan T Johansson <f96hajo(a)chalmers.se> wrote: Hi, running 14.2.6, debian buster (backports). Have set up a cephfs with 3 data pools and one metadata pool: myfs_data, myfs_data_hdd, myfs_data_ssd, and myfs_metadata. The data of all files are with the use of ceph.dir.layout.pool either stored in the pools myfs_data_hdd or myfs_data_ssd. This has also been checked by dumping the ceph.file.layout.pool attributes of all files. The filesystem has 1617949 files and 36042 directories. There are however approximately as many objects in the first pool created for the cephfs, myfs_data, as there are files. They also becomes more or fewer as files are created or deleted (so cannot be some leftover from earlier exercises). Note how the USED size is reported as 0 bytes, correctly reflecting that no file data is stored in them. POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR myfs_data 0 B 1618229 0 4854687 0 0 0 2263590 129 GiB 23312479 124 GiB 0 B 0 B myfs_data_hdd 831 GiB 136309 0 408927 0 0 0 106046 200 GiB 269084 277 GiB 0 B 0 B myfs_data_ssd 43 GiB 1552412 0 4657236 0 0 0 181468 2.3 GiB 4661935 12 GiB 0 B 0 B myfs_metadata 1.2 GiB 36096 0 108288 0 0 0 4828623 82 GiB 1355102 143 GiB 0 B 0 B Is this expected? I was assuming that in this scenario, all objects, both their data and any keys would be either in the metadata pool, or the two pools where the objects are stored. Is it some additional metadata keys that are stored in the first created data pool for cephfs? This would not be so nice in case the osd selection rules for it are using worse disks than the data itself... https://docs.ceph.com/docs/master/cephfs/file-layouts/#adding-a-data-pool-t… notes there is “a small amount of metadata” kept in the primary pool.

Thanks! This I managed to miss, probably as it was at the bottom of the page. In case one wants to use layouts to separate fast (likely many) from slow (likely large) files, it then sounds as the primary pool should the fast kind too, due to the large amount of objects. Thus this needs to be highlighted early in that documentation.

...

That’s not terribly clear; what is actually stored is a per-file location backtrace (its location in the directory tree) used for hardlink lookups and disaster recovery scenarios.

This info would be nice to add to the manual page. It is nice to know what kind of information is stored there. Again thanks for the clarification!

...

Btw: is there any tool to see the amount of key value data size associated with a pool? 'ceph osd df' gives omap and meta for osds, but not broken down per pool. I think this is in the newest master code, but I’m not certain which release it’s in...

Would it then (when available) also be in the 'rados df' command? Best regards, Håkan > -Greg > > > > Best regards, > Håkan > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > >

Gregory Farnum

11:56 p.m.

On Mon, Feb 10, 2020 at 12:29 AM Håkan T Johansson <f96hajo(a)chalmers.se> wrote:

...

On Mon, 10 Feb 2020, Gregory Farnum wrote:

That’s not terribly clear; what is actually stored is a per-file location backtrace (its location in the directory tree) used for hardlink lookups and disaster recovery scenarios.

This info would be nice to add to the manual page. It is nice to know what kind of information is stored there.

Yeah, PRs welcome. :p Just to be clear though, that shouldn't be performance-critical. It's lazily updated by the MDS when the directory location changes, but not otherwise.

...

Again thanks for the clarification!

Would it then (when available) also be in the 'rados df' command?

I really don't remember how everything is shared out but I think so?

...

Best regards, Håkan > -Greg > > > > Best regards, > Håkan > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > >

Dave Hall

11 Feb 11 Feb

3:34 a.m.

I was also confused by this topic and had intended to post a question this week. The documentation I recall reading said something about 'if you want to use erasure coding on a CephFS, you should use a small replicated data pool as the first pool, and your erasure coded pool as the second.' I did not see any obvious indication of how this would 'auto-magically' put the small files in the replicated pool and the large files in the erasure pool. although this sounds like a desirable behavior. Instead I found the notes 'file layouts' which doesn't seem to be able to use size as a criterion. Does anybody have anything further to add that would help clarify this? Thanks. -Dave Dave Hall Binghamton University On 2/10/20 1:26 PM, Gregory Farnum wrote: > On Mon, Feb 10, 2020 at 12:29 AM Håkan T Johansson <f96hajo(a)chalmers.se> wrote: >> >> On Mon, 10 Feb 2020, Gregory Farnum wrote: >> >>> On Sun, Feb 9, 2020 at 3:24 PM Håkan T Johansson <f96hajo(a)chalmers.se> wrote: >>> >>> Hi, >>> >>> running 14.2.6, debian buster (backports). >>> >>> Have set up a cephfs with 3 data pools and one metadata pool: >>> myfs_data, myfs_data_hdd, myfs_data_ssd, and myfs_metadata. >>> >>> The data of all files are with the use of ceph.dir.layout.pool either >>> stored in the pools myfs_data_hdd or myfs_data_ssd. This has also been >>> checked by dumping the ceph.file.layout.pool attributes of all files. >>> >>> The filesystem has 1617949 files and 36042 directories. >>> >>> There are however approximately as many objects in the first pool created >>> for the cephfs, myfs_data, as there are files. They also becomes more or >>> fewer as files are created or deleted (so cannot be some leftover from >>> earlier exercises). Note how the USED size is reported as 0 bytes, >>> correctly reflecting that no file data is stored in them. >>> >>> POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR >>> myfs_data 0 B 1618229 0 4854687 0 0 0 2263590 129 GiB 23312479 124 GiB 0 B 0 B >>> myfs_data_hdd 831 GiB 136309 0 408927 0 0 0 106046 200 GiB 269084 277 GiB 0 B 0 B >>> myfs_data_ssd 43 GiB 1552412 0 4657236 0 0 0 181468 2.3 GiB 4661935 12 GiB 0 B 0 B >>> myfs_metadata 1.2 GiB 36096 0 108288 0 0 0 4828623 82 GiB 1355102 143 GiB 0 B 0 B >>> >>> Is this expected? >>> >>> I was assuming that in this scenario, all objects, both their data and any >>> keys would be either in the metadata pool, or the two pools where the >>> objects are stored. >>> >>> Is it some additional metadata keys that are stored in the first >>> created data pool for cephfs? This would not be so nice in case the osd >>> selection rules for it are using worse disks than the data itself... >>> >>> >>> https://docs.ceph.com/docs/master/cephfs/file-layouts/#adding-a-data-pool-t… notes there is “a small amount of metadata” kept in the primary pool. >> Thanks! This I managed to miss, probably as it was at the bottom of the >> page. In case one wants to use layouts to separate fast (likely many) >> from slow (likely large) files, it then sounds as the primary pool should >> the fast kind too, due to the large amount of objects. Thus this needs to >> be highlighted early in that documentation. >> >>> That’s not terribly clear; what is actually stored is a per-file location backtrace (its location in the directory tree) used for hardlink lookups and disaster recovery >>> scenarios. >> This info would be nice to add to the manual page. It is nice to know >> what kind of information is stored there. > Yeah, PRs welcome. :p > Just to be clear though, that shouldn't be performance-critical. It's > lazily updated by the MDS when the directory location changes, but not > otherwise. > >> Again thanks for the clarification! >> >>> Btw: is there any tool to see the amount of key value data size associated >>> with a pool? 'ceph osd df' gives omap and meta for osds, but not broken >>> down per pool. >>> >>> >>> I think this is in the newest master code, but I’m not certain which release it’s in... >> Would it then (when available) also be in the 'rados df' command? > I really don't remember how everything is shared out but I think so? > >> Best regards, >> Håkan >> >> >>> -Greg >>> >>> >>> >>> Best regards, >>> Håkan >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> >>> >>> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Håkan T Johansson

11:16 a.m.

On Mon, 10 Feb 2020, Gregory Farnum wrote:

...

On Mon, Feb 10, 2020 at 12:29 AM Håkan T Johansson <f96hajo(a)chalmers.se> wrote:

On Mon, 10 Feb 2020, Gregory Farnum wrote:

That’s not terribly clear; what is actually stored is a per-file location backtrace (its location in the directory tree) used for hardlink lookups and disaster recovery scenarios.

This info would be nice to add to the manual page. It is nice to know what kind of information is stored there.

Yeah, PRs welcome. :p Just to be clear though, that shouldn't be performance-critical. It's lazily updated by the MDS when the directory location changes, but not otherwise.

The sheer amount of objects seems to make a big difference when a pg is rebalanced between drives though, in particular from a HDD (with SSD DB), despite having no data ? (Also comparing to the metadata pool, which does not have one object per file.) Best regards, Håkan > >> >> Again thanks for the clarification! >> >>> Btw: is there any tool to see the amount of key value data size associated >>> with a pool? 'ceph osd df' gives omap and meta for osds, but not broken >>> down per pool. >>> >>> >>> I think this is in the newest master code, but I’m not certain which release it’s in... >> >> Would it then (when available) also be in the 'rados df' command? > > I really don't remember how everything is shared out but I think so? > >> >> Best regards, >> Håkan >> >> >>> -Greg >>> >>> >>> >>> Best regards, >>> Håkan >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> >>> >>> > >

Eugen Block

28 May 28 May

4:50 p.m.

Hi, I've been waiting to respond to this thread for a couple of months now, I wanted to have the latest Nautilus updates installed because we had a lower version than the OP. I tried to reproduce it both with 14.2.3 and now with 14.2.9 and a brand new cephfs (lab environment, newly created pools). But I don't see anything in the primary data pool except for the files that actually belong there. IIUC one should see 1 object in the primary pool (cephfs_data) for each stored object in all of the other involved pools, correct? This I can't confirm, I only see 1 object in the primary pool and that was a file I put there: ceph-2:~ # ceph df [...] POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL cephfs_data 33 5 B 1 132 KiB 0 34 GiB cephfs_metadata 34 165 KiB 22 1.6 MiB 0 34 GiB ec2 35 21 MiB 11 34 MiB 0.03 68 GiB ec3 36 21 MiB 11 34 MiB 0.03 68 GiB Did I misunderstand something? This is on openSUSE Leap 15.1. I'd be very interested why this behaviour seems to differ between distros. Best regards, Eugen Zitat von Gregory Farnum <gfarnum(a)redhat.com>om>:

...

On Mon, Feb 10, 2020 at 12:29 AM Håkan T Johansson <f96hajo(a)chalmers.se> wrote:

On Mon, 10 Feb 2020, Gregory Farnum wrote:

On Sun, Feb 9, 2020 at 3:24 PM Håkan T Johansson

<f96hajo(a)chalmers.se> wrote:

ceph.dir.layout.pool either

stored in the pools myfs_data_hdd or myfs_data_ssd. This

has also been

checked by dumping the ceph.file.layout.pool attributes of

all files.

The filesystem has 1617949 files and 36042 directories. There are however approximately as many objects in the

first pool created

for the cephfs, myfs_data, as there are files. They also

becomes more or

fewer as files are created or deleted (so cannot be some

leftover from

earlier exercises). Note how the USED size is reported as 0 bytes, correctly reflecting that no file data is stored in them. POOL_NAME USED OBJECTS CLONES COPIES

MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR

myfs_data 0 B 1618229 0 4854687

0 0 0 2263590 129 GiB 23312479 124 GiB 0 B 0 B

myfs_data_hdd 831 GiB 136309 0 408927

0 0 0 106046 200 GiB 269084 277 GiB 0 B 0 B

myfs_data_ssd 43 GiB 1552412 0 4657236

0 0 0 181468 2.3 GiB 4661935 12 GiB 0 B 0 B

myfs_metadata 1.2 GiB 36096 0 108288

0 0 0 4828623 82 GiB 1355102 143 GiB 0 B 0 B

Is this expected? I was assuming that in this scenario, all objects, both

their data and any

keys would be either in the metadata pool, or the two pools

where the

objects are stored. Is it some additional metadata keys that are stored in the first created data pool for cephfs? This would not be so nice in

case the osd

selection rules for it are using worse disks than the data itself...

https://docs.ceph.com/docs/master/cephfs/file-layouts/#adding-a-data-pool-t… notes there is “a small amount of metadata” kept in the primary pool. Thanks! This I managed to miss, probably as it was at the bottom of the page. In case one wants to use layouts to separate fast (likely many) from slow (likely large) files, it then sounds as the primary pool should the fast kind too, due to the large amount of objects. Thus this needs to be highlighted early in that documentation.

That’s not terribly clear; what is actually stored is a per-file

location backtrace (its location in the directory tree) used for hardlink lookups and disaster recovery

scenarios.

This info would be nice to add to the manual page. It is nice to know what kind of information is stored there.

Yeah, PRs welcome. :p Just to be clear though, that shouldn't be performance-critical. It's lazily updated by the MDS when the directory location changes, but not otherwise.

Again thanks for the clarification!

Btw: is there any tool to see the amount of key value data

size associated

with a pool? 'ceph osd df' gives omap and meta for osds,

but not broken

down per pool. I think this is in the newest master code, but I’m not certain

which release it’s in... Would it then (when available) also be in the 'rados df' command?

I really don't remember how everything is shared out but I think so?

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

1427

days inactive

1536

days old

ceph-users@ceph.io

Manage subscription

6 comments

4 participants

tags (0)

participants (4)

Dave Hall
Eugen Block
Gregory Farnum
Håkan T Johansson