February 2020 - ceph-users

Re: Ceph Erasure Coding - Stored vs used

by Kristof Coucke

Hi Simon and Janne, Thanks for the reply. It seems indeed related to the bluestore_min_alloc_size. In an old thread I've also found the following: *S3 object saving pipeline:* *- S3 object is divided into multipart shards by client.* *- Rgw shards each multipart shard into rados objects of size* *rgw_obj_stripe_size.* *- Primary osd stripes rados object into ec stripes of width ==* *ec.k*profile.stripe_unit, ec code them and send units into secondary* *osds and write into object store (bluestore).* *- Each subobject of rados object has size == (rados object size)/k.* *- Then while writing into disk bluestore can divide rados subobject into* *extents of minimal size == bluestore_min_alloc_size_hdd.* *Next rules can save some space and iops:* *- rgw_multipart_min_part_size SHOULD be multiple of rgw_obj_stripe_size* *(client can use different value greater than)* *- MUST rgw_obj_stripe_size == rgw_max_chunk_size* *- ec stripe == osd_pool_erasure_code_stripe_unit or profile.stripe_unit* *- rgw_obj_stripe_size SHOULD be multiple of profile.stripe_unit*ec.k* *- bluestore_min_alloc_size_hdd MAY be equal to bluefs_alloc_size (to* *avoid fragmentation)* *- rgw_obj_stripe_size/ec.k SHOULD be multiple of* *bluestore_min_alloc_size_hdd* *- bluestore_min_alloc_size_hdd MAY be multiple of profile.stripe_unit* Doing this calculation would result in the fact that smaller files of 135KB end up in chunks of +/- 22KB. Writing this in 64KB gives me quite some wasted space. As far as I found, the allocation setting is kept on OSD level and set during creation. Adapting the setting requires each OSD to be recreated. As we have around 150 OSDs, I know what to script :-) We will perform some testing in our test environment and I'll try to post our feedback as long as I don't forget it... To be sure, we just want to check the size on disk of the object. Afaik, we'll need to export the rocks db and launch some queries on that, unless someone else can help me on this one? Before Bluestore this was quite easy to do... Regards,

4 years, 2 months

1
0
0 0

Ceph Erasure Coding - Stored vs used

by Kristof Coucke

Hi all, I have an issue on my Ceph cluster. For one of my pools I have 107TiB STORED and 298TiB USED. This is strange, since I've configured erasure coding (6 data chunks, 3 coding chunks). So, in an ideal world this should result in approx. 160.5TiB USED. The question now is why this is the case... There are 473+M objects stored. Lot's of these files are pretty small. (Read 150kb files). Not all of them though. I am running Nautilus version 14.2.4. I suspect that the stripe size is related with this issue. This is still the default (4MB), but I am not sure. Before BlueFS it was easy to check the size of the chunks on the disk... With BlueFS this is another story. I have the following questions: 1. How can I check this to be sure that this is the case? I actually want to drill down starting from an object I've sent to the Ceph cluster thru the RGW. I would like to see where the chunks are stored and which size is allocated for these on the disks. 2. If it is related to the stripe size, can I safely adapt this parameter or is this going to work forward only, or will it also work reversely? Many thanks, Kristof

4 years, 2 months

3
3
1 0

Ceph Erasure Coding - Stored vs used

by Kristof Coucke

Hi all, I have an issue on my Ceph cluster. For one of my pools I have 107TiB STORED and 298TiB USED. This is strange, since I've configured erasure coding (6 data chunks, 3 coding chunks). So, in an ideal world this should result in approx. 160.5TiB USED. The question now is why this is the case... There are 473+M objects stored. Lot's of these files are pretty small. (Read 150kb files). Not all of them though. I am running Nautilus version 14.2.4. I suspect that the stripe size is related with this issue. This is still the default (4MB), but I am not sure. Before BlueFS it was easy to check the size of the chunks on the disk... With BlueFS this is another story. I have the following questions: 1. How can I check this to be sure that this is the case? I actually want to drill down starting from an object I've sent to the Ceph cluster thru the RGW. I would like to see where the chunks are stored and which size is allocated for these on the disks. 2. If it is related to the stripe size, can I safely adapt this parameter or is this going to work forward only, or will it also work reversely? Many thanks, Kristof

4 years, 2 months

1
0
0 0

Fwd: Finding erasure-code-profile of crush rule

by David Seith

---------- Forwarded message --------- From: David Seith <david.seith(a)kit.edu> Date: Wed, 12 Feb 2020 at 11:24 Subject: Finding erasure-code-profile of crush rule To: <ceph-users(a)lists.ceph.com> Dear all, On our ceph cluster we have created multiple erasure coding profiles and then created a number of crush rules for this profiles using: ceph osd crush rule create-erasure {name} {profile-name} Is it now possible to find out which erasure coding profile is used by a certain crush rule? Doing: ceph osd crush rule dump does not show the name of the erasure coding profile associated to the rules. Best, David

4 years, 2 months

1
0
0 0

cephfs slow, howto investigate and tune mds configuration?

by Marc Roos

Say I think my cephfs is slow when I rsync to it, slower than it used to be. First of all, I do not get why it reads so much data. I assume the file attributes need to come from the mds server, so the rsync backup should mostly cause writes not? I think it started being slow, after enabling snapshots on the file system. - how can I determine if mds_cache_memory_limit = 8000000000 is still correct? - how can I test the mds performance from the command line, so I can experiment with cpu power configurations, and see if this brings a significant change?

4 years, 2 months

4
5
0 0

"mds daemon damaged" after restarting MDS - Filesystem DOWN

by Luca Cervigni

Not sure if the previous message went through since I was not subscribed. If yes sorry for the spam. Dear all Running nautilus 14.2.7. The data in the FS are important and cannot be lost. Today I increased the PGS of the volume pool from 8k to 16k. The active mds started reporting slow ops. (the filesystem is not in the volume pool). After few hours the FS was very slow, I reduced the backfill to 1 and since the situation was not improving, I restarted the MDS (no other standby MDSs. it was a single mds). After that the crash. The mds does not goes back up with this error: 020-02-07 07:03:32.477 7fbf69647700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol 2020-02-07 07:03:32.541 7fbf65e6a700 1 mds.ceph-mon-01 Updating MDS map to version 48461 from mon.2 2020-02-07 07:03:37.613 7fbf65e6a700 1 mds.ceph-mon-01 Updating MDS map to version 48462 from mon.2 2020-02-07 07:03:37.613 7fbf65e6a700 1 mds.ceph-mon-01 Map has assigned me to become a standby 2020-02-07 07:14:11.789 7fbf66e42700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0 2020-02-07 07:14:11.789 7fbf66e42700 -1 mds.ceph-mon-01 *** got signal Terminated *** 2020-02-07 07:14:11.789 7fbf66e42700 1 mds.ceph-mon-01 suicide! Wanted state up:standby 2020-02-07 07:14:12.565 7fbf65e6a700 0 ms_deliver_dispatch: unhandled message 0x563fcb438d00 mdsmap(e 48465) v1 from mon.2 v1:10.3.78.32:6789/0 2020-02-07 07:25:16.782 7f26c39de2c0 0 set uid:gid to 64045:64045 (ceph:ceph) 2020-02-07 07:25:16.782 7f26c39de2c0 0 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable), process ceph-mds, pid 3724 2020-02-07 07:25:16.782 7f26c39de2c0 0 pidfile_write: ignore empty --pid-file 2020-02-07 07:25:16.786 7f26b5326700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol 2020-02-07 07:25:16.790 7f26b1b49700 1 mds.ceph-mon-01 Updating MDS map to version 48472 from mon.0 2020-02-07 07:25:17.691 7f26b1b49700 1 mds.ceph-mon-01 Updating MDS map to version 48473 from mon.0 2020-02-07 07:25:17.691 7f26b1b49700 1 mds.ceph-mon-01 Map has assigned me to become a standby 2020-02-07 07:29:50.306 7f26b2b21700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0 2020-02-07 07:29:50.306 7f26b2b21700 -1 mds.ceph-mon-01 *** got signal Terminated *** 2020-02-07 07:29:50.306 7f26b2b21700 1 mds.ceph-mon-01 suicide! Wanted state up:standby 2020-02-07 07:29:50.526 7f26b5b27700 1 mds.beacon.ceph-mon-01 discarding unexpected beacon reply down:dne seq 70 dne 2020-02-07 07:29:52.802 7f26b1b49700 0 ms_deliver_dispatch: unhandled message 0x55ef110ab200 mdsmap(e 48474) v1 from mon.0 v1:10.3.78.22:6789/0 Rebooting did not help I asked #CEPH OFTC and they suggested to bring up another "fresh" mds. I did that, and they do not start, going to standby. LOGS: 2020-02-07 07:12:46.696 7fe4b388b2c0 0 set uid:gid to 64045:64045 (ceph:ceph) 2020-02-07 07:12:46.696 7fe4b388b2c0 0 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable), process ceph-mds, pid 74742 2020-02-07 07:12:46.696 7fe4b388b2c0 0 pidfile_write: ignore empty --pid-file 2020-02-07 07:12:46.704 7fe4a19f6700 1 mds.ceph-mon-02 Updating MDS map to version 48462 from mon.0 2020-02-07 07:12:47.456 7fe4a19f6700 1 mds.ceph-mon-02 Updating MDS map to version 48463 from mon.0 2020-02-07 07:12:47.456 7fe4a19f6700 1 mds.ceph-mon-02 Map has assigned me to become a standby 2020-02-07 07:14:16.615 7fe4a29ce700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0 2020-02-07 07:14:16.615 7fe4a29ce700 -1 mds.ceph-mon-02 *** got signal Terminated *** 2020-02-07 07:14:16.615 7fe4a29ce700 1 mds.ceph-mon-02 suicide! Wanted state up:standby 2020-02-07 07:14:16.947 7fe4a51d3700 1 mds.beacon.ceph-mon-02 discarding unexpected beacon reply down:dne seq 24 dne 2020-02-07 07:14:18.715 7fe4a19f6700 0 ms_deliver_dispatch: unhandled message 0x5602fbc6df80 mdsmap(e 48466) v1 from mon.0 v2:10.3.78.22:3300/0 2020-02-07 07:25:02.093 7f3c2f92a2c0 0 set uid:gid to 64045:64045 (ceph:ceph) 2020-02-07 07:25:02.093 7f3c2f92a2c0 0 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable), process ceph-mds, pid 75471 2020-02-07 07:25:02.093 7f3c2f92a2c0 0 pidfile_write: ignore empty --pid-file 2020-02-07 07:25:02.097 7f3c1da95700 1 mds.ceph-mon-02 Updating MDS map to version 48471 from mon.2 2020-02-07 07:25:06.413 7f3c1da95700 1 mds.ceph-mon-02 Updating MDS map to version 48472 from mon.2 2020-02-07 07:25:06.413 7f3c1da95700 1 mds.ceph-mon-02 Map has assigned me to become a standby 2020-02-07 07:29:56.869 7f3c1ea6d700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0 2020-02-07 07:29:56.869 7f3c1ea6d700 -1 mds.ceph-mon-02 *** got signal Terminated *** 2020-02-07 07:29:56.869 7f3c1ea6d700 1 mds.ceph-mon-02 suicide! Wanted state up:standby 2020-02-07 07:29:58.113 7f3c1da95700 0 ms_deliver_dispatch: unhandled message 0x563c5df33f80 mdsmap(e 48475) v1 from mon.2 v2:10.3.78.32:3300/0 Here ceph status cluster: id: a8dde71d-ca7b-4cf5-bd38-8989c6a27011 health: HEALTH_ERR 1 filesystem is degraded 1 filesystem is offline 1 mds daemon damaged 2 daemons have recently crashed services: mon: 3 daemons, quorum ceph-mon-01,ceph-mon-02,ceph-mon-03 (age 41m) mgr: ceph-mon-02(active, since 41m), standbys: ceph-mon-03, ceph-mon-01 mds: pawsey-sync-fs:0/1, 1 damaged osd: 925 osds: 715 up (since 2h), 715 in (since 23h) rgw: 3 daemons active (radosgw-01, radosgw-02, radosgw-03) data: pools: 24 pools, 26569 pgs objects: 52.64M objects, 199 TiB usage: 685 TiB used, 6.7 PiB / 7.3 PiB avail pgs: 26513 active+clean 54 active+clean+scrubbing+deep 2 active+clean+scrubbing Ceph osd ls detail: https://pastebin.com/raw/bxi4HSa5 the metadata pool is on NVMe Can anyone give me some help? Any command run like journal repairs do not work as they expect the MDs to be up. Thanks Cheers -- Luca Cervigni Infrastructure Architect Tel. +61864368802 Pawsey Supercomputing Centre 1 Bryce Ave, Kensington WA 6151 Australia

4 years, 2 months

2
3
0 0

luminous -> nautilus upgrade path

by Wolfgang Lendl

hello, we plan to upgrade from luminous to nautilus. does it make sense to do the mimic step instead of going directly for nautilus? br wolfgang

4 years, 2 months

5
4
0 0

MDS: obscene buffer_anon memory use when scanning lots of files (continued)

by John Madden

Following the list migration I need to re-open this thread: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/038014.html ... Upgraded to 14.2.7, doesn't appear to have affected the behavior. As requested: ~$ ceph tell mds.mds1 heap stats 2020-02-10 16:52:44.313 7fbda2cae700 0 client.59208005 ms_handle_reset on v2:x.x.x.x:6800/3372494505 2020-02-10 16:52:44.337 7fbda3cb0700 0 client.59249562 ms_handle_reset on v2:x.x.x.x:6800/3372494505 mds.mds1 tcmalloc heap stats:------------------------------------------------ MALLOC: 50000388656 (47684.1 MiB) Bytes in use by application MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist MALLOC: + 174879528 ( 166.8 MiB) Bytes in central cache freelist MALLOC: + 14511680 ( 13.8 MiB) Bytes in transfer cache freelist MALLOC: + 14089320 ( 13.4 MiB) Bytes in thread cache freelists MALLOC: + 90534048 ( 86.3 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 50294403232 (47964.5 MiB) Actual memory used (physical + swap) MALLOC: + 50987008 ( 48.6 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 50345390240 (48013.1 MiB) Virtual address space used MALLOC: MALLOC: 260018 Spans in use MALLOC: 20 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory. ~$ ceph tell mds.mds1 heap release 2020-02-10 16:52:47.205 7f037eff5700 0 client.59249625 ms_handle_reset on v2:x.x.x.x:6800/3372494505 2020-02-10 16:52:47.237 7f037fff7700 0 client.59249634 ms_handle_reset on v2:x.x.x.x:6800/3372494505 mds.mds1 releasing free RAM back to system. The pools over 15 minutes or so: ~$ ceph daemon mds.mds1 dump_mempools | jq .mempool.by_pool.buffer_anon { "items": 2045, "bytes": 3069493686 } ~$ ceph daemon mds.mds1 dump_mempools | jq .mempool.by_pool.buffer_anon { "items": 2445, "bytes": 3111162538 } ~$ ceph daemon mds.mds1 dump_mempools | jq .mempool.by_pool.buffer_anon { "items": 7850, "bytes": 7658678767 } ~$ ceph daemon mds.mds1 dump_mempools | jq .mempool.by_pool.buffer_anon { "items": 12274, "bytes": 11436728978 } ~$ ceph daemon mds.mds1 dump_mempools | jq .mempool.by_pool.buffer_anon { "items": 13747, "bytes": 11539478519 } ~$ ceph daemon mds.mds1 dump_mempools | jq .mempool.by_pool.buffer_anon { "items": 14615, "bytes": 13859676992 } ~$ ceph daemon mds.mds1 dump_mempools | jq .mempool.by_pool.buffer_anon { "items": 23267, "bytes": 22290063830 } ~$ ceph daemon mds.mds1 dump_mempools | jq .mempool.by_pool.buffer_anon { "items": 44944, "bytes": 40726959425 } And one about a minute after the heap release showing continued growth: ~$ ceph daemon mds.mds1 dump_mempools | jq .mempool.by_pool.buffer_anon { "items": 50694, "bytes": 47343942094 } This is on a single active MDS with 2 standbys, scan for about a million files with about 20 parallel threads on two clients, open and read each if it exists.

4 years, 2 months

1
0
0 0

Bluestore cache parameter precedence

by Boris Epstein

Hello list, As stated in this document: https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ there are multiple parameters defining cache limits for BlueStore. You have bluestore_cache_size (presumably controlling the cache size), bluestore_cache_size_hdd (presumably doing the same for HDD storage only) and bluestore_cache_size_ssd (presumably being the equivalent for SSD). My question is, does bluestore_cache_size override the disk-specific parameters, or do I need to set the disk-specific (or, rather, storage type specific ones separately if I want to keep them to a certain value. Thanks in advance. Boris.

4 years, 2 months

3
4
0 0

Fwd: PrimaryLogPG.cc: 11550: FAILED ceph_assert(head_obc)

by Jake Grimmett

Dear All, Following a clunky* cluster restart, we had 23 "objects unfound" 14 pg recovery_unfound We could see no way to recover the unfound objects, we decided to mark the objects in one pg unfound... [root@ceph1 bad_oid]# ceph pg 5.f2f mark_unfound_lost delete pg has 2 objects unfound and apparently lost marking Unfortunately, this immediately crashed the primary OSD for this PG: OSD log showing the osd crashing 3 times here: <http://p.ip.fi/gV8r> the assert was :> 2020-02-10 13:38:45.003 7fa713ef3700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/osd/PrimaryLogPG.cc: In function 'int PrimaryLogPG::recover_missing(const hobject_t&, eversion_t, int, PGBackend::RecoveryHandle*)' thread 7fa713ef3700 time 2020-02-10 13:38:45.000875 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/osd/PrimaryLogPG.cc: 11550: FAILED ceph_assert(head_obc) Questions.. 1) Is it possible to recover the flapping OSD? or should we fail out the flapping OSD and hope the cluster recovers? 2) We have 13 other pg with unfound objects. Do we need to mark_unfound these one at a time, and then fail out their primary OSD? (allowing the cluster to recover before mark_unfound the next pg & failing it's primary OSD) * thread describing the bad restart :> <https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IRKCDRRAH7Y…> many thanks! Jake -- Dr Jake Grimmett MRC Laboratory of Molecular Biology Francis Crick Avenue, Cambridge CB2 0QH, UK. -- Dr Jake Grimmett Head Of Scientific Computing MRC Laboratory of Molecular Biology Francis Crick Avenue, Cambridge CB2 0QH, UK. Phone 01223 267019 Mobile 0776 9886539

4 years, 2 months

1
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users February 2020