July 2023 - ceph-users - lists.ceph.io

by sejun21.kim＠samsung.com

Hi, I contact you for some question about quota. Situation is following below. 1. I set the user quota 10M 2. Using s3 browser, upload one 12M file 3. The upload failed as i wish, but some object remains in the pool(almost 10M) and s3brower doesn't show failed file. I expected nothing to be left in Ceph. My question is "can user or admin remove the remaining objects?"

10 months, 2 weeks

3
3
0 0

CEPH orch made osd without WAL

by Jan Marek

Hello, I've tried to add to CEPH cluster OSD node with a 12 rotational disks and 1 NVMe. My YAML was this: service_type: osd service_id: osd_spec_default service_name: osd.osd_spec_default placement: host_pattern: osd8 spec: block_db_size: 64G data_devices: rotational: 1 db_devices: paths: - /dev/nvme0n1 filter_logic: AND objectstore: bluestore Now I have 12 OSD with DB on NVMe device, but without WAL. How I can add WAL to this OSD? NVMe device still have 128GB free place. Thanks a lot. Sincerely Jan Marek -- Ing. Jan Marek University of South Bohemia Academic Computer Centre Phone: +420389032080 http://www.gnu.org/philosophy/no-word-attachments.cs.html

10 months, 2 weeks

3
7
0 0

Are replicas 4 or 6 safe during network partition? Will there be split-brain?

by jcichra＠cloudflare.com

Hi, I'm evaluating if an even number of replicas is safe. Are 4 and 6 replicas still safe compared to replicas 5 and 7? With replicas 4 the min_size is 2. Does this mean in a split-brain situation there's a possibility that OSDs on both sides of the split could accept writes? Here's the scenario I'm thinking of. Please correct the assumptions I am making. So far I am unable to create this problem in my testing: There are two sites, A and B. There are 5 mons, 2 in A, 3 in B. Looking at just one PG and 4 replicas, we have 2 replicas in site A and 2 replicas in site B. Site A holds the primary OSD for this PG. When a network split happens, I/O would still be working in site A since there are still 2 OSDs, even without mon quorum. The primary OSD can still reach one of the other OSDs but not a quorum of mons. After a period of time, site B has quorum and changes the topology of the cluster to say it lost site A. Writes are also accepted in site A because min_size 2 is also satisfied. My question is - is this an actual scenario that can happen with an even number of replicas and an even network split? Do I need to go to 5 replicas with a min_size of 3 to prevent an even split from accepting writes on both sides? In that case one side would have 3 copies and the other would have 2. Thanks.

10 months, 2 weeks

2
1
0 0

librbd Python asyncio

by Tony Liu

Hi, Wondering if there is librbd supporting Python asyncio, or any plan to do that? Thanks! Tony

10 months, 2 weeks

1
0
0 0

immutable bit

by Angelo Höngens

Hey guys and girls, I noticed CephFS on my kinda default 17.2.6 CephFS volume, it does not support setting the immutable bit. (Want to start using it with the Veeam hardened repo that uses the immutable bit). I do see a lot of very, very old posts with technical details on how to implement it, but is there a way for me to use that yet? Angelo. see https://tracker.ceph.com/issues/10679

10 months, 2 weeks

2
1
0 0

Cannot get backfill speed up

by Jesper Krogh

Hi. Fresh cluster - but despite setting: jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 | grep recovery_max_active_ssd osd_recovery_max_active_ssd 50 mon default[20] jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 | grep osd_max_backfills osd_max_backfills 100 mon default[10] I still get jskr@dkcphhpcmgt028:/$ sudo ceph status cluster: id: 5c384430-da91-11ed-af9c-c780a5227aff health: HEALTH_OK services: mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028 (age 16h) mgr: dkcphhpcmgt031.afbgjx(active, since 33h), standbys: dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd mds: 2/2 daemons up, 1 standby osd: 40 osds: 40 up (since 45h), 40 in (since 39h); 21 remapped pgs data: volumes: 2/2 healthy pools: 9 pools, 495 pgs objects: 24.85M objects, 60 TiB usage: 117 TiB used, 159 TiB / 276 TiB avail pgs: 10655690/145764002 objects misplaced (7.310%) 474 active+clean 15 active+remapped+backfilling 6 active+remapped+backfill_wait io: client: 0 B/s rd, 1.4 MiB/s wr, 0 op/s rd, 116 op/s wr recovery: 328 MiB/s, 108 objects/s progress: Global Recovery Event (9h) [==========================..] (remaining: 25m) With these numbers for the setting - I would expect to get more than 15 active backfilling... (and based on SSD's and 2x25gbit network, I can also spend more resources on recovery than 328 MiB/s Thanks, . -- Jesper Krogh

10 months, 2 weeks

4
3
0 0

MDSs report slow metadata IOs

by Ben

Hi, see many of this in cluster log channel. many are blocked with long period of seconds. It should hurt client access performance. Any ideas to get rid of them? Thanks, Ben --------------------------------- 7/7/23 4:48:50 PM [WRN] Health check update: 8 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 7/7/23 4:48:09 PM [WRN] Health check update: 7 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 7/7/23 4:48:09 PM [INF] MDS health message cleared (mds.?): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 559 secs 7/7/23 4:47:47 PM [WRN] Health check update: 8 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 7/7/23 4:47:11 PM [WRN] Health check update: 7 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 7/7/23 4:47:10 PM [INF] MDS health message cleared (mds.?): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 377 secs 7/7/23 4:46:22 PM [WRN] Health check update: 8 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 7/7/23 4:46:12 PM [WRN] Health check update: 7 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 7/7/23 4:45:40 PM [WRN] Health check update: 6 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 7/7/23 4:45:40 PM [INF] MDS health message cleared (mds.?): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 199 secs 7/7/23 4:45:12 PM [WRN] Health check update: 7 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 7/7/23 4:45:07 PM [WRN] Health check update: 6 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 7/7/23 4:44:58 PM [INF] MDS health message cleared (mds.?): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 565 secs 7/7/23 4:44:58 PM [WRN] Health check update: 7 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)

10 months, 2 weeks

2
1
0 0

pg_num != pgp_num - and unable to change.

by Jesper Krogh

Hi. Fresh cluster - after a dance where the autoscaler did not work (returned black) as described in the doc - I now seemingly have it working. It has bumpted target to something reasonable -- and is slowly incrementing pg_num and pgp_num by 2 over time (hope this is correct?) But . jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62 pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8 min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22 pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159 lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application cephfs pg_num = 150 pgp_num = 22 and setting pgp_num seemingly have zero effect on the system .. not even with autoscaling set to off. jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pg_autoscale_mode off set pool 22 pg_autoscale_mode to off jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pgp_num 150 set pool 22 pgp_num to 150 jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pg_num_min 128 set pool 22 pg_num_min to 128 jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pg_num 150 set pool 22 pg_num to 150 jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data pg_autoscale_mode on set pool 22 pg_autoscale_mode to on jskr@dkcphhpcmgt028:/$ sudo ceph progress PG autoscaler increasing pool 22 PGs from 150 to 512 (14s) [............................] jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62 pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8 min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22 pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159 lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application cephfs pgp_num != pg_num ? In earlier versions of ceph (without autoscaler) I have only experienced that setting pg_num and pgp_num took immidiate effect? Jesper jskr@dkcphhpcmgt028:/$ sudo ceph version ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) jskr@dkcphhpcmgt028:/$ sudo ceph health HEALTH_OK jskr@dkcphhpcmgt028:/$ sudo ceph status cluster: id: 5c384430-da91-11ed-af9c-c780a5227aff health: HEALTH_OK services: mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028 (age 15h) mgr: dkcphhpcmgt031.afbgjx(active, since 32h), standbys: dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd mds: 2/2 daemons up, 1 standby osd: 40 osds: 40 up (since 44h), 40 in (since 39h); 33 remapped pgs data: volumes: 2/2 healthy pools: 9 pools, 495 pgs objects: 24.85M objects, 60 TiB usage: 117 TiB used, 158 TiB / 276 TiB avail pgs: 13494029/145763897 objects misplaced (9.257%) 462 active+clean 23 active+remapped+backfilling 10 active+remapped+backfill_wait io: client: 0 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 94 op/s wr recovery: 705 MiB/s, 208 objects/s progress: -- Jesper Krogh

10 months, 2 weeks

3
2
0 0

CephFS snapshots: impact of moving data

by Kuhring, Mathias

Dear Ceph community, We want to restructure (i.e. move around) a lot of data (hundreds of terabyte) in our CephFS. And now I was wondering what happens within snapshots when I move data around within a snapshotted folder. I.e. do I need to account for a lot increased storage usage due to older snapshots differing from the new restructured state? In the end it is just metadata changes. Are the snapshots aware of this? Consider the following examples. Copying data: Let's say I have a folder /test, with a file XYZ in sub-folder /test/sub1 and an empty sub-folder /test/sub2. I create snapshot snapA in /test/.snap, copy XYZ to sub-folder /test/sub2, delete it from /test/sub1 and create another snapshot snapB. I would have two snapshots each with distinct copies of XYZ, hence using double the space in the FS: /test/.snap/snapA/sub1/XYZ <-- copy 1 /test/.snap/snapA/sub2/ /test/.snap/snapB/sub1/ /test/.snap/snapB/sub2/XYZ <-- copy 2 Moving data: Let's assume the same structure. But now after creating snapshot snapA, I move XYZ to sub-folder /test/sub2 and then create the other snapshot snapB. The directory tree will look the same. But how is this treated internally? Once I move the data, will there be an actually copy created in snapA to represent the old state? Or will this remain the same data (like a link to the inode or so)? And hence not double the storage used for that file. I couldn't find (or understand) anything related to this in the docs. The closest seems to be the hard-link section here: https://docs.ceph.com/en/quincy/dev/cephfs-snapshots/#hard-links Which unfortunately goes a bit over my head. So I'm not sure if this answers my question. Thank you all for your help. Appreciate it. Best Wishes, Mathias Kuhring

10 months, 2 weeks

3
2
1 0

Ceph Quarterly (CQ) - Issue #1

by Zac Dover

The first issue of "Ceph Quarterly" is attached to this email. Ceph Quarterly (or "CQ") is an overview of the past three months of upstream Ceph development. We provide CQ in three formats: A4, letter, and plain text wrapped at 80 columns. Zac Dover Upstream Documentation Ceph Foundation

10 months, 2 weeks

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users July 2023