March 2024 - ceph-users - lists.ceph.io

Improving CephFS performance by always putting "default" data pool on SSDs?

by Niklas Hambüchen

https://docs.ceph.com/en/reef/cephfs/createfs/ says: > The data pool used to create the file system is the “default” data pool and the location for storing all inode backtrace information, which is used for hard link management and disaster recovery. > For this reason, all CephFS inodes have at least one object in the default data pool. If erasure-coded pools are planned for file system data, it is best to configure the default as a replicated pool to improve small-object write and read performance when updating backtraces. This poses the question: Are normal replicated CephFS installations (metadata on SSDs, data on HDDs) set up with suboptimal performance because they don't do this? If having inodes/backtraces on replicated instead of EC improves performance, shouldn't one expect that putting inodes/backtraces on SSD would improve it even more? From the docs I also cannot really conclude when inotes/backtraces become important. Is that all the time, or only sometimes? Thanks!

1 month, 1 week

2
3
0 0

Linux Laptop Losing CephFS mounts on Sleep/Hibernate

by matthew＠peregrineit.net

Hi All, So I've got a Ceph Reef Cluster (latest version) with a CephFS system set up with a number of directories on it. On a Laptop (running Rocky Linux (latest version)) I've used fstab to mount a number of those directories - all good, everything works, happy happy joy joy! :-) However, when the laptop goes into sleep or hibernate mode (ie when I close the lid) and then bring it back out of sleep/hibernate (ie open the lid) the CephFS mounts are "not present". The only way to get them back is to run `mount -a` as either root or as sudo. This, as I'm sure you'll agree, is less than ideal - especially as this is a pilot project for non-admin users (ie they won't have access to the root account or sudo on their own (corporate) laptops). So, my question to the combined wisdom of the Community is what's the best way to resolve this issue? I've looked at autofs, and even tried (half-heartedly - it was late, and I wanted to go home :-) ) to get this running, but I'm note sure if this is the best way to resolve things. All help and advice on this greatly appreciated - thank in advance Cheers Dulux-Oz

1 month, 1 week

5
4
0 0

PG's stuck incomplete on EC pool after multiple drive failure

by Malcolm Haak

Hello all. I have a cluster with ~80TB of spinning disk. Its primary role is cephfs. Recently I had a multiple drive failure (it was not simultaneous) but it's left me with 20 incomplete pg's I know this data is toast, but I need to be able to get what isn't toast out of the cephfs. Well out of that pool and into a new pool. The issue is the PG's that are incomplete block IO and that hinders browsing the filesystem. I'm attempting to use the "new" ceph-objectstore-tool mark-complete operation, but I'm struggling to work out what to mark complete. Being that it's EC and each PG is made up of multiple shards (I think is the right word) and they all have their own status. I did manage to mark one of these shard pg's complete on what appeared to be the primary OSD, however it had no effect on that shard when checking it with ceph pg X query. By that I mean the shard was marked incomplete before and after using ceph-ojbectstore-tool. I'm running Ceph 18.2.2, I have all HDD osd's. I can get any logs that will help, I'm just not 100% sure where to start and don't want to just dump 20 PG's of ceph pg X query on the mailing list Thanks so much Mal

1 month, 1 week

1
0
0 0

mclock and massive reads

by Luis Domingues

Hello, We have a question about mClock scheduling reads on pacific (16.2.14 currently). When we do massive reads, from let's say machines we want to drain containing a lot of data on EC pools, we observe quite frequently slow ops on the source OSDs. Those slow ops affect the client services, talking directly rados. If we kill the OSD that causes slow ops, the recovery stays more or less at the same speed, but no more slow ops. And when we tweak mClock, if we limit on the OSDs that are the source, nothing that we can observe happens. However, if we limit on the target OSDs, the global speed slows down, and the slow ops disappear. So our question, is mClock taking into account the reads as well as the writes? Or are the reads calculate to be less expensive than the writes? Thanks, Luis Domingues Proton AG

1 month, 1 week

2
2
0 0

Fwd: Welcome to the "ceph-users" mailing list

by 许晨辉

---------- Forwarded message --------- 发件人： <ceph-users-request(a)ceph.io> Date: 2024年3月25日周一 20:58 Subject: Welcome to the "ceph-users" mailing list To: <xuchenhuig(a)gmail.com> Welcome to the "ceph-users" mailing list! To post to this list, send your email to: ceph-users(a)ceph.io You can unsubscribe or make adjustments to your options via email by sending a message to: ceph-users-request(a)ceph.io with the word 'help' in the subject or body (don't include the quotes), and you will get back a message with instructions. You will need your password to change your options, but for security purposes, this password is not included here. If you have forgotten your password you will need to reset it via the web UI.

1 month, 1 week

1
0
0 0

1x port from bond down causes all osd down in a single machine

by Szabo, Istvan (Agoda)

Hi, Wonder what we are missing from the netplan configuration on ubuntu which ceph needs to tolerate properly. We are using this bond configuration on ubuntu 20.04 with octopus ceph: bond1: macaddress: x.x.x.x.x.50 dhcp4: no dhcp6: no addresses: - 192.168.199.7/24 interfaces: - ens2f0np0 - ens2f1np1 mtu: 9000 parameters: mii-monitor-interval: 100 mode: 802.3ad lacp-rate: fast transmit-hash-policy: layer3+4 ens2f1np1 failed and caused slow ops, all osd down ... = disaster Any idea what is wrong with this bond config? Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

1 month, 1 week

2
1
0 0

Re: Return value from cephadm host-maintenance?

by John Mulligan

(adding the list back to the thread) On Wednesday, March 27, 2024 12:54:34 PM EDT Daniel Brown wrote: > John > > > I got curious and was taking another quick look through the python script > for cephadm. > That's always welcome. :-D > This is probably too simple of a question to be asking — or maybe I should > say, I’m not expecting that there’s a simple answer to what might seem like > a simple question - > > Is there anything that notifies the cluster, or the other hosts in a > cluster, when a host is going into maintenance mode that it is going into > maintenance mode, or is cephadm just doing systemctl commands behind the > scenes to stop and later restart the appropriate ceph containers locally on > that host? > > Maybe a better way to say it would be - what is differentiating between > maintenance mode and a host simply crashing or going offline? I'll paraphrase Adam King, tech lead for cephadm here: If one runs the command from cephadm binary directly, it will be disabling/ stopping the systemd target only. The intention is for users to use the `ceph orch host maintenance` ... commands. When you use the orch command (quoting Adam here): ``` when we put something into maintenance mode we 1) disable and stop the systemd target for the daemons on the host 2) set the noout flag for all the OSDs on that host 3) internally to cephadm mark the host as having a status of "maintenance" which has some effects such as us not refreshing metadata on that host or attempting to place/remove daemons from there The main difference from that to a host going offline is the noout flag for the OSDs, and that cephadm will not periodically try to check if the host is alive, as it would do for an offline host. I believe the noout flag stops it from trying to migrate all the data on that OSDs to other OSDs as it shouldn't be necessary if they will be coming back ``` The `cephadm host-maintenance enter` is meant to be a component of the `ceph orch host maintenance` workflow. It still has a bug, the way it always exits with an error is wrong. But you may not want to use it directly. Reference links: https://docs.ceph.com/en/latest/cephadm/host-management/#maintenance-mode https://docs.ceph.com/en/latest/dev/cephadm/host-maintenance/ > > On Mar 22, 2024, at 6:26 AM, Daniel Brown <daniel.h.brown(a)thermify.cloud> > > wrote: > > > > > > Looks like it got OK’ed. I’ll put in something today. > > > > > > -- > > Dan Brown > > > >> On Mar 21, 2024, at 13:44, John Mulligan <phlogistonjohn(a)asynchrono.us> > >> wrote:>> > >> On Thursday, March 21, 2024 11:43:19 AM EDT Daniel Brown wrote: > >>> Assuming I need admin approval to report this on tracker, how long does > >>> it > >>> take to get approved?? Signed up a couple days ago, but still seeing > >>> “Your > >>> account was created and is now pending administrator approval.” > >> > >> That's unfortunate. I pinged about your issue signing up on the ceph > >> slack > >> channel for infrastructure. Hopefully, that'll get somebody's attention. > >> If > >> you don't get access by tomorrow feel free to ping me again directly and > >> then *I'll* file the issue for you instead of having you wait around > >> more.

1 month, 1 week

1
0
0 0

nvme hpe

by Albert Shih

Hi. I notice in the log I got log from each node Mar 27 01:12:59 cthulhu1 sudo: ceph : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/nvme mo000800kxprv smart-log-add --json /dev/nvme1n1 Mar 27 01:13:06 cthulhu1 sudo: ceph : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/nvme hpe smart-log-add --json /dev/nvme0n1 Mar 27 01:13:01 cthulhu2 sudo: ceph : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/nvme mo000800kxprv smart-log-add --json /dev/nvme1n1 Mar 27 01:13:07 cthulhu2 sudo: ceph : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/nvme hpe smart-log-add --json /dev/nvme0n1 Mar 27 01:13:02 cthulhu3 sudo: ceph : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/nvme mo000800kxprv smart-log-add --json /dev/nvme1n1 Mar 27 01:13:07 cthulhu3 sudo: ceph : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/nvme hpe smart-log-add --json /dev/nvme0n1 Mar 27 01:13:03 cthulhu4 sudo: ceph : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/nvme mo000800kxnxh smart-log-add --json /dev/nvme2n1 Mar 27 01:13:07 cthulhu4 sudo: ceph : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/nvme hpe smart-log-add --json /dev/nvme0n1 Mar 27 01:13:03 cthulhu5 sudo: ceph : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/nvme mo000800kxnxh smart-log-add --json /dev/nvme1n1 Mar 27 01:13:06 cthulhu5 sudo: ceph : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/nvme hpe smart-log-add --json /dev/nvme0n1 So the problem are : The plugin for mo000800kxnxh is I'm guessing wrong (and does not exist) The plugin hpe doesn't exist either. nvme find (i'm guestion) the model with nvme list, but # nvme list Node SN Model Namespace Usage Format FW Rev ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 PWWVF0DSTHO1E7 HPE NS204i-p Gen10+ Boot Controller 1 480.04 GB / 480.04 GB 512 B + 0 B 12141004 /dev/nvme1n1 231940892591 MO000800KXNXH 1 2.76 GB / 800.17 GB 512 B + 0 B HPS0 /dev/nvme2n1 23194089256A MO000800KXNXH 1 799.47 GB / 800.17 GB 512 B + 0 B HPS0 Still with lshw I find out it's was a micron ssd. So my question : whats the best thing to do ? Which «plugin» should I use and how I tell cephad what to do ? Regards -- Albert SHIH 🦫 🐸 France Heure locale/Local time: mer. 27 mars 2024 15:43:54 CET

1 month, 1 week

1
0
0 0

Erasure Code with Autoscaler and Backfill_toofull

by Daniel Williams

Hey, I'm running ceph version 18.2.1 (reef) but this problem must have existed a long time before reef. The documentation says the autoscaler will target 100 pgs per OSD but I'm only seeing ~10. My erasure encoding is a stripe of 6 data 3 parity. Could that be the reason? PGs numbers for that EC pool are therefore multiplied by k+m by the autoscaler calculations? Is backfill_toofull calculated against the total size of the PG against every OSD it is destined for? For my case I have ~1TiB PGs because the autoscaler is creating only 10 per host, and then backfill too full is considering that one of my OSDs only has 500GiB free, although that doesn't quite add up either because two 1TiB PGs are backfilling two pg's that have OSD 1 in them. My backfill full ratio is set to 97%. Would it be correct for me to change the autoscaler to target ~700 pgs per osd and bias for storagefs and all EC pools to k+m? Should that be the default or the documentation recommended value? How scary is changing PG_NUM while backfilling misplaced PGs? It seems like there's a chance the backfill might succeed so I think I can wait. Any help is greatly appreciated, I've tried to include as much of the relevant debugging output as I can think of. Daniel # ceph osd ls | wc -l 44 # ceph pg ls | wc -l 484 # ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK .rgw.root 216.0k 3.0 480.2T 0.0000 1.0 32 on False default.rgw.control 0 3.0 480.2T 0.0000 1.0 32 on False default.rgw.meta 0 3.0 480.2T 0.0000 1.0 32 on False default.rgw.log 1636k 3.0 480.2T 0.0000 1.0 32 on False storagefs 233.5T 1.5 480.2T 0.7294 1.0 256 on False storagefs-meta 850.2M 4.0 480.2T 0.0000 4.0 32 on False storagefs_wide 355.3G 1.375 480.2T 0.0010 1.0 32 on False .mgr 457.3M 3.0 480.2T 0.0000 1.0 1 on False mgr-backup-2022-08-19 370.6M 3.0 480.2T 0.0000 1.0 32 on False # ceph osd pool ls detail | column -t pool 15 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on pool 16 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on pool 17 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on pool 18 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on pool 36 'storagefs' erasure profile 6.3 size 9 min_size 7 crush_rule 2 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on pool 37 'storagefs-meta' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on pool 45 'storagefs_wide' erasure profile 8.3 size 11 min_size 9 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on pool 46 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on pool 48 'mgr-backup-2022-08-19' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on # ceph osd erasure-code-profile get 6.3 crush-device-class= crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=6 m=3 plugin=jerasure technique=reed_sol_van w=8 # ceph pg ls | awk 'NR==1 || /backfill_toofull/' | awk '{print $1" "$2" "$4" "$6" "$11" "$15" "$16}' | column -t PG OBJECTS MISPLACED BYTES STATE UP ACTING 36.f 222077 141392 953817797727 active+remapped+backfill_toofull [1,27,41,8,36,17,14,40,32]p1 [33,32,29,23,16,17,28,1,14]p33 36.5c 221761 147015 950692130045 active+remapped+backfill_toofull [26,27,40,29,1,37,39,11,42]p26 [12,24,4,2,31,25,17,33,8]p12 36.60 222710 0 957109050809 active+remapped+backfill_toofull [41,34,22,3,1,35,9,39,29]p41 [2,34,22,3,27,32,28,24,1]p2 36.6b 222202 427168 953843892012 active+remapped+backfill_toofull [20,15,7,21,37,1,38,17,32]p20 [7,2,32,26,5,35,24,17,23]p7 36.74 222681 777546 957679960067 active+remapped+backfill_toofull [42,24,12,34,38,10,27,1,25]p42 [34,33,12,0,19,14,17,30,25]p34 36.7b 222974 1560818 957691042940 active+remapped+backfill_toofull [2,35,27,1,20,18,19,12,8]p2 [31,23,21,24,35,18,19,33,25]p31 36.82 222362 1998670 954507657022 active+remapped+backfill_toofull [37,22,1,38,11,23,27,32,33]p37 [27,33,0,32,5,25,20,13,15]p27 36.b5 221676 1330056 953443725830 active+remapped+backfill_toofull [6,8,38,12,21,1,39,34,27]p6 [33,8,26,12,3,10,22,34,1]p33 36.b6 222669 1335327 956973704883 active+remapped+backfill_toofull [11,13,41,4,12,34,29,6,1]p11 [2,29,34,4,12,9,15,6,28]p2 36.e0 221518 1772144 952581426388 active+remapped+backfill_toofull [1,27,21,31,30,23,37,13,28]p1 [25,21,14,31,1,2,34,17,24]p25 ceph pg ls | awk 'NR==1 || /backfilling/' | grep -e BYTES -e '\[1' -e ',1,' -e '1\]' | awk '{print $1" "$2" "$4" "$6" "$11" "$15" "$16}' | column -t PG OBJECTS MISPLACED BYTES STATE UP ACTING 36.4a 221508 89144 951346455917 active+remapped+backfilling [40,43,33,32,30,38,22,35,9]p40 [27,10,20,7,30,21,1,28,31]p27 36.79 222315 1111575 955797107713 active+remapped+backfilling [1,36,31,33,25,23,14,3,13]p1 [27,6,31,23,25,5,14,29,13]p27 36.8d 222229 1284156 955234423342 active+remapped+backfilling [35,34,27,37,38,36,43,3,16]p35 [35,34,15,26,1,11,27,18,16]p35 36.ba 222039 0 952547107971 active+remapped+backfilling [0,40,33,23,41,4,27,22,28]p0 [0,35,33,27,1,3,30,22,28]p0 36.da 221607 277464 951599928383 active+remapped+backfilling [21,31,8,9,11,25,36,23,28]p21 [0,10,1,22,33,11,35,15,28]p0 36.db 221685 58816 951420054091 active+remapped+backfilling [3,28,12,13,1,38,40,35,43]p3 [27,20,17,21,1,23,28,24,31]p27 # ceph osd df | sort -nk 17 | tail -n 5 21 hdd 9.09598 1.00000 9.1 TiB 7.7 TiB 7.7 TiB 0 B 31 GiB 1.4 TiB 84.62 1.16 68 up 24 hdd 9.09598 1.00000 9.1 TiB 7.7 TiB 7.7 TiB 1 KiB 25 GiB 1.4 TiB 84.98 1.16 69 up 29 hdd 9.09569 1.00000 9.1 TiB 8.0 TiB 8.0 TiB 72 MiB 23 GiB 1.1 TiB 88.42 1.21 73 up 13 hdd 9.09569 1.00000 9.1 TiB 8.1 TiB 8.1 TiB 1 KiB 22 GiB 1023 GiB 89.02 1.22 76 up 1 hdd 7.27698 1.00000 7.3 TiB 6.8 TiB 6.8 TiB 27 MiB 18 GiB 451 GiB 93.94 1.28 64 up # cat /etc/ceph/ceph.conf | grep full mon_osd_full_ratio = .98 mon_osd_nearfull_ratio = .96 mon_osd_backfillfull_ratio = .97 osd_backfill_full_ratio = .97 osd_failsafe_full_ratio = .99

1 month, 1 week

3
3
0 0

Ceph user/bucket usage metrics

by Kushagr Gupta

Hi team, I am new to ceph and I am looking to monitor the user/bucket usage for ceph. As per the following link: https://docs.ceph.com/en/latest/radosgw/metrics/ But when I enabled the same using the command: 'ceph config set client.rgw CONFIG_VARIABLE VALUE' I ould only see the following perf schema: ``` "rgw": { "req": { "type": 10, "metric_type": "counter", "value_type": "integer", "description": "Requests", "nick": "", "priority": 5, "units": "none" }, "failed_req": { "type": 10, "metric_type": "counter", "value_type": "integer", "description": "Aborted requests", "nick": "", "priority": 5, "units": "none" xxxxxxxxxxxxxxxxxxxSNIPxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ``` But as per the link we should have also gotten the following metrices: ``` "rgw_op": [ { "labels": {}, "counters": { "put_obj_ops": 2, "put_obj_bytes": 5327, "put_obj_lat": { "avgcount": 2, "sum": 2.818064835, "avgtime": 1.409032417 }, "get_obj_ops": 5, "get_obj_bytes": 5325, "get_obj_lat": { "avgcount": 2, "sum": 0.003000069, "avgtime": 0.001500034 }, ... "list_buckets_ops": 1, "list_buckets_lat": { "avgcount": 1, "sum": 0.002300000, "avgtime": 0.002300000 } } }, ] ``` But as per the following links: https://github.com/ceph/ceph/blob/v19.0.0/src/rgw/rgw_perf_counters.cc https://github.com/ceph/ceph/blob/v18.2.2/src/rgw/rgw_perf_counters.cc I don't think this feature is currently supported could anyone please help me with this? Ceph-version being used by us - 18.2.0(reef)/18.2.2 Thanks and Regards, Kushagra Gupta

1 month, 1 week

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users March 2024