October 2023 - ceph-users

How to confirm cache hit rate in ceph osd.

by mitsu

Hi, I'd like to know cache hit rate in ceph osd. I installed prometheus and grafana. But there aren't cache hit rate on grafana dashbords... Does Ceph have a cache hit rate counter? I'd like to know the impact of READ performance on Ceph cluster. Regards, -- Mitsumasa KONDO

7 months

1
0
0 0

Fixing BlueFS spillover (pacific 16.2.14)

by Chris Dunlop

Hi, tl;dr why are my osds still spilling? I've recently upgraded to 16.2.14 from 16.2.9 and started receiving bluefs spillover warnings (due to the "fix spillover alert" per the 16.2.14 release notes). E.g. from 'ceph health detail', the warning on one of these (there are a few): osd.76 spilled over 128 KiB metadata from 'db' device (56 GiB used of 60 GiB) to slow device This is a 15T HDD with only a 60G SSD for the db so it's not surprising it spilled as it's way below the recommendation for rbd usage at db size 1-2% of the storage size. There was some spare space on the db ssd so I increased the size of the db LV up over 400G and did an bluefs-bdev-expand. However, days later, I'm still getting the spillover warning for that osd, including after running a manual compact: # ceph tell osd.76 compact See attached perf-dump-76 for the perf dump output: # cephadm enter --name 'osd.76' ceph daemon 'osd.76' perf dump" | jq -r '.bluefs' In particular, if my understanding is correct, that's telling me the db available size is 487G (i.e. the LV expand worked), of which it's using 59G, and there's 128K spilled to the slow device: "db_total_bytes": 512309059584, # 487G "db_used_bytes": 63470305280, # 59G "slow_used_bytes": 131072, # 128K A "bluefs stats" also says the db is using 128K of slow storage (although perhaps it's getting the info from the same place as the perf dump?): # ceph tell osd.76 bluefs stats 1 : device size 0x7747ffe000 : using 0xea6200000(59 GiB) 2 : device size 0xe8d7fc00000 : using 0x6554d689000(6.3 TiB) RocksDBBlueFSVolumeSelector Usage Matrix: DEV/LEV WAL DB SLOW * * REAL FILES LOG 0 B 10 MiB 0 B 0 B 0 B 8.8 MiB 1 WAL 0 B 2.5 GiB 0 B 0 B 0 B 751 MiB 8 DB 0 B 56 GiB 128 KiB 0 B 0 B 50 GiB 842 SLOW 0 B 0 B 0 B 0 B 0 B 0 B 0 TOTAL 0 B 58 GiB 128 KiB 0 B 0 B 0 B 850 MAXIMUMS: LOG 0 B 22 MiB 0 B 0 B 0 B 18 MiB WAL 0 B 3.9 GiB 0 B 0 B 0 B 1.0 GiB DB 0 B 71 GiB 282 MiB 0 B 0 B 62 GiB SLOW 0 B 0 B 0 B 0 B 0 B 0 B TOTAL 0 B 74 GiB 282 MiB 0 B 0 B 0 B >> SIZE << 0 B 453 GiB 14 TiB I had a look at the "DUMPING STATS" output in the logs bug I don't know how to interpret it. I did try calculating the total of the sizes on the "Sum" lines but that comes to 100G so I don't know what that all means. See attached log-stats-76. I also tried "ceph-kvstore-tool bluestore-kv ... stats": $ { cephadm unit --fsid $clusterid --name osd.76 stop cephadm shell --fsid $clusterid --name osd.76 -- ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-76 stats cephadm unit --fsid $clusterid --name osd.76 start } Output attached as bluestore-kv-stats-76. I can't see anything interesting in there, although again I don't really know how to interpret it. So... why is this osd db still spilling onto slow storage, and how do I fix things so it's no longer using the slow storage? And a bonus issue... on another osd that hasn't yet been resized (i.e. again with a grossly undersized 60G db on SSD with a 15T HDD) I'm also getting a spillover warning. The "bluefs stats" seems to be saying the db is NOT currently spilling (i.e. "0 B" the DB/SLOW position in the matrix), but there's "something" currently using 59G on the slow device: $ ceph tell osd.85 bluefs stats 1 : device size 0xeffffe000 : using 0x3a3900000(15 GiB) 2 : device size 0xe8d7fc00000 : using 0x7aea7434000(7.7 TiB) RocksDBBlueFSVolumeSelector Usage Matrix: DEV/LEV WAL DB SLOW * * REAL FILES LOG 0 B 10 MiB 0 B 0 B 0 B 7.4 MiB 1 WAL 0 B 564 MiB 0 B 0 B 0 B 132 MiB 2 DB 0 B 11 GiB 0 B 0 B 0 B 8.1 GiB 177 SLOW 0 B 3.0 GiB 59 GiB 0 B 0 B 56 GiB 898 TOTAL 0 B 13 GiB 59 GiB 0 B 0 B 0 B 1072 MAXIMUMS: LOG 0 B 24 MiB 0 B 0 B 0 B 20 MiB WAL 0 B 2.8 GiB 0 B 0 B 0 B 1.0 GiB DB 0 B 22 GiB 448 KiB 0 B 0 B 18 GiB SLOW 0 B 3.3 GiB 62 GiB 0 B 0 B 62 GiB TOTAL 0 B 27 GiB 62 GiB 0 B 0 B 0 B >> SIZE << 0 B 57 GiB 14 TiB Is there anywhere that describes how to interpret this output, and specifically, what stuff is going into the SLOW row? Seemingly there's 898 "files" there, but not LOG, WAL or DB files - so what are they? Cheers, Chris

7 months

2
3
0 0

Dashboard and Object Gateway

by Tim Holloway

First, an abject apology for the horrors I'm about to unveil. I made a cold migration from GlusterFS to Ceph a few months back, so it was a learn-/screwup/-as-you-go affair. For reasons of presumed compatibility with some of my older servers, I started with Ceph Octopus. Unfortunately, Octopus seems to have been a nexus of transitions from older Ceph organization and management to a newer (cephadm) system combined with a relocation of many ceph resources and compounded by stale bits of documentation (notably some references to SysV procedures and an obsolete installer that doesn't even come with Octopus). A far bigger problem was a known issue where actions would be scheduled but never executed if the system was even slightly dirty. And of course, since my system was hopelessly dirty, that was a major issue. Finally I took a risk and bumped up to Pacific, where that issue no longer exists. I won't say that I'm 100% clean even now, but at least the remaining crud is in areas where it cannot do any harm. Presumably. Given that, the only bar now remaining to total joy has been my inability to connect via the Ceph Dashboard to the Object Gateway. This seems to be an oft-reported problem, but generally referenced relative to higher-level administrative interfaces like Kubernetes and rook. I'm interfacing more directly, however. Regardless, the error reported is notably familiar: [quote] The Object Gateway Service is not configured Error connecting to Object Gateway: RGW REST API failed request with status code 404 (b'{"Code":"NoSuchBucket","Message":"","BucketName":"default","RequestI d":"tx00' b'000dd0c65b8bda685b4-00652d8e0f-5e3a9b- default","HostId":"5e3a9b-default-defa' b'ult"}') Please consult the documentation on how to configure and enable the Object Gateway management functionality. [/quote] In point of fact, what this REALLY means in my case is that the bucket that is supposed to contain the necessary information for the dashboard and rgw to communicate has not been created. Presumably that SHOULDhave been done by the "ceph dashboard set-rgw-credentials" command, but apparently isn't, because the default zone has no buckets at all, much less one named "default". By way of reference, the dashboard is definitely trying to interact with the rgw container, because trying object gateway options on the dashboard result in the container logging the following. beast: 0x7efd29621620: 10.0.1.16 - dashboard [16/Oct/2023:19:25:03.678 +0000] "GET /default/metadata/user?myself HTTP/1.1" 404 To make everything happy, I'd be glad to accept instructions on how to manually brute-force construct this bucket. Of course, as a cleaner long-term solution, it would be nice if the failure to create could be detected and logged. And of course, the ultimate solution: something that would assist in making whatever processes are unhappy be happy. Thanks, Tim

7 months

3
6
0 0

Unable to delete rbd images

by Mohammad Alam

Hello All, Greetings. We've a Ceph Cluster with the version *ceph version 14.2.16-402-g7d47dbaf4d (7d47dbaf4d0960a2e910628360ae36def84ed913) nautilus (stable) ========================================= Issues: Can't able to delete rbd images We have deleted target from the dashboard and now trying to delete rbd images from cli but not able to delete. when we ran "rbd rm -f tegile-500tb -p iscsi-images" its returning 2023-10-16 15:22:16.719 7f90bb332700 -1 librbd::image::PreRemoveRequest: 0x7f90a80041a0 check_image_watchers: image has watchers - not removing Removing image: 0% complete...failed. rbd: error: image still has watchers This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout. ============================ It is also not being deleted from dashboard. ============================ Even we tried to list the watcher but it is not returning anything like no such file or directory , ============================ "rbd info iscsi-images/tegile-500tb" rbd: error opening image tegile-500tb: (2) No such file or directory ============================ It is not showing on "rbd showmapped" output as well for that particular images, hence we can not unmap it. We can not restart iscsi gateway because that is being running and we can not interrupt it. =========================== Suggest how to fix this issue,

7 months

2
1
0 0

Ceph 16.2.14: how to set mon_rocksdb_options to enable RocksDB compression?

by Zakhar Kirpichenko

Hi, I'm still trying to fight large Ceph monitor writes. One option I considered is enabling RocksDB compression, as our nodes have more than sufficient RAM and CPU. Unfortunately, monitors seem to completely ignore the compression setting: I tried: - setting ceph config set mon.ceph05 mon_rocksdb_options "write_buffer_size=33554432,compression=kLZ4Compression,level_compaction_dynamic_level_bytes=true", restarting the test monitor. The monitor started with no RocksDB compression: debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb: Compression algorithms supported: debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb: kZSTDNotFinalCompression supported: 0 debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb: kXpressCompression supported: 0 debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb: kLZ4HCCompression supported: 1 debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb: kLZ4Compression supported: 1 debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb: kBZip2Compression supported: 0 debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb: kZlibCompression supported: 1 debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb: kSnappyCompression supported: 1 ... debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb: Options.compression: NoCompression debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb: Options.bottommost_compression: Disabled - setting ceph config set mon mon_rocksdb_options "write_buffer_size=33554432,compression=kLZ4Compression,level_compaction_dynamic_level_bytes=true", restarting the test monitor. The monitor started with no RocksDB compression, the same way as above. In each case config options were correctly set and readable with config get. I also found a suggestion in ceph-users ( https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/KJM232IHN7…) to set compression in a similar manner. Unfortunately, these options appear to be ignored. How can I enable RocksDB compression in Ceph monitors? I would very much appreciate your advices and comments. Best regards, Zakhar

7 months

3
8
0 0

RGW: How to trigger to recalculate the bucket stats?

by Huy Nguyen

Hi, For some reason, I need to recalculate the bucket stats. Is this possible? Thanks

7 months

1
0
0 0

NFS - HA and Ingress completion note?

by andreas＠lsit.ucsb.edu

NFS - HA and Ingress: [ https://docs.ceph.com/en/latest/mgr/nfs/#ingress ] Referring to Note#2, is NFS high-availability functionality considered complete (and stable)?

7 months

1
0
0 0

Time to Upgrade from Nautilus

by Dave Hall

Hello. It's been a while. For the past couple years I've had a cluster running Nautilus on Debian 10 using the Debian Ceph packages, and deployed with Ceph-Ansible. It's not a huge cluster - 10 OSD nodes with 80 x 12TB HDD OSDs, plus 3 management nodes, and about 40% full at the moment - but it is a critical resource for one of our researchers. Back then I had some misgivings about non-Debian packages and also about containerized Ceph. I don't know if my feelings about these things have changed that much, but it's time to upgrade, and, with the advent of cephadm it looks like it's just better to stay mainstream. So I'm looking for advice on how to get from where I'm at to at least Pacific or Quincy. I've read a little in the last couple days. I've seen various opinions on (not) skipping releases and on when to switch to cephadm. I'm also concerned about cleaning up those old Debian packages - will there be a point where I can 'apt-get purge' them without harming the cluster. One particular thing: The upgrade instructions in various places on docs.ceph.com say something like Upgrade monitors by installing the new packages and restarting the monitor daemons. To me this is kind of vague. Perhaps there is a different concept fo 'packages' within the cephadm environment. I could really use some clarification on this. I'd also consider decommissioning a few nodes, setting up a new cluster on fresh Debian installs. and migrating the data and remaining nodes. This would be a long and painful process - decommission a node, move it, move some data, decommission another node - and I don't know what effect it would have on external references to our object store. Please advise. Thanks. -Dave -- Dave Hall Binghamton University kdhall(a)binghamton.edu

7 months, 1 week

3
3
0 0

Is nfs-ganesha + kerberos actually a thing?

by Torkil Svensgaard

Hi We have kerberos working with bare metal kernel NFS exporting RBDs. I can see in the ceph documentation[1] that nfs-ganesha should work with kerberos but I'm having little luck getting it to work. This bit from the container log seems to suggest that some plumbing is missing? " 13/10/2023 08:09:12 : epoch 6528fb25 : ceph-flash1 : ganesha.nfsd-2[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT :Callback creds directory (/var/run/ganesha) already exists 13/10/2023 08:09:12 : epoch 6528fb25 : ceph-flash1 : ganesha.nfsd-2[main] find_keytab_entry :NFS CB :WARN :Configuration file does not specify default realm while getting default realm name 13/10/2023 08:09:12 : epoch 6528fb25 : ceph-flash1 : ganesha.nfsd-2[main] gssd_refresh_krb5_machine_credential :NFS CB :CRIT :ERROR: gssd_refresh_krb5_machine_credential: no usable keytab entry found in keytab /etc/krb5.keytab for connection with host localhost 13/10/2023 08:09:12 : epoch 6528fb25 : ceph-flash1 : ganesha.nfsd-2[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :gssd_refresh_krb5_machine_credential failed (-1765328160:0) " Thoughts? Mvh. Torkil [1] https://docs.ceph.com/en/quincy/mgr/nfs/#create-cephfs-export -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark

7 months, 1 week

2
5
0 0

Please help collecting stats of Ceph monitor disk writes

by Zakhar Kirpichenko

Hi! Further to my thread "Ceph 16.2.x mon compactions, disk writes" ( https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/XGCI2LFW5RH…) where we have established that Ceph monitors indeed write considerable amounts of data to disks, I would like to request fellow Ceph users to provide feedback and help gather some statistics regarding whether this happens on all clusters or on some specific subset of clusters. The procedure is rather simple and won't take much of your time. If you are willing to help, please follow this procedure: --------- 1. Install iotop and run the following command on any of your monitor nodes: iotop -ao -bn 2 -d 300 2>&1 | grep -E "TID|ceph-mon" This will collect a 5-minute disk I/O statistics and produce an output containing the stats for Ceph monitor threads running on the node: TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 4854 be/4 167 8.62 M 2.27 G 0.00 % 0.72 % ceph-mon -n mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [rocksdb:low0] 4919 be/4 167 0.00 B 39.43 M 0.00 % 0.02 % ceph-mon -n mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [ms_dispatch] 4855 be/4 167 8.00 K 19.55 M 0.00 % 0.00 % ceph-mon -n mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [rocksdb:high0] We're particularly interested in the amount of written data. --------- 2. Optional: collect the number of "manual compaction" events from the monitor. This step will depend on how your monitor runs. My cluster is managed by cephadm and monitors run in docker containers, thus I can do something like this, where MYMONCONTAINERID is the container ID of Ceph monitor: # date; d=$(date +'%Y-%m-%d'); docker logs MYMONCONTAINERID 2>&1 | grep $d | grep -ci "manual compaction from" Fri 13 Oct 2023 06:29:39 AM UTC 580 Alternatively, I could run the command against the log file MYMONLOGFILE, whose location I obtained with docker inspect: # date; d=$(date +'%Y-%m-%d'); grep $d MYMONLOGFILE | grep -ci "manual compaction from" Fri 13 Oct 2023 06:35:27 AM UTC 588 If you run monitors with podman or without containerization, please get this information the way that is most convenient in your setup. --------- 3. Optional: collect the monitor store.db size. Usually the monitor store.db is available at /var/lib/ceph/FSID/mon.NAME/store.db/, for example: # du -hs /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/ 642M /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/ --------- 4. Optional: collect Ceph cluster version and status. For example: root@ceph01:/# ceph version; ceph -s ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable) cluster: id: 3f50555a-ae2a-11eb-a2fc-ffde44714d86 health: HEALTH_OK services: mon: 5 daemons, quorum ceph01,ceph03,ceph04,ceph05,ceph02 (age 2w) mgr: ceph01.vankui(active, since 13d), standbys: ceph02.shsinf osd: 96 osds: 96 up (since 2w), 95 in (since 3w) data: pools: 10 pools, 2400 pgs objects: 6.30M objects, 16 TiB usage: 61 TiB used, 716 TiB / 777 TiB avail pgs: 2396 active+clean 3 active+clean+scrubbing+deep 1 active+clean+scrubbing io: client: 71 MiB/s rd, 60 MiB/s wr, 2.94k op/s rd, 2.56k op/s wr --------- 5. Reply to this thread and submit the collected information. For example: 1) iotop results: ... Paste data obtained in step 1) 2) manual compactions: ... Paste data obtained in step 2), or put "N/A" 3) monitor store.db size: ... Paste data obtained in step 3), or put "N/A" 4) cluster version and status: ... Paste data obtained in step 4), or put "N/A" ------------- I would very much appreciate your effort and help with gathering these stats. Please don't hesitate to contact me with any questions or concerns. Best regards, Zakhar

7 months, 1 week

5
6
0 0

2024

2023

2022

2021

2020

2019

ceph-users October 2023