Hi,
I'd like to know cache hit rate in ceph osd. I installed prometheus and grafana. But there aren't cache hit rate on grafana dashbords...
Does Ceph have a cache hit rate counter? I'd like to know the impact of READ performance on Ceph cluster.
Regards,
--
Mitsumasa KONDO
Hi,
tl;dr why are my osds still spilling?
I've recently upgraded to 16.2.14 from 16.2.9 and started receiving bluefs
spillover warnings (due to the "fix spillover alert" per the 16.2.14
release notes). E.g. from 'ceph health detail', the warning on one of
these (there are a few):
osd.76 spilled over 128 KiB metadata from 'db' device (56 GiB used of 60 GiB) to slow device
This is a 15T HDD with only a 60G SSD for the db so it's not surprising it
spilled as it's way below the recommendation for rbd usage at db size 1-2%
of the storage size.
There was some spare space on the db ssd so I increased the size of the db
LV up over 400G and did an bluefs-bdev-expand.
However, days later, I'm still getting the spillover warning for that osd,
including after running a manual compact:
# ceph tell osd.76 compact
See attached perf-dump-76 for the perf dump output:
# cephadm enter --name 'osd.76' ceph daemon 'osd.76' perf dump" | jq -r '.bluefs'
In particular, if my understanding is correct, that's telling me the db
available size is 487G (i.e. the LV expand worked), of which it's using
59G, and there's 128K spilled to the slow device:
"db_total_bytes": 512309059584, # 487G
"db_used_bytes": 63470305280, # 59G
"slow_used_bytes": 131072, # 128K
A "bluefs stats" also says the db is using 128K of slow storage (although
perhaps it's getting the info from the same place as the perf dump?):
# ceph tell osd.76 bluefs stats
1 : device size 0x7747ffe000 : using 0xea6200000(59 GiB)
2 : device size 0xe8d7fc00000 : using 0x6554d689000(6.3 TiB)
RocksDBBlueFSVolumeSelector Usage Matrix:
DEV/LEV WAL DB SLOW * * REAL FILES
LOG 0 B 10 MiB 0 B 0 B 0 B 8.8 MiB 1
WAL 0 B 2.5 GiB 0 B 0 B 0 B 751 MiB 8
DB 0 B 56 GiB 128 KiB 0 B 0 B 50 GiB 842
SLOW 0 B 0 B 0 B 0 B 0 B 0 B 0
TOTAL 0 B 58 GiB 128 KiB 0 B 0 B 0 B 850
MAXIMUMS:
LOG 0 B 22 MiB 0 B 0 B 0 B 18 MiB
WAL 0 B 3.9 GiB 0 B 0 B 0 B 1.0 GiB
DB 0 B 71 GiB 282 MiB 0 B 0 B 62 GiB
SLOW 0 B 0 B 0 B 0 B 0 B 0 B
TOTAL 0 B 74 GiB 282 MiB 0 B 0 B 0 B
>> SIZE << 0 B 453 GiB 14 TiB
I had a look at the "DUMPING STATS" output in the logs bug I don't know
how to interpret it. I did try calculating the total of the sizes on the
"Sum" lines but that comes to 100G so I don't know what that all means.
See attached log-stats-76.
I also tried "ceph-kvstore-tool bluestore-kv ... stats":
$ {
cephadm unit --fsid $clusterid --name osd.76 stop
cephadm shell --fsid $clusterid --name osd.76 -- ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-76 stats
cephadm unit --fsid $clusterid --name osd.76 start
}
Output attached as bluestore-kv-stats-76. I can't see anything interesting
in there, although again I don't really know how to interpret it.
So... why is this osd db still spilling onto slow storage, and how do I fix
things so it's no longer using the slow storage?
And a bonus issue... on another osd that hasn't yet been resized (i.e.
again with a grossly undersized 60G db on SSD with a 15T HDD) I'm also
getting a spillover warning. The "bluefs stats" seems to be saying the db
is NOT currently spilling (i.e. "0 B" the DB/SLOW position in the matrix),
but there's "something" currently using 59G on the slow device:
$ ceph tell osd.85 bluefs stats
1 : device size 0xeffffe000 : using 0x3a3900000(15 GiB)
2 : device size 0xe8d7fc00000 : using 0x7aea7434000(7.7 TiB)
RocksDBBlueFSVolumeSelector Usage Matrix:
DEV/LEV WAL DB SLOW * * REAL FILES
LOG 0 B 10 MiB 0 B 0 B 0 B 7.4 MiB 1
WAL 0 B 564 MiB 0 B 0 B 0 B 132 MiB 2
DB 0 B 11 GiB 0 B 0 B 0 B 8.1 GiB 177
SLOW 0 B 3.0 GiB 59 GiB 0 B 0 B 56 GiB 898
TOTAL 0 B 13 GiB 59 GiB 0 B 0 B 0 B 1072
MAXIMUMS:
LOG 0 B 24 MiB 0 B 0 B 0 B 20 MiB
WAL 0 B 2.8 GiB 0 B 0 B 0 B 1.0 GiB
DB 0 B 22 GiB 448 KiB 0 B 0 B 18 GiB
SLOW 0 B 3.3 GiB 62 GiB 0 B 0 B 62 GiB
TOTAL 0 B 27 GiB 62 GiB 0 B 0 B 0 B
>> SIZE << 0 B 57 GiB 14 TiB
Is there anywhere that describes how to interpret this output, and
specifically, what stuff is going into the SLOW row? Seemingly there's
898 "files" there, but not LOG, WAL or DB files - so what are they?
Cheers,
Chris
First, an abject apology for the horrors I'm about to unveil. I made a
cold migration from GlusterFS to Ceph a few months back, so it was a
learn-/screwup/-as-you-go affair.
For reasons of presumed compatibility with some of my older servers, I
started with Ceph Octopus. Unfortunately, Octopus seems to have been a
nexus of transitions from older Ceph organization and management to a
newer (cephadm) system combined with a relocation of many ceph
resources and compounded by stale bits of documentation (notably some
references to SysV procedures and an obsolete installer that doesn't
even come with Octopus).
A far bigger problem was a known issue where actions would be scheduled
but never executed if the system was even slightly dirty. And of
course, since my system was hopelessly dirty, that was a major issue.
Finally I took a risk and bumped up to Pacific, where that issue no
longer exists. I won't say that I'm 100% clean even now, but at least
the remaining crud is in areas where it cannot do any harm. Presumably.
Given that, the only bar now remaining to total joy has been my
inability to connect via the Ceph Dashboard to the Object Gateway.
This seems to be an oft-reported problem, but generally referenced
relative to higher-level administrative interfaces like Kubernetes and
rook. I'm interfacing more directly, however. Regardless, the error
reported is notably familiar:
[quote]
The Object Gateway Service is not configured
Error connecting to Object Gateway: RGW REST API failed request with
status code 404
(b'{"Code":"NoSuchBucket","Message":"","BucketName":"default","RequestI
d":"tx00' b'000dd0c65b8bda685b4-00652d8e0f-5e3a9b-
default","HostId":"5e3a9b-default-defa' b'ult"}')
Please consult the documentation on how to configure and enable the
Object Gateway management functionality.
[/quote]
In point of fact, what this REALLY means in my case is that the bucket
that is supposed to contain the necessary information for the dashboard
and rgw to communicate has not been created. Presumably that SHOULDhave
been done by the "ceph dashboard set-rgw-credentials" command, but
apparently isn't, because the default zone has no buckets at all, much
less one named "default".
By way of reference, the dashboard is definitely trying to interact
with the rgw container, because trying object gateway options on the
dashboard result in the container logging the following.
beast: 0x7efd29621620: 10.0.1.16 - dashboard [16/Oct/2023:19:25:03.678
+0000] "GET /default/metadata/user?myself HTTP/1.1" 404
To make everything happy, I'd be glad to accept instructions on how to
manually brute-force construct this bucket.
Of course, as a cleaner long-term solution, it would be nice if the
failure to create could be detected and logged.
And of course, the ultimate solution: something that would assist in
making whatever processes are unhappy be happy.
Thanks,
Tim
Hello All,
Greetings. We've a Ceph Cluster with the version
*ceph version 14.2.16-402-g7d47dbaf4d
(7d47dbaf4d0960a2e910628360ae36def84ed913) nautilus (stable)
=========================================
Issues: Can't able to delete rbd images
We have deleted target from the dashboard and now trying to delete rbd images from cli but not able to delete.
when we ran "rbd rm -f tegile-500tb -p iscsi-images" its returning
2023-10-16 15:22:16.719 7f90bb332700 -1 librbd::image::PreRemoveRequest: 0x7f90a80041a0 check_image_watchers: image has watchers - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.
============================
It is also not being deleted from dashboard.
============================
Even we tried to list the watcher but it is not returning anything like no such file or directory ,
============================
"rbd info iscsi-images/tegile-500tb"
rbd: error opening image tegile-500tb: (2) No such file or directory
============================
It is not showing on "rbd showmapped" output as well for that particular images, hence we can not unmap it.
We can not restart iscsi gateway because that is being running and we can not interrupt it.
===========================
Suggest how to fix this issue,
Hi,
I'm still trying to fight large Ceph monitor writes. One option I
considered is enabling RocksDB compression, as our nodes have more than
sufficient RAM and CPU. Unfortunately, monitors seem to completely ignore
the compression setting:
I tried:
- setting ceph config set mon.ceph05 mon_rocksdb_options
"write_buffer_size=33554432,compression=kLZ4Compression,level_compaction_dynamic_level_bytes=true",
restarting the test monitor. The monitor started with no RocksDB
compression:
debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb: Compression
algorithms supported:
debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb:
kZSTDNotFinalCompression supported: 0
debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb:
kXpressCompression supported: 0
debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb:
kLZ4HCCompression supported: 1
debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb:
kLZ4Compression supported: 1
debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb:
kBZip2Compression supported: 0
debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb:
kZlibCompression supported: 1
debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb:
kSnappyCompression supported: 1
...
debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb:
Options.compression: NoCompression
debug 2023-10-13T19:47:00.403+0000 7f1cd967a880 4 rocksdb:
Options.bottommost_compression: Disabled
- setting ceph config set mon mon_rocksdb_options
"write_buffer_size=33554432,compression=kLZ4Compression,level_compaction_dynamic_level_bytes=true",
restarting the test monitor. The monitor started with no RocksDB
compression, the same way as above.
In each case config options were correctly set and readable with config
get. I also found a suggestion in ceph-users (
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/KJM232IHN7…)
to set compression in a similar manner. Unfortunately, these options appear
to be ignored.
How can I enable RocksDB compression in Ceph monitors?
I would very much appreciate your advices and comments.
Best regards,
Zakhar
Hello.
It's been a while. For the past couple years I've had a cluster running
Nautilus on Debian 10 using the Debian Ceph packages, and deployed with
Ceph-Ansible. It's not a huge cluster - 10 OSD nodes with 80 x 12TB HDD
OSDs, plus 3 management nodes, and about 40% full at the moment - but it is
a critical resource for one of our researchers.
Back then I had some misgivings about non-Debian packages and also about
containerized Ceph. I don't know if my feelings about these things have
changed that much, but it's time to upgrade, and, with the advent of
cephadm it looks like it's just better to stay mainstream.
So I'm looking for advice on how to get from where I'm at to at least
Pacific or Quincy.
I've read a little in the last couple days. I've seen various opinions on
(not) skipping releases and on when to switch to cephadm. I'm also
concerned about cleaning up those old Debian packages - will there be a
point where I can 'apt-get purge' them without harming the cluster.
One particular thing: The upgrade instructions in various places on
docs.ceph.com say something like
Upgrade monitors by installing the new packages and restarting the monitor
daemons.
To me this is kind of vague. Perhaps there is a different concept fo
'packages' within the cephadm environment. I could really use some
clarification on this.
I'd also consider decommissioning a few nodes, setting up a new cluster on
fresh Debian installs. and migrating the data and remaining nodes. This
would be a long and painful process - decommission a node, move it, move
some data, decommission another node - and I don't know what effect it
would have on external references to our object store.
Please advise.
Thanks.
-Dave
--
Dave Hall
Binghamton University
kdhall(a)binghamton.edu
Hi
We have kerberos working with bare metal kernel NFS exporting RBDs. I
can see in the ceph documentation[1] that nfs-ganesha should work with
kerberos but I'm having little luck getting it to work.
This bit from the container log seems to suggest that some plumbing is
missing?
"
13/10/2023 08:09:12 : epoch 6528fb25 : ceph-flash1 :
ganesha.nfsd-2[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT
:Callback creds directory (/var/run/ganesha) already exists
13/10/2023 08:09:12 : epoch 6528fb25 : ceph-flash1 :
ganesha.nfsd-2[main] find_keytab_entry :NFS CB :WARN :Configuration file
does not specify default realm while getting default realm name
13/10/2023 08:09:12 : epoch 6528fb25 : ceph-flash1 :
ganesha.nfsd-2[main] gssd_refresh_krb5_machine_credential :NFS CB :CRIT
:ERROR: gssd_refresh_krb5_machine_credential: no usable keytab entry
found in keytab /etc/krb5.keytab for connection with host localhost
13/10/2023 08:09:12 : epoch 6528fb25 : ceph-flash1 :
ganesha.nfsd-2[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN
:gssd_refresh_krb5_machine_credential failed (-1765328160:0)
"
Thoughts?
Mvh.
Torkil
[1] https://docs.ceph.com/en/quincy/mgr/nfs/#create-cephfs-export
--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
Hi!
Further to my thread "Ceph 16.2.x mon compactions, disk writes" (
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/XGCI2LFW5RH…)
where we have established that Ceph monitors indeed write considerable
amounts of data to disks, I would like to request fellow Ceph users to
provide feedback and help gather some statistics regarding whether this
happens on all clusters or on some specific subset of clusters.
The procedure is rather simple and won't take much of your time.
If you are willing to help, please follow this procedure:
---------
1. Install iotop and run the following command on any of your monitor nodes:
iotop -ao -bn 2 -d 300 2>&1 | grep -E "TID|ceph-mon"
This will collect a 5-minute disk I/O statistics and produce an output
containing the stats for Ceph monitor threads running on the node:
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
4854 be/4 167 8.62 M 2.27 G 0.00 % 0.72 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
--default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]
4919 be/4 167 0.00 B 39.43 M 0.00 % 0.02 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
--default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [ms_dispatch]
4855 be/4 167 8.00 K 19.55 M 0.00 % 0.00 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
--default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:high0]
We're particularly interested in the amount of written data.
---------
2. Optional: collect the number of "manual compaction" events from the
monitor.
This step will depend on how your monitor runs. My cluster is managed by
cephadm and monitors run in docker containers, thus I can do something like
this, where MYMONCONTAINERID is the container ID of Ceph monitor:
# date; d=$(date +'%Y-%m-%d'); docker logs MYMONCONTAINERID 2>&1 | grep $d
| grep -ci "manual compaction from"
Fri 13 Oct 2023 06:29:39 AM UTC
580
Alternatively, I could run the command against the log file MYMONLOGFILE,
whose location I obtained with docker inspect:
# date; d=$(date +'%Y-%m-%d'); grep $d MYMONLOGFILE | grep -ci "manual
compaction from"
Fri 13 Oct 2023 06:35:27 AM UTC
588
If you run monitors with podman or without containerization, please get
this information the way that is most convenient in your setup.
---------
3. Optional: collect the monitor store.db size.
Usually the monitor store.db is available at
/var/lib/ceph/FSID/mon.NAME/store.db/, for example:
# du -hs
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/
642M
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/
---------
4. Optional: collect Ceph cluster version and status.
For example:
root@ceph01:/# ceph version; ceph -s
ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific
(stable)
cluster:
id: 3f50555a-ae2a-11eb-a2fc-ffde44714d86
health: HEALTH_OK
services:
mon: 5 daemons, quorum ceph01,ceph03,ceph04,ceph05,ceph02 (age 2w)
mgr: ceph01.vankui(active, since 13d), standbys: ceph02.shsinf
osd: 96 osds: 96 up (since 2w), 95 in (since 3w)
data:
pools: 10 pools, 2400 pgs
objects: 6.30M objects, 16 TiB
usage: 61 TiB used, 716 TiB / 777 TiB avail
pgs: 2396 active+clean
3 active+clean+scrubbing+deep
1 active+clean+scrubbing
io:
client: 71 MiB/s rd, 60 MiB/s wr, 2.94k op/s rd, 2.56k op/s wr
---------
5. Reply to this thread and submit the collected information.
For example:
1) iotop results:
... Paste data obtained in step 1)
2) manual compactions:
... Paste data obtained in step 2), or put "N/A"
3) monitor store.db size:
... Paste data obtained in step 3), or put "N/A"
4) cluster version and status:
... Paste data obtained in step 4), or put "N/A"
-------------
I would very much appreciate your effort and help with gathering these
stats. Please don't hesitate to contact me with any questions or concerns.
Best regards,
Zakhar