Hello Ceph users,
we see strange issue on last recent Ceph installation v17.6.2. We store
data on HDD pool, index pool is on SSD. Each OSD store its wal on NVME
partition. Benchmarks didn't expose any issues with cluster, but since we
placed production load on it we see constantly growing OSD latency time
(osd_read_latency) on SSD disks (where Index pool located). Latency is
constantly growing day-by-day, but disks are not utilized even for 50%.
Interesting, that when we move Index pool from SSD to NVME disks (disk
space allows it for now) - osd latency drops to zero and start increasing
from the ground. Also we noticed, that any change of pg_num for index pool
(from 256 to 128 for instance) also drops latency to zero. And it starts
its growth again (https://postimg.cc/5YHk9bby).
From client perspective it looks like one operation takes longer and longer
each other day and operation time drops each time when we do some change on
index pool. I've enabled debug_optracker 10/0 and it shows, that OSD spend
most time in `queued_for_pg` state, but physical disk utilization is about
10-20%. Also per logs I see, that longest operation is Listbucket, but it
is strange, that with less than 100'000 items in bucket list even with
'max_keys=1' takes 3-40 seconds.
If it matters client is Apache Flink doing checkpoints via S3 protocol.
Here is an example of operation with debug_optracking logs:
2023-12-29T16:24:28.873353+0300, event: throttled, op: osd_op(client.1227774
.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.861549+0300, event: header_read, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873358+0300, event: all_read, op: osd_op(client.1227774
.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873359+0300, event: dispatched, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873389+0300, event: queued_for_pg, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077528+0300, event: reached_pg, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077561+0300, event: started, op: osd_op(client.1227774
.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.
9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call
rgw.bucket_prepare_op in=331b] snapc 0=[]
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077714+0300, event: waiting for subops from 59,494, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146157+0300, event: sub_op_commit_rec, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146166+0300, event: op_commit, op: osd_op(client.1227774
.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.
9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call
rgw.bucket_prepare_op in=331b] snapc 0=[]
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.146191+0300, event: sub_op_commit_rec, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146204+0300, event: commit_sent, op: osd_op(client.
1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146216+0300, event: done, op:
osd_op(client.1227774.0:22575820
7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
Does anybody faced with the same issue? I'd be very grateful for any idea
because at this point I've got stuck with what to tune and where to look at.
Cluster setup - replication 3x, 15 servers in 3 datacenters with datacenter
as failure domain. 7x HDD (data), 2x SSD (index), 1x NVME (wal + OS).
ceph config - https://pastebin.com/pCqxXhT3
OSD read latency graph - https://postimg.cc/5YHk9bby
--
Thank you,
Roman
1024 PGs on NVMe.
From: Anthony D'Atri <anthony.datri(a)gmail.com>
Sent: Friday, February 2, 2024 2:37 PM
To: Cory Snyder <csnyder(a)1111systems.com>
Subject: Re: [ceph-users] OSD read latency grows over time
Thanks. What type of media are your index OSDs? How many PGs? > On Feb 2, 2024, at 2: 32 PM, Cory Snyder <csnyder@ 1111systems. com> wrote: > > Yes, we changed osd_memory_target to 10 GB on just our index OSDs. These OSDs have over
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
Report Suspicious
ZjQcmQRYFpfptBannerEnd
Thanks. What type of media are your index OSDs? How many PGs?
> On Feb 2, 2024, at 2:32 PM, Cory Snyder <csnyder(a)1111systems.com> wrote:
>
> Yes, we changed osd_memory_target to 10 GB on just our index OSDs. These OSDs have over 300 GB of lz4 compressed bucket index omap data. Here is a graph showing the latencies before/after that single change:
>
> https://urldefense.com/v3/__https://pasteboard.co/IMCUWa1t3Uau.png__;!!J0dt…
>
> Cory Snyder
>
>
> From: Anthony D'Atri <anthony.datri(a)gmail.com>
> Sent: Friday, February 2, 2024 2:15 PM
> To: Cory Snyder <csnyder(a)1111systems.com>
> Cc: ceph-users <ceph-users(a)ceph.io>
> Subject: Re: [ceph-users] OSD read latency grows over time
>
> You adjusted osd_memory_target? Higher than the default 4GB? Another thing that we've found is that rocksdb can become quite slow if it doesn't have enough memory for internal caches. As our cluster usage has grown, we've needed to increase
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
> Report Suspicious
>
> ZjQcmQRYFpfptBannerEnd
> You adjusted osd_memory_target? Higher than the default 4GB?
>
>
>
> Another thing that we've found is that rocksdb can become quite slow if it doesn't have enough memory for internal caches. As our cluster usage has grown, we've needed to increase OSD memory in accordance with bucket index pool usage. One one cluster, we found that increasing OSD memory improved rocksdb latencies by over 10x.
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi,
A little basic question.
I created a volume with
ceph fs volume
then a subvolume called «erasure» I can see that with
root@cthulhu1:/etc/ceph# ceph fs subvolume info cephfs erasure
{
"atime": "2024-02-02 11:02:07",
"bytes_pcent": "undefined",
"bytes_quota": "infinite",
"bytes_used": 0,
"created_at": "2024-02-02 11:02:07",
"ctime": "2024-02-02 14:12:30",
"data_pool": "data_ec",
"features": [
"snapshot-clone",
"snapshot-autoprotect",
"snapshot-retention"
],
"gid": 0,
"mode": 16877,
"mon_addrs": [
"145.238.187.184:6789",
"145.238.187.185:6789",
"145.238.187.186:6789",
"145.238.187.188:6789",
"145.238.187.187:6789"
],
"mtime": "2024-02-02 14:12:30",
"path": "/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9",
"pool_namespace": "",
"state": "complete",
"type": "subvolume",
"uid": 0
}
From the mon server I was able to mount the «partition» with
mount -t ceph admin@fXXXXXXX-c0f2-11ee-9307-f7e3b9f03075.cephfs=/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9 /mnt
but on my test client I'm unable to mount
root@ceph-vo-m:/etc/ceph# mount -t ceph vo@fxxxxxxx-c0f2-11ee-9307-f7e3b9f03075.cephfs=/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9/ /vo --verbose
parsing options: rw
source mount path was not specified
unable to parse mount source: -22
root@ceph-vo-m:/etc/ceph#
So I copy the /etc/ceph/ceph.conf on my client
Put the /etc/ceph/ceph.client.vo.keyring on my client
No firewall between the client/cluster.
Weird part is when I run a tcpdump on my client I didn't see any tcp
activity.
Anyway to debug this pb ?
Thanks
Regards
--
Albert SHIH 🦫 🐸
France
Heure locale/Local time:
ven. 02 févr. 2024 16:21:01 CET
Hi group,
Today I conducted a small experiment to test an assumption of mine,
namely that Ceph incurs a substantial network overhead when doing many
small files.
One RBD was created, and on top of that an XFS containing 1.6 M files,
each with size 10 kiB:
# rbd info libvirt/bobtest
rbd image 'bobtest':
size 20 GiB in 5120 objects
order 22 (4 MiB objects)
[...]
# df -h /space
Filesystem Size Used Avail Use% Mounted on
/dev/rbd0 20G 20G 181M 100% /space
# ls -lh /space |head
total 19G
-rw-r--r--. 1 root root 10K Feb 2 14:13 xaa
-rw-r--r--. 1 root root 10K Feb 2 14:13 xab
-rw-r--r--. 1 root root 10K Feb 2 14:13 xac
-rw-r--r--. 1 root root 10K Feb 2 14:13 xad
-rw-r--r--. 1 root root 10K Feb 2 14:13 xae
-rw-r--r--. 1 root root 10K Feb 2 14:13 xaf
-rw-r--r--. 1 root root 10K Feb 2 14:13 xag
-rw-r--r--. 1 root root 10K Feb 2 14:13 xah
-rw-r--r--. 1 root root 10K Feb 2 14:13 xai
# ls /space |wc -l
1638400
All files contain pseudorandom (i.e. incompressible) junk.
My assumption was, that as the backend RBD block size is 4 MiB, it would
be necessary for the client machine to download at least that 4 MiB
worth of data on any given request, even if the file in the XFS is only
10 kB.
I.e. I cat(1) a small file, the RBD client grabs the relevant 4 MiB
block from Ceph, from this the small amount of requested data is
extracted and presented to userspace.
That's not what I see, however. My testing procedure is as follows:
I have a list of all the files on the RBD, order randomized, stored in
root's home folder -- this to make sure that I can pick file names at
random by going through the list from the top, and not causing network
traffic by listing files directly in the target FS. I then reboot the
node to ensure that all caches are empty and start an iftop(1) to
monitor network usage.
Mapping the RBD and mounting the XFS results in 5.29 MB worth of data
read from the network.
Reading one file at random from the XFS results in approx. 200 kB of
network read.
Reading 100 files at random results in approx. 3.83 MB of network read.
Reading 1000 files at random results in approx. 36.2 MB of network read.
Bottom line is that reading any 10 kiB of actual data results in
approximately 37 kiB data being transferred over the network. Overhead,
sure, but nowhere near what I expected, which was 4 MiB per block of
data "hit" in the backend.
Is the RBD client performing partial object reads? Is that even a thing?
Cheers,
Ruben Vestergaard
Hi cephers,
I've been looking into better balancing our clusters with upmaps lately,
and ran into upmap cases that behave in a less than ideal way. If there
is any cycle in the upmaps like
ceph osd pg-upmap-items <pgid> a b b a
or
ceph osd pg-upmap-items <pgid> a b b c c a
the upmap validation passes, the upmap gets added to the osdmap, but
then gets silently ignored. Obviously this is for EC pools - irrelevant
for replicated pools where the order of OSDs is not significant.
The relevant code OSDMap::_apply_upmap even has a comment about this:
if (q != pg_upmap_items.end()) {
// NOTE: this approach does not allow a bidirectional swap,
// e.g., [[1,2],[2,1]] applied to [0,1,2] -> [0,2,1].
for (auto& r : q->second) {
// make sure the replacement value doesn't already appear
...
I'm trying to understand the reasons for this limitation: is it the case
that this is just a matter of convenience of coding
(OSDMap::_apply_upmap could do this correctly with a bit more careful
approach), or is there some inherent limitation somewhere else that
prevents these cases from working? I did notice that just updating
crush weights (without using upmaps) produces similar changes to the UP
set (swaps OSDs in EC pools sometimes), so the OSDs seem to be perfectly
capable of doing backfills for osdmap changes that shuffle the order of
OSDs in the UP set. Some insight/history here would be appreciated.
Either way, the behavior of validation passing on an upmap and then the
upmap getting silently ignored is not ideal. I do realize that all
clients would have to agree on this code, since clients independently
execute it to find the OSDs to access (so rolling out a change to this
is challenging).
Andras
In our department we're getting starting with Ceph 'reef', using Ceph FUSE client for our Ubuntu workstations.
So far so good, except I can't quite figure out one aspect of subvolumes.
When I do the commands:
[root@ceph1 ~]# ceph fs subvolumegroup create cephfs csvg
[root@ceph1 ~]# ceph fs subvolume create cephfs staff csvg --size 2000000000000
I get these results:
- A subvolume group csvg is created on volume cephfs
- A subvolume called staff is created in csvg subvolume (like /volumes/csvg/staff ) however there is no size limit set at this folder in the Ceph dashboard view
- A folder with an random UUID name is created inside the subvolume staff ( like /volumes/csvg/staff/6a1b3de5-f6ab-4878-aea3-3c3c6f96ffcf ); this folder does have a size set on it of 2TB
My questions are:
- what's the purpose of this UUID, and is it a requirement?
- which directory should be mounted for my clients, staff/ or staff/{UUID}, for the size limit to take effect?
- is there any way to hide or disable this UUID for client mounts? (eg in /etc/fstab)
[root@ceph1 ~]# ceph fs subvolume getpath cephfs staff csvg
/volumes/csvg/staff/6a1b3de5-f6ab-4878-aea3-3c3c6f96ffcf
[root@ceph1 ~]# ceph fs subvolume ls cephfs csvg
[
{
"name": "staff"
}
]
--
Sincerely,
Matthew Melendy
IT Services Specialist
CS System Services Group
FEC 3550, University of New Mexico
Hello team,
I failed to login to my ceph dashboard which is running pacific as version
and deployed using ceph-ansible . I have set admin password using the
following command : "ceph dashboard ac-user-set-password admin -i
ceph-dash-pass" where ceph-dash-pass possesses the real password. I am
getting the following output : "{"username": "admin", "password":
"$2b$12$Ge/2cpg0ZGjRPnBC2YREP.E5oVyNvV4SC9HU4PMsWWMBtC9UvL7mG", "roles":
["administrator"], "name": null, "email": null, "lastUpdate": 1706866328,
"enabled": false, "pwdExpirationDate": null, "pwdUpdateRequired": false}"
Once I login to the dashboard , still i get the same error message. I am
guessing it is because the above "enabled" field is set to false . Ho w to
set that field to true ? or if there is other alternative to set it you can
advise.
Thank you
Hi,
During an upgrade from pacific to quincy, we needed to recreate the mons
because the mons were pretty old and still using leveldb.
So step one was to destroy one of the mons. After that we recreated the
monitor, and although it starts, it remains in state ‘probing’, as you
can see below.
No matter what I tried, it won’t come up. I’ve seen quite some messages
that the MTU might be an issue, but that seems to be ok:
root@proxmox03:/var/log/ceph# fping -b 1472 10.10.10.{1..3} -M
10.10.10.1 is alive
10.10.10.2 is alive
10.10.10.3 is alive
Does anyone have an idea how to fix this? I’ve tried destroying and
recreating the mon a few times now. Could it be that the leveldb mons
only support mon.$id notation for the monitors?
root@proxmox03:/var/log/ceph# ceph daemon mon.proxmox03 mon_status
{
"name": “proxmox03”,
"rank": 2,
"state": “probing”,
"election_epoch": 0,
"quorum": [],
"features": {
"required_con": “2449958197560098820”,
"required_mon": [
“kraken”,
“luminous”,
“mimic”,
"osdmap-prune”,
“nautilus”,
“octopus”,
“pacific”,
"elector-pinging”
],
"quorum_con": “0”,
"quorum_mon": []
},
"outside_quorum": [
“proxmox03”
],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 0,
"fsid": "39b1e85c-7b47-4262-9f0a-47ae91042bac”,
"modified": "2024-01-23T21:02:12.631320Z”,
"created": "2017-03-15T14:54:55.743017Z”,
"min_mon_release": 16,
"min_mon_release_name": “pacific”,
"election_strategy": 1,
"disallowed_leaders: ": “”,
"stretch_mode": false,
"tiebreaker_mon": “”,
"removed_ranks: ": “2”,
"features": {
"persistent": [
“kraken”,
“luminous”,
“mimic”,
"osdmap-prune”,
“nautilus”,
“octopus”,
“pacific”,
"elector-pinging”
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": “0”,
"public_addrs": {
"addrvec": [
{
"type": “v2”,
"addr": "10.10.10.1:3300”,
"nonce": 0
},
{
"type": “v1”,
"addr": "10.10.10.1:6789”,
"nonce": 0
}
]
},
"addr": "10.10.10.1:6789/0”,
"public_addr": "10.10.10.1:6789/0”,
"priority": 0,
"weight": 0,
"crush_location": “{}”
},
{
"rank": 1,
"name": “1”,
"public_addrs": {
"addrvec": [
{
"type": “v2”,
"addr": "10.10.10.2:3300”,
"nonce": 0
},
{
"type": “v1”,
"addr": "10.10.10.2:6789”,
"nonce": 0
}
]
},
"addr": "10.10.10.2:6789/0”,
"public_addr": "10.10.10.2:6789/0”,
"priority": 0,
"weight": 0,
"crush_location": “{}”
},
{
"rank": 2,
"name": “proxmox03”,
"public_addrs": {
"addrvec": [
{
"type": “v2”,
"addr": "10.10.10.3:3300”,
"nonce": 0
},
{
"type": “v1”,
"addr": "10.10.10.3:6789”,
"nonce": 0
}
]
},
"addr": "10.10.10.3:6789/0”,
"public_addr": "10.10.10.3:6789/0”,
"priority": 0,
"weight": 0,
"crush_location": “{}”
}
]
},
"feature_map": {
"mon": [
{
"features": “0x3f01cfbdfffdffff”,
"release": “luminous”,
"num": 1
}
]
},
"stretch_mode": false
}
—
Mark Schouten
CTO, Tuxis B.V.
+31 318 200208 / mark(a)tuxis.nl
Hello team,
I have a cluster in production composed by 3 osds servers with 20 disks
each deployed using ceph-ansibleand ubuntu OS , and the version is pacific
. These days is in WARN state caused by pgs which are not deep-scrubbed in
time . I tried to deep-scrubbed some pg manually but seems that the cluster
can be slow, would like your assistance in order that my cluster can be in
HEALTH_OK state as before without any interuption of service . The cluster
is used as openstack backend storage.
Best Regards
Michel
ceph -s
cluster:
id: cb0caedc-eb5b-42d1-a34f-96facfda8c27
health: HEALTH_WARN
6 pgs not deep-scrubbed in time
services:
mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 11M)
mgr: ceph-mon2(active, since 11M), standbys: ceph-mon3, ceph-mon1
osd: 48 osds: 48 up (since 11M), 48 in (since 11M)
rgw: 6 daemons active (6 hosts, 1 zones)
data:
pools: 10 pools, 385 pgs
objects: 5.97M objects, 23 TiB
usage: 151 TiB used, 282 TiB / 433 TiB avail
pgs: 381 active+clean
4 active+clean+scrubbing+deep
io:
client: 59 MiB/s rd, 860 MiB/s wr, 155 op/s rd, 665 op/s wr
root@ceph-osd3:~# ceph health detail
HEALTH_WARN 6 pgs not deep-scrubbed in time
[WRN] PG_NOT_DEEP_SCRUBBED: 6 pgs not deep-scrubbed in time
pg 6.78 not deep-scrubbed since 2024-01-11T16:07:54.875746+0200
pg 6.60 not deep-scrubbed since 2024-01-13T19:44:26.922000+0200
pg 6.5c not deep-scrubbed since 2024-01-13T09:07:24.780936+0200
pg 4.12 not deep-scrubbed since 2024-01-13T09:09:22.176240+0200
pg 10.d not deep-scrubbed since 2024-01-12T08:04:02.078062+0200
pg 5.f not deep-scrubbed since 2024-01-12T06:06:00.970665+0200