Hello Ceph users,
we see strange issue on last recent Ceph installation v17.6.2. We store
data on HDD pool, index pool is on SSD. Each OSD store its wal on NVME
partition. Benchmarks didn't expose any issues with cluster, but since we
placed production load on it we see constantly growing OSD latency time
(osd_read_latency) on SSD disks (where Index pool located). Latency is
constantly growing day-by-day, but disks are not utilized even for 50%.
Interesting, that when we move Index pool from SSD to NVME disks (disk
space allows it for now) - osd latency drops to zero and start increasing
from the ground. Also we noticed, that any change of pg_num for index pool
(from 256 to 128 for instance) also drops latency to zero. And it starts
its growth again (https://postimg.cc/5YHk9bby).
From client perspective it looks like one operation takes longer and longer
each other day and operation time drops each time when we do some change on
index pool. I've enabled debug_optracker 10/0 and it shows, that OSD spend
most time in `queued_for_pg` state, but physical disk utilization is about
10-20%. Also per logs I see, that longest operation is Listbucket, but it
is strange, that with less than 100'000 items in bucket list even with
'max_keys=1' takes 3-40 seconds.
If it matters client is Apache Flink doing checkpoints via S3 protocol.
Here is an example of operation with debug_optracking logs:
2023-12-29T16:24:28.873353+0300, event: throttled, op: osd_op(client.1227774
.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.861549+0300, event: header_read, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873358+0300, event: all_read, op: osd_op(client.1227774
.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873359+0300, event: dispatched, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873389+0300, event: queued_for_pg, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077528+0300, event: reached_pg, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077561+0300, event: started, op: osd_op(client.1227774
.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.
9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call
rgw.bucket_prepare_op in=331b] snapc 0=[]
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077714+0300, event: waiting for subops from 59,494, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146157+0300, event: sub_op_commit_rec, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146166+0300, event: op_commit, op: osd_op(client.1227774
.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.
9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call
rgw.bucket_prepare_op in=331b] snapc 0=[]
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.146191+0300, event: sub_op_commit_rec, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146204+0300, event: commit_sent, op: osd_op(client.
1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146216+0300, event: done, op:
osd_op(client.1227774.0:22575820
7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
Does anybody faced with the same issue? I'd be very grateful for any idea
because at this point I've got stuck with what to tune and where to look at.
Cluster setup - replication 3x, 15 servers in 3 datacenters with datacenter
as failure domain. 7x HDD (data), 2x SSD (index), 1x NVME (wal + OS).
ceph config - https://pastebin.com/pCqxXhT3
OSD read latency graph - https://postimg.cc/5YHk9bby
--
Thank you,
Roman
In our department we're getting starting with Ceph 'reef', using Ceph FUSE client for our Ubuntu workstations.
So far so good, except I can't quite figure out one aspect of subvolumes.
When I do the commands:
[root@ceph1 ~]# ceph fs subvolumegroup create cephfs csvg
[root@ceph1 ~]# ceph fs subvolume create cephfs staff csvg --size 2000000000000
I get these results:
- A subvolume group csvg is created on volume cephfs
- A subvolume called staff is created in csvg subvolume (like /volumes/csvg/staff ) however there is no size limit set at this folder in the Ceph dashboard view
- A folder with an random UUID name is created inside the subvolume staff ( like /volumes/csvg/staff/6a1b3de5-f6ab-4878-aea3-3c3c6f96ffcf ); this folder does have a size set on it of 2TB
My questions are:
- what's the purpose of this UUID, and is it a requirement?
- which directory should be mounted for my clients, staff/ or staff/{UUID}, for the size limit to take effect?
- is there any way to hide or disable this UUID for client mounts? (eg in /etc/fstab)
[root@ceph1 ~]# ceph fs subvolume getpath cephfs staff csvg
/volumes/csvg/staff/6a1b3de5-f6ab-4878-aea3-3c3c6f96ffcf
[root@ceph1 ~]# ceph fs subvolume ls cephfs csvg
[
{
"name": "staff"
}
]
--
Sincerely,
Matthew Melendy
IT Services Specialist
CS System Services Group
FEC 3550, University of New Mexico
Hi,
During an upgrade from pacific to quincy, we needed to recreate the mons
because the mons were pretty old and still using leveldb.
So step one was to destroy one of the mons. After that we recreated the
monitor, and although it starts, it remains in state ‘probing’, as you
can see below.
No matter what I tried, it won’t come up. I’ve seen quite some messages
that the MTU might be an issue, but that seems to be ok:
root@proxmox03:/var/log/ceph# fping -b 1472 10.10.10.{1..3} -M
10.10.10.1 is alive
10.10.10.2 is alive
10.10.10.3 is alive
Does anyone have an idea how to fix this? I’ve tried destroying and
recreating the mon a few times now. Could it be that the leveldb mons
only support mon.$id notation for the monitors?
root@proxmox03:/var/log/ceph# ceph daemon mon.proxmox03 mon_status
{
"name": “proxmox03”,
"rank": 2,
"state": “probing”,
"election_epoch": 0,
"quorum": [],
"features": {
"required_con": “2449958197560098820”,
"required_mon": [
“kraken”,
“luminous”,
“mimic”,
"osdmap-prune”,
“nautilus”,
“octopus”,
“pacific”,
"elector-pinging”
],
"quorum_con": “0”,
"quorum_mon": []
},
"outside_quorum": [
“proxmox03”
],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 0,
"fsid": "39b1e85c-7b47-4262-9f0a-47ae91042bac”,
"modified": "2024-01-23T21:02:12.631320Z”,
"created": "2017-03-15T14:54:55.743017Z”,
"min_mon_release": 16,
"min_mon_release_name": “pacific”,
"election_strategy": 1,
"disallowed_leaders: ": “”,
"stretch_mode": false,
"tiebreaker_mon": “”,
"removed_ranks: ": “2”,
"features": {
"persistent": [
“kraken”,
“luminous”,
“mimic”,
"osdmap-prune”,
“nautilus”,
“octopus”,
“pacific”,
"elector-pinging”
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": “0”,
"public_addrs": {
"addrvec": [
{
"type": “v2”,
"addr": "10.10.10.1:3300”,
"nonce": 0
},
{
"type": “v1”,
"addr": "10.10.10.1:6789”,
"nonce": 0
}
]
},
"addr": "10.10.10.1:6789/0”,
"public_addr": "10.10.10.1:6789/0”,
"priority": 0,
"weight": 0,
"crush_location": “{}”
},
{
"rank": 1,
"name": “1”,
"public_addrs": {
"addrvec": [
{
"type": “v2”,
"addr": "10.10.10.2:3300”,
"nonce": 0
},
{
"type": “v1”,
"addr": "10.10.10.2:6789”,
"nonce": 0
}
]
},
"addr": "10.10.10.2:6789/0”,
"public_addr": "10.10.10.2:6789/0”,
"priority": 0,
"weight": 0,
"crush_location": “{}”
},
{
"rank": 2,
"name": “proxmox03”,
"public_addrs": {
"addrvec": [
{
"type": “v2”,
"addr": "10.10.10.3:3300”,
"nonce": 0
},
{
"type": “v1”,
"addr": "10.10.10.3:6789”,
"nonce": 0
}
]
},
"addr": "10.10.10.3:6789/0”,
"public_addr": "10.10.10.3:6789/0”,
"priority": 0,
"weight": 0,
"crush_location": “{}”
}
]
},
"feature_map": {
"mon": [
{
"features": “0x3f01cfbdfffdffff”,
"release": “luminous”,
"num": 1
}
]
},
"stretch_mode": false
}
—
Mark Schouten
CTO, Tuxis B.V.
+31 318 200208 / mark(a)tuxis.nl
Hello team,
I have a cluster in production composed by 3 osds servers with 20 disks
each deployed using ceph-ansibleand ubuntu OS , and the version is pacific
. These days is in WARN state caused by pgs which are not deep-scrubbed in
time . I tried to deep-scrubbed some pg manually but seems that the cluster
can be slow, would like your assistance in order that my cluster can be in
HEALTH_OK state as before without any interuption of service . The cluster
is used as openstack backend storage.
Best Regards
Michel
ceph -s
cluster:
id: cb0caedc-eb5b-42d1-a34f-96facfda8c27
health: HEALTH_WARN
6 pgs not deep-scrubbed in time
services:
mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 11M)
mgr: ceph-mon2(active, since 11M), standbys: ceph-mon3, ceph-mon1
osd: 48 osds: 48 up (since 11M), 48 in (since 11M)
rgw: 6 daemons active (6 hosts, 1 zones)
data:
pools: 10 pools, 385 pgs
objects: 5.97M objects, 23 TiB
usage: 151 TiB used, 282 TiB / 433 TiB avail
pgs: 381 active+clean
4 active+clean+scrubbing+deep
io:
client: 59 MiB/s rd, 860 MiB/s wr, 155 op/s rd, 665 op/s wr
root@ceph-osd3:~# ceph health detail
HEALTH_WARN 6 pgs not deep-scrubbed in time
[WRN] PG_NOT_DEEP_SCRUBBED: 6 pgs not deep-scrubbed in time
pg 6.78 not deep-scrubbed since 2024-01-11T16:07:54.875746+0200
pg 6.60 not deep-scrubbed since 2024-01-13T19:44:26.922000+0200
pg 6.5c not deep-scrubbed since 2024-01-13T09:07:24.780936+0200
pg 4.12 not deep-scrubbed since 2024-01-13T09:09:22.176240+0200
pg 10.d not deep-scrubbed since 2024-01-12T08:04:02.078062+0200
pg 5.f not deep-scrubbed since 2024-01-12T06:06:00.970665+0200
Hi All,
Quick Q: How easy/hard is it to change the IP networks of:
1) A Ceph Cluster's "Front-End" Network?
2) A Ceph Cluster's "Back-End" Network?
Is it a "simply" matter of:
a) Placing the Nodes in maintenance mode
b) Changing a config file (I assume it's /etc/ceph/ceph.conf) on each Node
c) Rebooting the Nodes
d) Taking each Node out of Maintenance Mode
Thanks in advance
Cheers
Dulux-Oz
Hello,
Recently we got a problem from an internal customer on our S3. Our setup consist
of roughly 10 servers with 140 OSDs. Our 3 RGWs are collocated with monitors on
dedicated servers in a HA setup with HAProxy in front. We are running 16.2.14
on Podman with Cephadm.
Our S3 is constantly having a traffic of 500 req/s average per RGW instance.
The problem is described in this issue: https://tracker.ceph.com/issues/63935.
Basically this customer is having a Grafana Mimir instance pushing to our S3 and
during a compaction process it does a special pattern like this:
```
29/Dec/2023:17:13:28.961 rgw-frontend~ rgw-backend/server-mon-01-rgw0 0/0/0/127/127 200 228 - - ---- 132/132/70/67/0 0/0 "PUT /1234/object HTTP/1.1"
29/Dec/2023:17:13:29.101 rgw-frontend~ rgw-backend/server-mon-01-rgw0 0/0/0/1/1 200 381 - - ---- 132/132/76/71/0 0/0 "GET /1234/object HTTP/1.1"
29/Dec/2023:17:13:29.121 rgw-frontend~ rgw-backend/server-mon-01-rgw0 0/0/0/1/1 200 381 - - ---- 132/132/71/59/0 0/0 "GET /1234/object HTTP/1.1"
29/Dec/2023:17:13:29.137 rgw-frontend~ rgw-backend/server-mon-03-rgw0 0/0/0/4/4 204 153 - - ---- 132/132/71/6/0 0/0 "DELETE /1234/object HTTP/1.1"
29/Dec/2023:19:03:21.671 rgw-frontend~ rgw-backend/server-mon-03-rgw0 0/0/0/1/1 404 472 - - ---- 55/55/26/0/0 0/0 "GET /1234/object HTTP/1.1"
```
It is doing PUT, GET and DELETE in the same second. Afterwards the customer can
see the deleted object when doing a ListObjects in the bucket but if he tries to access it then RGW
returns a 404.
After looking in Ceph, it appears the object has a bucket index entry but the
associated RADOS object does not exist anymore. The bucket does not have
versioning or object locking.
Did someone encounter something similar? Thank you!
Regards,
--
Mathias Chapelain
Storage Engineer
Proton AG
Hi all,
after performing "ceph orch host drain" on one of our host with only the
mgr container left, I encounter that another mgr daemon is indeed
deployed on another host, but the "old" does not get removed from the
drain command. The same happens if I edit the mgr service via UI to
define different hosts for the daemon and again the old mgr daemons are
not getting removed. Any recommendations? I am using a setup with podman
and RHEL.
Best,
Mevludin
good morning,
i was struggling trying to understand why i cannot find this setting on
my reef version, is it because is only on latest dev ceph version and not
before?
https://docs.ceph.com/en/*latest*
/radosgw/metrics/#user-bucket-counter-caches
Reef gives 404....
https://docs.ceph.com/en/reef/radosgw/metrics/
thank you!