January 2024 - ceph-users

Throughput metrics missing iwhen updating Ceph Quincy to Reef

by Jose Vicente

3 months

4
8
0 0

by Roman Pashin

Hello Ceph users, we see strange issue on last recent Ceph installation v17.6.2. We store data on HDD pool, index pool is on SSD. Each OSD store its wal on NVME partition. Benchmarks didn't expose any issues with cluster, but since we placed production load on it we see constantly growing OSD latency time (osd_read_latency) on SSD disks (where Index pool located). Latency is constantly growing day-by-day, but disks are not utilized even for 50%. Interesting, that when we move Index pool from SSD to NVME disks (disk space allows it for now) - osd latency drops to zero and start increasing from the ground. Also we noticed, that any change of pg_num for index pool (from 256 to 128 for instance) also drops latency to zero. And it starts its growth again (https://postimg.cc/5YHk9bby). From client perspective it looks like one operation takes longer and longer each other day and operation time drops each time when we do some change on index pool. I've enabled debug_optracker 10/0 and it shows, that OSD spend most time in `queued_for_pg` state, but physical disk utilization is about 10-20%. Also per logs I see, that longest operation is Listbucket, but it is strange, that with less than 100'000 items in bucket list even with 'max_keys=1' takes 3-40 seconds. If it matters client is Apache Flink doing checkpoints via S3 protocol. Here is an example of operation with debug_optracking logs: 2023-12-29T16:24:28.873353+0300, event: throttled, op: osd_op(client.1227774 .0:22575820 7.19 7.a84f1a59 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:28.861549+0300, event: header_read, op: osd_op(client. 1227774.0:22575820 7.19 7.a84f1a59 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:28.873358+0300, event: all_read, op: osd_op(client.1227774 .0:22575820 7.19 7.a84f1a59 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:28.873359+0300, event: dispatched, op: osd_op(client. 1227774.0:22575820 7.19 7.a84f1a59 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:28.873389+0300, event: queued_for_pg, op: osd_op(client. 1227774.0:22575820 7.19 7.a84f1a59 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.077528+0300, event: reached_pg, op: osd_op(client. 1227774.0:22575820 7.19 7.a84f1a59 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.077561+0300, event: started, op: osd_op(client.1227774 .0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27. 9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.077714+0300, event: waiting for subops from 59,494, op: osd_op(client.1227774.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.146157+0300, event: sub_op_commit_rec, op: osd_op(client.1227774.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.146166+0300, event: op_commit, op: osd_op(client.1227774 .0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27. 9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.146191+0300, event: sub_op_commit_rec, op: osd_op(client.1227774.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.146204+0300, event: commit_sent, op: osd_op(client. 1227774.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.146216+0300, event: done, op: osd_op(client.1227774.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) Does anybody faced with the same issue? I'd be very grateful for any idea because at this point I've got stuck with what to tune and where to look at. Cluster setup - replication 3x, 15 servers in 3 datacenters with datacenter as failure domain. 7x HDD (data), 2x SSD (index), 1x NVME (wal + OS). ceph config - https://pastebin.com/pCqxXhT3 OSD read latency graph - https://postimg.cc/5YHk9bby -- Thank you, Roman

3 months

9
17
0 0

Understanding subvolumes

by Matthew Melendy

In our department we're getting starting with Ceph 'reef', using Ceph FUSE client for our Ubuntu workstations. So far so good, except I can't quite figure out one aspect of subvolumes. When I do the commands: [root@ceph1 ~]# ceph fs subvolumegroup create cephfs csvg [root@ceph1 ~]# ceph fs subvolume create cephfs staff csvg --size 2000000000000 I get these results: - A subvolume group csvg is created on volume cephfs - A subvolume called staff is created in csvg subvolume (like /volumes/csvg/staff ) however there is no size limit set at this folder in the Ceph dashboard view - A folder with an random UUID name is created inside the subvolume staff ( like /volumes/csvg/staff/6a1b3de5-f6ab-4878-aea3-3c3c6f96ffcf ); this folder does have a size set on it of 2TB My questions are: - what's the purpose of this UUID, and is it a requirement? - which directory should be mounted for my clients, staff/ or staff/{UUID}, for the size limit to take effect? - is there any way to hide or disable this UUID for client mounts? (eg in /etc/fstab) [root@ceph1 ~]# ceph fs subvolume getpath cephfs staff csvg /volumes/csvg/staff/6a1b3de5-f6ab-4878-aea3-3c3c6f96ffcf [root@ceph1 ~]# ceph fs subvolume ls cephfs csvg [ { "name": "staff" } ] -- Sincerely, Matthew Melendy IT Services Specialist CS System Services Group FEC 3550, University of New Mexico

3 months

4
3
0 0

Cannot recreate monitor in upgrade from pacific to quincy (leveldb -> rocksdb)

by Mark Schouten

Hi, During an upgrade from pacific to quincy, we needed to recreate the mons because the mons were pretty old and still using leveldb. So step one was to destroy one of the mons. After that we recreated the monitor, and although it starts, it remains in state ‘probing’, as you can see below. No matter what I tried, it won’t come up. I’ve seen quite some messages that the MTU might be an issue, but that seems to be ok: root@proxmox03:/var/log/ceph# fping -b 1472 10.10.10.{1..3} -M 10.10.10.1 is alive 10.10.10.2 is alive 10.10.10.3 is alive Does anyone have an idea how to fix this? I’ve tried destroying and recreating the mon a few times now. Could it be that the leveldb mons only support mon.$id notation for the monitors? root@proxmox03:/var/log/ceph# ceph daemon mon.proxmox03 mon_status { "name": “proxmox03”, "rank": 2, "state": “probing”, "election_epoch": 0, "quorum": [], "features": { "required_con": “2449958197560098820”, "required_mon": [ “kraken”, “luminous”, “mimic”, "osdmap-prune”, “nautilus”, “octopus”, “pacific”, "elector-pinging” ], "quorum_con": “0”, "quorum_mon": [] }, "outside_quorum": [ “proxmox03” ], "extra_probe_peers": [], "sync_provider": [], "monmap": { "epoch": 0, "fsid": "39b1e85c-7b47-4262-9f0a-47ae91042bac”, "modified": "2024-01-23T21:02:12.631320Z”, "created": "2017-03-15T14:54:55.743017Z”, "min_mon_release": 16, "min_mon_release_name": “pacific”, "election_strategy": 1, "disallowed_leaders: ": “”, "stretch_mode": false, "tiebreaker_mon": “”, "removed_ranks: ": “2”, "features": { "persistent": [ “kraken”, “luminous”, “mimic”, "osdmap-prune”, “nautilus”, “octopus”, “pacific”, "elector-pinging” ], "optional": [] }, "mons": [ { "rank": 0, "name": “0”, "public_addrs": { "addrvec": [ { "type": “v2”, "addr": "10.10.10.1:3300”, "nonce": 0 }, { "type": “v1”, "addr": "10.10.10.1:6789”, "nonce": 0 } ] }, "addr": "10.10.10.1:6789/0”, "public_addr": "10.10.10.1:6789/0”, "priority": 0, "weight": 0, "crush_location": “{}” }, { "rank": 1, "name": “1”, "public_addrs": { "addrvec": [ { "type": “v2”, "addr": "10.10.10.2:3300”, "nonce": 0 }, { "type": “v1”, "addr": "10.10.10.2:6789”, "nonce": 0 } ] }, "addr": "10.10.10.2:6789/0”, "public_addr": "10.10.10.2:6789/0”, "priority": 0, "weight": 0, "crush_location": “{}” }, { "rank": 2, "name": “proxmox03”, "public_addrs": { "addrvec": [ { "type": “v2”, "addr": "10.10.10.3:3300”, "nonce": 0 }, { "type": “v1”, "addr": "10.10.10.3:6789”, "nonce": 0 } ] }, "addr": "10.10.10.3:6789/0”, "public_addr": "10.10.10.3:6789/0”, "priority": 0, "weight": 0, "crush_location": “{}” } ] }, "feature_map": { "mon": [ { "features": “0x3f01cfbdfffdffff”, "release": “luminous”, "num": 1 } ] }, "stretch_mode": false } — Mark Schouten CTO, Tuxis B.V. +31 318 200208 / mark(a)tuxis.nl

3 months, 1 week

2
6
0 0

6 pgs not deep-scrubbed in time

by Michel Niyoyita

Hello team, I have a cluster in production composed by 3 osds servers with 20 disks each deployed using ceph-ansibleand ubuntu OS , and the version is pacific . These days is in WARN state caused by pgs which are not deep-scrubbed in time . I tried to deep-scrubbed some pg manually but seems that the cluster can be slow, would like your assistance in order that my cluster can be in HEALTH_OK state as before without any interuption of service . The cluster is used as openstack backend storage. Best Regards Michel ceph -s cluster: id: cb0caedc-eb5b-42d1-a34f-96facfda8c27 health: HEALTH_WARN 6 pgs not deep-scrubbed in time services: mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 11M) mgr: ceph-mon2(active, since 11M), standbys: ceph-mon3, ceph-mon1 osd: 48 osds: 48 up (since 11M), 48 in (since 11M) rgw: 6 daemons active (6 hosts, 1 zones) data: pools: 10 pools, 385 pgs objects: 5.97M objects, 23 TiB usage: 151 TiB used, 282 TiB / 433 TiB avail pgs: 381 active+clean 4 active+clean+scrubbing+deep io: client: 59 MiB/s rd, 860 MiB/s wr, 155 op/s rd, 665 op/s wr root@ceph-osd3:~# ceph health detail HEALTH_WARN 6 pgs not deep-scrubbed in time [WRN] PG_NOT_DEEP_SCRUBBED: 6 pgs not deep-scrubbed in time pg 6.78 not deep-scrubbed since 2024-01-11T16:07:54.875746+0200 pg 6.60 not deep-scrubbed since 2024-01-13T19:44:26.922000+0200 pg 6.5c not deep-scrubbed since 2024-01-13T09:07:24.780936+0200 pg 4.12 not deep-scrubbed since 2024-01-13T09:09:22.176240+0200 pg 10.d not deep-scrubbed since 2024-01-12T08:04:02.078062+0200 pg 5.f not deep-scrubbed since 2024-01-12T06:06:00.970665+0200

3 months, 1 week

6
27
0 0

how can install latest dev release?

by garcetto

good morning, how can i install latest dev release using cephadm? thank you.

3 months, 1 week

3
4
0 0

Changing A Ceph Cluster's Front- And/Or Back-End Networks IP Address(es)

by duluxoz

Hi All, Quick Q: How easy/hard is it to change the IP networks of: 1) A Ceph Cluster's "Front-End" Network? 2) A Ceph Cluster's "Back-End" Network? Is it a "simply" matter of: a) Placing the Nodes in maintenance mode b) Changing a config file (I assume it's /etc/ceph/ceph.conf) on each Node c) Rebooting the Nodes d) Taking each Node out of Maintenance Mode Thanks in advance Cheers Dulux-Oz

3 months, 1 week

2
1
0 0

S3 object appears in ListObject but 404 when issuing a GET

by Mathias Chapelain

Hello, Recently we got a problem from an internal customer on our S3. Our setup consist of roughly 10 servers with 140 OSDs. Our 3 RGWs are collocated with monitors on dedicated servers in a HA setup with HAProxy in front. We are running 16.2.14 on Podman with Cephadm. Our S3 is constantly having a traffic of 500 req/s average per RGW instance. The problem is described in this issue: https://tracker.ceph.com/issues/63935. Basically this customer is having a Grafana Mimir instance pushing to our S3 and during a compaction process it does a special pattern like this: ``` 29/Dec/2023:17:13:28.961 rgw-frontend~ rgw-backend/server-mon-01-rgw0 0/0/0/127/127 200 228 - - ---- 132/132/70/67/0 0/0 "PUT /1234/object HTTP/1.1" 29/Dec/2023:17:13:29.101 rgw-frontend~ rgw-backend/server-mon-01-rgw0 0/0/0/1/1 200 381 - - ---- 132/132/76/71/0 0/0 "GET /1234/object HTTP/1.1" 29/Dec/2023:17:13:29.121 rgw-frontend~ rgw-backend/server-mon-01-rgw0 0/0/0/1/1 200 381 - - ---- 132/132/71/59/0 0/0 "GET /1234/object HTTP/1.1" 29/Dec/2023:17:13:29.137 rgw-frontend~ rgw-backend/server-mon-03-rgw0 0/0/0/4/4 204 153 - - ---- 132/132/71/6/0 0/0 "DELETE /1234/object HTTP/1.1" 29/Dec/2023:19:03:21.671 rgw-frontend~ rgw-backend/server-mon-03-rgw0 0/0/0/1/1 404 472 - - ---- 55/55/26/0/0 0/0 "GET /1234/object HTTP/1.1" ``` It is doing PUT, GET and DELETE in the same second. Afterwards the customer can see the deleted object when doing a ListObjects in the bucket but if he tries to access it then RGW returns a 404. After looking in Ceph, it appears the object has a bucket index entry but the associated RADOS object does not exist anymore. The bucket does not have versioning or object locking. Did someone encounter something similar? Thank you! Regards, -- Mathias Chapelain Storage Engineer Proton AG

3 months, 1 week

1
0
0 0

Pacific: Drain hosts does not remove mgr daemon

by Mevludin Blazevic

Hi all, after performing "ceph orch host drain" on one of our host with only the mgr container left, I encounter that another mgr daemon is indeed deployed on another host, but the "old" does not get removed from the drain command. The same happens if I edit the mgr service via UI to define different hosts for the daemon and again the old mgr daemons are not getting removed. Any recommendations? I am using a setup with podman and RHEL. Best, Mevludin

3 months, 1 week

2
1
0 0

Help on rgw metrics (was rgw_user_counters_cache)

by garcetto

good morning, i was struggling trying to understand why i cannot find this setting on my reef version, is it because is only on latest dev ceph version and not before? https://docs.ceph.com/en/*latest* /radosgw/metrics/#user-bucket-counter-caches Reef gives 404.... https://docs.ceph.com/en/reef/radosgw/metrics/ thank you!

3 months, 1 week

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users January 2024