February 2024 - ceph-users

by Roman Pashin

Hello Ceph users, we see strange issue on last recent Ceph installation v17.6.2. We store data on HDD pool, index pool is on SSD. Each OSD store its wal on NVME partition. Benchmarks didn't expose any issues with cluster, but since we placed production load on it we see constantly growing OSD latency time (osd_read_latency) on SSD disks (where Index pool located). Latency is constantly growing day-by-day, but disks are not utilized even for 50%. Interesting, that when we move Index pool from SSD to NVME disks (disk space allows it for now) - osd latency drops to zero and start increasing from the ground. Also we noticed, that any change of pg_num for index pool (from 256 to 128 for instance) also drops latency to zero. And it starts its growth again (https://postimg.cc/5YHk9bby). From client perspective it looks like one operation takes longer and longer each other day and operation time drops each time when we do some change on index pool. I've enabled debug_optracker 10/0 and it shows, that OSD spend most time in `queued_for_pg` state, but physical disk utilization is about 10-20%. Also per logs I see, that longest operation is Listbucket, but it is strange, that with less than 100'000 items in bucket list even with 'max_keys=1' takes 3-40 seconds. If it matters client is Apache Flink doing checkpoints via S3 protocol. Here is an example of operation with debug_optracking logs: 2023-12-29T16:24:28.873353+0300, event: throttled, op: osd_op(client.1227774 .0:22575820 7.19 7.a84f1a59 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:28.861549+0300, event: header_read, op: osd_op(client. 1227774.0:22575820 7.19 7.a84f1a59 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:28.873358+0300, event: all_read, op: osd_op(client.1227774 .0:22575820 7.19 7.a84f1a59 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:28.873359+0300, event: dispatched, op: osd_op(client. 1227774.0:22575820 7.19 7.a84f1a59 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:28.873389+0300, event: queued_for_pg, op: osd_op(client. 1227774.0:22575820 7.19 7.a84f1a59 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.077528+0300, event: reached_pg, op: osd_op(client. 1227774.0:22575820 7.19 7.a84f1a59 (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.077561+0300, event: started, op: osd_op(client.1227774 .0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27. 9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.077714+0300, event: waiting for subops from 59,494, op: osd_op(client.1227774.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.146157+0300, event: sub_op_commit_rec, op: osd_op(client.1227774.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.146166+0300, event: op_commit, op: osd_op(client.1227774 .0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27. 9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.146191+0300, event: sub_op_commit_rec, op: osd_op(client.1227774.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.146204+0300, event: commit_sent, op: osd_op(client. 1227774.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) 2023-12-29T16:24:38.146216+0300, event: done, op: osd_op(client.1227774.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e83833) Does anybody faced with the same issue? I'd be very grateful for any idea because at this point I've got stuck with what to tune and where to look at. Cluster setup - replication 3x, 15 servers in 3 datacenters with datacenter as failure domain. 7x HDD (data), 2x SSD (index), 1x NVME (wal + OS). ceph config - https://pastebin.com/pCqxXhT3 OSD read latency graph - https://postimg.cc/5YHk9bby -- Thank you, Roman

3 months, 2 weeks

9
17
0 0

Re: OSD read latency grows over time

by Cory Snyder

1024 PGs on NVMe. From: Anthony D'Atri <anthony.datri(a)gmail.com> Sent: Friday, February 2, 2024 2:37 PM To: Cory Snyder <csnyder(a)1111systems.com> Subject: Re: [ceph-users] OSD read latency grows over time Thanks. What type of media are your index OSDs? How many PGs? > On Feb 2, 2024, at 2: 32 PM, Cory Snyder <csnyder@ 1111systems. com> wrote: > > Yes, we changed osd_memory_target to 10 GB on just our index OSDs. These OSDs have over ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ZjQcmQRYFpfptBannerEnd Thanks. What type of media are your index OSDs? How many PGs? > On Feb 2, 2024, at 2:32 PM, Cory Snyder <csnyder(a)1111systems.com> wrote: > > Yes, we changed osd_memory_target to 10 GB on just our index OSDs. These OSDs have over 300 GB of lz4 compressed bucket index omap data. Here is a graph showing the latencies before/after that single change: > > https://urldefense.com/v3/__https://pasteboard.co/IMCUWa1t3Uau.png__;!!J0dt… > > Cory Snyder > > > From: Anthony D'Atri <anthony.datri(a)gmail.com> > Sent: Friday, February 2, 2024 2:15 PM > To: Cory Snyder <csnyder(a)1111systems.com> > Cc: ceph-users <ceph-users(a)ceph.io> > Subject: Re: [ceph-users] OSD read latency grows over time > > You adjusted osd_memory_target? Higher than the default 4GB? Another thing that we've found is that rocksdb can become quite slow if it doesn't have enough memory for internal caches. As our cluster usage has grown, we've needed to increase > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > Report Suspicious > > ZjQcmQRYFpfptBannerEnd > You adjusted osd_memory_target? Higher than the default 4GB? > > > > Another thing that we've found is that rocksdb can become quite slow if it doesn't have enough memory for internal caches. As our cluster usage has grown, we've needed to increase OSD memory in accordance with bucket index pool usage. One one cluster, we found that increasing OSD memory improved rocksdb latencies by over 10x. > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 months, 2 weeks

1
0
0 0

Unable to mount ceph

by Albert Shih

Hi, A little basic question. I created a volume with ceph fs volume then a subvolume called «erasure» I can see that with root@cthulhu1:/etc/ceph# ceph fs subvolume info cephfs erasure { "atime": "2024-02-02 11:02:07", "bytes_pcent": "undefined", "bytes_quota": "infinite", "bytes_used": 0, "created_at": "2024-02-02 11:02:07", "ctime": "2024-02-02 14:12:30", "data_pool": "data_ec", "features": [ "snapshot-clone", "snapshot-autoprotect", "snapshot-retention" ], "gid": 0, "mode": 16877, "mon_addrs": [ "145.238.187.184:6789", "145.238.187.185:6789", "145.238.187.186:6789", "145.238.187.188:6789", "145.238.187.187:6789" ], "mtime": "2024-02-02 14:12:30", "path": "/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9", "pool_namespace": "", "state": "complete", "type": "subvolume", "uid": 0 } From the mon server I was able to mount the «partition» with mount -t ceph admin@fXXXXXXX-c0f2-11ee-9307-f7e3b9f03075.cephfs=/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9 /mnt but on my test client I'm unable to mount root@ceph-vo-m:/etc/ceph# mount -t ceph vo@fxxxxxxx-c0f2-11ee-9307-f7e3b9f03075.cephfs=/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9/ /vo --verbose parsing options: rw source mount path was not specified unable to parse mount source: -22 root@ceph-vo-m:/etc/ceph# So I copy the /etc/ceph/ceph.conf on my client Put the /etc/ceph/ceph.client.vo.keyring on my client No firewall between the client/cluster. Weird part is when I run a tcpdump on my client I didn't see any tcp activity. Anyway to debug this pb ? Thanks Regards -- Albert SHIH 🦫 🐸 France Heure locale/Local time: ven. 02 févr. 2024 16:21:01 CET

3 months, 2 weeks

1
1
0 0

XFS on top of RBD, overhead

by Ruben Vestergaard

Hi group, Today I conducted a small experiment to test an assumption of mine, namely that Ceph incurs a substantial network overhead when doing many small files. One RBD was created, and on top of that an XFS containing 1.6 M files, each with size 10 kiB: # rbd info libvirt/bobtest rbd image 'bobtest': size 20 GiB in 5120 objects order 22 (4 MiB objects) [...] # df -h /space Filesystem Size Used Avail Use% Mounted on /dev/rbd0 20G 20G 181M 100% /space # ls -lh /space |head total 19G -rw-r--r--. 1 root root 10K Feb 2 14:13 xaa -rw-r--r--. 1 root root 10K Feb 2 14:13 xab -rw-r--r--. 1 root root 10K Feb 2 14:13 xac -rw-r--r--. 1 root root 10K Feb 2 14:13 xad -rw-r--r--. 1 root root 10K Feb 2 14:13 xae -rw-r--r--. 1 root root 10K Feb 2 14:13 xaf -rw-r--r--. 1 root root 10K Feb 2 14:13 xag -rw-r--r--. 1 root root 10K Feb 2 14:13 xah -rw-r--r--. 1 root root 10K Feb 2 14:13 xai # ls /space |wc -l 1638400 All files contain pseudorandom (i.e. incompressible) junk. My assumption was, that as the backend RBD block size is 4 MiB, it would be necessary for the client machine to download at least that 4 MiB worth of data on any given request, even if the file in the XFS is only 10 kB. I.e. I cat(1) a small file, the RBD client grabs the relevant 4 MiB block from Ceph, from this the small amount of requested data is extracted and presented to userspace. That's not what I see, however. My testing procedure is as follows: I have a list of all the files on the RBD, order randomized, stored in root's home folder -- this to make sure that I can pick file names at random by going through the list from the top, and not causing network traffic by listing files directly in the target FS. I then reboot the node to ensure that all caches are empty and start an iftop(1) to monitor network usage. Mapping the RBD and mounting the XFS results in 5.29 MB worth of data read from the network. Reading one file at random from the XFS results in approx. 200 kB of network read. Reading 100 files at random results in approx. 3.83 MB of network read. Reading 1000 files at random results in approx. 36.2 MB of network read. Bottom line is that reading any 10 kiB of actual data results in approximately 37 kiB data being transferred over the network. Overhead, sure, but nowhere near what I expected, which was 4 MiB per block of data "hit" in the backend. Is the RBD client performing partial object reads? Is that even a thing? Cheers, Ruben Vestergaard

3 months, 2 weeks

3
3
0 0

PG upmap corner cases that silently fail

by Andras Pataki

Hi cephers, I've been looking into better balancing our clusters with upmaps lately, and ran into upmap cases that behave in a less than ideal way. If there is any cycle in the upmaps like ceph osd pg-upmap-items <pgid> a b b a or ceph osd pg-upmap-items <pgid> a b b c c a the upmap validation passes, the upmap gets added to the osdmap, but then gets silently ignored. Obviously this is for EC pools - irrelevant for replicated pools where the order of OSDs is not significant. The relevant code OSDMap::_apply_upmap even has a comment about this: if (q != pg_upmap_items.end()) { // NOTE: this approach does not allow a bidirectional swap, // e.g., [[1,2],[2,1]] applied to [0,1,2] -> [0,2,1]. for (auto& r : q->second) { // make sure the replacement value doesn't already appear ... I'm trying to understand the reasons for this limitation: is it the case that this is just a matter of convenience of coding (OSDMap::_apply_upmap could do this correctly with a bit more careful approach), or is there some inherent limitation somewhere else that prevents these cases from working? I did notice that just updating crush weights (without using upmaps) produces similar changes to the UP set (swaps OSDs in EC pools sometimes), so the OSDs seem to be perfectly capable of doing backfills for osdmap changes that shuffle the order of OSDs in the UP set. Some insight/history here would be appreciated. Either way, the behavior of validation passing on an upmap and then the upmap getting silently ignored is not ideal. I do realize that all clients would have to agree on this code, since clients independently execute it to find the OSDs to access (so rolling out a change to this is challenging). Andras

3 months, 2 weeks

1
0
0 0

Understanding subvolumes

by Matthew Melendy

In our department we're getting starting with Ceph 'reef', using Ceph FUSE client for our Ubuntu workstations. So far so good, except I can't quite figure out one aspect of subvolumes. When I do the commands: [root@ceph1 ~]# ceph fs subvolumegroup create cephfs csvg [root@ceph1 ~]# ceph fs subvolume create cephfs staff csvg --size 2000000000000 I get these results: - A subvolume group csvg is created on volume cephfs - A subvolume called staff is created in csvg subvolume (like /volumes/csvg/staff ) however there is no size limit set at this folder in the Ceph dashboard view - A folder with an random UUID name is created inside the subvolume staff ( like /volumes/csvg/staff/6a1b3de5-f6ab-4878-aea3-3c3c6f96ffcf ); this folder does have a size set on it of 2TB My questions are: - what's the purpose of this UUID, and is it a requirement? - which directory should be mounted for my clients, staff/ or staff/{UUID}, for the size limit to take effect? - is there any way to hide or disable this UUID for client mounts? (eg in /etc/fstab) [root@ceph1 ~]# ceph fs subvolume getpath cephfs staff csvg /volumes/csvg/staff/6a1b3de5-f6ab-4878-aea3-3c3c6f96ffcf [root@ceph1 ~]# ceph fs subvolume ls cephfs csvg [ { "name": "staff" } ] -- Sincerely, Matthew Melendy IT Services Specialist CS System Services Group FEC 3550, University of New Mexico

3 months, 2 weeks

4
3
0 0

Ceph Dashboard failed to execute login

by Michel Niyoyita

Hello team, I failed to login to my ceph dashboard which is running pacific as version and deployed using ceph-ansible . I have set admin password using the following command : "ceph dashboard ac-user-set-password admin -i ceph-dash-pass" where ceph-dash-pass possesses the real password. I am getting the following output : "{"username": "admin", "password": "$2b$12$Ge/2cpg0ZGjRPnBC2YREP.E5oVyNvV4SC9HU4PMsWWMBtC9UvL7mG", "roles": ["administrator"], "name": null, "email": null, "lastUpdate": 1706866328, "enabled": false, "pwdExpirationDate": null, "pwdUpdateRequired": false}" Once I login to the dashboard , still i get the same error message. I am guessing it is because the above "enabled" field is set to false . Ho w to set that field to true ? or if there is other alternative to set it you can advise. Thank you

3 months, 2 weeks

2
2
0 0

Cannot recreate monitor in upgrade from pacific to quincy (leveldb -> rocksdb)

by Mark Schouten

Hi, During an upgrade from pacific to quincy, we needed to recreate the mons because the mons were pretty old and still using leveldb. So step one was to destroy one of the mons. After that we recreated the monitor, and although it starts, it remains in state ‘probing’, as you can see below. No matter what I tried, it won’t come up. I’ve seen quite some messages that the MTU might be an issue, but that seems to be ok: root@proxmox03:/var/log/ceph# fping -b 1472 10.10.10.{1..3} -M 10.10.10.1 is alive 10.10.10.2 is alive 10.10.10.3 is alive Does anyone have an idea how to fix this? I’ve tried destroying and recreating the mon a few times now. Could it be that the leveldb mons only support mon.$id notation for the monitors? root@proxmox03:/var/log/ceph# ceph daemon mon.proxmox03 mon_status { "name": “proxmox03”, "rank": 2, "state": “probing”, "election_epoch": 0, "quorum": [], "features": { "required_con": “2449958197560098820”, "required_mon": [ “kraken”, “luminous”, “mimic”, "osdmap-prune”, “nautilus”, “octopus”, “pacific”, "elector-pinging” ], "quorum_con": “0”, "quorum_mon": [] }, "outside_quorum": [ “proxmox03” ], "extra_probe_peers": [], "sync_provider": [], "monmap": { "epoch": 0, "fsid": "39b1e85c-7b47-4262-9f0a-47ae91042bac”, "modified": "2024-01-23T21:02:12.631320Z”, "created": "2017-03-15T14:54:55.743017Z”, "min_mon_release": 16, "min_mon_release_name": “pacific”, "election_strategy": 1, "disallowed_leaders: ": “”, "stretch_mode": false, "tiebreaker_mon": “”, "removed_ranks: ": “2”, "features": { "persistent": [ “kraken”, “luminous”, “mimic”, "osdmap-prune”, “nautilus”, “octopus”, “pacific”, "elector-pinging” ], "optional": [] }, "mons": [ { "rank": 0, "name": “0”, "public_addrs": { "addrvec": [ { "type": “v2”, "addr": "10.10.10.1:3300”, "nonce": 0 }, { "type": “v1”, "addr": "10.10.10.1:6789”, "nonce": 0 } ] }, "addr": "10.10.10.1:6789/0”, "public_addr": "10.10.10.1:6789/0”, "priority": 0, "weight": 0, "crush_location": “{}” }, { "rank": 1, "name": “1”, "public_addrs": { "addrvec": [ { "type": “v2”, "addr": "10.10.10.2:3300”, "nonce": 0 }, { "type": “v1”, "addr": "10.10.10.2:6789”, "nonce": 0 } ] }, "addr": "10.10.10.2:6789/0”, "public_addr": "10.10.10.2:6789/0”, "priority": 0, "weight": 0, "crush_location": “{}” }, { "rank": 2, "name": “proxmox03”, "public_addrs": { "addrvec": [ { "type": “v2”, "addr": "10.10.10.3:3300”, "nonce": 0 }, { "type": “v1”, "addr": "10.10.10.3:6789”, "nonce": 0 } ] }, "addr": "10.10.10.3:6789/0”, "public_addr": "10.10.10.3:6789/0”, "priority": 0, "weight": 0, "crush_location": “{}” } ] }, "feature_map": { "mon": [ { "features": “0x3f01cfbdfffdffff”, "release": “luminous”, "num": 1 } ] }, "stretch_mode": false } — Mark Schouten CTO, Tuxis B.V. +31 318 200208 / mark(a)tuxis.nl

3 months, 2 weeks

2
6
0 0

6 pgs not deep-scrubbed in time

by Michel Niyoyita

Hello team, I have a cluster in production composed by 3 osds servers with 20 disks each deployed using ceph-ansibleand ubuntu OS , and the version is pacific . These days is in WARN state caused by pgs which are not deep-scrubbed in time . I tried to deep-scrubbed some pg manually but seems that the cluster can be slow, would like your assistance in order that my cluster can be in HEALTH_OK state as before without any interuption of service . The cluster is used as openstack backend storage. Best Regards Michel ceph -s cluster: id: cb0caedc-eb5b-42d1-a34f-96facfda8c27 health: HEALTH_WARN 6 pgs not deep-scrubbed in time services: mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 11M) mgr: ceph-mon2(active, since 11M), standbys: ceph-mon3, ceph-mon1 osd: 48 osds: 48 up (since 11M), 48 in (since 11M) rgw: 6 daemons active (6 hosts, 1 zones) data: pools: 10 pools, 385 pgs objects: 5.97M objects, 23 TiB usage: 151 TiB used, 282 TiB / 433 TiB avail pgs: 381 active+clean 4 active+clean+scrubbing+deep io: client: 59 MiB/s rd, 860 MiB/s wr, 155 op/s rd, 665 op/s wr root@ceph-osd3:~# ceph health detail HEALTH_WARN 6 pgs not deep-scrubbed in time [WRN] PG_NOT_DEEP_SCRUBBED: 6 pgs not deep-scrubbed in time pg 6.78 not deep-scrubbed since 2024-01-11T16:07:54.875746+0200 pg 6.60 not deep-scrubbed since 2024-01-13T19:44:26.922000+0200 pg 6.5c not deep-scrubbed since 2024-01-13T09:07:24.780936+0200 pg 4.12 not deep-scrubbed since 2024-01-13T09:09:22.176240+0200 pg 10.d not deep-scrubbed since 2024-01-12T08:04:02.078062+0200 pg 5.f not deep-scrubbed since 2024-01-12T06:06:00.970665+0200

3 months, 2 weeks

6
27
0 0

Re: Performance improvement suggestion

by quaglio＠bol.com.br

3 months, 2 weeks

1
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users February 2024