May 2023 - ceph-users - lists.ceph.io

Small RGW objects and RADOS 64KB minimun size

by Loïc Dachary

Bonjour, Reading Karan's blog post about benchmarking the insertion of billions objects to Ceph via S3 / RGW[0] from last year, it reads: > we decided to lower bluestore_min_alloc_size_hdd to 18KB and re-test. As represented in chart-5, the object creation rate found to be notably reduced after lowering the bluestore_min_alloc_size_hdd parameter from 64KB (default) to 18KB. As such, for objects larger than the bluestore_min_alloc_size_hdd , the default values seems to be optimal, smaller objects further require more investigation if you intended to reduce bluestore_min_alloc_size_hdd parameter. There also is a mail thread dated 2018 on this topic as well, with the same conclusion although using RADOS directly and not RGW[3]. I read the RGW data layout page in the documentation[1] and concluded that by default every object inserted with S3 / RGW will indeed use at least 64kb. A pull request from last year[2] seems to confirm it and also suggests modifying bluestore_min_alloc_size_hdd has adverse side effects. That being said, I'm curious to know if people developed strategies to cope with this overhead. Someone mentioned packing objects together client side to make them larger. But maybe there are simpler ways to do the same? Cheers [0] https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond [1] https://docs.ceph.com/en/latest/radosgw/layout/ [2] https://github.com/ceph/ceph/pull/32809 [3] https://www.spinics.net/lists/ceph-users/msg45755.html -- Loïc Dachary, Artisan Logiciel Libre

11 months, 1 week

4
6
0 0

BlueStore fragmentation woes

by Hector Martin

Hi, I've been seeing relatively large fragmentation numbers on all my OSDs: ceph daemon osd.13 bluestore allocator score block { "fragmentation_rating": 0.77251526920454427 } These aren't that old, as I recreated them all around July last year. They mostly hold CephFS data with erasure coding, with a mix of large and small files. The OSDs are at around 80%-85% utilization right now. Most of the data was written sequentially when the OSDs were created (I rsynced everything from a remote backup). Since then more data has been added, but not particularly quickly. At some point I noticed pathologically slow writes, and I couldn't figure out what was wrong. Eventually I did some block tracing and noticed the I/Os were very small, even though CephFS-side I was just writing one large file sequentially, and that's when I stumbled upon the free space fragmentation problem. Indeed, deleting some large files opened up some larger free extents and resolved the problem, but only until those get filled up and I'm back to fragmented tiny extents. So effectively I'm stuck at the current utilization, as trying to fill them up any more just slows down to an absolute crawl. I'm adding a few more OSDs and plan on doing the dance of removing one OSD at a time and replacing it with another one to hopefully improve the situation, but obviously this is going to take forever. Is there any plan for offering a defrag tool of some sort for bluestore? - Hector

11 months, 1 week

7
27
0 0

[Pacific] Admin keys no longer works I get access denied URGENT!!!

by wodel youchi

Hi, After a wrong manipulation, the admin key no longer works, it seems it has been modified. My cluster is built using containers. When I execute ceph -s I get [root@controllera ceph]# ceph -s 2023-05-31T11:33:20.940+0100 7ff7b2d13700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1] 2023-05-31T11:33:20.940+0100 7ff7b1d11700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1] 2023-05-31T11:33:20.940+0100 7ff7b2512700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1] [errno 13] RADOS permission denied (error connecting to the cluster) From the log file I am getting : May 31 11:03:02 controllera docker[214909]: debug 2023-05-31T11:03:02.714+0100 7fcfc0c91700 0 cephx server client.admin: unexpected key: req.key=5fea877f2a68548b expected_key=8c2074e03ffa449a How can I recover the correct key? Regards.

11 months, 1 week

2
1
0 0

all buckets mtime = "0.000000" after upgrade to 17.2.6

by alyarb＠gmail.com

I was running on 17.2.5 since October, and just upgraded to 17.2.6, and now the "mtime" property on all my buckets is 0.000000. On all previous versions going back to Nautilus this wasn't an issue, and we do like to have that value present. radosgw-admin has no quick way to get the last object in the bucket. Here's my tracker submission: https://tracker.ceph.com/issues/61264#change-239348

11 months, 1 week

2
1
0 0

MDS corrupt (also RADOS-level copy?)

by Jake Grimmett

Dear All, we are trying to recover from what we suspect is a corrupt MDS :( and have been following the guide here: <https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/> Symptoms: MDS SSD pool (2TB) filled completely over the weekend, normally uses less than 400GB, resulting in MDS crash. We added 4 x extra SSD to increase pool capacity to 3.5TB, however MDS did not recover # ceph fs status cephfs2 - 0 clients ======= RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 failed 1 resolve wilma-s3 8065 8063 8047 0 2 resolve wilma-s2 901k 802k 34.4k 0 POOL TYPE USED AVAIL mds_ssd metadata 2296G 3566G primary_fs_data data 0 3566G ec82pool data 2168T 3557T STANDBY MDS wilma-s1 wilma-s4 setting "ceph mds repaired 0" causes rank 0 to restart, and then immediately fail. Following the disaster-recovery-experts guide, the first step we did was to export the MDS journals, e.g: # cephfs-journal-tool --rank=cephfs2:0 journal export /root/backup.bin.0 journal is 9744716714163~658103700 wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0 so far so good, however when we try to backup the final MDS the process consumes all available RAM (470GB) and needs to be killed after 14 minutes. # cephfs-journal-tool --rank=cephfs2:2 journal export /root/backup.bin.2 similarly, "recover_dentries summary" consumes all RAM when applied to MDS 2 # cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1 event recover_dentries summary" at this point, we tried to follow the instructions and make a RADOS level copy of the journal data, however the link in the docs doesn't explain how to do this and just points to <http://tracker.ceph.com/issues/9902> At this point we are tempted to reset the journal on MDS 2, but wanted to get a feeling from others about how dangerous this could be? We have a backup, but as there is 1.8PB of data, it's going to take a few weeks to restore.... any ideas gratefully received. Jake -- Dr Jake Grimmett Head Of Scientific Computing MRC Laboratory of Molecular Biology Francis Crick Avenue, Cambridge CB2 0QH, UK.

11 months, 1 week

2
3
0 0

how to use ctdb_mutex_ceph_rados_helper

by Angelo Höngens

Hey, I have a test setup with a 3-node samba cluster. This cluster consists of 3 vm's storing its locks on a replicated gluster volume. I want to switch to 2 physical smb-gateways for performance reasons (not enough money for 3), and since the 2-node cluster can't get quorum, I hope to switch to storing the ctdb lock in ceph and hope that will work reliably. (experiences with 2 node SMB clusters?) I am looking into the ctdb rados helper: [cluster] recovery lock = !/usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_ceph_rados_helper ceph client.tenant1 cephfs_metadata ctdb_lock Now I do have a bit of experience with cephfs, rbd and rgw, but not rados. How do I give the user client.tenant1 permissions? We have a single cephfs, with 4 different tenants (departments). Each department has their own samba cluster. We're using cephfs permissions to limit the tenants to their own path (I hope). example of ceph auth: client.tenant1 key: ***** caps: [mds] allow rws fsname=cephfs path=/tenant1 caps: [mon] allow r fsname=cephfs caps: [osd] allow rw tag cephfs data=cephfs If I try some stuff manually (without really knowing how to specify objects or what that means), I get this permission denied error: root@tenant1-1:~# /usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_ceph_rados_helper ceph client.tenant1 cephfs_metadata tenant1/ctdb_lock 1 /usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_ceph_rados_helper: Failed to get lock on RADOS object 'tenant1/ctdb_lock' - (Operation not permitted) Angelo.

11 months, 1 week

1
0
0 0

Important: RGW multisite bug may silently corrupt encrypted objects on replication

by Casey Bodley

Our downstream QE team recently observed an md5 mismatch of replicated objects when testing rgw's server-side encryption in multisite. This corruption is specific to s3 multipart uploads, and only affects the replicated copy - the original object remains intact. The bug likely affects Ceph releases all the way back to Luminous where server-side encryption was first introduced. To expand on the cause of this corruption: Encryption of multipart uploads requires special handling around the part boundaries, because each part is uploaded and encrypted separately. In multisite, objects are replicated in their encrypted form, and multipart uploads are replicated as a single part. As a result, the replicated copy loses its knowledge about the original part boundaries required to decrypt the data correctly. We don't have a fix yet, but we're tracking it in https://tracker.ceph.com/issues/46062. The fix will only modify the replication logic, so won't repair any objects that have already replicated incorrectly. We'll need to develop a radosgw-admin command to search for affected objects and reschedule their replication. In the meantime, I can only advise multisite users to avoid using encryption for multipart uploads. If you'd like to scan your cluster for existing encrypted multipart uploads, you can identify them with a s3 HeadObject request. The response would include a x-amz-server-side-encryption header, and the ETag header value (with "s removed) would be longer than 32 characters (multipart ETags are in the special form "<md5sum>-<num parts>"). Take care not to delete the corrupted replicas, because an active-active multisite configuration would go on to delete the original copy.

11 months, 1 week

2
4
0 0

Custom CRUSH maps HOWTO?

by Thorne Lawler

Hi folks! I have a Ceph production 17.2.6 cluster with 6 machines in it - four newer, faster machines with 4x3.84TB NVME drives each, and two with 24x1.68TB SAS disks each. I know I should have done something smart with the CRUSH maps for this up front, but until now I have shied away from CRUSH maps as they sound really complex. Right now my cluster's performance, especially write performance, is not what it needs to be, and I am looking for advice: 1. How should I be structuring my crush map, and why? 2. How does one actually edit and manage a CRUSH map? What /commands/ does one use? This isn't clear at all in the documentation. Are there any GUI tools out there for managing CRUSH? 3. Is this going to impact production performance or availability while I'm configuring it? I have tens of thousands of users relying on this thing, so I can't take any risks. Thanks in advance! -- Regards, Thorne Lawler - Senior System Administrator *DDNS* | ABN 76 088 607 265 First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172 P +61 499 449 170 _DDNS /_*Please note:* The information contained in this email message and any attached files may be confidential information, and may also be the subject of legal professional privilege. _If you are not the intended recipient any use, disclosure or copying of this email is unauthorised. _If you received this email in error, please notify Discount Domain Name Services Pty Ltd on 03 9815 6868 to report this matter and delete all copies of this transmission together with any attachments. /

11 months, 1 week

3
3
0 0

Ceph client version vs server version inter-operability

by Mark Kirkwood

Hi, We are running a ceph cluster that is currently on Luminous. At this point most of our clients are also Luminous, but as we provision new client hosts we are using client versions that are more recent (e.g Octopus, Pacific and more recently Quincy). Is this safe? Is there a known list of what client versions are compatible with what server version? We are only using RBD and are specifying rbd_default_features (the same) on all server and client hosts. regards Mark

11 months, 1 week

1
0
0 0

Seeking feedback on Improving cephadm bootstrap process

by Redouane Kachach

Dear ceph community, As you are aware, cephadm has become the default tool for installing Ceph on bare-metal systems. Currently, during the bootstrap process of a new cluster, if the user interrupts the process manually or if there are any issues causing the bootstrap process to fail, cephadm leaves behind the failed cluster files and processes on the current host. While this can be beneficial for debugging and resolving issues related to the cephadm bootstrap process, it can create difficulties for inexperienced users who need to delete the faulty cluster and proceed with the Ceph installation. The problem described in the tracker https://tracker.ceph.com/issues/57016 is a good example of this issue.In the cephadm development team, we are considering ways to enhance the user experience during the bootstrap of a new cluster. We have discussed the following options:1) Retain the cluster files without deleting them, but provide the user with a clear command to remove the broken/faulty cluster. 2) Automatically delete the broken/failed ceph installation and offer an option for the user to disable this behavior if desired.Both options have their advantages and disadvantages, which is why we are seeking your feedback. We would like to know which option you prefer and the reasoning behind your choice. Please provide reasonable arguments to justify your preference.Your feedback will be taken into careful consideration when we work on improving the ceph bootstrap process.Thank you, Redouane, On behalf of cephadm dev team.

11 months, 1 week

6
8
0 0

2024

2023

2022

2021

2020

2019

ceph-users May 2023