Bonjour,
Reading Karan's blog post about benchmarking the insertion of billions objects to Ceph via S3 / RGW[0] from last year, it reads:
> we decided to lower bluestore_min_alloc_size_hdd to 18KB and re-test. As represented in chart-5, the object creation rate found to be notably reduced after lowering the bluestore_min_alloc_size_hdd parameter from 64KB (default) to 18KB. As such, for objects larger than the bluestore_min_alloc_size_hdd , the default values seems to be optimal, smaller objects further require more investigation if you intended to reduce bluestore_min_alloc_size_hdd parameter.
There also is a mail thread dated 2018 on this topic as well, with the same conclusion although using RADOS directly and not RGW[3]. I read the RGW data layout page in the documentation[1] and concluded that by default every object inserted with S3 / RGW will indeed use at least 64kb. A pull request from last year[2] seems to confirm it and also suggests modifying bluestore_min_alloc_size_hdd has adverse side effects.
That being said, I'm curious to know if people developed strategies to cope with this overhead. Someone mentioned packing objects together client side to make them larger. But maybe there are simpler ways to do the same?
Cheers
[0] https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond
[1] https://docs.ceph.com/en/latest/radosgw/layout/
[2] https://github.com/ceph/ceph/pull/32809
[3] https://www.spinics.net/lists/ceph-users/msg45755.html
--
Loïc Dachary, Artisan Logiciel Libre
Hi,
I've been seeing relatively large fragmentation numbers on all my OSDs:
ceph daemon osd.13 bluestore allocator score block
{
"fragmentation_rating": 0.77251526920454427
}
These aren't that old, as I recreated them all around July last year.
They mostly hold CephFS data with erasure coding, with a mix of large
and small files. The OSDs are at around 80%-85% utilization right now.
Most of the data was written sequentially when the OSDs were created (I
rsynced everything from a remote backup). Since then more data has been
added, but not particularly quickly.
At some point I noticed pathologically slow writes, and I couldn't
figure out what was wrong. Eventually I did some block tracing and
noticed the I/Os were very small, even though CephFS-side I was just
writing one large file sequentially, and that's when I stumbled upon the
free space fragmentation problem. Indeed, deleting some large files
opened up some larger free extents and resolved the problem, but only
until those get filled up and I'm back to fragmented tiny extents. So
effectively I'm stuck at the current utilization, as trying to fill them
up any more just slows down to an absolute crawl.
I'm adding a few more OSDs and plan on doing the dance of removing one
OSD at a time and replacing it with another one to hopefully improve the
situation, but obviously this is going to take forever.
Is there any plan for offering a defrag tool of some sort for bluestore?
- Hector
Hi,
After a wrong manipulation, the admin key no longer works, it seems it has
been modified.
My cluster is built using containers.
When I execute ceph -s I get
[root@controllera ceph]# ceph -s
2023-05-31T11:33:20.940+0100 7ff7b2d13700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-05-31T11:33:20.940+0100 7ff7b1d11700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-05-31T11:33:20.940+0100 7ff7b2512700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
[errno 13] RADOS permission denied (error connecting to the cluster)
From the log file I am getting :
May 31 11:03:02 controllera docker[214909]: debug
2023-05-31T11:03:02.714+0100 7fcfc0c91700 0 cephx server client.admin:
unexpected key: req.key=5fea877f2a68548b expected_key=8c2074e03ffa449a
How can I recover the correct key?
Regards.
I was running on 17.2.5 since October, and just upgraded to 17.2.6, and now the "mtime" property on all my buckets is 0.000000.
On all previous versions going back to Nautilus this wasn't an issue, and we do like to have that value present. radosgw-admin has no quick way to get the last object in the bucket.
Here's my tracker submission:
https://tracker.ceph.com/issues/61264#change-239348
Dear All,
we are trying to recover from what we suspect is a corrupt MDS :(
and have been following the guide here:
<https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/>
Symptoms: MDS SSD pool (2TB) filled completely over the weekend,
normally uses less than 400GB, resulting in MDS crash.
We added 4 x extra SSD to increase pool capacity to 3.5TB, however MDS
did not recover
# ceph fs status
cephfs2 - 0 clients
=======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 failed
1 resolve wilma-s3 8065 8063 8047 0
2 resolve wilma-s2 901k 802k 34.4k 0
POOL TYPE USED AVAIL
mds_ssd metadata 2296G 3566G
primary_fs_data data 0 3566G
ec82pool data 2168T 3557T
STANDBY MDS
wilma-s1
wilma-s4
setting "ceph mds repaired 0" causes rank 0 to restart, and then
immediately fail.
Following the disaster-recovery-experts guide, the first step we did was
to export the MDS journals, e.g:
# cephfs-journal-tool --rank=cephfs2:0 journal export /root/backup.bin.0
journal is 9744716714163~658103700
wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0
so far so good, however when we try to backup the final MDS the process
consumes all available RAM (470GB) and needs to be killed after 14 minutes.
# cephfs-journal-tool --rank=cephfs2:2 journal export /root/backup.bin.2
similarly, "recover_dentries summary" consumes all RAM when applied to MDS 2
# cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary
We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event
recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1
event recover_dentries summary"
at this point, we tried to follow the instructions and make a RADOS
level copy of the journal data, however the link in the docs doesn't
explain how to do this and just points to
<http://tracker.ceph.com/issues/9902>
At this point we are tempted to reset the journal on MDS 2, but wanted
to get a feeling from others about how dangerous this could be?
We have a backup, but as there is 1.8PB of data, it's going to take a
few weeks to restore....
any ideas gratefully received.
Jake
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Hey,
I have a test setup with a 3-node samba cluster. This cluster consists
of 3 vm's storing its locks on a replicated gluster volume.
I want to switch to 2 physical smb-gateways for performance reasons
(not enough money for 3), and since the 2-node cluster can't get
quorum, I hope to switch to storing the ctdb lock in ceph and hope
that will work reliably. (experiences with 2 node SMB clusters?)
I am looking into the ctdb rados helper:
[cluster]
recovery lock =
!/usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_ceph_rados_helper ceph
client.tenant1 cephfs_metadata ctdb_lock
Now I do have a bit of experience with cephfs, rbd and rgw, but not
rados. How do I give the user client.tenant1 permissions?
We have a single cephfs, with 4 different tenants (departments). Each
department has their own samba cluster. We're using cephfs permissions
to limit the tenants to their own path (I hope).
example of ceph auth:
client.tenant1
key: *****
caps: [mds] allow rws fsname=cephfs path=/tenant1
caps: [mon] allow r fsname=cephfs
caps: [osd] allow rw tag cephfs data=cephfs
If I try some stuff manually (without really knowing how to specify
objects or what that means), I get this permission denied error:
root@tenant1-1:~#
/usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_ceph_rados_helper ceph
client.tenant1 cephfs_metadata tenant1/ctdb_lock 1
/usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_ceph_rados_helper: Failed to
get lock on RADOS object 'tenant1/ctdb_lock' - (Operation not
permitted)
Angelo.
Our downstream QE team recently observed an md5 mismatch of replicated
objects when testing rgw's server-side encryption in multisite. This
corruption is specific to s3 multipart uploads, and only affects the
replicated copy - the original object remains intact. The bug likely
affects Ceph releases all the way back to Luminous where server-side
encryption was first introduced.
To expand on the cause of this corruption: Encryption of multipart
uploads requires special handling around the part boundaries, because
each part is uploaded and encrypted separately. In multisite, objects
are replicated in their encrypted form, and multipart uploads are
replicated as a single part. As a result, the replicated copy loses
its knowledge about the original part boundaries required to decrypt
the data correctly.
We don't have a fix yet, but we're tracking it in
https://tracker.ceph.com/issues/46062. The fix will only modify the
replication logic, so won't repair any objects that have already
replicated incorrectly. We'll need to develop a radosgw-admin command
to search for affected objects and reschedule their replication.
In the meantime, I can only advise multisite users to avoid using
encryption for multipart uploads. If you'd like to scan your cluster
for existing encrypted multipart uploads, you can identify them with a
s3 HeadObject request. The response would include a
x-amz-server-side-encryption header, and the ETag header value (with
"s removed) would be longer than 32 characters (multipart ETags are in
the special form "<md5sum>-<num parts>"). Take care not to delete the
corrupted replicas, because an active-active multisite configuration
would go on to delete the original copy.
Hi folks!
I have a Ceph production 17.2.6 cluster with 6 machines in it - four
newer, faster machines with 4x3.84TB NVME drives each, and two with
24x1.68TB SAS disks each.
I know I should have done something smart with the CRUSH maps for this
up front, but until now I have shied away from CRUSH maps as they sound
really complex.
Right now my cluster's performance, especially write performance, is not
what it needs to be, and I am looking for advice:
1. How should I be structuring my crush map, and why?
2. How does one actually edit and manage a CRUSH map? What /commands/
does one use? This isn't clear at all in the documentation. Are there
any GUI tools out there for managing CRUSH?
3. Is this going to impact production performance or availability while
I'm configuring it? I have tens of thousands of users relying on this
thing, so I can't take any risks.
Thanks in advance!
--
Regards,
Thorne Lawler - Senior System Administrator
*DDNS* | ABN 76 088 607 265
First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172
P +61 499 449 170
_DDNS
/_*Please note:* The information contained in this email message and any
attached files may be confidential information, and may also be the
subject of legal professional privilege. _If you are not the intended
recipient any use, disclosure or copying of this email is unauthorised.
_If you received this email in error, please notify Discount Domain Name
Services Pty Ltd on 03 9815 6868 to report this matter and delete all
copies of this transmission together with any attachments. /
Hi,
We are running a ceph cluster that is currently on Luminous. At this
point most of our clients are also Luminous, but as we provision new
client hosts we are using client versions that are more recent (e.g
Octopus, Pacific and more recently Quincy). Is this safe? Is there a
known list of what client versions are compatible with what server version?
We are only using RBD and are specifying rbd_default_features (the same)
on all server and client hosts.
regards
Mark
Dear ceph community,
As you are aware, cephadm has become the default tool for installing Ceph
on bare-metal systems. Currently, during the bootstrap process of a new
cluster, if the user interrupts the process manually or if there are any
issues causing the bootstrap process to fail, cephadm leaves behind the
failed cluster files and processes on the current host. While this can be
beneficial for debugging and resolving issues related to the cephadm
bootstrap process, it can create difficulties for inexperienced users who
need to delete the faulty cluster and proceed with the Ceph installation.
The problem described in the tracker https://tracker.ceph.com/issues/57016 is
a good example of this issue.In the cephadm development team, we are
considering ways to enhance the user experience during the bootstrap of a
new cluster. We have discussed the following options:1) Retain the cluster
files without deleting them, but provide the user with a clear command to
remove the broken/faulty cluster.
2) Automatically delete the broken/failed ceph installation and offer an
option for the user to disable this behavior if desired.Both options have
their advantages and disadvantages, which is why we are seeking your
feedback. We would like to know which option you prefer and the reasoning
behind your choice. Please provide reasonable arguments to justify your
preference.Your feedback will be taken into careful consideration when we
work on improving the ceph bootstrap process.Thank you,
Redouane,
On behalf of cephadm dev team.