Our downstream QE team recently observed an md5 mismatch of replicated
objects when testing rgw's server-side encryption in multisite. This
corruption is specific to s3 multipart uploads, and only affects the
replicated copy - the original object remains intact. The bug likely
affects Ceph releases all the way back to Luminous where server-side
encryption was first introduced.
To expand on the cause of this corruption: Encryption of multipart
uploads requires special handling around the part boundaries, because
each part is uploaded and encrypted separately. In multisite, objects
are replicated in their encrypted form, and multipart uploads are
replicated as a single part. As a result, the replicated copy loses
its knowledge about the original part boundaries required to decrypt
the data correctly.
We don't have a fix yet, but we're tracking it in
https://tracker.ceph.com/issues/46062. The fix will only modify the
replication logic, so won't repair any objects that have already
replicated incorrectly. We'll need to develop a radosgw-admin command
to search for affected objects and reschedule their replication.
In the meantime, I can only advise multisite users to avoid using
encryption for multipart uploads. If you'd like to scan your cluster
for existing encrypted multipart uploads, you can identify them with a
s3 HeadObject request. The response would include a
x-amz-server-side-encryption header, and the ETag header value (with
"s removed) would be longer than 32 characters (multipart ETags are in
the special form "<md5sum>-<num parts>"). Take care not to delete the
corrupted replicas, because an active-active multisite configuration
would go on to delete the original copy.
I have a Ceph production 17.2.6 cluster with 6 machines in it - four
newer, faster machines with 4x3.84TB NVME drives each, and two with
24x1.68TB SAS disks each.
I know I should have done something smart with the CRUSH maps for this
up front, but until now I have shied away from CRUSH maps as they sound
Right now my cluster's performance, especially write performance, is not
what it needs to be, and I am looking for advice:
1. How should I be structuring my crush map, and why?
2. How does one actually edit and manage a CRUSH map? What /commands/
does one use? This isn't clear at all in the documentation. Are there
any GUI tools out there for managing CRUSH?
3. Is this going to impact production performance or availability while
I'm configuring it? I have tens of thousands of users relying on this
thing, so I can't take any risks.
Thanks in advance!
Thorne Lawler - Senior System Administrator
*DDNS* | ABN 76 088 607 265
First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172
P +61 499 449 170
/_*Please note:* The information contained in this email message and any
attached files may be confidential information, and may also be the
subject of legal professional privilege. _If you are not the intended
recipient any use, disclosure or copying of this email is unauthorised.
_If you received this email in error, please notify Discount Domain Name
Services Pty Ltd on 03 9815 6868 to report this matter and delete all
copies of this transmission together with any attachments. /
We are running a ceph cluster that is currently on Luminous. At this
point most of our clients are also Luminous, but as we provision new
client hosts we are using client versions that are more recent (e.g
Octopus, Pacific and more recently Quincy). Is this safe? Is there a
known list of what client versions are compatible with what server version?
We are only using RBD and are specifying rbd_default_features (the same)
on all server and client hosts.
Dear ceph community,
As you are aware, cephadm has become the default tool for installing Ceph
on bare-metal systems. Currently, during the bootstrap process of a new
cluster, if the user interrupts the process manually or if there are any
issues causing the bootstrap process to fail, cephadm leaves behind the
failed cluster files and processes on the current host. While this can be
beneficial for debugging and resolving issues related to the cephadm
bootstrap process, it can create difficulties for inexperienced users who
need to delete the faulty cluster and proceed with the Ceph installation.
The problem described in the tracker https://tracker.ceph.com/issues/57016 is
a good example of this issue.In the cephadm development team, we are
considering ways to enhance the user experience during the bootstrap of a
new cluster. We have discussed the following options:1) Retain the cluster
files without deleting them, but provide the user with a clear command to
remove the broken/faulty cluster.
2) Automatically delete the broken/failed ceph installation and offer an
option for the user to disable this behavior if desired.Both options have
their advantages and disadvantages, which is why we are seeking your
feedback. We would like to know which option you prefer and the reasoning
behind your choice. Please provide reasonable arguments to justify your
preference.Your feedback will be taken into careful consideration when we
work on improving the ceph bootstrap process.Thank you,
On behalf of cephadm dev team.
What would happen if we set up an RBD mirroring configuration, and in the
target system (the system where the RBD image is mirrored) we create
snapshots of this image? Would that cause some problems?
Also, what happens if we delete the source RBD image? Would that trigger a
deletion in the target system RBD image as well?
Thanks in advance!
Sorry for poking this old thread, but does this issue still persist in
the 6.3 kernels?
Clyso GmbH | https://www.clyso.com
On Wed, Dec 7, 2022 at 3:42 AM William Edwards <wedwards(a)cyberfusion.nl> wrote:
> > Op 7 dec. 2022 om 11:59 heeft Stefan Kooman <stefan(a)bit.nl> het volgende geschreven:
> > On 5/13/22 09:38, Xiubo Li wrote:
> >>> On 5/12/22 12:06 AM, Stefan Kooman wrote:
> >>> Hi List,
> >>> We have quite a few linux kernel clients for CephFS. One of our customers has been running mainline kernels (CentOS 7 elrepo) for the past two years. They started out with 3.x kernels (default CentOS 7), but upgraded to mainline when those kernels would frequently generate MDS warnings like "failing to respond to capability release". That worked fine until 5.14 kernel. 5.14 and up would use a lot of CPU and *way* more bandwidth on CephFS than older kernels (order of magnitude). After the MDS was upgraded from Nautilus to Octopus that behavior is gone (comparable CPU / bandwidth usage as older kernels). However, the newer kernels are now the ones that give "failing to respond to capability release", and worse, clients get evicted (unresponsive as far as the MDS is concerned). Even the latest 5.17 kernels have that. No difference is observed between using messenger v1 or v2. MDS version is 15.2.16.
> >>> Surprisingly the latest stable kernels from CentOS 7 work flawlessly now. Although that is good news, newer operating systems come with newer kernels.
> >>> Does anyone else observe the same behavior with newish kernel clients?
> >> There have some known bugs, which have been fixed or under fixing recently, even in the mainline and, not sure whether are they related. Such as . More detail please see ceph-client repo testing branch .
> > None of the issues you mentioned were related. We gained some more experience with newer kernel clients, specifically on Ubuntu Focal / Jammy (5.15). Performance issues seem to arise in certain workloads, specifically load-balanced Apache shared web hosting clusters with CephFS. We have tested linux kernel clients from 5.8 up to and including 6.0 with a production workload and the short summary is:
> > < 5.13, everything works fine
> > 5.13 and up is giving issues
> I see this issue on 6.0.0 as well.
> > We tested the 5.13.-rc1 as well, and already that kernel is giving issues. So something has changed in 5.13 that results in performance regression in certain workloads. And I wonder if it has something to do with the changes related to fscache that have, and are, happening in the kernel. These web servers might access the same directories / files concurrently.
> > Note: we have quite a few 5.15 kernel clients not doing any (load-balanced) web based workload (container clusters on CephFS) that don't have any performance issue running these kernels.
> > Issue: poor CephFS performance
> > Symptom / result: excessive CephFS network usage (order of magnitude higher than for older kernels not having this issue), within a minute there are a bunch of slow web service processes, claiming loads of virtual memory, that result in heavy swap usage and basically rendering the node unusable slow.
> > Other users that replied to this thread experienced similar symptoms. It is reproducible on both CentOS (EPEL mainline kernels) as well as on Ubuntu (hwe as well as default relase kernel).
> > MDS version used: 15.2.16 (with a backported patch from 15.2.17) (single active / standby-replay)
> > Does this ring a bell?
> > Gr. Stefan
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
I just restarted one of our mds servers. I can find some "progress" in logs
mds.beacon.icadmin006 Sending beacon up:replay seq 461
mds.beacon.icadmin006 received beacon reply up:replay seq 461 rtt 0
How I know how long is the sequence (ie. when the node will be finished
On 29/05/2023 20.55, Anthony D'Atri wrote:
> Check the uptime for the OSDs in question
I restarted all my OSDs within the past 10 days or so. Maybe OSD
restarts are somehow breaking these stats?
>> On May 29, 2023, at 6:44 AM, Hector Martin <marcan(a)marcan.st> wrote:
>> I'm watching a cluster finish a bunch of backfilling, and I noticed that
>> quite often PGs end up with zero misplaced objects, even though they are
>> still backfilling.
>> Right now the cluster is down to 6 backfilling PGs:
>> volumes: 1/1 healthy
>> pools: 6 pools, 268 pgs
>> objects: 18.79M objects, 29 TiB
>> usage: 49 TiB used, 25 TiB / 75 TiB avail
>> pgs: 262 active+clean
>> 6 active+remapped+backfilling
>> But there are no misplaced objects, and the misplaced column in `ceph pg
>> dump` is zero for all PGs.
>> If I do a `ceph pg dump_json`, I can see `num_objects_recovered`
>> increasing for these PGs... but the misplaced count is still 0.
>> Is there something else that would cause recoveries/backfills other than
>> misplaced objects? Or perhaps there is a bug somewhere causing the
>> misplaced object count to be misreported as 0 sometimes?
>> # ceph -v
>> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
>> - Hector
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
we are running a cephfs cluster with the following version:
ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific
Several MDSs are reporting slow requests:
HEALTH_WARN 4 MDSs report slow requests
[WRN] MDS_SLOW_REQUEST: 4 MDSs report slow requests
mds.icadmin011(mds.5): 1 slow requests are blocked > 30 secs
mds.icadmin015(mds.6): 2 slow requests are blocked > 30 secs
mds.icadmin006(mds.4): 8 slow requests are blocked > 30 secs
mds.icadmin007(mds.2): 2 slow requests are blocked > 30 secs
According to Quincy's documentation (
https://docs.ceph.com/en/quincy/cephfs/troubleshooting/), this can be
investigated by issuing:
ceph mds.icadmin011 dump cache /tmp/dump.txt
Unfortunately, this command fails:
no valid command found; 10 closest matches:
pg dump [all|summary|sum|delta|pools|osds|pgs|pgs_brief...]
pg dump_json [all|summary|sum|pools|osds|pgs...]
pg ls-by-pool <poolstr> [<states>...]
pg ls-by-primary <id|osd.id> [<pool:int>] [<states>...]
pg ls-by-osd <id|osd.id> [<pool:int>] [<states>...]
pg ls [<pool:int>] [<states>...]
pg dump_stuck [inactive|unclean|stale|undersized|degraded...]
Error EINVAL: invalid command
I imagine that it is related to the fact that we are running the Pacific
version and not the Quincy version.
When looking at the Pacific's documentation (
https://docs.ceph.com/en/pacific/cephfs/health-messages/), I should:
> Use the ops admin socket command to list outstanding metadata operations.
Unfortunately, I fail to really understand what I'm supposed to do. Can
someone give a pointer?
I'm watching a cluster finish a bunch of backfilling, and I noticed that
quite often PGs end up with zero misplaced objects, even though they are
Right now the cluster is down to 6 backfilling PGs:
volumes: 1/1 healthy
pools: 6 pools, 268 pgs
objects: 18.79M objects, 29 TiB
usage: 49 TiB used, 25 TiB / 75 TiB avail
pgs: 262 active+clean
But there are no misplaced objects, and the misplaced column in `ceph pg
dump` is zero for all PGs.
If I do a `ceph pg dump_json`, I can see `num_objects_recovered`
increasing for these PGs... but the misplaced count is still 0.
Is there something else that would cause recoveries/backfills other than
misplaced objects? Or perhaps there is a bug somewhere causing the
misplaced object count to be misreported as 0 sometimes?
# ceph -v
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy