Hi,
I've been seeing relatively large fragmentation numbers on all my OSDs:
ceph daemon osd.13 bluestore allocator score block
{
"fragmentation_rating": 0.77251526920454427
}
These aren't that old, as I recreated them all around July last year.
They mostly hold CephFS data with erasure coding, with a mix of large
and small files. The OSDs are at around 80%-85% utilization right now.
Most of the data was written sequentially when the OSDs were created (I
rsynced everything from a remote backup). Since then more data has been
added, but not particularly quickly.
At some point I noticed pathologically slow writes, and I couldn't
figure out what was wrong. Eventually I did some block tracing and
noticed the I/Os were very small, even though CephFS-side I was just
writing one large file sequentially, and that's when I stumbled upon the
free space fragmentation problem. Indeed, deleting some large files
opened up some larger free extents and resolved the problem, but only
until those get filled up and I'm back to fragmented tiny extents. So
effectively I'm stuck at the current utilization, as trying to fill them
up any more just slows down to an absolute crawl.
I'm adding a few more OSDs and plan on doing the dance of removing one
OSD at a time and replacing it with another one to hopefully improve the
situation, but obviously this is going to take forever.
Is there any plan for offering a defrag tool of some sort for bluestore?
- Hector
Dear ceph community,
As you are aware, cephadm has become the default tool for installing Ceph
on bare-metal systems. Currently, during the bootstrap process of a new
cluster, if the user interrupts the process manually or if there are any
issues causing the bootstrap process to fail, cephadm leaves behind the
failed cluster files and processes on the current host. While this can be
beneficial for debugging and resolving issues related to the cephadm
bootstrap process, it can create difficulties for inexperienced users who
need to delete the faulty cluster and proceed with the Ceph installation.
The problem described in the tracker https://tracker.ceph.com/issues/57016 is
a good example of this issue.In the cephadm development team, we are
considering ways to enhance the user experience during the bootstrap of a
new cluster. We have discussed the following options:1) Retain the cluster
files without deleting them, but provide the user with a clear command to
remove the broken/faulty cluster.
2) Automatically delete the broken/failed ceph installation and offer an
option for the user to disable this behavior if desired.Both options have
their advantages and disadvantages, which is why we are seeking your
feedback. We would like to know which option you prefer and the reasoning
behind your choice. Please provide reasonable arguments to justify your
preference.Your feedback will be taken into careful consideration when we
work on improving the ceph bootstrap process.Thank you,
Redouane,
On behalf of cephadm dev team.
I am looking at using an iscsi gateway in front of a ceph setup. However
the warning in the docs is concerning:
The iSCSI gateway is in maintenance as of November 2022. This means that
it is no longer in active development and will not be updated to add new
features.
Does this mean I should be wary of using it, or is it simply that it
does all the stuff it needs to and no further development is needed?
regards
Mark
Dear All,
I'm trying to recover failed MDS metadata by following the link below but
having troubles. Thanks in advance.
Question1: how to scan 2 data pools with scan_extents (cmd 1). The cmd
didn't work with two pools specified. Should I scan one then the other?
Question2: As to scan_inodes (cmd 2), should I only specify the first data
pool per the document. I'm concerned if the 2nd pool is not scanned then
that'll cause metadata loss.
*my fs name: cephfs, data pools: cephfs_hdd, cephfs_ssd*
cmd 1: cephfs-data-scan scan_extents --filesystem cephfs cephfs_hdd
cephfs_ssd
cmd 2: cephfs-data-scan scan_inodes --filesystem cephfs cephfs_hdd
cephfs-data-scan scan_extents [<data pool> [<extra data pool>
...]]cephfs-data-scan scan_inodes [<data pool>]cephfs-data-scan
scan_links
Note, the data pool parameters for ‘scan_extents’, ‘scan_inodes’ and
‘cleanup’ commands are optional, and usually the tool will be able to
detect the pools automatically. Still you may override this. The
‘scan_extents’ command needs all data pools to be specified,* while
‘scan_inodes’ and ‘cleanup’ commands need only the main data pool.*
*https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/
<https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/>*
--
Best Regards,
*Justin Li*
IT Support/Systems Administrator
*Justin.Li2030(a)Gmail.com <Justin.Li2030(a)Gmail.com>*
<http://www.linkedin.com/in/justinli7>
Dear Ceph folks,
Recently one of our clients approached us with a request on encrpytion per user, i.e. using individual encrytion key for each user and encryption files and object store.
Does anyone know (or have experience) how to do with CephFS and Ceph RGW?
Any suggestionns or comments are highly appreciated,
best regards,
Samuel
huxiaoyu(a)horebdata.cn
Our downstream QE team recently observed an md5 mismatch of replicated
objects when testing rgw's server-side encryption in multisite. This
corruption is specific to s3 multipart uploads, and only affects the
replicated copy - the original object remains intact. The bug likely
affects Ceph releases all the way back to Luminous where server-side
encryption was first introduced.
To expand on the cause of this corruption: Encryption of multipart
uploads requires special handling around the part boundaries, because
each part is uploaded and encrypted separately. In multisite, objects
are replicated in their encrypted form, and multipart uploads are
replicated as a single part. As a result, the replicated copy loses
its knowledge about the original part boundaries required to decrypt
the data correctly.
We don't have a fix yet, but we're tracking it in
https://tracker.ceph.com/issues/46062. The fix will only modify the
replication logic, so won't repair any objects that have already
replicated incorrectly. We'll need to develop a radosgw-admin command
to search for affected objects and reschedule their replication.
In the meantime, I can only advise multisite users to avoid using
encryption for multipart uploads. If you'd like to scan your cluster
for existing encrypted multipart uploads, you can identify them with a
s3 HeadObject request. The response would include a
x-amz-server-side-encryption header, and the ETag header value (with
"s removed) would be longer than 32 characters (multipart ETags are in
the special form "<md5sum>-<num parts>"). Take care not to delete the
corrupted replicas, because an active-active multisite configuration
would go on to delete the original copy.
Hi everyone
I'm new to CEPH, just a french 4 days training session with Octopus on
VMs that convince me to build my first cluster.
At this time I have 4 old identical nodes for testing with 3 HDDs each,
2 network interfaces and running Alma Linux8 (el8). I try to replay the
training session but it fails, breaking the web interface because of
some problems with podman 4.2 not compatible with Octopus.
So I try to deploy Pacific with cephadm tool on my first node (mostha1)
(to enable testing also an upgrade later).
dnf -y install
https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noar…
monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }')
cephadm bootstrap --mon-ip $monip --initial-dashboard-password xxxxx \
--initial-dashboard-user admceph \
--allow-fqdn-hostname --cluster-network 10.1.0.0/16
This was sucessfull.
But running "*c**eph orch device ls*" do not show any HDD even if I have
/dev/sda (used by the OS), /dev/sdb and /dev/sdc
The web interface shows a row capacity which is an aggregate of the
sizes of the 3 HDDs for the node.
I've also tried to reset /dev/sdb but cephadm do not see it:
[ceph: root@mostha1 /]# ceph orch device zap
mostha1.legi.grenoble-inp.fr /dev/sdb --force
Error EINVAL: Device path '/dev/sdb' not found on host
'mostha1.legi.grenoble-inp.fr'
On my first attempt with octopus, I was able to list the available HDD
with this command line. Before moving to Pacific, the OS on this node
has been reinstalled from scratch.
Any advices for a CEPH beginner ?
Thanks
Patrick
Hi,
lately, we have had some issues with our MDSs (Ceph version 16.2.10
Pacific).
Part of them are related to MDS being behind on trimming.
I checked the documentation and found the following information (
https://docs.ceph.com/en/pacific/cephfs/health-messages/):
> CephFS maintains a metadata journal that is divided into *log segments*.
The length of journal (in number of segments) is controlled by the setting
mds_log_max_segments, and when the number of segments exceeds that setting
the MDS starts writing back metadata so that it can remove (trim) the
oldest segments. If this writeback is happening too slowly, or a software
bug is preventing trimming, then this health message may appear. The
threshold for this message to appear is controlled by the config option
mds_log_warn_factor, the default is 2.0.
Some resources on the web (https://www.suse.com/support/kb/doc/?id=000019740)
indicated that a solution would be to change the `mds_log_max_segments`.
Which I did:
```
ceph --cluster floki tell mds.* injectargs '--mds_log_max_segments=400000'
```
Of course, the warning disappeared, but I have a feeling that I just hid
the problem. Pushing a value to 400'000 when the default value is 512 is a
lot.
Why is the trimming not taking place? How can I troubleshoot this further?
Best,
Emmanuel