Hi everyone!
My Ceph cluster (17.2.6) has a CephFS volume which is showing 41TB usage
for the data pool, but there are only 5.5TB of files in it. There are
fewer than 100 files on the filesystem in total, so where is all that
space going?
How can I analyze my cephfs to understand what is using that space, and
if possible, how can I reclaim that space?
Thank you.
--
Regards,
Thorne Lawler - Senior System Administrator
*DDNS* | ABN 76 088 607 265
First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172
P +61 499 449 170
_DDNS
/_*Please note:* The information contained in this email message and any
attached files may be confidential information, and may also be the
subject of legal professional privilege. _If you are not the intended
recipient any use, disclosure or copying of this email is unauthorised.
_If you received this email in error, please notify Discount Domain Name
Services Pty Ltd on 03 9815 6868 to report this matter and delete all
copies of this transmission together with any attachments. /
Hi Cephers,
This is to sound out potential interest in supporting VMWare's Photon OS
distribution with Ceph/Cephadm. Photon OS is a lightweight distribution
that is optimized for deploying containers within virtual machines. It is
based on systemd, includes Docker, and uses the tdnf (Tiny DNF) package
manager.
Cephadm already supports Azure Linux/CBL-Mariner
<https://github.com/microsoft/azurelinux>, which is a Microsoft derivative
from the Photon OS project for the Azure ecosystem. Thus, no major changes
would be required.
If you're interested and willing to take part in the deployment beta-test,
please drop us a line!
Thank you!
Kind Regards,
Ernesto
Hi All
We have a nicely functional NVMe cluster running – but in the process of expanding it we have encountered slow_ops
The drill was:
1. Maintaince mode
2. Ceph orch osd add
3. Change weights to 1
4. Disable maintaince mode.
Somethings goes wrong around 3 – where slowops begins to kick in – but restarting the OSD’s underneath can make it go away – tested twice.
In one situation the 2) step failed to bring up the OSD – causing the entire process to work 100% correctly. Thus changing weight from 7.68 to 1
In above process on live OSD seem to be the issue.
It is actually a flaw that we managed to get the entire cluster created with weight 1 instead of 7.68 in the first place thus:
1. Can we bulk change weights on all existing OSD’s without huge data movement? If so – how?
2. Can we ceph orch add – but with a specific initial weight?
Thanks.
Best regards,
Jesper Agerbo Krogh
Director Digitalization
Digitalization
[cid:image001.png@01DA794D.5131C5B0]<http://www.linkedin.com/company/haldor-topsoe>
Topsoe A/S
Haldor Topsøes Allé 1
2800 Kgs. Lyngby
Denmark
Phone (direct): 27773240
[cid:image002.jpg@01DA794D.5131C5B0]<http://www.linkedin.com/company/haldor-topsoe> [cid:image003.jpg@01DA794D.5131C5B0] <https://twitter.com/topsoe_official> [cid:image004.jpg@01DA794D.5131C5B0] <https://www.facebook.com/TopsoeOfficial> [cid:image005.jpg@01DA794D.5131C5B0] <https://www.topsoe.com/our-resources/knowledge/videos/>
Read more at topsoe.com<http://www.topsoe.com>
Topsoe A/S and/or its affiliates. This e-mail message (including attachments, if any) is confidential and may be privileged. It is intended only for the addressee.
Any unauthorised distribution or disclosure is prohibited. Disclosure to anyone other than the intended recipient does not constitute waiver of privilege.
If you have received this email in error, please notify the sender by email and delete it and any attachments from your computer system and records.
Hi,
There is no such attribute.
/mnt: ceph.dir.subvolume: No such attribute
I did not have getfattr installed so needed to install attr package.
Can it be that this package was not installed when fs was created so
ceph.dir.subvolume could not be set at creation?
Did not get any warnings at creation though.
Thanks for you help!!
On lör, mar 16 2024 at 00:53:22 +0530, Neeraj Pratap Singh
<neesingh(a)redhat.com> wrote:
> Can u pls do getfattr on root directory and tell whats the output?
> Run this command: getfattr -n ceph.dir.subvolume /mnt
>
> On Thu, Mar 14, 2024 at 4:38 PM Marcus <marcus(a)marcux.org
> <mailto:marcus@marcux.org>> wrote:
>>
>> Hi all,
>> I have just setup a small ceph cluster with ceph fs.
>> The setup is reef 18.2.1 on Debian bookworm.
>> The system is up and running the way it should,
>> though I have a problem with ceph fs snapshots.
>>
>> When I read the doc I should be able to make a
>> snapshot in any directory in the filesystem.
>> I can do a snapshot in the root of the filesystem
>> but if I try somewhere else I get:
>> Operation not permitted
>> This is the same if I do it with mkdir or
>> with ceph fs subvolume snapshot create ...
>>
>> I have created an auth client with rws:
>> [client.snap-mount]
>> key = ****
>> caps mds = "allow rws fsname=gds-common"
>> caps mon = "allow r fsname=gds-common"
>> caps osd = "allow rw tag cephfs data=gds-common"
>>
>> Where the filsystem is called gds-common,
>> saved in a file on the client:
>> /etc/ceph/ceph.client.snap-mount.keyring
>>
>> I mount ceph fs with:
>> mount -t ceph :/ -o name=snap-mount /mnt
>>
>> If I create a snapshot in root, it works fine, as in:
>> mkdir /mnt/.snap/mysnap
>> I also notice that in every subdir there is a "snapshot dir" as well
>> with the name _mysnap_1, as in:
>> /mnt/dir/.snap/_mysnap_1
>>
>> My guess that is is a part of the snapshot system, this "snapshot"
>> dissapear when the snapshot is removed with:
>> rmdir /mnt/.snap/mysnap
>>
>> If I try to make a snapshot in another directory this does not work:
>> mkdir /mnt/dir/.snap/othersnap
>> Get the error:
>> cannot create directory ‘/mnt/dir/.snap/othersnap’: Operation
>> not
>> permitted
>>
>> It is the same thing on the commandline, root works:
>> ceph fs subvolume snapshot create gds-common / fromcmd
>>
>> But not in a subdir:
>> ceph fs subvolume snapshot create gds-common /dir dirsnap
>> Error EINVAL: invalid value specified for ceph.dir.subvolume
>>
>> I also notice that when you use any command of type:
>> ceph fs subvolume snapshot ...
>> You get a new directory (volumes) in the root:
>> /mnt/volumes/_legacy/6666cd76f96956469e7be39d750cc7d9.meta
>>
>> I do not know if I am missing something, some lacking of
>> config or so.
>>
>> Thanks for your help!!
>>
>> Best regards
>> Marcus
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> <mailto:ceph-users@ceph.io>
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> <mailto:ceph-users-leave@ceph.io>
Hello
I found myself in the following situation:
[WRN] PG_AVAILABILITY: Reduced data availability: 3 pgs inactive
pg 4.3d is stuck inactive for 8d, current state
activating+undersized+degraded+remapped, last acting
[4,NONE,46,NONE,10,13,NONE,74]
pg 4.6e is stuck inactive for 9d, current state
activating+undersized+degraded+remapped, last acting
[NONE,27,77,79,55,48,50,NONE]
pg 4.cb is stuck inactive for 8d, current state
activating+undersized+degraded+remapped, last acting
[6,NONE,42,8,60,22,35,45]
I have one cephfs with two backing pools -- one for replicated data, the
other for erasure data. Each pool is mapped to REPLICATED/ vs. ERASURE/
directories on the filesystem.
The above pgs. are affecting the ERASURE pool (5+3) backing the FS. How
can I get ceph to recover these three PGs?
Thank you.
I was just looking at our crush rules as we need to change them from
failure domain host to failure domain datacenter. The replicated ones
seem trivial but what about this one for EC 4+2?
rule rbd_ec_data {
id 0
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step chooseleaf indep 0 type host
step emit
}
We already have this crush rule for EC 4+5:
"
rule cephfs.hdd.data {
id 7
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 0 type datacenter
step chooseleaf indep 3 type host
step emit
}
"
I don't understand the "num" argument for the choose step. The
documentation[1] says:
"
If {num} == 0, choose pool-num-replicas buckets (as many buckets as are
available).
If pool-num-replicas > {num} > 0, choose that many buckets.
If {num} < 0, choose pool-num-replicas - {num} buckets.
"
Num is 0 but it's not replicated so how does this translate to picking 3
of 3 datacenters?
I am thinking we should just change 3 to 2 for the chooseleaf line for
the 4+2 rule since for 4+5 each DC needs 3 shards and for 4+2 each DC
needs 2 shards. Comments?
Mvh.
Torkil
[1] https://docs.ceph.com/en/reef/rados/operations/crush-map-edits/
--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
Hello everyone,
We've been experiencing on our quincy CephFS clusters (one 17.2.6 and
another 17.2.7) repeated slow ops with our client kernel mounts
(Ceph 17.2.7 and version 4 Linux kernels on all clients) that seem to
originate from slow ops on osds despite the underlying hardware being
fine. Our 2 clusters are similar and are both Alma8 systems where more
specifically:
* Cluster (1) is Alma8.8 running Ceph version 17.2.7 with 7 NVMe SSD
OSDs storing the metadata and 432 spinning SATA disks storing the
bulk data in an EC pool (8 data shards and 2 parity blocks) across
40 nodes. The whole cluster is used to support a single file system
with 1 active MDS and 2 standby ones.
* Cluster (2) is Alma8.7 running Ceph version 17.2.6 with 4 NVMe SSD
OSDs storing the metadata and 348 spinning SAS disks storing the
bulk data in EC pools (8 data shards and 2 parity blocks). This
cluster houses multiple filesystems each with their own dedicated
MDS along with 3 communal standby ones.
Nearly daily we often find that we're get the following error messages:
MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST.
Along with these messages, certain files and directory cannot be stat-ed
and any processes involving these files hang indefinitely. We have been
fixing this by:
1. First, finding the oldest blocked MDS op and the inode listed there:
~$ ceph tell mds.${my_mds} dump_blocked_ops 2>> /dev/null | grep
-c description
"description": "client_request(client.251247219:662 getattr
AsLsXsFs #0x100922d1102 2024-03-13T12:51:57.988115+0000
caller_uid=26983, caller_gid=26983)",
# inode/ object of interest: 100922d1102
2. Second, finding all the current clients that have a cap for this
blocked inode from the faulty MDS' session list (i.e. ceph tell
mds.${my_mds} session ls --cap-dump) and then examining the client
whose had the cap the longest:
~$ ceph tell mds.${my_mds} session ls --cap-dump ...
2024-03-13T13:01:36: client.251247219
2024-03-13T12:50:28: client.245466949
3. Then on the aforementioned oldest client, get the current ops in
flight to the OSDs (via the "/sys/kernel/debug/ceph/*/osdc" files)
and get the op corresponding to the blocked inode along with the OSD
the I/O is going to:
root@client245466949 $ grep 100922d1102
/sys/kernel/debug/ceph/*/osdc
48366 osd79 2.249f8a51 2.a51s0
[79,351,232,179,107,195,323,14,128,167]/79
[79,351,232,179,107,195,323,14,128,167]/79 e374191
100922d1102.000000f5 0x400024 1 write
# osd causing errors is osd.79
4. Finally, we restart this "hanging" OSD where this results in ls
and I/O on the previously "stuck" files no longer "hanging" .
Once we get this OSD for which the blocked inode is waiting for, we can
see in the system logs that the OSD has slow ops:
~$ systemctl --no-pager --full status ceph-osd@79
...
2024-03-13T12:49:37 -1 osd.79 374175 get_health_metrics reporting 3
slow ops, oldest is osd_op(client.245466949.0:41350 2.ca4s0
2.ce648ca4 (undecoded) ondisk+write+known_if_redirected e374173)
...
Files that these "hanging" inodes correspond to as well as the
directories housing these files can't be opened or stat-ed (causing
directories to hang) where we've found restarting this OSD with slow ops
to be the least disruptive way of resolving this (compared with a forced
umount and then re-mount on the client). There are no issues with the
underlying hardware for either the osd reporting these slow ops or any
other drive within the acting PG and there seems to be no correlation
between what processes are involved or what type of files these are.
What could be causing these slow ops and certain files and directories
to "hang"? There aren't workflows being performed that generate a large
number of small files nor are there directories with a large number of
files within them. This seems to happen with a wide range of hard-drives
and we see this on SATA and SAS type drives where our nodes are
interconnected with 25 Gb/s NICs so we can't see how the underlying
hardware would be causing any I/O bottlenecks. Has anyone else seen this
type of behaviour before and have any ideas? Is there a way to stop
these from happening as we are having to solve these nearly daily now
and we can't seem to find a way to reduce them. We do use snapshots to
backup our cluster where we've been doing this for ~6 months and these
issues have only been occurring on and off for a couple of months but
much more frequently now.
Kindest regards,
Ivan Clayson
--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
Hi again, hopefully for the last time with problems.
We had a MDS crash earlier with the MDS staying in failed state and used a command to reset the filesystem (this was wrong, I know now, thanks Patrick Donnelly for pointing this out). I did a full scrub on the filesystem and two files were damaged. One of those got repaired, but the following file keeps giving errors and can't be removed.
What can I do now? Below some information.
# ceph tell mds.atlassian-prod:0 damage ls
[
{
"damage_type": "backtrace",
"id": 2244444901,
"ino": 1099534008829,
"path": "/app1/shared/data/repositories/11271/objects/41/8f82507a0737c611720ed224bcc8b7a24fda01"
}
]
Trying to repair the error (online research shows this should work for a backtrace damage type)
----------
# ceph tell mds.atlassian-prod:0 scrub start /app1/shared/data/repositories/11271 recursive,repair,force
{
"return_code": 0,
"scrub_tag": "d10ead42-5280-4224-971e-4f3022e79278",
"mode": "asynchronous"
}
Cluster logs after this
----------
1/2/24 9:37:05 AM
[INF]
scrub summary: idle
1/2/24 9:37:02 AM
[INF]
scrub summary: idle+waiting paths [/app1/shared/data/repositories/11271]
1/2/24 9:37:01 AM
[INF]
scrub summary: active paths [/app1/shared/data/repositories/11271]
1/2/24 9:37:01 AM
[INF]
scrub summary: idle+waiting paths [/app1/shared/data/repositories/11271]
1/2/24 9:37:01 AM
[INF]
scrub queued for path: /app1/shared/data/repositories/11271
But the error doesn't disappear and still can't remove the file.
On the client trying to remove the file (we got a backup)
----------
$ rm -f /mnt/shared_disk-app1/shared/data/repositories/11271/objects/41/8f82507a0737c611720ed224bcc8b7a24fda01
rm: cannot remove '/mnt/shared_disk-app1/shared/data/repositories/11271/objects/41/8f82507a0737c611720ed224bcc8b7a24fda01': Input/output error
Best regards,
Sake
Hi!
As I'm reading through the documentation about subtree pinning, I was wondering if the following is possible.
We've got the following directory structure.
/
/app1
/app2
/app3
/app4
Can I pin /app1 to MDS rank 0 and 1, the directory /app2 to rank 2 and finally /app3 and /app4 to rank 3?
I would like to load balance the subfolders of /app1 to 2 (or 3) MDS servers.
Best regards,
Sake