Hi,
Is there any config on Ceph that block/not perform space reclaim?
I test on one pool which has only one image 1.8 TiB in used.
rbd $p du im/root
warning: fast-diff map is not enabled for root. operation may be slow.
NAME PROVISIONED USED
root 2.2 TiB 1.8 TiB
I already removed all snaphots and now pool has only one image alone.
I run both fstrim over the filesystem (XFS) and try rbd sparsify im/root (don't know what it is exactly but it mentions to reclaim something)
It still shows the pool used 6.9 TiB which totally not make sense right? It should be up to 3.6 (1.8 * 2) according to its replica?
POOLS:
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
im 19 32 3.5 TiB 918.34k 6.9 TiB 4.80 69 TiB N/A 10 TiB 918.34k 0 B 0 B
I think now some of others pool have this issue too, we do clean up a lot but seems space not reclaimed.
I estimate more than 50 TiB should be able to reclaim, actual usage of this cluster much less than current reported number.
Thank you for your help.
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Dear Ceph users,
Our CephFS is not releasing/freeing up space after deleting hundreds of
terabytes of data.
By now, this drives us in a "nearfull" osd/pool situation and thus
throttles IO.
We are on ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
quincy (stable).
Recently, we moved a bunch of data to a new pool with better EC.
This was done by adding a new EC pool to the FS.
Then assigning the FS root to the new EC pool via the directory layout xattr
(so all new data is written to the new pool).
And finally copying old data to new folders.
I swapped the data as follows to remain the old directory structures.
I also made snapshots for validation purposes.
So basically:
cp -r mymount/mydata/ mymount/new/ # this creates copy on new pool
mkdir mymount/mydata/.snap/tovalidate
mkdir mymount/new/mydata/.snap/tovalidate
mv mymount/mydata/ mymount/old/
mv mymount/new/mydata mymount/
I could see the increase of data in the new pool as expected (ceph df).
I compared the snapshots with hashdeep to make sure the new data is alright.
Then I went ahead deleting the old data, basically:
rmdir mymount/old/mydata/.snap/* # this also included a bunch of other
older snapshots
rm -r mymount/old/mydata
At first we had a bunch of PGs with snaptrim/snaptrim_wait.
But they are done for quite some time now.
And now, already two weeks later the size of the old pool still hasn't
really decreased.
I'm still waiting for around 500 TB to be released (and much more is
planned).
I honestly have no clue, where to go from here.
From my point of view (i.e. the CephFS mount), the data is gone.
I also never hard/soft-linked it anywhere.
This doesn't seem to be a regular issue.
At least I couldn't find anything related or resolved in the docs or
user list, yet.
If anybody has an idea how to resolve this, I would highly appreciate it.
Best Wishes,
Mathias
Hi all,
For about a week our CephFS has experienced issues with its MDS.
Currently the MDS is stuck in "up:rejoin"
Issues become apparent when simple commands like "mv foo bar/" hung.
I unmounted CephFS offline on the clients, evicted those remaining, and then issued
ceph config set mds.0 mds_wipe_sessions true
ceph config set mds.1 mds_wipe_sessions true
which allowed me to delete the hung requests.
I've lost the exact commands I used, but something like
rados -p cephfs_metadata ls | grep mds
rados rm -p cephfs_metadata mds0_openfiles.0
etc
This allowed the MDS to get to "up:rejoin" where it has been stuck ever since which is getting on five days.
# ceph mds stat
cephfs:1/1 {0=cephfs.ceph00.uvlkrw=up:rejoin} 2 up:standby
root@ceph00:/var/log/ceph/a614303a-5eb5-11ed-b492-011f01e12c9a# ceph -s
cluster:
id: a614303a-5eb5-11ed-b492-011f01e12c9a
health: HEALTH_WARN
1 filesystem is degraded
1 pgs not deep-scrubbed in time
2 pool(s) do not have an application enabled
1 daemons have recently crashed
services:
mon: 3 daemons, quorum ceph00,ceph01,ceph02 (age 57m)
mgr: ceph01.lvdgyr(active, since 2h), standbys: ceph00.gpwpgs
mds: 1/1 daemons up, 2 standby
osd: 91 osds: 90 up (since 78m), 90 in (since 112m)
data:
volumes: 0/1 healthy, 1 recovering
pools: 5 pools, 1539 pgs
objects: 138.83M objects, 485 TiB
usage: 971 TiB used, 348 TiB / 1.3 PiB avail
pgs: 1527 active+clean
12 active+clean+scrubbing+deep
io:
client: 3.1 MiB/s rd, 3.16k op/s rd, 0 op/s wr
# ceph --version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
I've tried failing the MDS so it switches. Rebooted a couple of times.
I've added more OSDs to the metadata pool and took one out as I thought it might be a bad metadata OSD (The "recently crashed" daemon).
The error logs are full of
(prefix to all are:
Nov 27 14:02:44 ceph00 bash[2145]: debug 2023-11-27T14:02:44.619+0000 7f74e845e700 1 -- [v2:192.168.1.128:6800/2157301677,v1:192.168.1.128:6801/2157301677] --> [v2:192.168.1.133:6896/4289132926,v1:192.168.1.133:6897/4289132926]
)
crc :-1 s=READY pgs=12 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).send_message enqueueing message m=0x559be00adc00 type=42 osd_op(mds.0.36244:8142873 3.ff 3:ff5b34d6:::1.00000000:head [getxattr parent in=6b] snapc 0=[] ondisk+read+known_if_redirected+full_force+supports_pool_eio e32465) v8
crc :-1 s=READY pgs=12 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x559be00adc00 seq=8142643 osd_op(mds.0.36244:8142873 3.ff 3:ff5b34d6:::1.00000000:head [getxattr parent in=6b] snapc 0=[] ondisk+read+known_if_redirected+full_force+supports_pool_eio e32465) v8
crc :-1 s=THROTTLE_DONE pgs=12 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_message got 154 + 0 + 30 byte message. envelope type=43 src osd.89 off 0
crc :-1 s=READ_MESSAGE_COMPLETE pgs=12 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_message received message m=0x559be01f4480 seq=8142643 from=osd.89 type=43 osd_op_reply(8142873 1.00000000 [getxattr (30) out=30b] v0'0 uv560123 ondisk = 0) v8
osd_op_reply(8142873 1.00000000 [getxattr (30) out=30b] v0'0 uv560123 ondisk = 0) v8 ==== 154+0+30 (crc 0 0 0) 0x559be01f4480 con 0x559be00ad800
osd_op(unknown.0.36244:8142874 3.ff 3:ff5b34d6:::1.00000000:head [getxattr parent in=6b] snapc 0=[] ondisk+read+known_if_redirected+full_force+supports_pool_eio e32465) v8 -- 0x559be2caec00 con 0x559be00ad800
Repeating multiple times a second (and filling /var)
Prior to taking one of the cephfs_metadata OSDs offline, these came from communications from ceph00 to the node hosting the suspected bad OSD.
Now they are between ceph00 and the host of the replacement metadata OSD.
Does anyone have any suggestion on how to get the MDS to switch from "up:rejoin" to "up:active"?
Is there any way to debug this, to determine what issue really is? I'm unable to interpret the debug log.
Cheers,
Eric
________________________________________________________
Dr Eric Tittley
Research Computing Officer www.roe.ac.uk/~ert<http://www.roe.ac.uk/~ert>
Institute for Astronomy Royal Observatory, Edinburgh
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
HI, Experts,
we are using cephfs with 16.2.* with multi active mds, and recently, we have two nodes mount with ceph-fuse due to the old os system.
and one nodes run a python script with `glob.glob(path)`, and another client doing `cp` operation on the same path.
then we see some log about `mds slow request`, and logs complain “failed to authpin, subtree is being exported"
then need to restart mds,
our question is, does there any dead lock? how can we avoid this and how to fix it without restart mds(it will influence other users) ?
Thanks a ton!
xz
Hi,
I have an image with a snapshot and some changes after snapshot.
```
$ rbd du backup/f0408e1e-06b6-437b-a2b5-70e3751d0a26
NAME PROVISIONED USED
f0408e1e-06b6-437b-a2b5-70e3751d0a26@snapshot-eb085877-7557-4620-9c01-c5587b857029 10 GiB 2.4 GiB
f0408e1e-06b6-437b-a2b5-70e3751d0a26 10 GiB 2.4 GiB
<TOTAL> 10 GiB 4.8 GiB
```
If there is no changes after snapshot, the image line will show 0 used.
I did export and import.
```
$ rbd export --export-format 2 backup/f0408e1e-06b6-437b-a2b5-70e3751d0a26 - | rbd import --export-format 2 - backup/test
Exporting image: 100% complete...done.
Importing image: 100% complete...done.
```
When check the imported image, the image line shows 0 used.
```
$ rbd du backup/test
NAME PROVISIONED USED
test@snapshot-eb085877-7557-4620-9c01-c5587b857029 10 GiB 2.4 GiB
test 10 GiB 0 B
<TOTAL> 10 GiB 2.4 GiB
```
Any clues how that happened? I'd expect the same du as the source.
I tried another quick test. It works fine.
```
$ rbd create backup/test-src --size 10G
$ sudo rbd map backup/test-src
/dev/rbd0
$ echo "hello" | sudo tee /dev/rbd0
hello
$ rbd du backup/test-src
NAME PROVISIONED USED
test-src 10 GiB 4 MiB
$ rbd snap create backup/test-src@snap-1
Creating snap: 100% complete...done.
$ rbd du backup/test-src
NAME PROVISIONED USED
test-src@snap-1 10 GiB 4 MiB
test-src 10 GiB 0 B
<TOTAL> 10 GiB 4 MiB
$ echo "world" | sudo tee /dev/rbd0
world
$ rbd du backup/test-src
NAME PROVISIONED USED
test-src@snap-1 10 GiB 4 MiB
test-src 10 GiB 4 MiB
<TOTAL> 10 GiB 8 MiB
$ rbd export --export-format 2 backup/test-src - | rbd import --export-format 2 - backup/test-dst
Exporting image: 100% complete...done.
Importing image: 100% complete...done.
$ rbd du backup/test-dst
NAME PROVISIONED USED
test-dst@snap-1 10 GiB 4 MiB
test-dst 10 GiB 4 MiB
<TOTAL> 10 GiB 8 MiB
```
Thanks!
Tony
Is there anyone using containerized CEPH over CentOS Stream 9 Hosts already?
I think there is a pretty big issue in here if CEPH images are built over
CentOS but never tested against it.
Hello again guys,
Can you recommend me a book that explains best practices with Ceph,
for example is it okay to have mon,mgr, osd in the same virtual machine,
what is the recommended architecture according to your experience?
Because by default is doing this:
Cluster Ceph | +----------------------------+----------------------------+ |
| | |10.0.0.52 |10.0.0.194 |10.0.0.229 +-----------+-----------+
+-----------+-----------+ +-----------+-----------+ |[node01.jotelulu.space]|
[node02.jotelulu.space] |[node03.jotelulu.space]| | OSD +----+ OSD +----+
OSD | | Monitor Daemon | | Monitor Daemon | Monitor Daemon | | Manager
Daemon | |Manager Daemon(standby) | | | +-----------------------+
+-----------------------+ +-----------------------+
--
Regards
*Francisco Arencibia Quesada.*
*DevOps Engineer*
Hi everyone.
Status : Installing a ceph cluster
Version : 17.2.7 Quincy
OS : Debian 11.
Each of my server got two ip address. One public and one private.
When I'm trying to deploy my cluster with on a server
server1 (the hostname)
with
cephadm bootstrap --mon-id hostname --mon-ip IP_PRIVATE --cluster-network PRIVATE_SUB
I end up with private network for
ceph config get mon public_network
So I try to change it with
ceph config set mon public_network PUBLIC_SUB
still with lsof -i |grep -i listen I got
ceph-mgr 31427 ceph 49u IPv4 119937 0t0 TCP server1-ceph.private.:7150 (LISTEN)
node_expo 31572 nobody 3u IPv6 65495 0t0 TCP *:9100 (LISTEN)
alertmana 31573 nobody 3u IPv6 21377 0t0 TCP *:9094 (LISTEN)
alertmana 31573 nobody 8u IPv6 136298 0t0 TCP *:9093 (LISTEN)
prometheu 31757 nobody 7u IPv6 109680 0t0 TCP *:9095 (LISTEN)
grafana 31758 node-exporter 11u IPv6 100726 0t0 TCP *:3000 (LISTEN)
ceph-mon 31850 ceph 27u IPv4 139664 0t0 TCP server1-ceph.private.:3300 (LISTEN)
ceph-mon 31850 ceph 28u IPv4 139665 0t0 TCP server1-ceph.private.:6789 (LISTEN)
So the ceph-mon listen on the private interface.
Is this something normal ? Because according to
https://access.redhat.com/documentation/fr-fr/red_hat_ceph_storage/5/html/c…
only the OSD should listen on private network.
Is they are anyway to configure booth public_network and private_network
with cephadm bootstrap ?
Regards.
--
Albert SHIH 🦫 🐸
France
Heure locale/Local time:
jeu. 30 nov. 2023 18:27:08 CET
Hi,
After updating from 17.2.6 to 17.2.7 with cephadm, our cluster went into
MDS_DAMAGE state. We had some prior issues with faulty kernel clients
not releasing capabilities, therefore the update might just be a
coincidence.
`ceph tell mds.cephfs:0 damage ls` lists 56 affected files all with
these general details:
{
"damage_type": "dentry",
"id": 123456,
"ino": 1234567890,
"frag": "*",
"dname": "some-filename.ext",
"snap_id": "head",
"path": "/full/path/to/file"
}
The behaviour upon trying to access file information in the (Kernel
mounted) filesystem is a bit inconsistent. Generally, the first `stat`
call seems to result in "Input/output error", the next call provides all
`stat` data as expected from an undamaged file. The file can be read
with `cat` with full and correct content (verified with backup) once the
stat call succeeds.
Scrubbing the affected subdirectories with `ceph tell mds.cephfs:0 scrub
start /path/to/dir/ recursive,repair,force` does not fix the issue.
Trying to delete the file results in an "Input/output error". If the
stat calls beforehand succeeded, this also crashes the active MDS with
these messages in the system journal:
> Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]: mds.0.cache.den(0x10012271195 DisplaySettings.json) newly corrupt dentry to be committed: [dentry #0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012271197 state=1073741824 | inodepin=1 0x56413e1e2780]
> Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]: log_channel(cluster) log [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry #0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012271197 state=1073741824 | inodepin=1 0x56413e1e2780]
> Nov 24 14:21:15 iceph-18.servernet ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]: 2023-11-24T13:21:15.654+0000 7f3fdcde0700 -1 mds.0.cache.den(0x10012271195 DisplaySettings.json) newly corrupt dentry to be committed: [dentry #0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x1001>
> Nov 24 14:21:15 iceph-18.servernet ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]: 2023-11-24T13:21:15.654+0000 7f3fdcde0700 -1 log_channel(cluster) log [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry #0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012>
> Nov 24 14:21:15 iceph-18.servernet ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc: In function 'void MDSRank::abort(std::string_view)' thread 7f3fdcde0700 time 2023-11-24T13:21:15.655088+0000
> Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc: In function 'void MDSRank::abort(std::string_view)' thread 7f3fdcde0700 time 2023-11-24T13:21:15.655088+0000
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc: 937: ceph_abort_msg("abort() called")
>
> ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
> 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd7) [0x7f3fe5a1cb03]
> 2: (MDSRank::abort(std::basic_string_view<char, std::char_traits<char> >)+0x7d) [0x5640f2e6fa2d]
> 3: (CDentry::check_corruption(bool)+0x740) [0x5640f30e4820]
> 4: (EMetaBlob::add_primary_dentry(EMetaBlob::dirlump&, CDentry*, CInode*, unsigned char)+0x47) [0x5640f2f41877]
> 5: (EOpen::add_clean_inode(CInode*)+0x121) [0x5640f2f49fc1]
> 6: (Locker::adjust_cap_wanted(Capability*, int, int)+0x426) [0x5640f305e036]
> 7: (Locker::process_request_cap_release(boost::intrusive_ptr<MDRequestImpl>&, client_t, ceph_mds_request_release const&, std::basic_string_view<char, std::char_traits<char> >)+0x599) [0x5640f307f7e9]
> 8: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0xc06) [0x5640f2f2a7c6]
> 9: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x13c) [0x5640f2f2ef6c]
> 10: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5db) [0x5640f2e7727b]
> 11: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x5640f2e778bc]
> 12: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x5640f2e60c2f]
> 13: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7f3fe5c97ed8]
> 14: (DispatchQueue::entry()+0x50f) [0x7f3fe5c9531f]
> 15: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f3fe5d5f381]
> 16: /lib64/libpthread.so.0(+0x81ca) [0x7f3fe4a0b1ca]
> 17: clone()
Deleting the file with cephfs-shell also does give Input/output error (5).
Does anyone have an idea on how to proceed here? I am perfectly fine
with loosing the affected files, they can all be easily restored from
backup.
Cheers
Sebastian