Hi,
Perhaps this is a known issue and I was simply too dumb to find it, but
we are having problems with our CephFS metadata pool filling up over night.
Our cluster has a small SSD pool of around 15TB which hosts our CephFS
metadata pool. Usually, that's more than enough. The normal size of the
pool ranges between 200 and 800GiB (which is quite a lot of fluctuation
already). Yesterday, we had suddenly had the pool fill up entirely and
they only way to fix it was to add more capacity. I increased the pool
size to 18TB by adding more SSDs and could resolve the problem. After a
couple of hours of reshuffling, the pool size finally went back to 230GiB.
But then we had another fill-up tonight to 7.6TiB. Luckily, I had
adjusted the weights so that not all disks could fill up entirely like
last time, so it ended there.
I wasn't really able to identify the problem yesterday, but under the
more controllable scenario today, I could check the MDS logs at
debug_mds=10 and to me it seems like the problem is caused by snapshot
trimming. The logs contain a lot of snapshot-related messages for paths
that haven't been touched in a long time. The messages all look
something like this:
May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200
7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first cap,
joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1
b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941
0x10000000000 ...
May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200
7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir
0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0
child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0
tempexporting=0 0x5607759d9600]
May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200
7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir
0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 request=0
child=0 frozen=0 subtree=1 importing=0 replicated=0 waiter=0 authpin=0
tempexporting=0 0x56034ed25200]
May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200
7f0e6becd700 10 mds.0.server set_trace_dist snaprealm
snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2
snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941
0x10000000000 'monthly_20230201'
2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000
'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24
0x10000000000 'monthly_20230401' ...) len=384
May 31 09:25:36 deltaweb055 ceph-mds[3268481]:
2023-05-31T09:25:36.076+0200 7f0e6becd700 10
mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving realm
snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2
snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
'monthly_20230101' ...)
The daily_*, montly_* etc. names are the names of our regular snapshots.
I posted a larger log file snippet using ceph-post-file with the ID:
da0eb93d-f340-4457-8a3f-434e8ef37d36
Is it possible that the MDS are trimming old snapshots without taking
care not to fill up the entire metadata pool?
Cheers
Janek
Hi all,
I wanted to call attention to some RGW issues that we've observed on a
Pacific cluster over the past several weeks. The problems relate to versioned
buckets and index entries that can be left behind after transactions complete
abnormally. The scenario is multi-faceted and we're still investigating some of
the details, but I wanted to provide a big-picture summary of what we've found
so far. It looks like most of these issues should be reproducible on versions
before and after Pacific as well. I'll enumerate the individual issues below:
1. PUT requests during reshard of versioned bucket fail with 404 and leave
behind dark data
Tracker: https://tracker.ceph.com/issues/61359
2. When bucket index ops are cancelled it can leave behind zombie index entries
This one was merged a few months ago and did make the v16.2.13 release, but
in our case we had billions of extra index entries by the time that we had
upgraded to the patched version.
Tracker: https://tracker.ceph.com/issues/58673
3. Issuing a delete for a key that already has a delete marker as the current
version leaves behind index entries and OLH objects
Note that the tracker's original description describes the problem a bit
differently, but I've clarified the nature of the issue in a comment.
Tracker: https://tracker.ceph.com/issues/59663
The extra index entries and OLH objects that are left behind due to these sorts
of issues are obviously annoying in regards to the fact that they unnecessarily
consume space, but we've found that they can also cause severe performance
degradation for bucket listings, lifecycle processing, and other ops indirectly
due to higher osd latencies.
The reason for the performance impact is that bucket listing calls must
repeatedly perform additional OSD ops until they find the requisite number
of entries to return. The OSD cls method for bucket listing also does its own
internal iteration for the same purpose. Since these entries are invalid, they
are skipped. In the case that we observed, where some of our bucket indexes were
filled with a sea of contiguous leftover entries, the process of continually
iterating over and skipping invalid entries caused enormous read amplification.
I believe that the following tracker is describing symptoms that are related to
the same issue: https://tracker.ceph.com/issues/59164.
Note that this can also cause LC processing to repeatedly fail in cases where
there are enough contiguous invalid entries, since the OSD cls code eventually
gives up and returns an error that isn't handled.
The severity of these issues likely varies greatly based upon client behavior.
If anyone has experienced similar problems, we'd love to hear about the nature
of how they've manifested for you so that we can be more confident that we've
plugged all of the holes.
Thanks,
Cory Snyder
11:11 Systems
Dear Community,
I would like to collect your feedback on this issue. This is a followup
from a discussion that started in the RGW refactoring meeting on 31-May-23
(thanks @Krunal Chheda <kchheda3(a)bloomberg.net> for bringing up this
topic!).
Currently persistent notifications are retried indefinitely.
The only limiting mechanism that exists is that all notifications to a
specific topic are stored in one RADOS object (of size 128MB).
Assuming notifications are ~1KB at most, this would give us at least 128K
notifications that can wait in the queue.
When the queue fills up (e.g. kafka broker is down for 20 minutes, we are
sending ~100 notifications per second) we start sending "slow down" replies
to the client, and in this case the S3 operation will not be performed.
This means that, for example, an outage of the kafka system would
eventually cause an outage of our service. Note that this may also be a
result of a misconfiguration of the kafka broker, or decommissioning of a
broker.
To avoid that, we propose several options:
* use a fifo instead of a queue. This would allow us to hold more than 128K
messages - survive longer broker outages, and at a higher message rate.
there should still probably be a limit set on the size of the fifo
* define maximum number of retries allowed for a notification
* define maximum time the notification may stay in the queue before it is
removed
We should probably start with these definitions done as topic attributes,
reflecting our delivery guarantees for this specific destination.
Will try to capture the results of the discussion in this tracker:
https://tracker.ceph.com/issues/61532
Thanks,
Yuval
Hi,
we are currently running a ceph fs cluster at the following version:
MDS version: ceph version 16.2.10
(45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)
The cluster is composed of 7 active MDSs and 1 standby MDS:
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active icadmin012 Reqs: 73 /s 1938k 1880k 85.3k 92.8k
1 active icadmin008 Reqs: 206 /s 2375k 2375k 7081 171k
2 active icadmin007 Reqs: 91 /s 5709k 5256k 149k 299k
3 active icadmin014 Reqs: 93 /s 679k 664k 40.1k 216k
4 active icadmin013 Reqs: 86 /s 3585k 3569k 12.7k 197k
5 active icadmin011 Reqs: 72 /s 225k 221k 8611 164k
6 active icadmin015 Reqs: 87 /s 1682k 1610k 27.9k 274k
POOL TYPE USED AVAIL
cephfs_metadata metadata 8552G 22.3T
cephfs_data data 226T 22.3T
STANDBY MDS
icadmin006
When I restart one of the active MDSs, the standby MDS becomes active and
its state becomes "replay". So far, so good!
However, only one of the other "active" MDSs seems to remain active. All
activities drop from the other ones:
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active icadmin012 Reqs: 0 /s 1938k 1881k 85.3k 9720
1 active icadmin008 Reqs: 0 /s 2375k 2375k 7080 2505
2 active icadmin007 Reqs: 2 /s 5709k 5256k 149k 26.5k
3 active icadmin014 Reqs: 0 /s 679k 664k 40.1k 3259
4 replay icadmin006 801k 801k 1279 0
5 active icadmin011 Reqs: 0 /s 225k 221k 8611 9241
6 active icadmin015 Reqs: 0 /s 1682k 1610k 27.9k 34.8k
POOL TYPE USED AVAIL
cephfs_metadata metadata 8539G 22.8T
cephfs_data data 225T 22.8T
STANDBY MDS
icadmin013
In effect, the cluster becomes almost unavailable until the newly promoted
MDS finishes rejoining the cluster.
Obviously, this defeats the purpose of having 7MDSs.
Is this behavior?
If not, what configuration items should I check to go back to "normal"
operations?
Best,
Emmanuel
I'm in the process of exploring if it is worthwhile to add RadosGW to
our existing ceph cluster. We've had a few internal requests for
exposing the S3 API for some of our business units, right now we just
use the ceph cluster for VM disk image storage via RBD.
Everything looks pretty straight forward until we hit multitenancy. The
page on multi-tenancy doesn't dive into permission delegation:
https://docs.ceph.com/en/quincy/radosgw/multitenancy/
The end goal I want is to be able to create a single user per tenant
(Business Unit) which will act as their 'administrator', where they can
then do basically whatever they want under their tenant sandbox (though
I don't think we need more advanced cases like creations of roles or
policies, just create/delete their own users, buckets, objects). I was
hopeful this would just work, and I asked on the ceph IRC channel on
OFTC and was told once I grant a user caps="users=*", they would then be
allowed to create users *outside* of their own tenant using the Rados
Admin API and that I should explore IAM roles.
I think it would make sense to add a feature, such as a flag that can be
set on a user, to ensure they stay in their "sandbox". I'd assume this
is probably a common use-case.
Anyhow, if its possible to do today using iam roles/policies, then
great, unfortunately this is my first time looking at this stuff and
there are some things not immediately obvious.
I saw this online about AWS itself and creating a permissions boundary,
but that's for allowing creation of roles within a boundary:
https://www.qloudx.com/delegate-aws-iam-user-and-role-creation-without-givi…
I'm not sure what "Action" is associated with the Rados Admin API create
user for applying a boundary that the user can only create users with
the same tenant name.
https://docs.ceph.com/en/quincy/radosgw/adminops/#create-user
Any guidance on this would be extremely helpful.
Thanks!
-Brad
Hi,
I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but the
following problem existed when I was still everywhere on 17.2.5 .
I had a major issue in my cluster which could be solved with a lot of
your help and even more trial and error. Right now it seems that most is
already fixed but I can't rule out that there's still some problem
hidden. The very issue I'm asking about started during the repair.
When I want to orchestrate the cluster, it logs the command but it
doesn't do anything. No matter if I use ceph dashboard or "ceph orch" in
"cephadm shell". I don't get any error message when I try to deploy new
services, redeploy them etc. The log only says "scheduled" and that's
it. Same when I change placement rules. Usually I use tags. But since
they don't work anymore, too, I tried host and umanaged. No success. The
only way I can actually start and stop containers is via systemctl from
the host itself.
When I run "ceph orch ls" or "ceph orch ps" I see services I deployed
for testing being deleted (for weeks now). Ans especially a lot of old
MDS are listed as "error" or "starting". The list doesn't match reality
at all because I had to start them by hand.
I tried "ceph mgr fail" and even a complete shutdown of the whole
cluster with all nodes including all mgs, mds even osd - everything
during a maintenance window. Didn't change anything.
Could you help me? To be honest I'm still rather new to Ceph and since I
didn't find anything in the logs that caught my eye I would be thankful
for hints how to debug.
Cheers,
Thomas
--
http://www.widhalm.or.at
GnuPG : 6265BAE6 , A84CB603
Threema: H7AV7D33
Telegram, Signal: widhalmt(a)widhalm.or.at
Dear Ceph folks,
Recently one of our clients approached us with a request on encrpytion per user, i.e. using individual encrytion key for each user and encryption files and object store.
Does anyone know (or have experience) how to do with CephFS and Ceph RGW?
Any suggestionns or comments are highly appreciated,
best regards,
Samuel
huxiaoyu(a)horebdata.cn
Details of this release are summarized here:
https://tracker.ceph.com/issues/61515#note-1
Release Notes - TBD
Seeking approvals/reviews for:
rados - Neha, Radek, Travis, Ernesto, Adam King (we still have to
merge https://github.com/ceph/ceph/pull/51788 for
the core)
rgw - Casey
fs - Venky
orch - Adam King
rbd - Ilya
krbd - Ilya
upgrade/octopus-x - deprecated
upgrade/pacific-x - known issues, Ilya, Laura?
upgrade/reef-p2p - N/A
clients upgrades - not run yet
powercycle - Brad
ceph-volume - in progress
Please reply to this email with approval and/or trackers of known
issues/PRs to address them.
gibba upgrade was done and will need to be done again this week.
LRC upgrade TBD
TIA
Hi Team,
I'm writing to bring to your attention an issue we have encountered with the "mtime" (modification time) behavior for directories in the Ceph filesystem.
Upon observation, we have noticed that when the mtime of a directory (let's say: dir1) is explicitly changed in CephFS, subsequent additions of files or directories within
'dir1' fail to update the directory's mtime as expected.
This behavior appears to be specific to CephFS - we have reproduced this issue on both Quincy and Pacific. Similar steps work as expected in the ext4 filesystem amongst others.
Reproduction steps:
1. Create a directory - mkdir dir1
2. Modify mtime using the touch command - touch dir1
3. Create a file or directory inside of 'dir1' - mkdir dir1/dir2
Expected result:
mtime for dir1 should change to the time the file or directory was created in step 3
Actual result:
there was no change to the mtime for 'dir1'
Note : For more detail, kindly find the attached logs.
Our queries are :
1. Is this expected behavior for CephFS?
2. If so, can you explain why the directory behavior is inconsistent depending on whether the mtime for the directory has previously been manually updated.
Best Regards,
Sandip Divekar
Component QA Lead SDET.