I was thinking about how a refcounted cas pool would interact with
snapshots and it occurred to me that dropping refs when an object is
deleted may break snapshotted versions of that object. If object A has
a ref to chunk X, is snapshotted, then A is deleted, we'll (currently)
drop the ref to X and remove it. That means that A can't be read.
One way to get around that would be to mirror snaps from the source pool
to the chunk pool--this is how cache tiering works. The problem I see
there is that I'd hoped to allow multiple pools to share/consume the same
chunk pool, but each pool has its own snapid namespace.
Another would be to bake the refs more deepling into the source rados pool
so that the refs are only dropped after all clones also drop the ref.
That is harder to track, though, since I think you'd need to examine all
of the clones to know whether the ref is truly gone. Unless we embed
even more metadata in the SnapSet--something analogous to clone_overlap to
identifying the chunks. That seems like it will bloat that structure,
I've been working on a mgr module to push some types of ceph system
activity into the kubernetes events api, so kubernetes users get a more
granular view of what's going on within the ceph cluster.
The code is not complete, but if you're interested it's here -
At this point it does the following;
- ties into the log_monitor2 rados call to pick up audit and healthcheck
- creates an hourly health heartbeat (health + capacity)
- emits events for configuration changes
- host add/remove
- osd add/remove
- pool add/remove
- pool size and min_size changes
Interested to hear if there are any other ceph 'events' that I should be
while discussing the snap-schedule module, a feature request for protecting
snapshots against deletion came up. This is partly tied to the scheduled
pruning, when a user wants to protect an individual snapshot from being pruned.
It could however also be used to protect individual snapshots from being
One proposed interface (thx Lars) would be to implement a new extended attribute
on snapshot directories. I like this as an interface as its quiet flexible and
could be build upon (by say a mgr module command).
Does snapshot deletion involve the MDS? Otherwise older clients not aware might
still delete them (like quotas depend on client cooperation)
Is a virtual xattr on a virtual directory a bad idea?
Other concerns or better ideas?
- This mail is in HTML. Some elements may be ommited in plain text. -
My name is Ulrich Bredemeyer, I am writing you this proposal in good faith. it is my duty to send in a financial report to my head office at the end of each business year. On the course of the last year business report, I discovered that my branch office made excess profit of 42.5 USD . I have placed these funds in escrow account with no beneficiary. As an officer of this bank I cannot be directly connected to this money, so my aim of contacting you is to assist me receive this money in your bank account and get reward for yourself. I have perfected the banking procedure to enable our head office transfer the funds into your account in less than seven working days. I have not met with you in person but I am instinctively convinced that you are the ideal person I require to handle the project for me. Reply me with my email address below:
Reply with the information below to enable me give you a call to further discussion: FULL NAMES, ADDRESS, OCCUPATION,TELEPHONE and CELL PHONE, COUNTRY , AGE.
Without calling ceph_mount_perms_set(), libcephfs consumers such as
Samba can rely upon UserPerm::uid() and UserPerm::gid() to fallback to
geteuid() and setegid() respectively for things such as ACL enforcement.
However, there is no such fallback for supplementary groups, so ACL
checks for a user which is only permitted path access via a
supplementary group will result in a permission denied error.
Samba ticket: https://bugzilla.samba.org/show_bug.cgi?id=14053
I've written a patch to address this (it currently omits the get_gids()
Does this approach make sense, or should Samba go down the
ceph_mount_perms_set() route to avoid this bug? The latter
would likely be problematic, as user/group details for a mount will
we are seeing a trend towards rather large RGW S3 buckets lately.
we've worked on
several clusters with 100 - 500 million objects in a single bucket, and we've
been asked about the possibilities of buckets with several billion objects more
From our experience: buckets with tens of million objects work just fine with
no big problems usually. Buckets with hundreds of million objects require some
attention. Buckets with billions of objects? "How about indexless buckets?" -
"No, we need to list them".
A few stories and some questions:
1. The recommended number of objects per shard is 100k. Why? How was this
default configuration derived?
It doesn't really match my experiences. We know a few clusters running with
larger shards because resharding isn't possible for various reasons at the
moment. They sometimes work better than buckets with lots of shards.
So we've been considering to at least double that 100k target shard size
for large buckets, that would make the following point far less annoying.
2. Many shards + ordered object listing = lots of IO
Unfortunately telling people to not use ordered listings when they don't really
need them doesn't really work as their software usually just doesn't support
A listing request for X objects will retrieve up to X objects from each shard
for ordering them. That will lead to quite a lot of traffic between the OSDs
and the radosgw instances, even for relatively innocent simple queries as X
defaults to 1000 usually.
Simple example: just getting the first page of a bucket listing with 4096
shards fetches around 1 GB of data from the OSD to return ~300kb or so to the
I've got two clusters here that are only used for some relatively low-bandwidth
backup use case here. However, there are a few buckets with hundreds of millions
of objects that are sometimes being listed by the backup system.
The result is that this cluster has an average read IO of 1-2 GB/s, all going
to the index pool. Not a big deal since that's coming from SSDs and goes over
80 Gbit/s LACP bonds. But it does pose the question about scalability
as the user-
visible load created by the S3 clients is quite low.
3. Deleting large buckets
Someone accidentaly put 450 million small objects into a bucket and only noticed
when the cluster ran full. The bucket isn't needed, so just delete it and case
Deleting is unfortunately far slower than adding objects, also
memory during deletion: https://tracker.ceph.com/issues/40700
Increasing --max-concurrent-ios helps with deletion speed (option does effect
deletion concurrency, documentation says it's only for other specific commands).
Since the deletion is going faster than new data is being added to that cluster
the "solution" was to run the deletion command in a memory-limited cgroup and
restart it automatically after it gets killed due to leaking.
How could the bucket deletion of the future look like? Would it be possible
to put all objects in buckets into RADOS namespaces and implement some kind
of efficient namespace deletion on the OSD level similar to how pool deletions
are handled at a lower level?
4. Common prefixes could filtered in the rgw class on the OSD instead
of in radosgw
Consider a bucket with 100 folders with 1000 objects in each and only one shard
/p1/1, /p1/2, ..., /p1/1000, /p2/1, /p2/2, ..., /p2/1000, ... /p100/1000
Now a user wants to list / with aggregating common prefixes on the
delimiter / and
wants up to 1000 results.
So there'll be 100 results returned to the client: the common prefixes
p1 to p100.
How much data will be transfered between the OSDs and radosgw for this request?
How many omap entries does the OSD scan?
radosgw will ask the (single) index object to list the first 1000 objects. It'll
return 1000 objects in a quite unhelpful way: /p1/1, /p1/2, ...., /p1/1000
radosgw will discard 999 of these and detect one common prefix and continue the
iteration at /p1/\xFF to skip the remaining entries in /p1/ if there are any.
The OSD will then return everything in /p2/ in that next request and so on.
So it'll internally list every single object in that bucket. That will
be a problem
for large buckets and having lots of shards doesn't help either.
This shouldn't be too hard to fix: add an option "aggregate prefixes" to the RGW
class method and duplicate the fast-forward logic from radosgw in
cls_rgw. It doesn't
even need to change the response type or anything, it just needs to
limit entries in
common prefixes to one result.
Is this a good idea or am I missing something?
IO would be reduced by a factor of 100 for that particular
pathological case. I've
unfortunately seen a real-world setup that I think hits a case like that.
Looking for help with your Ceph cluster? Contact us at https://croit.io
Tel: +49 89 1896585 90
On Thu, Jun 27, 2019 at 8:58 PM nokia ceph <nokiacephusers(a)gmail.com> wrote:
> Hi Team,
> We have a requirement to create multiple copies of an object and currently we are handling it in client side to write as separate objects and this causes huge network traffic between client and cluster.
> Is there possibility of cloning an object to multiple copies using librados api?
> Please share the document details if it is feasible.
It may be possible to use an object class to accomplish what you want
to achieve but the more we understand what you are trying to do, the
better the advice we can offer (at the moment your description sounds
like replication which is already part of RADOS as you know).
More on object classes from Cephalocon Barcelona in May this year:
> ceph-users mailing list