For developers submitting jobs using teuthology, we now have
recommendations on what priority level to use:
https://docs.ceph.com/docs/master/dev/developer_guide/#testing-priority
--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
Hi,
With Pacific (16.2.0) out of the door we have the RBD persistent
WriteBack cache for RBD:
https://docs.ceph.com/en/latest/rbd/rbd-persistent-write-back-cache/
Has anybody performed some benchmarks with the RBD cache?
Interested in the QD=1 bs=4k performance mainly.
I don't have any proper hardware available to run benchmarks on yet.
Wido
Hi,
I would like to bring some attention to a problem we have been
observing with nautilus, and which I reported here [1].
If a pg is in backfill_unfound state ("unfound" objects were detected
during backfill), and one of the osds from the active set is restarted
the state changes to clean, losing the information about unfound
objects.
And when I tired to reproduce the issue on the master with the same
scenario, the status did not change, but I was observing the primary
osd crash after a non-primary restart.
I looked through the commit log and did not find a commit explicitely
saying (or giving a hint) this problem was adressing in the master and
I see there was large refactoring in the related code since
nautilus. So probably the issue was "solved" during refactoring?
We would love to see the problem fixed in the nautilus, and I would
like to backport the "fix", but right now I don't have a clear
understanding if there really was a fix in the master and what to do
with that crash that may be related to the "fix".
I might try to find the commit that changed the behaviour by
bisecting, but this looks like a long way, so I want to ask here first
if anybody has a hint.
[1] https://tracker.ceph.com/issues/50351
Thanks,
--
Mykola Golub
Hi everyone,
You may have noticed some unusual activity in the backport PRs in the
past week, namely force pushes to the base branch and temporary changes
of the same, resulting in humongous changesets/diffstats being shown.
This was done to work around some deficiencies in the release process,
apologies for the inconvenience.
Now that 14.2.20, 15.2.11 and 16.2.1 are out the door, everything has
been restored. Jenkins is currently backed up processing "make check"
and "ceph API tests" jobs, but it should clear up by tomorrow. Please
retrigger with "jenkins test make check", "jenkins test api", etc if
needed.
Some of you have clicked the unhelpful "Update branch" button, which
generated an unneeded merge commit. Please get rid of it, either by
rebasing or simply rolling back to the parent. I went through the PRs
and commented on those that need action, but please double check that
there are no merge commits or unrelated changes in the "Commits" list
before merging.
Thanks,
Ilya
Hi everyone,
In June 2021, we're hosting a month of Ceph presentations, lightning
talks, and unconference sessions such as BOFs. There is no
registration or cost to attend this event.
The CFP is now open until May 12th.
https://ceph.io/events/ceph-month-june-2021/cfp
Speakers will receive confirmation that their presentation is accepted
and further instructions for scheduling by May 16th.
The schedule will be available on May 19th.
Join the Ceph community as we discuss how Ceph, the massively
scalable, open-source, software-defined storage system, can radically
improve the economics and management of data storage for your
enterprise.
--
Mike Perez
tl;dr: we need to change the MDS infrastructure for fscrypt (again), and
I want to do it in a way that would clean up some existing mess and more
easily allow for future changes. The design is a bit odd though...
Sorry for the long email here, but I needed communicate this design, and
the rationale for the changes I'm proposing. First, the rationale:
I've been (intermittently) working on the fscrypt implementation for
cephfs, and have posted a few different draft proposals for the first
part of it [1], which rely on a couple of changes in the MDS:
- the alternate_names feature [2]. This is needed to handle extra-long
filenames without allowing unprintable characters in the filename.
- setting an "fscrypted" flag if the inode has an fscrypt context blob
in encryption.ctx xattr [3].
With the filenames part more or less done, the next steps are to plumb
in content encryption. Because the MDS handles truncates, we have to
teach it to align those on fscrypt block boundaries. Rather than foist
those details onto the MDS, the current idea is to add an opaque blob to
the inode that would get updated along with size changes. The client
would be responsible for filling out that field with the actual i_size,
and would always round the existing size field up to the end of the last
crypto block. That keeps the real size opaque to the MDS and the
existing size handling logic should "just work". Regardless, that means
we need another inode field for the size.
Storing the context in an xattr is also proving to be problematic [4].
There are some situations where we can end up with an inode that is
flagged as encrypted but doesn't have the caps to trust its xattrs. We
could just treat "encryption.ctx" as special and not require Xs caps to
read whatever cached value we have, and that might fix that issue, but
I'm not fully convinced that's foolproof. We might end up with no cached
context on a directory that is actually encrypted in some cases and not
have a context.
At this point, I'm thinking it might be best to unify all of the
per-inode info into a single field that the MDS would treat as opaque.
Note that the alternate_names feature would remain more or less
untouched since it's associated more with dentries than inodes.
The initial version of this field would look something like this:
struct ceph_fscrypt_context {
u8 version; // == 1
struct fscrypt_context_v2 fscrypt_ctx; // 40 bytes
__le32 blocksize // 4k for now
__le64 size; // "real"
i_size
};
The MDS would send this along with any size updates (InodeStat, and
MClientCaps replies). The client would need to send this in cap
flushes/updates, and we'd also need to extend the SETATTR op too, so the
client can update this field in truncates (at least).
I don't look forward to having to plumb this into all of the different
client ops that can create inodes though. What I'm thinking we might
want to do is expose this field as the "ceph.fscrypt" vxattr.
The client can stuff that into the xattr blob when creating a new inode,
and the MDS can scrape it out of that and move the data into the correct
field in the inode. A setxattr on this field would update the new field
too. It's an ugly interface, but shouldn't be too bad to handle and we
have some precedent for this sort of thing.
The rules for handling the new field in the client would be a bit weird
though. We'll need to allow it to reading the fscrypt_ctx part without
any caps (since that should be static once it's set), but the size
handling needs to be under the same caps as the traditional size field
(Is that Fsx? The rules for this are never quite clear to me.)
Would it be better to have two different fields here -- fscrypt_auth and
fscrypt_file? Or maybe, fscrypt_static/_dynamic? We don't necessarily
need to keep all of this info together, but it seemed neater that way.
Thoughts? Opinions? Is this a horrible idea? What would be better?
Thanks,
--
Jeff Layton <jlayton(a)redhat.com>
[1]: latest draft was posted here:
https://lore.kernel.org/ceph-devel/53d5bebb28c1e0cd354a336a56bf103d5e3a6344…
[2]: https://github.com/ceph/ceph/pull/37297
[3]:
https://github.com/ceph/ceph/commit/7fe1c57846a42443f0258fd877d7166f33fd596f
[4]:
https://lore.kernel.org/ceph-devel/53d5bebb28c1e0cd354a336a56bf103d5e3a6344…
hi folks,
while looking at https://github.com/ceph/ceph/pull/32422, i think a
probably safer approach is to make the monitor more efficient. currently,
monitor is sort of a single-threaded application. quite a few critical code
paths of monitor are protected by Monitor::lock, among other things
- periodical task performed by tick() which is in turn called by SafeTimer.
the "safty" of the SafeTimer is ensured by Monitor::lock
- Monitor::_ms_dispatch is also called with the Monitor::lock acquired. in
the case of https://github.com/ceph/ceph/pull/32422, one or
more kcephfs clients are even able to slow down the whole cluster by asking
for the latest osdmap with an ancient one in its hand, if the cluster is
able to rebalance/recover in speedy way and accumulate lots of osdmap in a
short time.
a typical scaring use case is:
1. an all-flash cluster just completes a rebalance/recover. the rebalance
completed quickly, and it leaves the cluster with a ton of osdmaps before
some of the clients have a chance to pick up these updated maps.
2. (kcephfs) clients with ancient osdmaps in their hands wake up randomly,
and they want the latest osdmap!
3. monitors are occupied with loading the maps from rocksdb and encoding
them in very large batches (when discussing with the author of
https://github.com/ceph/ceph/pull/32422, he mentioned that the total size
of inc osdmap could be up to 200~300 MiB).
4. and the cluster is basically unresponsive.
so, does it sound like a right way to improve its performance when serving
the CPU intensive workload by dissecting the data dependencies in the
monitor and to explore the possibility to make the monitor more
multi-threaded?
thoughts?
Hi all,
Just sharing my sunday morning frustration of checking the build of my ports.
This occurs in ./src/test/encoding/check-generated.sh
In itself this type of problem if of course trivial to solve.
But in this case we use diff to compare the output, so there is
no easy way to fix this
2 DecayCounter
/tmp/typ-s31EUGoSy /tmp/typ-iLTjVqhpI differ: char 24, line 2
**** DecayCounter test 1 dump_json check failed ****
ceph-dencoder type DecayCounter select_test 1 dump_json > /tmp/typ-s31EUGoSy
ceph-dencoder type DecayCounter select_test 1 encode decode dump_json > /tmp/typ-iLTjVqhpI
2c2
< "value": 2.99990449484967,
---
> "value": 2.9999046414456356,
Probably the easiest is to exclude the test and go on with life as it is.
But correct way is probably shorten the representation of the float printing.
So we end up with '2.9999' of perhaps even shorter which will make it '3.000'
Is that something that is appropriate to do in the dump_json part, if I can flesh
out the DecayCounter as "exception"
--WjW
Hi Mark,
While trying to figure out a random failure in the mempool tests[0] introduced when fixing a bug in how mempool selects shards holding the byte count of a given pool[1] earlier this year, I was intrigued by this "cache line ping pong" problem[2]. And I wonder if you have some kind of benchmark, somewhere in your toolbox, that someone could use to demonstrate the problem. Maybe such a code could be adapted to show the benefit of the optimization implemented in mempool?
Cheers
[0] https://tracker.ceph.com/issues/49781#note-9
[1] https://github.com/ceph/ceph/pull/39057/files
[2] https://www.drdobbs.com/parallel/understanding-and-avoiding-memory-issues/2…
--
Loïc Dachary, Artisan Logiciel Libre