Hi everyone,
As a first step to providing QoS in the OSD by default in Quincy, we
have started using mclock scheduler by default in master with
https://github.com/ceph/ceph/pull/40016. The aim is to get more
testing with it and catch bugs early.
I wouldn't expect it to cause too many issues in test suites other
than rados, but if anybody finds new issues related to things not
finishing in time or hitting timeouts, it may be worth checking if
this change is responsible. It should be a matter of adjusting some
tests.
Cheers,
Neha
tl;dr: we need to change the MDS infrastructure for fscrypt (again), and
I want to do it in a way that would clean up some existing mess and more
easily allow for future changes. The design is a bit odd though...
Sorry for the long email here, but I needed communicate this design, and
the rationale for the changes I'm proposing. First, the rationale:
I've been (intermittently) working on the fscrypt implementation for
cephfs, and have posted a few different draft proposals for the first
part of it [1], which rely on a couple of changes in the MDS:
- the alternate_names feature [2]. This is needed to handle extra-long
filenames without allowing unprintable characters in the filename.
- setting an "fscrypted" flag if the inode has an fscrypt context blob
in encryption.ctx xattr [3].
With the filenames part more or less done, the next steps are to plumb
in content encryption. Because the MDS handles truncates, we have to
teach it to align those on fscrypt block boundaries. Rather than foist
those details onto the MDS, the current idea is to add an opaque blob to
the inode that would get updated along with size changes. The client
would be responsible for filling out that field with the actual i_size,
and would always round the existing size field up to the end of the last
crypto block. That keeps the real size opaque to the MDS and the
existing size handling logic should "just work". Regardless, that means
we need another inode field for the size.
Storing the context in an xattr is also proving to be problematic [4].
There are some situations where we can end up with an inode that is
flagged as encrypted but doesn't have the caps to trust its xattrs. We
could just treat "encryption.ctx" as special and not require Xs caps to
read whatever cached value we have, and that might fix that issue, but
I'm not fully convinced that's foolproof. We might end up with no cached
context on a directory that is actually encrypted in some cases and not
have a context.
At this point, I'm thinking it might be best to unify all of the
per-inode info into a single field that the MDS would treat as opaque.
Note that the alternate_names feature would remain more or less
untouched since it's associated more with dentries than inodes.
The initial version of this field would look something like this:
struct ceph_fscrypt_context {
u8 version; // == 1
struct fscrypt_context_v2 fscrypt_ctx; // 40 bytes
__le32 blocksize // 4k for now
__le64 size; // "real"
i_size
};
The MDS would send this along with any size updates (InodeStat, and
MClientCaps replies). The client would need to send this in cap
flushes/updates, and we'd also need to extend the SETATTR op too, so the
client can update this field in truncates (at least).
I don't look forward to having to plumb this into all of the different
client ops that can create inodes though. What I'm thinking we might
want to do is expose this field as the "ceph.fscrypt" vxattr.
The client can stuff that into the xattr blob when creating a new inode,
and the MDS can scrape it out of that and move the data into the correct
field in the inode. A setxattr on this field would update the new field
too. It's an ugly interface, but shouldn't be too bad to handle and we
have some precedent for this sort of thing.
The rules for handling the new field in the client would be a bit weird
though. We'll need to allow it to reading the fscrypt_ctx part without
any caps (since that should be static once it's set), but the size
handling needs to be under the same caps as the traditional size field
(Is that Fsx? The rules for this are never quite clear to me.)
Would it be better to have two different fields here -- fscrypt_auth and
fscrypt_file? Or maybe, fscrypt_static/_dynamic? We don't necessarily
need to keep all of this info together, but it seemed neater that way.
Thoughts? Opinions? Is this a horrible idea? What would be better?
Thanks,
--
Jeff Layton <jlayton(a)redhat.com>
[1]: latest draft was posted here:
https://lore.kernel.org/ceph-devel/53d5bebb28c1e0cd354a336a56bf103d5e3a6344…
[2]: https://github.com/ceph/ceph/pull/37297
[3]:
https://github.com/ceph/ceph/commit/7fe1c57846a42443f0258fd877d7166f33fd596f
[4]:
https://lore.kernel.org/ceph-devel/53d5bebb28c1e0cd354a336a56bf103d5e3a6344…
Dear Ceph Developers,
On our Ceph S3 Storage clusters we have found that the rate of
recovery/backfill on the cluster with S3/RADOS object sizes between 10KB -
100KB takes much longer to recover compared to our other cluster whose S3
object sizes are usually a few tens of MB (with "rgw_obj_stripe_size" of
4MB, so RADOS objects would be 4MB or lesser).
We're exploring ways to improve the recovery speed by keeping the following
factors constant (since tweaking them would lead to other issues):
1. Type of media - this would be HDD as moving all data to SSD would be
prohibitively expensive
2. "osd_max_backfills" - We do not want to increase this as it leads to
blocked requests and interferes with client I/O. We suspect that the disk
RPS gets saturated if increased.
3. PG count - Increasing this would lead to more memory usage beyond
what's available with the OSDs.
I came across the same question posted on this forum a few years back but
seems to have no answers. Refer this
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023403.ht…>
and this <https://www.spinics.net/lists/ceph-users/msg43136.html>.
Can the community help me understand what is theoretically causing this
slowness? Is the overhead in recovering each RADOS object (grabbing a lock
on the PG, txn overhead) so high that any increase in its number would
decrease the recovery throughput?
Should I just tweak our workloads to not generate small sized S3/RADOS
objects so that the MTTR would become better for our cluster?
Thanks,
Prasad Krishnan
--
*-----------------------------------------------------------------------------------------*
*This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the
system manager. This message contains confidential information and is
intended only for the individual named. If you are not the named addressee,
you should not disseminate, distribute or copy this email. Please notify
the sender immediately by email if you have received this email by mistake
and delete this email from your system. If you are not the intended
recipient, you are notified that disclosing, copying, distributing or
taking any action in reliance on the contents of this information is
strictly prohibited.*****
****
*Any views or opinions presented in this
email are solely those of the author and do not necessarily represent those
of the organization. Any information on shares, debentures or similar
instruments, recommended product pricing, valuations and the like are for
information purposes only. It is not meant to be an instruction or
recommendation, as the case may be, to buy or to sell securities, products,
services nor an offer to buy or sell securities, products or services
unless specifically stated to be so on behalf of the Flipkart group.
Employees of the Flipkart group of companies are expressly required not to
make defamatory statements and not to infringe or authorise any
infringement of copyright or any other legal right by email communications.
Any such communication is contrary to organizational policy and outside the
scope of the employment of the individual concerned. The organization will
not accept any liability in respect of such communication, and the employee
responsible will be personally liable for any damages or other liability
arising.*****
****
*Our organization accepts no liability for the
content of this email, or for the consequences of any actions taken on the
basis of the information *provided,* unless that information is
subsequently confirmed in writing. If you are not the intended recipient,
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.*
_-----------------------------------------------------------------------------------------_
Hi,
This problem also happened in my customer's environment, so I want to solve this problem.
To facilitate the discussion, I restate the problem and the current solution.
(Mykola has already written the solution idea. I am sorry if there is anything different from Mykola's idea.)
In master:
Problem: A primary OSD crashes in an unnecessary situation. (I think this is a bug.)
Solution: Remove the ceph_assert from the code below.
---------------------------------------------------------------
diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc
index 626e8ccefb..12956424bd 100644
--- a/src/osd/PrimaryLogPG.cc
+++ b/src/osd/PrimaryLogPG.cc
@@ -13079,7 +13079,6 @@ void PrimaryLogPG::_clear_recovery_state()
last_backfill_started = hobject_t();
set<hobject_t>::iterator i = backfills_in_flight.begin();
while (i != backfills_in_flight.end()) {
- ceph_assert(recovering.count(*i));
backfills_in_flight.erase(i++);
}
---------------------------------------------------------------
The reason is as follows.
- The above code assumes that all of the objects contained in backfills_in_flight are contained in recovering.
- However, the current implementation of on_failed_pull[1], if it is non-primary OSD, unfound objects will remain only in backfills_in_flight. (but unconditionally removed from recovering[2])
Therefore, the above ceph_assert does not match the current implementation of on_failed_pull.
I thinks this ceph_assert should be removed, but I would like to hear opinion from the community.
[1]: https://github.com/ceph/ceph/blob/813933f81e3d682a0b1ae6dd906e38e78c4859a4/…
[2]: https://github.com/ceph/ceph/blob/813933f81e3d682a0b1ae6dd906e38e78c4859a4/…
In nautilus:
Problem: backfill_unfound state becomes clear when the OSD is restarted. (This is also a bug.)
This causes a user to mistakenly think the problem has been solved and cause unexpected trouble.
Solution: Remain unfound objects in backfills_in_flight such as on_failed_pull, if it is non-primary OSD.
There is the following commit[3], but as the range of correction of this commit is wide, so I think only the minimum correction necessary for problem solving should be directly committed to nautilus.
[3]: https://github.com/ceph/ceph/commit/8a8947d2a32d6390cb17099398e7f2212660c9a1
In addition, if this problem is solved, the problem that primary OSD crashes occurs, so the commit of the master described above needs to be backported.
I am considering sending PRs next week, so please let me know if you have any opinions from the community before that.
--
Jin
We are Ceph Luminous 12.2.13-0 in one of our clusters and we are observing OSDs flapping.
The stack trace we have from the OSDs that went down is given below :
/var/log/ceph/ceph-osd.27.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge
/release/12.2.12/rpm/el7/BUILD/ceph-12.2.12/src/osd/ECTransaction.h: 179: FAILED assert(plan.to_read.count(i.first) == 0 || (!plan.to_read.at(i.first).empty() && !i
.second.has_source()))
We have seen a similar bug raised at https://tracker.ceph.com/issues/21756
But the fix has been given in nautilus version (https://github.com/ceph/ceph/pull/18241/commits/fb50f43244f0a9bc59f9aa4e231…)
Please let us know if we can backport the same fix in Luminous or is there something else we can do to fix the issue.
Thanks!
SEO for eCommerce sites are the method of giving rise to person online stock more noticeable in the search device outcome layers. A person can get more traffic from the paid search, while SEO price is much limited. E-commerce marketing services are the process of utilizing promotional schemes to drive more traffic in person online stock. This also helps to convert this traffic into paying clients or retaining those clients into a post-investment. So, SEO is the most effective way to reach more and more clients. https://www.mindmingles.com/seo-company-india/
Hi all,
Highlights from this weeks' CLT meeting:
- Can we deprecate ceph-nano?
- Yes; it is no longer maintained. Josh will note as much in the
README and archive this and other old repos in the github org.
- ceph-adm replaces most use cases since it supports
containerization and deployment onto a single host. Team will explore
supporting OSDs on loopback devices, which is the main differentiator.
- Discussed the Redmine tracker triage process and considered removing
the top-level "Ceph" project from the list of available new ticket
destinations. I and others pushed back on this since we need a place
to put non-subproject-specific issues, so we agreed that going forward
project leads will scrub the top-level Ceph project for new relevant
issues during their regular bug scrubs. I took an action item to go
through the existing back log and sort it appropriately (though I also
think Sage did a bunch this morning).
- We added a COSI repository in the Ceph org for working with RGW.
- Pacific v16.2.2 needed release notes review, which it got so the
release is out now.
- We got a question about new options for the dashboard and other
component communication, following on from the CDS sessions. Ernesto
will follow up on this.
-Greg
Hi Lucian,
could you please take a look at the recent build failures?
following is an excerpt of the output of the build.
make[6]: Leaving directory '/mnt/ceph/build.deps/src/curl/docs/libcurl'
make[5]: Leaving directory '/mnt/ceph/build.deps/src/curl/docs/libcurl'
make[4]: Leaving directory '/mnt/ceph/build.deps/src/curl/docs/libcurl'
make[3]: Leaving directory '/mnt/ceph/build.deps/src/curl'
make[2]: Leaving directory '/mnt/ceph/build.deps/src/curl'
make[1]: Leaving directory '/mnt/ceph/build.deps/src/curl'
gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now
see https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=…
for the full output.
cheers,