Hi folks,
As many of you are aware, RAM/CPU is particularly scarce on
teuthology.front.sepia.ceph.com which means that any log viewing of QA
results can cause swapping and general slowness. At today's Ceph
Infrastructure Weekly call [1], I proposed only allowing root,
www-data, and teuthworker access to /teuthology so that others would
be forced to look at logs on other beefier development machines like
senta [2], vossi [3], your personal workstation [4], or even some
temporary locked node [5].
The current plan is to proceed barring some reasonable justification
not to. There are a few to-do items to make it happen laid out in the
minutes.
Until then, admins of teuthology would appreciate it if you already
start moving your log viewing to other machines without waiting for a
technical barrier to be set up.
[1] https://pad.ceph.com/p/ceph-infra-weekly
[2] https://wiki.sepia.ceph.com/doku.php?id=hardware:senta
[3] https://wiki.sepia.ceph.com/doku.php?id=hardware:vossi
[4] https://wiki.sepia.ceph.com/doku.php?id=services:cephfs
[5] https://wiki.sepia.ceph.com/doku.php?id=testnodeaccess
--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
thanks Chunmei (and cc ceph dev list),
"RGWPutObj uses async_md5 for ETag" is a first draft at hooking this
up for rgw, and the pr description in
https://github.com/ceph/ceph/pull/52488 includes a todo list for
further work
the first step is to run this async hashing in parallel with
filter->process(), which is what applies compression/encryption and
ultimately writes the data to rados. ideally this would mask most of
the latency from md5
the pr adds a single instance of async_md5::Batch that runs on a
strand executor of rgw's thread pool. this means the md5 calculations
can run on any available thread, but will only utilize one core at a
time. we might improve utilization by adding more Batch instances, but
that could reduce the probability that rgw requests can fill those
batches within the "batch timeout". this timeout is a free parameter
that would need tuning
i'd like to determine how effective one instance is at offloading the
md5 calculations. if the md5 part (almost) always completes before
filter->process() does, that's probably good enough. if not, we'd want
a way to scale the number of instances based on latency or load
unfortunately, Mark and i are seeing crashes while testing this rgw
integration. i added a comment about this in
https://github.com/ceph/ceph/pull/52385#pullrequestreview-1578597699,
and will follow up there
finally, regarding the base async/batching library in
https://github.com/ceph/ceph/pull/52385, i'd like to explore ways to
extend that to cover other types of hashes and other libraries like
QAT. do you think QAT would work well under the same async/batching
interface?
overall, does this sound like a reasonable design for rgw?
On Mon, Aug 14, 2023 at 5:52 PM Liu, Chunmei <chunmei.liu(a)intel.com> wrote:
>
> Hi Casey,
>
> In your following two PRS:
> https://github.com/ceph/ceph/pull/52385 builds an asynchronous batching library on top of the isal crypto library's multi-buffer md5 facilities:
> https://github.com/ceph/ceph/pull/52488 rgw/op: RGWPutObj uses async_md5 for ETag.
> Seems rgw can do async md5 batching. I am wondering what the current work status is. If all features are implemented or still has other features need to be implemented in next step? What can intel team help here?
>
> Thanks!
> -Chunmei
>
> > -----Original Message-----
> > From: Cheng, Yingxin <yingxin.cheng(a)intel.com>
> > Sent: Thursday, August 10, 2023 10:32 PM
> > To: Casey Bodley <cbodley(a)redhat.com>; Liu, Chunmei
> > <chunmei.liu(a)intel.com>; Feng, Hualong <hualong.feng(a)intel.com>
> > Cc: Tang, Guifeng <guifeng.tang(a)intel.com>; mbenjami
> > <mbenjami(a)redhat.com>; Mark Kogan <mkogan(a)redhat.com>; Marcus
> > Watts <mwatts(a)redhat.com>; Gabriel BenHanokh <gbenhano(a)redhat.com>
> > Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration
> >
> > Include Chunmei.
> >
> > Regards,
> > -Yingxin
> >
> > > -----Original Message-----
> > > From: Cheng, Yingxin
> > > Sent: Thursday, July 13, 2023 3:54 PM
> > > To: Casey Bodley <cbodley(a)redhat.com>; Feng, Hualong
> > > <hualong.feng(a)intel.com>
> > > Cc: Tang, Guifeng <guifeng.tang(a)intel.com>; mbenjami
> > > <mbenjami(a)redhat.com>; Mark Kogan <mkogan(a)redhat.com>; Marcus
> > Watts
> > > <mwatts(a)redhat.com>; Gabriel BenHanokh <gbenhano(a)redhat.com>
> > > Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration
> > >
> > > Merge another RGW thread here, it is discussing the same thing.
> > >
> > > I'm also learning and try to answer below:
> > >
> > > > But when less than 100% CPU is running, using AVX512 to save core
> > > > may cause
> > > slowdown.
> > >
> > > > well put, this is the part i'm struggling with too. it feels like a
> > > > tradeoff between cpu usage and added latency (both from thread
> > > > synchronization and waiting for full batches)
> > >
> > > Yeah, there are synthetic effects from both software and hardware:
> > > latency vs batching, CPU downclocking (should have less impacts from
> > > later CPU models such as SPR), sync vs async, etc.
> > >
> > > > I'm not sure if the current work is an informed implementation with
> > > > respect to
> > > the isa-l interface; I'm fairly sure it is with regard to boost::asio
> > > and related topics.
> > >
> > > My understanding is that the asynchronous way has the best opportunity
> > > to fully use the 16 lane AVX512 accelerations by batching the
> > > requests, and falling back to the synchronous way is worth considering
> > > under small sizes or low depth. But the decisions need to be based on real
> > test results.
> > >
> > > > Is there anything we can do to spread the load evenly on
> > > SSE/AVX/AVX2/AVX512 units?
> > > > I assume that each physical core has a standalone unit - can we make
> > > > sure to
> > > employee them all in parallel?
> > >
> > > SIMD are CPU instructions rather than off-loadable device. They can
> > > only be executed from a thread synchronously. So if the parallelism
> > > exceeds 16 lanes, it should be reasonable to start another worker
> > > thread. And if there are only 8 outstanding lanes at the moment, AVX2
> > > should be a better choice rather than the heavier AVX512. I'm not sure
> > > yet whether isa-l library is intelligent enough to pick the right instruction or
> > it is manually controlled.
> > >
> > > > This handles only MD5, which certainly matters to us, but I suspect
> > > > we strongly
> > > need acceleration for sha-256 and sha-512.
> > >
> > > Looks isa-l supports these optimizations, which is the recommended way
> > > because there might be multiple options available at the same time (AVX512
> > and sha-ni).
> > > Looking at the source code, the library is able to detect the
> > > availability and select the best possible option.
> > >
> > > Regards,
> > > -Yingxin
> > >
> > > > -----Original Message-----
> > > > From: Casey Bodley <cbodley(a)redhat.com>
> > > > Sent: Thursday, July 13, 2023 3:19 AM
> > > > To: Feng, Hualong <hualong.feng(a)intel.com>
> > > > Cc: Cheng, Yingxin <yingxin.cheng(a)intel.com>; Tang, Guifeng
> > > > <guifeng.tang(a)intel.com>; mbenjami <mbenjami(a)redhat.com>; Mark
> > Kogan
> > > > <mkogan(a)redhat.com>
> > > > Subject: Re: rgw: interest in isa-l_crypto for md5 acceleration
> > > >
> > > > thanks Hualong, (cc Matt and Mark)
> > > >
> > > > On Wed, Jul 12, 2023 at 6:22 AM Feng, Hualong
> > > > <hualong.feng(a)intel.com>
> > > > wrote:
> > > > >
> > > > > Hi Casey
> > > > >
> > > > > Our team are interested in this, but we need to study more details.
> > > > >
> > > > > We have learned about md5 implemented by AVX512 before. We know
> > > > > that
> > > > md5 can only be calculated serially for an object, and cannot be
> > > > split for concurrent calculation. For AVX512, its basic
> > > > understanding is to divide a core into 16 lanes, and then these 16
> > > > lanes can be calculated at the same time. So when we use md5
> > > > implemented by AVX512, we need to wait for multiple request objects to
> > calculate md5 at the same time.
> > > > >
> > > > > So here are two difficulties to consider:
> > > > > 1. When calculating, we need to fill all the lanes on a core as
> > > > > much as
> > > possible.
> > > > Only in this way can the advantages of AVX512/AVX2 be reflected.
> > > > > 2. What we have seen so far is the comparison between using
> > > > > AVX512/AVX2
> > > > on a single core and not using it. But when we actually run RGW,
> > > > unless RGW is allocated or the machine where it is located is
> > > > already running with 100% CPU, then using AVX512 will increase the
> > > > calculation speed of md5. But when less than 100% CPU is running,
> > > > using AVX512 to save core may cause slowdown. So how do we know
> > > > under what circumstances we should use AVX instructions in code?
> > > >
> > > > well put, this is the part i'm struggling with too. it feels like a
> > > > tradeoff between cpu usage and added latency (both from thread
> > > > synchronization and waiting for full batches)
> > > >
> > > > if we could track the rate of hash updates per second, we might use
> > > > that to decide whether we're likely to get a full batch within some
> > > > limit of acceptable latency
> > > >
> > > > RGWPutObj could probably mask some of this latency by running these
> > > > asynchronous hashes in parallel while reading the next 4MB chunk
> > > > from the frontend
> > > >
> > > > in https://github.com/ceph/ceph/pull/52385 i introduced the concept
> > > > of a batch_timeout, which can force the processing of a partial batch.
> > > > bounding this latency seemed like a necessary part of the model. if
> > > > RGWPutObj can mask this latency, then we might use that
> > > > batch_timeout to avoid the need to track a global hash rate
> > > >
> > > > >
> > > > > We have implemented a POC before, using QAT to implement the hash
> > > > > algorithm in ceph (SHA256, MD5, SHA1, SHA512, HMACSHA256,
> > > > > HMACSHA1),
> > > >
> > > > very cool, thanks. what is the benefit of QAT here, compared to
> > > > something like isa-l_crypto that just uses AVX instructions? is QAT
> > > > able to
> > > offload some of this?
> > > > would the use of QAT rule out any hardware (like AMD cpus) that
> > > > would otherwise support AVX?
> > > >
> > > > > However, due to md5 security reasons and there is no convenient
> > > > > alternative
> > > > framework in ceph, it is temporarily blocked.
> > > > >
> > https://github.com/ceph/ceph/compare/main...hualongfeng:ceph:hash_
> > > > > qa
> > > > > t_
> > > > > mode
> > > >
> > > > i'll reach out to our security contact to get some more clarity on
> > > > this stuff. i know that openssl's FIPS certification is important to
> > > > downstream products. but
> > > > md5 isn't a cryptographic hash and etag isn't used for security, so
> > > > i've assumed we could use other md5 implementations there. i'm less
> > > > sure about the SHA family, but i know Matt's interested in using
> > > > those for checksumming in rgw
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > >
> > > > > Thanks
> > > > > -Hualong
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Casey Bodley <cbodley(a)redhat.com>
> > > > > > Sent: Tuesday, July 11, 2023 11:06 PM
> > > > > > To: Feng, Hualong <hualong.feng(a)intel.com>
> > > > > > Subject: rgw: interest in isa-l_crypto for md5 acceleration
> > > > > >
> > > > > > hey Hualong,
> > > > > >
> > > > > > ceph is already using intel's isa-l_crypto library for crypto
> > > > > > acceleration. i just started looking into its multi-buffer md5
> > > > > > implementation
> > > > > > (https://github.com/intel/isa-l_crypto/blob/master/include/md5_mb.
> > > > > > h) for use in rgw to vectorize our ETag calculations (a feature
> > > > > > tracked in https://tracker.ceph.com/issues/61646). i've started
> > > > > > some initial work in https://github.com/ceph/ceph/pull/52385,
> > > > > > but we'll still need to decide how best to integrate that into
> > > > > > rgw
> > > > > >
> > > > > > would your team be interested in collaborrating on this? we'd
> > > > > > love your input on the design for rgw, and how best to measure
> > > > > > and tune its performance
> > > > >
>
I apologize for any inconvenience, but I have encountered a challenging situation while analyzing the Ceph filesystem's behavior during concurrent rename operations across directories. Specifically, my concern revolves around the potential for directory loops when two clients initiate renames simultaneously.
In the VFS, there's a filesystem-specific vfs_rename_mutex that serializes the rename operation. In Ceph, I noticed the presence of a global client lock. However, I'm uncertain if the MDS serializes rename requests.
Consider the following scenario:
a
/ \
b c
/ \
d e
/ \
f g
If Client 1 attempts to rename "c" to "f" while Client 2 tries to rename "b" to "g" concurrently, and both succeed, we could end up with a loop in the directory structure.
Could you please provide clarity on how CephFS handles such situations? Your insights would be invaluable.
We're very happy to announce the first stable release of the Reef series.
We express our gratitude to all members of the Ceph community who
contributed by proposing pull requests, testing this release,
providing feedback, and offering valuable suggestions.
Major Changes from Quincy:
- RADOS: RocksDB has been upgraded to version 7.9.2.
- RADOS: There have been significant improvements to RocksDB iteration
overhead and performance.
- RADOS: The perf dump and perf schema commands have been deprecated
in favor of the new counter dump and counter schema commands.
- RADOS: Cache tiering is now deprecated.
- RADOS: A new feature, the "read balancer", is now available, which
allows users to balance primary PGs per pool on their clusters.
- RGW: Bucket resharding is now supported for multi-site configurations.
- RGW: There have been significant improvements to the stability and
consistency of multi-site replication.
- RGW: Compression is now supported for objects uploaded with
Server-Side Encryption.
- Dashboard: There is a new Dashboard page with improved layout.
Active alerts and some important charts are now displayed inside
cards.
- RBD: Support for layered client-side encryption has been added.
- Telemetry: Users can now opt in to participate in a leaderboard in
the telemetry public dashboards.
We encourage you to read the full release notes at
https://ceph.io/en/news/blog/2023/v18-2-0-reef-released/
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at https://download.ceph.com/tarballs/ceph-18.2.0.tar.gz
* Containers at https://quay.io/repository/ceph/ceph
* For packages, see https://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 5dd24139a1eada541a3bc16b6941c5dde975e26d
Did you know? Every Ceph release is built and tested on resources
funded directly by the non-profit Ceph Foundation.
If you would like to support this and our other efforts, please
consider joining now https://ceph.io/en/foundation/.
* following up on the recent issue of teuthology's /home directory
filling up, Patrick is testing the use of home directories on cephfs.
Laura, Yuri, and Venky volunteered their home directories for
additional testing
* more admin permissions were granted for core infrastructure to help
resolve lab issues: see
https://pad.ceph.com/p/ceph-infra-contact-sheet
* backport testing for pacific 16.2.14 is nearly complete and qa
validation may start next week
Hi,
Let's do some serious necromancy here.
I just had this exact problem. Turns out that after rebooting all nodes
(one at the time of course), the monitor could join perfectly.
Why? You tell me. We did not see any traces of the ip address in any
dumps that we could get a hold of. I restarted all ceph-mgr beforehand
as well.
Med vänliga hälsningar
Josef Johansson
On 10/5/21 15:37, Konstantin Shalygin wrote:
> As last resort we've change ipaddr of this host, and mon successfully joined to quorum. When revert ipaddr back - mon can't join, we think there something on switch side or on old mon's side. From old mon's I was checked new mon process connectivity via telnet - all works
> It's good to make a some reproducer of this network problem to know what exactly message of ceph protocol is broken
>
>
>
> k
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
On Quincy (17.2.6) we are able to use per host RGW SSL certificates using:
```shell
echo "{{ host_ssl_crt }}" | ceph config-key set rgw/cert/{{ realmname }}/{{ zonename> }}{{ hostname }}.crt -i -
echo "{{ host_ssl_key }}" | ceph config-key set rgw/cert/{{ realmname }}/{{ zonename> }}{{ hostname }}.key -i -
```
Logging on 17.2.6 shows:
```
Aug 14 14:41:01 ceph-dev-gw2 radosgw[1373]: framework conf key: ssl_certificate, val: config://rgw/cert/_ Realmname _/_ Zonename _/ceph-dev-gw2.crt
Aug 14 14:41:01 ceph-dev-gw2 radosgw[1373]: framework conf key: ssl_private_key, val: config://rgw/cert/_ Realmname _/_ Zonename _/ceph-dev-gw2.key
```
On reef (18.2.0) this fails and the logging shows:
```
Aug 15 10:50:17 ceph-dev-gw2 radosgw[178335]: framework conf key: ssl_certificate, val: config://rgw/cert/$realm/$zone.crt
Aug 15 10:50:17 ceph-dev-gw2 radosgw[178335]: framework conf key: ssl_private_key, val: config://rgw/cert/$realm/$zone.key
```
Is this a regression or should we configure this in another was?
With kind regards,
Jeroen
I'd appreciate if any cmake experts could look into this:
https://tracker.ceph.com/issues/62428
--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D