Thanks Casey, I understand the design, status, and further work, give us some time to
discuss the design. Will let you know if we have some idea.
Thanks!
-Chunmei
-----Original Message-----
From: Casey Bodley <cbodley(a)redhat.com>
Sent: Tuesday, August 15, 2023 9:44 AM
To: Liu, Chunmei <chunmei.liu(a)intel.com>
Cc: Cheng, Yingxin <yingxin.cheng(a)intel.com>om>; Feng, Hualong
<hualong.feng(a)intel.com>om>; Tang, Guifeng <guifeng.tang(a)intel.com>om>;
mbenjami <mbenjami(a)redhat.com>om>; Mark Kogan <mkogan(a)redhat.com>om>;
Marcus Watts <mwatts(a)redhat.com>om>; Gabriel BenHanokh
<gbenhano(a)redhat.com>om>; dev(a)ceph.io; seenafallah(a)gmail.com
Subject: Re: rgw: interest in isa-l_crypto for md5 acceleration
thanks Chunmei (and cc ceph dev list),
"RGWPutObj uses async_md5 for ETag" is a first draft at hooking this up for
rgw, and the pr description in
https://github.com/ceph/ceph/pull/52488 includes a todo list for further work
the first step is to run this async hashing in parallel with
filter->process(), which is what applies compression/encryption and
ultimately writes the data to rados. ideally this would mask most of the
latency from md5
the pr adds a single instance of async_md5::Batch that runs on a strand
executor of rgw's thread pool. this means the md5 calculations can run on
any available thread, but will only utilize one core at a time. we might improve
utilization by adding more Batch instances, but that could reduce the
probability that rgw requests can fill those batches within the "batch
timeout". this timeout is a free parameter that would need tuning
i'd like to determine how effective one instance is at offloading the
md5 calculations. if the md5 part (almost) always completes before
filter->process() does, that's probably good enough. if not, we'd want
a way to scale the number of instances based on latency or load
unfortunately, Mark and i are seeing crashes while testing this rgw integration.
i added a comment about this in
https://github.com/ceph/ceph/pull/52385#pullrequestreview-1578597699,
and will follow up there
finally, regarding the base async/batching library in
https://github.com/ceph/ceph/pull/52385, i'd like to explore ways to extend
that to cover other types of hashes and other libraries like QAT. do you think
QAT would work well under the same async/batching interface?
overall, does this sound like a reasonable design for rgw?
On Mon, Aug 14, 2023 at 5:52 PM Liu, Chunmei <chunmei.liu(a)intel.com>
wrote:
Hi Casey,
In your following two PRS:
https://github.com/ceph/ceph/pull/52385 builds an asynchronous
batching library on top of the isal crypto library's multi-buffer md5
facilities:
async_md5 for ETag.
Seems rgw can do async md5 batching. I am
wondering what the current
work status is. If all features are implemented or still has other features
need
to be implemented in next step? What can intel team help here?
Thanks!
-Chunmei
> -----Original Message-----
> From: Cheng, Yingxin <yingxin.cheng(a)intel.com>
> Sent: Thursday, August 10, 2023 10:32 PM
> To: Casey Bodley <cbodley(a)redhat.com>om>; Liu, Chunmei
> <chunmei.liu(a)intel.com>om>; Feng, Hualong <hualong.feng(a)intel.com>
> Cc: Tang, Guifeng <guifeng.tang(a)intel.com>om>; mbenjami
> <mbenjami(a)redhat.com>om>; Mark Kogan <mkogan(a)redhat.com>om>; Marcus
Watts
> <mwatts(a)redhat.com>om>; Gabriel BenHanokh
<gbenhano(a)redhat.com>
> Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration
>
> Include Chunmei.
>
> Regards,
> -Yingxin
>
> > -----Original Message-----
> > From: Cheng, Yingxin
> > Sent: Thursday, July 13, 2023 3:54 PM
> > To: Casey Bodley <cbodley(a)redhat.com>om>; Feng, Hualong
> > <hualong.feng(a)intel.com>
> > Cc: Tang, Guifeng <guifeng.tang(a)intel.com>om>; mbenjami
> > <mbenjami(a)redhat.com>om>; Mark Kogan <mkogan(a)redhat.com>om>;
Marcus
> Watts
> > <mwatts(a)redhat.com>om>; Gabriel BenHanokh <gbenhano(a)redhat.com>
> > Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration
> >
> > Merge another RGW thread here, it is discussing the same thing.
> >
> > I'm also learning and try to answer below:
> >
> > > But when less than 100% CPU is running, using AVX512 to save
> > > core may cause
> > slowdown.
> >
> > > well put, this is the part i'm struggling with too. it feels
> > > like a tradeoff between cpu usage and added latency (both from
> > > thread synchronization and waiting for full batches)
> >
> > Yeah, there are synthetic effects from both software and hardware:
> > latency vs batching, CPU downclocking (should have less impacts
> > from later CPU models such as SPR), sync vs async, etc.
> >
> > > I'm not sure if the current work is an informed implementation
> > > with respect to
> > the isa-l interface; I'm fairly sure it is with regard to
> > boost::asio and related topics.
> >
> > My understanding is that the asynchronous way has the best
> > opportunity to fully use the 16 lane AVX512 accelerations by
> > batching the requests, and falling back to the synchronous way is
> > worth considering under small sizes or low depth. But the
> > decisions need to be based on real
> test results.
> >
> > > Is there anything we can do to spread the load evenly on
> > SSE/AVX/AVX2/AVX512 units?
> > > I assume that each physical core has a standalone unit - can we
> > > make sure to
> > employee them all in parallel?
> >
> > SIMD are CPU instructions rather than off-loadable device. They
> > can only be executed from a thread synchronously. So if the
> > parallelism exceeds 16 lanes, it should be reasonable to start
> > another worker thread. And if there are only 8 outstanding lanes
> > at the moment, AVX2 should be a better choice rather than the
> > heavier AVX512. I'm not sure yet whether isa-l library is
> > intelligent enough to pick the right instruction or
> it is manually controlled.
> >
> > > This handles only MD5, which certainly matters to us, but I
> > > suspect we strongly
> > need acceleration for sha-256 and sha-512.
> >
> > Looks isa-l supports these optimizations, which is the recommended
> > way because there might be multiple options available at the same
> > time (AVX512
> and sha-ni).
> > Looking at the source code, the library is able to detect the
> > availability and select the best possible option.
> >
> > Regards,
> > -Yingxin
> >
> > > -----Original Message-----
> > > From: Casey Bodley <cbodley(a)redhat.com>
> > > Sent: Thursday, July 13, 2023 3:19 AM
> > > To: Feng, Hualong <hualong.feng(a)intel.com>
> > > Cc: Cheng, Yingxin <yingxin.cheng(a)intel.com>om>; Tang, Guifeng
> > > <guifeng.tang(a)intel.com>om>; mbenjami <mbenjami(a)redhat.com>om>;
Mark
> Kogan
> > > <mkogan(a)redhat.com>
> > > Subject: Re: rgw: interest in isa-l_crypto for md5 acceleration
> > >
> > > thanks Hualong, (cc Matt and Mark)
> > >
> > > On Wed, Jul 12, 2023 at 6:22 AM Feng, Hualong
> > > <hualong.feng(a)intel.com>
> > > wrote:
> > > >
> > > > Hi Casey
> > > >
> > > > Our team are interested in this, but we need to study more details.
> > > >
> > > > We have learned about md5 implemented by AVX512 before. We
> > > > know that
> > > md5 can only be calculated serially for an object, and cannot be
> > > split for concurrent calculation. For AVX512, its basic
> > > understanding is to divide a core into 16 lanes, and then these
> > > 16 lanes can be calculated at the same time. So when we use md5
> > > implemented by AVX512, we need to wait for multiple request
> > > objects to
> calculate md5 at the same time.
> > > >
> > > > So here are two difficulties to consider:
> > > > 1. When calculating, we need to fill all the lanes on a core
> > > > as much as
> > possible.
> > > Only in this way can the advantages of AVX512/AVX2 be reflected.
> > > > 2. What we have seen so far is the comparison between using
> > > > AVX512/AVX2
> > > on a single core and not using it. But when we actually run RGW,
> > > unless RGW is allocated or the machine where it is located is
> > > already running with 100% CPU, then using AVX512 will increase
> > > the calculation speed of md5. But when less than 100% CPU is
> > > running, using AVX512 to save core may cause slowdown. So how do
> > > we know under what circumstances we should use AVX instructions in
code?
> > >
> > > well put, this is the part i'm struggling with too. it feels
> > > like a tradeoff between cpu usage and added latency (both from
> > > thread synchronization and waiting for full batches)
> > >
> > > if we could track the rate of hash updates per second, we might
> > > use that to decide whether we're likely to get a full batch
> > > within some limit of acceptable latency
> > >
> > > RGWPutObj could probably mask some of this latency by running
> > > these asynchronous hashes in parallel while reading the next 4MB
> > > chunk from the frontend
> > >
> > > in
https://github.com/ceph/ceph/pull/52385 i introduced the
> > > concept of a batch_timeout, which can force the processing of a partial
batch.
> > > bounding this latency seemed like
a necessary part of the model.
> > > if RGWPutObj can mask this latency, then we might use that
> > > batch_timeout to avoid the need to track a global hash rate
> > >
> > > >
> > > > We have implemented a POC before, using QAT to implement the
> > > > hash algorithm in ceph (SHA256, MD5, SHA1, SHA512, HMACSHA256,
> > > > HMACSHA1),
> > >
> > > very cool, thanks. what is the benefit of QAT here, compared to
> > > something like isa-l_crypto that just uses AVX instructions? is
> > > QAT able to
> > offload some of this?
> > > would the use of QAT rule out any hardware (like AMD cpus) that
> > > would otherwise support AVX?
> > >
> > > > However, due to md5 security reasons and there is no
> > > > convenient alternative
> > > framework in ceph, it is temporarily blocked.
> > > >
>
https://github.com/ceph/ceph/compare/main...hualongfeng:ceph:hash_
> > > > qa
> > > > t_
> > > > mode
> > >
> > > i'll reach out to our security contact to get some more clarity
> > > on this stuff. i know that openssl's FIPS certification is
> > > important to downstream products. but
> > > md5 isn't a cryptographic hash and etag isn't used for security,
> > > so i've assumed we could use other md5 implementations there.
> > > i'm less sure about the SHA family, but i know Matt's interested
> > > in using those for checksumming in rgw
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > >
> > > > Thanks
> > > > -Hualong
> > > >
> > > > > -----Original Message-----
> > > > > From: Casey Bodley <cbodley(a)redhat.com>
> > > > > Sent: Tuesday, July 11, 2023 11:06 PM
> > > > > To: Feng, Hualong <hualong.feng(a)intel.com>
> > > > > Subject: rgw: interest in isa-l_crypto for md5 acceleration
> > > > >
> > > > > hey Hualong,
> > > > >
> > > > > ceph is already using intel's isa-l_crypto library for
> > > > > crypto acceleration. i just started looking into its
> > > > > multi-buffer md5 implementation
> > > > > (
https://github.com/intel/isa-
l_crypto/blob/master/include/md5_mb.
> > > > > > h) for use in rgw to vectorize our ETag calculations (a
> > > > > > feature tracked in
https://tracker.ceph.com/issues/61646).
> > > > > > i've started some initial work in
> > > > > >
https://github.com/ceph/ceph/pull/52385,
> > > > > > but we'll still need to decide how best to integrate
that
> > > > > > into rgw
> > > > > >
> > > > > > would your team be interested in collaborrating on this?
> > > > > > we'd love your input on the design for rgw, and how
best to
> > > > > > measure and tune its performance
> > > > >
>