New subject: rgw: interest in isa-l_crypto for md5 acceleration

15 Aug 2023

thanks Chunmei (and cc ceph dev list),

"RGWPutObj uses async_md5 for ETag" is a first draft at hooking this
up for rgw, and the pr description in
https://github.com/ceph/ceph/pull/52488 includes a todo list for
further work

the first step is to run this async hashing in parallel with
filter->process(), which is what applies compression/encryption and
ultimately writes the data to rados. ideally this would mask most of
the latency from md5

the pr adds a single instance of async_md5::Batch that runs on a
strand executor of rgw's thread pool. this means the md5 calculations
can run on any available thread, but will only utilize one core at a
time. we might improve utilization by adding more Batch instances, but
that could reduce the probability that rgw requests can fill those
batches within the "batch timeout". this timeout is a free parameter
that would need tuning

i'd like to determine how effective one instance is at offloading the
md5 calculations. if the md5 part (almost) always completes before
filter->process() does, that's probably good enough. if not, we'd want
a way to scale the number of instances based on latency or load

unfortunately, Mark and i are seeing crashes while testing this rgw
integration. i added a comment about this in
https://github.com/ceph/ceph/pull/52385#pullrequestreview-1578597699,
and will follow up there

finally, regarding the base async/batching library in
https://github.com/ceph/ceph/pull/52385, i'd like to explore ways to
extend that to cover other types of hashes and other libraries like
QAT. do you think QAT would work well under the same async/batching
interface?

overall, does this sound like a reasonable design for rgw?

On Mon, Aug 14, 2023 at 5:52 PM Liu, Chunmei &lt;chunmei.liu(a)intel.com&gt; wrote:
...

 Hi Casey,

     In your following two PRS:
     https://github.com/ceph/ceph/pull/52385 builds an asynchronous batching library on
top of the isal crypto library's multi-buffer md5 facilities:
     https://github.com/ceph/ceph/pull/52488 rgw/op: RGWPutObj uses async_md5 for ETag.
 Seems rgw can do async md5 batching. I am wondering what the current work status is.  If
all features are implemented or still has other features need to be implemented in next
step? What can intel team help here?

  Thanks!
 -Chunmei

  -----Original Message-----
 From: Cheng, Yingxin &lt;yingxin.cheng(a)intel.com&gt;
 Sent: Thursday, August 10, 2023 10:32 PM
 To: Casey Bodley &lt;cbodley(a)redhat.com&gt;om>; Liu, Chunmei
 &lt;chunmei.liu(a)intel.com&gt;om>; Feng, Hualong &lt;hualong.feng(a)intel.com&gt;
 Cc: Tang, Guifeng &lt;guifeng.tang(a)intel.com&gt;om>; mbenjami
 &lt;mbenjami(a)redhat.com&gt;om>; Mark Kogan &lt;mkogan(a)redhat.com&gt;om>; Marcus
 Watts &lt;mwatts(a)redhat.com&gt;om>; Gabriel BenHanokh &lt;gbenhano(a)redhat.com&gt;
 Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration

 Include Chunmei.

 Regards,
 -Yingxin

  -----Original Message-----
 From: Cheng, Yingxin
 Sent: Thursday, July 13, 2023 3:54 PM
 To: Casey Bodley &lt;cbodley(a)redhat.com&gt;om>; Feng, Hualong
 &lt;hualong.feng(a)intel.com&gt;
 Cc: Tang, Guifeng &lt;guifeng.tang(a)intel.com&gt;om>; mbenjami
 &lt;mbenjami(a)redhat.com&gt;om>; Mark Kogan &lt;mkogan(a)redhat.com&gt;om>; Marcus  Watts
  &lt;mwatts(a)redhat.com&gt;om>; Gabriel BenHanokh
&lt;gbenhano(a)redhat.com&gt;
 Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration

 Merge another RGW thread here, it is discussing the same thing.

 I'm also learning and try to answer below:

  But when less than 100% CPU is running, using
AVX512 to save core
 may cause  slowdown.

  well put, this is the part i'm struggling
with too. it feels like a
 tradeoff between cpu usage and added latency (both from thread
 synchronization and waiting for full batches) 
 Yeah, there are synthetic effects from both software and hardware:
 latency vs batching, CPU downclocking (should have less impacts from
 later CPU models such as SPR), sync vs async, etc.

  I'm not sure if the current work is an
informed implementation with
 respect to  the isa-l interface;  I'm fairly sure it is with regard to
boost::asio
 and related topics.

 My understanding is that the asynchronous way has the best opportunity
 to fully use the 16 lane AVX512 accelerations by batching the
 requests, and falling back to the synchronous way is worth considering
 under small sizes or low depth. But the decisions need to be based on real  test
results.

  Is there anything we can do to spread the load
evenly on  SSE/AVX/AVX2/AVX512 units?
  I assume that each physical core has a standalone
unit - can we make
 sure to  employee them all in parallel?

 SIMD are CPU instructions rather than off-loadable device. They can
 only be executed from a thread synchronously. So if the parallelism
 exceeds 16 lanes, it should be reasonable to start another worker
 thread. And if there are only 8 outstanding lanes at the moment, AVX2
 should be a better choice rather than the heavier AVX512. I'm not sure
 yet whether isa-l library is intelligent enough to pick the right instruction or 
it is manually controlled.

  This handles only MD5, which certainly matters to
us, but I suspect
 we strongly  need acceleration for sha-256 and sha-512.

 Looks isa-l supports these optimizations, which is the recommended way
 because there might be multiple options available at the same time (AVX512  and
sha-ni).
  Looking at the source code, the library is able
to detect the
 availability and select the best possible option.

 Regards,
 -Yingxin

 > -----Original Message-----
 > From: Casey Bodley &lt;cbodley(a)redhat.com&gt;
 > Sent: Thursday, July 13, 2023 3:19 AM
 > To: Feng, Hualong &lt;hualong.feng(a)intel.com&gt;
 > Cc: Cheng, Yingxin &lt;yingxin.cheng(a)intel.com&gt;om>; Tang, Guifeng
 > &lt;guifeng.tang(a)intel.com&gt;om>; mbenjami &lt;mbenjami(a)redhat.com&gt;om>; Mark 
Kogan
  > &lt;mkogan(a)redhat.com&gt;
 > Subject: Re: rgw: interest in isa-l_crypto for md5 acceleration
 >
 > thanks Hualong, (cc Matt and Mark)
 >
 > On Wed, Jul 12, 2023 at 6:22 AM Feng, Hualong
 > &lt;hualong.feng(a)intel.com&gt;
 > wrote:
 > >
 > > Hi Casey
 > >
 > > Our team are interested in this, but we need to study more details.
 > >
 > > We have learned about md5 implemented by AVX512 before. We know
 > > that
 > md5 can only be calculated serially for an object, and cannot be
 > split for concurrent calculation. For AVX512, its basic
 > understanding is to divide a core into 16 lanes, and then these 16
 > lanes can be calculated at the same time. So when we use md5
 > implemented by AVX512, we need to wait for multiple request objects to 
calculate md5 at the same time.
   >
 > So here are two difficulties to consider:
 > 1. When calculating, we need to fill all the lanes on a core as
 > much as  possible.
  Only in this way can the advantages of
AVX512/AVX2 be reflected.
 > 2. What we have seen so far is the comparison between using
 > AVX512/AVX2
 on a single core and not using it. But when we actually run RGW,
 unless RGW is allocated or the machine where it is located is
 already running with 100% CPU, then using AVX512 will increase the
 calculation speed of md5. But when less than 100% CPU is running,
 using AVX512 to save core may cause slowdown. So how do we know
 under what circumstances we should use AVX instructions in code?

 well put, this is the part i'm struggling with too. it feels like a
 tradeoff between cpu usage and added latency (both from thread
 synchronization and waiting for full batches)

 if we could track the rate of hash updates per second, we might use
 that to decide whether we're likely to get a full batch within some
 limit of acceptable latency

 RGWPutObj could probably mask some of this latency by running these
 asynchronous hashes in parallel while reading the next 4MB chunk
 from the frontend

 in https://github.com/ceph/ceph/pull/52385 i introduced the concept
 of a batch_timeout, which can force the processing of a partial batch.
 bounding this latency seemed like a necessary part of the model. if
 RGWPutObj can mask this latency, then we might use that
 batch_timeout to avoid the need to track a global hash rate

 >
 > We have implemented a POC before, using QAT to implement the hash
 > algorithm in ceph (SHA256, MD5, SHA1, SHA512, HMACSHA256,
 > HMACSHA1),

 very cool, thanks. what is the benefit of QAT here, compared to
 something like isa-l_crypto that just uses AVX instructions? is QAT
 able to  offload some of this?
 > would the use of QAT rule out any hardware (like AMD cpus) that
 > would otherwise support AVX?
 >
 > > However, due to md5 security reasons and there is no convenient
 > > alternative
 > framework in ceph, it is temporarily blocked.
 > > 
 https://github.com/ceph/ceph/compare/main...hualongfeng:ceph:hash_
 > > > qa
 > > > t_
 > > > mode
 > >
 > > i'll reach out to our security contact to get some more clarity on
 > > this stuff. i know that openssl's FIPS certification is important to
 > > downstream products. but
 > > md5 isn't a cryptographic hash and etag isn't used for security, so
 > > i've assumed we could use other md5 implementations there. i'm less
 > > sure about the SHA family, but i know Matt's interested in using
 > > those for checksumming in rgw
 > >
 > >
 > >
 > >
 > >
 > >
 > >
 > >
 > >
 > >
 > > >
 > > > Thanks
 > > > -Hualong
 > > >
 > > > > -----Original Message-----
 > > > > From: Casey Bodley &lt;cbodley(a)redhat.com&gt;
 > > > > Sent: Tuesday, July 11, 2023 11:06 PM
 > > > > To: Feng, Hualong &lt;hualong.feng(a)intel.com&gt;
 > > > > Subject: rgw: interest in isa-l_crypto for md5 acceleration
 > > > >
 > > > > hey Hualong,
 > > > >
 > > > > ceph is already using intel's isa-l_crypto library for crypto
 > > > > acceleration. i just started looking into its multi-buffer md5
 > > > > implementation
 > > > > (https://github.com/intel/isa-l_crypto/blob/master/include/md5_mb.
 > > > > h) for use in rgw to vectorize our ETag calculations (a feature
 > > > > tracked in https://tracker.ceph.com/issues/61646). i've started
 > > > > some initial work in https://github.com/ceph/ceph/pull/52385,
 > > > > but we'll still need to decide how best to integrate that into
 > > > > rgw
 > > > >
 > > > > would your team be interested in collaborrating on this? we'd
 > > > > love your input on the design for rgw, and how best to measure
 > > > > and tune its performance
 > > >