Re: RGW encrypt is implemented by qat batch and queue mode - Dev

18 Feb 2023

On Mon, Feb 13, 2023 at 8:48 PM Feng, Hualong &lt;hualong.feng(a)intel.com&gt; wrote:
...

  -----Original Message-----
 From: Casey Bodley &lt;cbodley(a)redhat.com&gt;
 Sent: Wednesday, October 12, 2022 11:11 PM
 To: Feng, Hualong &lt;hualong.feng(a)intel.com&gt;
 Cc: Mark Kogan &lt;mkogan(a)redhat.com&gt;om>; Tang, Guifeng
 &lt;guifeng.tang(a)intel.com&gt;om>; Ma, Jianpeng &lt;jianpeng.ma(a)intel.com&gt;om>;
 dev(a)ceph.io
 Subject: Re: RGW encrypt is implemented by qat batch and queue mode

 On Thu, Sep 22, 2022 at 9:31 PM Feng, Hualong &lt;hualong.feng(a)intel.com&gt;
 wrote:

 > -----Original Message-----
 > From: Casey Bodley &lt;cbodley(a)redhat.com&gt;
 > Sent: Wednesday, September 21, 2022 10:20 PM
 > To: Feng, Hualong &lt;hualong.feng(a)intel.com&gt;
 > Cc: Mark Kogan &lt;mkogan(a)redhat.com&gt;om>; Tang, Guifeng
 > &lt;guifeng.tang(a)intel.com&gt;om>; Ma, Jianpeng &lt;jianpeng.ma(a)intel.com&gt;om>;
 > dev(a)ceph.io
 > Subject: Re: RGW encrypt is implemented by qat batch and queue mode
   > On Mon, Sep 19, 2022 at 4:06 AM Feng,
Hualong
 > &lt;hualong.feng(a)intel.com&gt;
 > wrote:
 >   > > Hi Mark, Casey
 >  
 >  
 >   > >
Could you spare some time to help review these two PRs or add them
 > > to
 > your plan?
 >  
 >  
 >   > > The
PR link is below:
 >   > >
https://github.com/ceph/ceph/pull/47040
 >   > >
https://github.com/ceph/ceph/pull/47845
 >  
 >  
 >   > > I
reimplemented the qat encryption plugin. Since the existing RGW
 > encryption uses 4KB as an encryption unit, the performance is poor
 > when the qat batch interface is not used. Now I have reimplemented
 > the encryption plug-in using the qat batch interface, which is done
 > in two PRs. PR47040 is used to realize that when the encrypted data
 > block is larger than 128KB, 32 pieces of 4K data are taken out for a
 > batch submission each time. PR47845 is based on PR47040, each time
 > the encrypted data block is smaller than 128KB, it is put into a
 > buffer queue first, and when 32 pieces of 4K data or timeout can be  reached,
a batch submission is performed.

   > The
performance result is below, and moreover, the higher the CPU
 > usage,
 the more obvious the effect of qat.

> From the flame graph, the proportion of the encryption plug-in
 implemented by qat in the RGWPutObj::execute function is lower than
 that of the encryption plug-in implemented by isal.

   > Thanks
   > -Hualong

 hey Hualong et al, (cc dev list)

 thanks for reaching out, this really helps me understand what those
 PRs are trying to accomplish

 in general i'm concerned about the need for threads, locking, and
 buffering down in the crypto plugins. ideally this stuff would be
 under the application's control. in radosgw, we've been trying to
 eliminate any blocking waits on condition variables in our i/o path
 now that requests are handled in coroutines - instead of blocking an
 entire thread, we just suspend the coroutine and run others in the
 meantime 
 I agree with your view, but now crypto function calls are still using the 
synchronous interface. If we don't want the plugin to contain condition
 variables, we need to implement the plugin in an asynchronous way and
 provide an asynchronous interface. This requires the RGW to call the
 interface to make changes.

 And the number of QAT instances is difficult to keep consistent with the  number of
threads. The number of QAT instance(hardware resources) is
 limited. When the number of instances is less than the number of threads, we
 still need to wait for the free instance. If the asynchronous interface is used,
 we can use the queue as a buffer to avoid blocking the current thread while
 waiting for a free instance.

 If it is still a synchronous interface, there is no good way to eliminate the 
condition variable. Do you have a better suggestion here?

 below you suggest that we could fall back to CPU processing for small object
 uploads. could we use that same fallback in the cases where we'd otherwise
 have to block waiting for a QAT instance?

  seeing that graph by object size, my first
impression was that
 radosgw should be using bigger buffers. 
 Use a bigger buffer? You mean we should change the encrypted  CHUNK_SIZE, from the
current 4096B, to a bigger one?
  Or other buffers? 
 sorry not the CHUNK_SIZE, but the total amount of data we can feed into QAT
 at a time. i see in https://github.com/ceph/ceph/pull/47040
 that you've found the loop in AES_256_CBC::cbc_transform() which breaks
 the input into CHUNK_SIZEd pieces, and you converted that loop into a single
 batch() call - that part looks great

 if each call to cbc_transform() is getting large enough buffers, it could acquire
 exclusive access to one QAT instance, feed all of its data through it, then
 release the instance back to a pool. for a large object workload it seems like
 this strategy would best utilize the hardware resources, because you never
 have to coordinate a single batch across several threads. you just need to
 acquire/release access to a QAT instance every 4MB, which you can use for
 32 batches (assuming batch size is 32*4k=128k?) at a time

  GetObj and PutObj are both reading data in 4MB
chunks, maybe we can
 find a way to use the qat batch interfaces within those chunks? 
 Yes, they are both reading data in 4MB.
 But when the object we put is larger than 4MB, the data block size when  calling
the encryption function is not necessarily 4MB.

 Such as the below that put an object, but the data block size that the
 encryption function using is 64KB PUT /examplebucket/chunkObject.txt

 content-encoding:aws-chunked
 content-length:66824
 host:s3.amazonaws.com
 x-amz-content-sha256:STREAMING-AWS4-HMAC-SHA256-PAYLOAD
 x-amz-date:20130524T000000Z
 x-amz-decoded-content-length:66560
 x-amz-storage-class:REDUCED_REDUNDANCY
 Authorization:AWS4-HMAC-SHA256
  Credential=AKIAIOSFODNN7EXAMPLE/20130524/us-east-1/s3/aws4_request
 ,Sig

nedHeaders=content-encoding;content-length;host;x-amz-content-sha256;x
 > -amz-date;x-amz-decoded-content-length;x-amz-storage-class,Signature=4

f232c4386841ef735655705268965c44a0e4690baa4adea153f7db9fa80a0a9
 > ---------------

10000;chunk-signature=ad80c730a21e5b8d04586a2213dd63b9a0e99e0e230
 7b0ad
  e35a65485a288648
 <65536-bytes>  > ---------------

10000;chunk-signature=ad80c730a21e5b8d04586a2213dd63b9a0e99e0e230
 7b0ad
  e35a65485a288648
 <65536-bytes> 
 are you sure that this http-level chunking has an effect on the buffer sizes that
 encryption sees? it may cause the buffers to be segmented at 64k, but
 encryption and decryption both call bufferlist::c_str() to reallocate a single
 contiguous buffer:
 https://github.com/ceph/ceph/blob/9aa8bed/src/rgw/rgw_crypt.cc#L490

 so i'd still expect this loop in RGWPutObj::execute() to read up to
 rgw_max_chunk_size at a time:
 https://github.com/ceph/ceph/blob/fc01eeb7/src/rgw/rgw_op.cc#L4111-L41
 41

 if there are cases where the RGWPutObj_BlockEncrypt filter isn't getting large
 enough buffers, we can use the same strategy as
 https://github.com/ceph/ceph/pull/21479, where we improved compression
 ratios by adding a buffering filter in front

   > > that could avoid the need for
cross-thread queues and
 > > synchronization. compared to your approach in
 > > https://github.com/ceph/ceph/pull/47845, i imagine this would show
 > > less of a benefit for small object uploads, but more of a benefit
 > > for the big ones. do you think this could work?
   > If in order to avoid the need for
cross-thread queues and show less of a
 benefit for small object uploads, we can turn small objects to CPU processing.
 Only for big object, we use QAT batch api.
   > Hi Casey
   > Thanks for your reply. The detail
message and some question are above.
   > Thanks
 > -Hualong

 all of my feedback here relates to large objects, though you've really focused
 on small objects in https://github.com/ceph/ceph/pull/47845.
 for small object workloads, i do agree that the queuing and thread
 synchronization is necessary to take advantage of this batching

 it's just hard for me to tell whether that extra complexity is worth it. we've
 tried to minimize any synchronization between rgw's requests so that we're
 able to scale to thousands of concurrent requests/connections. at scale, i'd
 worry that lock contention here would negate some of the gains from QAT

 in workloads with a mix of small and large objects, i think we'd make the best
 use of QAT if we applied it to the larger objects (>= 128k?) where we can use
 it most efficiently 
 Hi Casey

 About the PR https://github.com/ceph/ceph/pull/47040. I have changed the code to be
coroutine,
 so it is able to scale without lock contention. And I restrict the use of qat only when
object>64K so that
 we can use it most efficiently.

 In the QAT part code, I use two async_* interface: one used to get instance, another one
used to submit perform request.
 And in rgw code, in order to get `yield` parameter in crypto plugin, I add extra
parameter in all process function.

 Can you help to review, is the coroutine mode I changed feasible? 
thanks Hualong,

the coroutine changes are nicely done; however i still have concerns
about the overall design:

1. these crypto plugins are meant to be common to all ceph components.
rgw may be the only user now, but this reliance on a coroutine-based
runtime could make the plugin unusable elsewhere

the `optional_yield` wrapper (which may or may not contain a real
coroutine yield_context) can potentially make this more general,
if-and-only-if the plugin has a synchronous implementation as a
fallback. currently, the plugin interfaces take an optional_yield, but
QccCrypto::perform_op_batch() calls y.get_yield_context() on it
unconditionally. even within rgw, there may not be a real
yield_context - rgw_beast_enable_async may be false, or the object
write maybe be driven synchronously by an admin command like
`radosgw-admin obj rewrite`

2. even with coroutines, rgw requests may still have to wait for a qat
instance. with a limited number of instances, couldn't this itself
become a bottleneck as we scale up the number of concurrent requests?

earlier in the thread, we discussed falling back to the cpu
implementation if there wasn't a qat instance available. that could
avoid the need for waits, synchronous or otherwise, inside of the
plugin. this would let us take advantage of hardware acceleration when
we can without introducing any new contention. do you see any
drawbacks to this approach?

i hate to keep sending you back to the drawing board; would it help to
discuss this in person? the Performance Weekly call
(https://pad.ceph.com/p/performance_weekly) could be a good place for
that. if that isn't a good time, we might schedule a separate call or
wait until March 1st for the Ceph Developer Monthly (APAC)

...

 And about the PR https://github.com/ceph/ceph/pull/47845, I do agree your view that the
extra complexity isn't
 worth it. I will close this PR.

 Thanks
 -Hualong