[ceph-users] Re: Multipart uploads with partsizes larger than 16MiB failing on Nautilus

10 Sep 2020

On Thu, Sep 10, 2020 at 10:19 AM shubjero &lt;shubjero(a)gmail.com&gt; wrote:
...

 Hi Casey,

 I was never setting rgw_max_chunk_size in my ceph.conf so it must have
 been default? Funny enough I dont even see this configuration
 parameter in the documentation
 https://docs.ceph.com/docs/nautilus/radosgw/config-ref/ .

 Armed with your information I tried setting the following in my ceph.conf:

 root@ceph-1:~# ceph --admin-daemon
 /var/run/ceph/ceph-client.rgw.ceph-1.28726.94406486979736.asok config
 show | egrep "rgw_max_chunk_size|rgw_put_obj_min|rgw_obj_stripe_size"
     "rgw_max_chunk_size": "67108864",
     "rgw_obj_stripe_size": "67108864",
     "rgw_put_obj_min_window_size": "67108864",

 And with this configuration I was able to upload with large part sizes
 (2GB) using the aws client without error. 
are you sure there's a benefit to using such large part sizes? a
smaller part size should allow the client to stream more uploads at a
time. it also makes recovery much cheaper; if a 2GB PUT request times
out, the client will retry and send the entire 2GB again. with a
smaller part size, the server can commit this data more frequently and
limit the amount of bandwidth wasted on retries

...

 Do you know if there is any expected performance improvement with
 larger chunk/stripe/window sizes? As I said previously our use case is
 dealing with very large genomic files being uploaded and downloaded
 (average is probably 100GB per file). 
rgw_max_chunk_size specifies how much data we'll send in a single osd
request. rgw_obj_stripe_size specifies how much data we'll write to a
single rados object before creating a new stripe/object.
rgw_put_obj_min_window_size specifies how much object data we'll
buffer in memory as we stream chunks out to their osds

i don't think we saw any benefit from chunk sizes over 4M, but you're
welcome to experiment and measure that in your environment. generally
you want a rgw_obj_stripe_size == rgw_max_chunk_size so that each of
your writes go to a different rados object; if, for example, your
stripe size was 2x the chunk size, we would write two chunks to each
rados object - but the osd has to apply these writes sequentially, so
you lose some parallelism this way

regarding rgw_put_obj_min_window_size, the number of parallel writes
we can do is equal to (rgw_put_obj_min_window_size /
rgw_max_chunk_size). in a default configuration, this is 16M/4M = 4.
you can experiment with a larger multiplier here, but do take overall
memory usage into account! If rgw_max_concurrent_requests is 1024 and
all of those are large PUT requests, then we'd use up to
(rgw_max_concurrent_requests * rgw_put_obj_min_window_size) or 16G of
memory

in general, i think the default tunings should perform well here. if
you have a lot of memory to work with on rgw nodes, you can experiment
with larger values of rgw_put_obj_min_window_size

...

 On Wed, Sep 9, 2020 at 11:29 AM Casey Bodley &lt;cbodley(a)redhat.com&gt; wrote:

 What is your rgw_max_chunk_size? It looks like you'll get these
 EDEADLK errors when rgw_max_chunk_size > rgw_put_obj_min_window_size,
 because we try to write in units of chunk size but the window is too
 small to write a single chunk.

 On Wed, Sep 9, 2020 at 8:51 AM shubjero &lt;shubjero(a)gmail.com&gt; wrote:

 Will do Matt

 On Tue, Sep 8, 2020 at 5:36 PM Matt Benjamin &lt;mbenjami(a)redhat.com&gt; wrote:

 thanks, Shubjero

 Would you consider creating a ceph tracker issue for this?

 regards,

 Matt

 On Tue, Sep 8, 2020 at 4:13 PM shubjero &lt;shubjero(a)gmail.com&gt; wrote:
 >
 > I had been looking into this issue all day and during testing found
 > that a specific configuration option we had been setting for years was
 > the culprit. Not setting this value and letting it fall back to the
 > default seems to have fixed our issue with multipart uploads.
 >
 > If you are curious, the configuration option is rgw_obj_stripe_size
 > which was being set to 67108864 bytes (64MiB). The default is 4194304
 > bytes (4MiB). This is a documented option
 > (https://docs.ceph.com/docs/nautilus/radosgw/config-ref/) and from my
 > testing it seems like using anything but the default (only tried
 > larger values) breaks multipart uploads.
 >
 > On Tue, Sep 8, 2020 at 12:12 PM shubjero &lt;shubjero(a)gmail.com&gt; wrote:
 > >
 > > Hey all,
 > >
 > > I'm creating a new post for this issue as we've narrowed the problem
 > > down to a partsize limitation on multipart upload. We have discovered
 > > that in our production Nautilus (14.2.11) cluster and our lab Nautilus
 > > (14.2.10) cluster that multipart uploads with a configured part size
 > > of greater than 16777216 bytes (16MiB) will return a status 500 /
 > > internal server error from radosgw.
 > >
 > > So far I have increased the following rgw settings/values that looked
 > > suspect, without any success/improvement with partsizes.
 > > Such as:
 > >     "rgw_get_obj_window_size": "16777216",
 > >     "rgw_put_obj_min_window_size": "16777216",
 > >
 > > I am trying to determine if this is because of a conservative default
 > > setting somewhere that I don't know about or if this is perhaps a bug?
 > >
 > > I would appreciate it if someone on Nautilus with rgw could also test
 > > / provide feedback. It's very easy to reproduce and configuring your
 > > partsize with aws2cli requires you to put the following in your aws
 > > 'config'
 > > s3 =
 > >   multipart_chunksize = 32MB
 > >
 > > rgw server logs during a failed multipart upload (32MB chunk/partsize):
 > > 2020-09-08 15:59:36.054 7f2d32fa6700  1 ====== starting new request
 > > req=0x55953dc36930 =====
 > > 2020-09-08 15:59:36.082 7f2d32fa6700 -1 res_query() failed
 > > 2020-09-08 15:59:36.138 7f2d32fa6700  1 ====== req done
 > > req=0x55953dc36930 op status=0 http_status=200 latency=0.0839988s
 > > ======
 > > 2020-09-08 16:00:07.285 7f2d3dfbc700  1 ====== starting new request
 > > req=0x55953dc36930 =====
 > > 2020-09-08 16:00:07.285 7f2d3dfbc700 -1 res_query() failed
 > > 2020-09-08 16:00:07.353 7f2d00741700  1 ====== starting new request
 > > req=0x55954dd5e930 =====
 > > 2020-09-08 16:00:07.357 7f2d00741700 -1 res_query() failed
 > > 2020-09-08 16:00:07.413 7f2cc56cb700  1 ====== starting new request
 > > req=0x55953dc02930 =====
 > > 2020-09-08 16:00:07.417 7f2cc56cb700 -1 res_query() failed
 > > 2020-09-08 16:00:07.473 7f2cb26a5700  1 ====== starting new request
 > > req=0x5595426f6930 =====
 > > 2020-09-08 16:00:07.473 7f2cb26a5700 -1 res_query() failed
 > > 2020-09-08 16:00:09.465 7f2d3dfbc700  0 WARNING: set_req_state_err
 > > err_no=35 resorting to 500
 > > 2020-09-08 16:00:09.465 7f2d3dfbc700  1 ====== req done
 > > req=0x55953dc36930 op status=-35 http_status=500 latency=2.17997s
 > > ======
 > > 2020-09-08 16:00:09.549 7f2d00741700  0 WARNING: set_req_state_err
 > > err_no=35 resorting to 500
 > > 2020-09-08 16:00:09.549 7f2d00741700  1 ====== req done
 > > req=0x55954dd5e930 op status=-35 http_status=500 latency=2.19597s
 > > ======
 > > 2020-09-08 16:00:09.605 7f2cc56cb700  0 WARNING: set_req_state_err
 > > err_no=35 resorting to 500
 > > 2020-09-08 16:00:09.609 7f2cc56cb700  1 ====== req done
 > > req=0x55953dc02930 op status=-35 http_status=500 latency=2.19597s
 > > ======
 > > 2020-09-08 16:00:09.641 7f2cb26a5700  0 WARNING: set_req_state_err
 > > err_no=35 resorting to 500
 > > 2020-09-08 16:00:09.641 7f2cb26a5700  1 ====== req done
 > > req=0x5595426f6930 op status=-35 http_status=500 latency=2.16797s
 > > ======
 > >
 > > awscli client side output during a failed multipart upload:
 > > root@jump:~# aws --no-verify-ssl --endpoint-url
 > > http://lab-object.cancercollaboratory.org:7480 s3 cp 4GBfile
 > > s3://troubleshooting
 > > upload failed: ./4GBfile to s3://troubleshooting/4GBfile An error
 > > occurred (UnknownError) when calling the UploadPart operation (reached
 > > max retries: 2): Unknown
 > >
 > > Thanks,
 > >
 > > Jared Baker
 > > Cloud Architect for the Cancer Genome Collaboratory
 > > Ontario Institute for Cancer Research
 > _______________________________________________
 > ceph-users mailing list -- ceph-users(a)ceph.io
 > To unsubscribe send an email to ceph-users-leave(a)ceph.io
 >

 --

 Matt Benjamin
 Red Hat, Inc.
 315 West Huron Street, Suite 140A
 Ann Arbor, Michigan 48103

 http://www.redhat.com/en/technologies/storage

 tel.  734-821-5101
 fax.  734-769-8938
 cel.  734-216-5309
  _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Multipart uploads with partsizes larger than 16MiB failing on Nautilus