Multipart uploads with partsizes larger than 16MiB failing on Nautilus - ceph-users

8 Sep 2020

Hey all,

I'm creating a new post for this issue as we've narrowed the problem
down to a partsize limitation on multipart upload. We have discovered
that in our production Nautilus (14.2.11) cluster and our lab Nautilus
(14.2.10) cluster that multipart uploads with a configured part size
of greater than 16777216 bytes (16MiB) will return a status 500 /
internal server error from radosgw.

So far I have increased the following rgw settings/values that looked
suspect, without any success/improvement with partsizes.
Such as:
    "rgw_get_obj_window_size": "16777216",
    "rgw_put_obj_min_window_size": "16777216",

I am trying to determine if this is because of a conservative default
setting somewhere that I don't know about or if this is perhaps a bug?

I would appreciate it if someone on Nautilus with rgw could also test
/ provide feedback. It's very easy to reproduce and configuring your
partsize with aws2cli requires you to put the following in your aws
'config'
s3 =
  multipart_chunksize = 32MB

rgw server logs during a failed multipart upload (32MB chunk/partsize):
2020-09-08 15:59:36.054 7f2d32fa6700  1 ====== starting new request
req=0x55953dc36930 =====
2020-09-08 15:59:36.082 7f2d32fa6700 -1 res_query() failed
2020-09-08 15:59:36.138 7f2d32fa6700  1 ====== req done
req=0x55953dc36930 op status=0 http_status=200 latency=0.0839988s
======
2020-09-08 16:00:07.285 7f2d3dfbc700  1 ====== starting new request
req=0x55953dc36930 =====
2020-09-08 16:00:07.285 7f2d3dfbc700 -1 res_query() failed
2020-09-08 16:00:07.353 7f2d00741700  1 ====== starting new request
req=0x55954dd5e930 =====
2020-09-08 16:00:07.357 7f2d00741700 -1 res_query() failed
2020-09-08 16:00:07.413 7f2cc56cb700  1 ====== starting new request
req=0x55953dc02930 =====
2020-09-08 16:00:07.417 7f2cc56cb700 -1 res_query() failed
2020-09-08 16:00:07.473 7f2cb26a5700  1 ====== starting new request
req=0x5595426f6930 =====
2020-09-08 16:00:07.473 7f2cb26a5700 -1 res_query() failed
2020-09-08 16:00:09.465 7f2d3dfbc700  0 WARNING: set_req_state_err
err_no=35 resorting to 500
2020-09-08 16:00:09.465 7f2d3dfbc700  1 ====== req done
req=0x55953dc36930 op status=-35 http_status=500 latency=2.17997s
======
2020-09-08 16:00:09.549 7f2d00741700  0 WARNING: set_req_state_err
err_no=35 resorting to 500
2020-09-08 16:00:09.549 7f2d00741700  1 ====== req done
req=0x55954dd5e930 op status=-35 http_status=500 latency=2.19597s
======
2020-09-08 16:00:09.605 7f2cc56cb700  0 WARNING: set_req_state_err
err_no=35 resorting to 500
2020-09-08 16:00:09.609 7f2cc56cb700  1 ====== req done
req=0x55953dc02930 op status=-35 http_status=500 latency=2.19597s
======
2020-09-08 16:00:09.641 7f2cb26a5700  0 WARNING: set_req_state_err
err_no=35 resorting to 500
2020-09-08 16:00:09.641 7f2cb26a5700  1 ====== req done
req=0x5595426f6930 op status=-35 http_status=500 latency=2.16797s
======

awscli client side output during a failed multipart upload:
root@jump:~# aws --no-verify-ssl --endpoint-url
http://lab-object.cancercollaboratory.org:7480 s3 cp 4GBfile
s3://troubleshooting
upload failed: ./4GBfile to s3://troubleshooting/4GBfile An error
occurred (UnknownError) when calling the UploadPart operation (reached
max retries: 2): Unknown

Thanks,

Jared Baker
Cloud Architect for the Cancer Genome Collaboratory
Ontario Institute for Cancer Research