CEPH performance issues running as Spark storage layer - ceph-users

16 Jul 2020

Hi All, 
We’re using CEPH cluster (Nautulis 14.2.10) as a S3 object storage  layer for Spark 3 with
Yarn running in distributed environment. The issue we see however is slow performance when
running even simple spark query on data stored on large number of objects, for example
50.000 objects. We’re aware  of slow object listing in S3, but should that really kill the
performance while using spark for reading\analyzing\writing  the data on S3? Running the
same query on the same dataset content but stored in 100 files is multiple times faster.

Bucket and bucket index we use for spark are stored on OSDs with SSDs (we’re having 150 of
them)  , we’re using 12 RGW instances, each limited to 32 concurrent connection by custom
app to prevent RGW queue blowing up). When running spark queries, RGW queues rises –
depending on number of spark executors – to around 30 per instance providing the number of
executors per RGW instance is also in similar. We don’t see any other bottlenecks on infra
side than RGW queues. We applied various tuning options for CEPH regarding
RGW\OSD\Bluestore performance (ie: objecter_inflight_op_bytes, objecter_inflight_ops,
rgw_bucket_index_max_aio, rgw_cache_lru_size ) but the spark works still really slow in
above mentioned scenario. 
The other problem we observed is that when using 4 RGW instead of 12 we see performance
degradation only in about 40-50%. We would  expect that RGW scaling will behave  more
efficiently.

Does anyone using CEPH in similar way as a storage layer for Spark? Do you observer
similar behavior and maybe have some workarounds\solutions for slowness when working with
high number of objects?