XFS on RBD on EC painfully slow - ceph-users

27 May 2021

Hoping someone may be able to help point out where my bottleneck(s) may be.

I have an 80TB kRBD image on an EC8:2 pool, with an XFS filesystem on top of that.
This was not an ideal scenario, rather it was a rescue mission to dump a large, aging raid
array before it was too late, so I'm working with the hand I was dealt.

To further conflate the issues, the main directory structure consists of lots and lots of
small file sizes, and deep directories.

My goal is to try and rsync (or otherwise) data from the RBD to cephfs, but its just
unbearably slow and will take ~150 days to transfer ~35TB, which is far from ideal.

...
           15.41G  79%    4.36MB/s    0:56:09
(xfr#23165, ir-chk=4061/27259) 
...
  avg-cpu:  %user   %nice %system %iowait  %steal  
%idle
            0.17    0.00    1.34   13.23    0.00   85.26

 Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s  
wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz 
%util
 rbd0           124.00      0.66     0.00   0.00   17.30     5.48   50.00      0.17    
0.00   0.00   31.70     3.49    0.00      0.00     0.00   0.00    0.00     0.00    3.39 
96.40 
Rsync progress and iostat (during the rsync) from the rbd to a local ssd, to remove any
bottlenecks doubling back to cephfs.
About 16G in 1h, not exactly blazing, this being 5 of the 7000 directories I'm looking
to offload to cephfs.

Currently running 15.2.11, and the host is Ubuntu 20.04 (5.4.0-72-generic) with a single
E5-2620, 64GB of memory, and 4x10GbT bond talking to ceph, iperf proves it out.
EC8:2, across about 16 hosts, 240 OSDs, with 24 of those being 8TB 7.2k SAS, and the other
216 being 2TB 7.2K SATA. So there are quite a few spindles in play here.
Only 128 PGs, in this pool, but its the only RBD image in this pool. Autoscaler recommends
going to 512, but was hoping to avoid the performance overhead of the PG splits if
possible, given perf is bad enough as is.

Examining the main directory structure it looks like there are 7000 files per directory,
about 60% of which are <1MiB, and in all totaling nearly 5GiB per directory.

My fstab for this is:
...
  xfs	_netdev,noatime	0	0 
I tried to increase the read_ahead_kb to 4M from 128K at
/sys/block/rbd0/queue/read_ahead_kb to match the object/stripe size of the EC pool, but
that doesn't appear to have had much of an impact.

The only thing I can think of that I could possibly try as a change would be to increase
the queue depth in the rbdmap up from 128, so thats my next bullet to fire.

Attaching xfs_info in case there are any useful nuggets:
...
  meta-data=/dev/rbd0              isize=256   
agcount=81, agsize=268435455 blks
          =                       sectsz=512   attr=2, projid32bit=0
          =                       crc=0        finobt=0, sparse=0, rmapbt=0
          =                       reflink=0
 data     =                       bsize=4096   blocks=21483470848, imaxpct=5
          =                       sunit=0      swidth=0 blks
 naming   =version 2              bsize=4096   ascii-ci=0, ftype=0
 log      =internal log           bsize=4096   blocks=32768, version=2
          =                       sectsz=512   sunit=0 blks, lazy-count=0
 realtime =none                   extsz=4096   blocks=0, rtextents=0 
And rbd-info:
...
  rbd image 'rbd-image-name:
         size 85 TiB in 22282240 objects
         order 22 (4 MiB objects)
         snapshot_count: 0
         id: a09cac2b772af5
         data_pool: rbd-ec82-pool
         block_name_prefix: rbd_data.29.a09cac2b772af5
         format: 2
         features: layering, exclusive-lock, object-map, fast-diff, deep-flatten,
data-pool
         op_features:
         flags:
         create_timestamp: Mon Apr 12 18:44:38 2021
         access_timestamp: Mon Apr 12 18:44:38 2021
         modify_timestamp: Mon Apr 12 18:44:38 2021 

Any other ideas or hints are greatly appreciated.

Thanks,
Reed