CephFS Octopus snapshots / kworker at 100% / kernel vs. fuse client - ceph-users

5 Feb 2021

Hi,

I am running a Ceph Octopus (15.2.8) cluster primarily for CephFS. 
Metadata is stored on SSD, data is stored in three different pools on 
HDD. Currently, I use 22 subvolumes.

I am rotating snapshots on 16 subvolumes, all in the same pool, which is 
the primary data pool for CephFS. Currently I have 41 snapshots per 
subvolume. The goal is 50 snapshots (see bottom of mail for details). 
Snapshots are only placed in the root subvolume directory, i.e. 
/volumes/_nogroup/subvolname/hex-id/.snap

I place the snapshots on one of the nodes. Complete CephFS is mounted, 
mkdir and rmdir is performed for each relevant subvolume, then CephFS is 
unmounted again. All PGs are active+clean most of the time, only a few 
in snaptrim for 1-2 minutes after snapshot deletion. I therefore assume 
that snaptrim is not a limiting factor.

Obviously, the total number of snapshots is more than the 400 and 100 I 
see mentioned in some documentation. I am unsure if that is an issue 
here, as the snapshots are all in disjunct subvolumes.

When mounting the subvolumes with kernel client (ranging from CentOS 7 
supplied 3.10 up to 5.4.93), after some time and for some subvolumes the 
kworker process begins to hug 100% cpu usage and stat operations become 
very slow (even slower than with fuse client). I can mostly replicate 
this by starting specific rsync operations (with many small files, e.g. 
CTAN, CentOS, Debian mirrors) and by running a bareos backup. The 
kworker process seems to be stuck even after terminating the causing 
operating, i.e. rsync or bareos-fd.

Interestingly, I can even trigger these issues on a host that has only a 
single CephFS subvolume without any snapshots mounted, as long as that 
subvolume is in the same pool as other subvolumes with snapshots.

I don't see any abnormal behaviour on the cluster nodes or on other 
clients during these kworker hanging phases.

With fuse client, in normal operation stat calls are about 10-20x slower 
than with the kernel client. However, I don't encounter the extreme 
slowdown behaviour. I am therefore currently mounting some 
known-problematic subvolumes with fuse and non-problematic subvolumes 
with the kernel client.

My questions are:
- Is this known or expected behaviour?
- I could move the subvolumes with snapshots into a subvolumegroup and 
snapshot the whole group instead of each subvolume. Will this be likely 
to solve the issues?
- What is the current recommendation regarding CephFS and max number of 
snapshots?

Cluster setup:
5 nodes with a total of 56 OSDs
Each node has a Xeon Silver 4208 and 128 GB RAM
Each node has two 480GB Samsung PM883 SSD used for CephFS metadata pool
HDDs are ranging from 8TB to 14TB, majority is 14TB
10 GbE internal network and 10 GbE client network, no Jumbo frames

$ ceph df
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
hdd    520 TiB  141 TiB  378 TiB   379 TiB      72.88
ssd    3.9 TiB  3.8 TiB  1.7 GiB    97 GiB       2.46
TOTAL  524 TiB  145 TiB  378 TiB   379 TiB      72.36

--- POOLS ---
POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX AVAIL
device_health_metrics   1     1   66 MiB       57  198 MiB      0     23 TiB
cephfs.cephfs.meta      2  1024   26 GiB    2.29M   77 GiB   2.06    1.2 TiB
cephfs.cephfs.data      3  1024   70 TiB   54.95M  213 TiB  75.19     23 TiB
lofar                   4   512   77 TiB   21.41M  154 TiB  68.68     35 TiB
proxmox                 6    64  526 GiB  158.60k  1.6 TiB   2.16     23 TiB
archive                 7    32  7.3 TiB    5.42M   10 TiB  12.57     56 TiB
Snapshots are only on cephfs.cephfs.data pool.

Intended snapshot rotation:
4 quarter-hourly snapshots
24 hourly snapshots
14 daily snapshots
8 weekly snapshots

Cheers
Sebastian