[RFC] CephFS dmClock QoS Scheduler - Dev

25 Sep 2020

Hi Ceph maintainers and developers, 

The objective of this is to discuss our work on a dmClock based client QoS management for
CephFS.

Our group at LINE maintains Ceph storage clusters such as RGW, RBD, and CephFS to
internally support OpenStack and K8S based private cloud environment for various
applications and platforms including LINE messenger. We have seen that the RGW and RBD
services can provide consistent performance to multiple active users since RGW employes
the dmClock QoS scheduler for S3 clients and hypervisors internally utilize I/O throttler
for VM block storage clients. Unfortunately, unlike RGW and RBD, CephFS clients can
directly issue metadata requests to MDSs and filedata requests OSDs as they want. This
situation occasionally (or frequently) happens and the other client performance may be
degraded by the noisy neighbor.  In the end, consistent performance cannot be guaranteed
in our environment. From this observation and motivation, we are now considering the
client QoS scheduler using the dmClock library for CephFS.

A few things about how to realize the QoS scheduler.

- Per subvolume QoS management. IOPS resources are only shared among the clients that
mount the same root directory. QoS parameters can be easily configured through the
extended attributes (similar to quota). Each dmClock scheduler can manage clients'
requests using client session information.
- MDS QoS management. Client metadata requests like create, lookup, and etc. are managed
by dmClock scheduler placed between the dispatcher and the main request handler (e.g.,
Server::handle_client_request()). We have observed that two active MDSs provide
approximately 20KIOPS. As performance capacity is sometimes scarce for lots of clients,
QoS management is needed for MDS.
- OSD QoS management. We would like to reopen and improve the previous work available at
https://github.com/ceph/ceph/pull/20235.
- Client QoS management. Each client manages the dmClock tracker to keep track of both rho
and delta to be packed to client request messages.

In case of the CLI, QoS parameters are configured using the extended attributes on each
subvolume directory. Specifically, separate QoS configurations are considered for both
MDSs and OSDs. 

setfattr -n ceph.dmclock.mds_reservation -v 200
/volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
setfattr -n ceph.dmclock.mds_weight -v 500
/volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
setfattr -n ceph.dmclock.mds_limit -v 1000
/volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55

setfattr -n ceph.dmclock.osd_reservation -v 500
/volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
setfattr -n ceph.dmclock.osd_weight -v 1000
/volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
setfattr -n ceph.dmclock.osd_limit -v 2000
/volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55

Our QoS work has been kicked off from the previous month. Our first step is to go over the
prior work and dmClock algorithm/library. Now we are actively focusing on checking the
feasibility of our idea with some modifications to MDS and ceph-fuse. Our development is
planned as follows. 

- dmClock scheduler will be integrated into MDS and ceph-fuse by December 2020.
- dmClock scheduler will be incorporated with OSD by the first half of the next year.

Does the community have any plan to develop per client QoS management? Are there any other
issues related to our QoS work?  We are looking forward to hearing your valuable comments
and feedback at an early stage.

Thanks

Yongseok Oh