Hi Josh and maintainers,
We have confirmed again that the dmClock algorithm does not address multiple clients
running on the same subvolume. In the dmClock, the client tracker monitors how much
workload has been performed on each server and acts as a sort of scheduler through sending
rho/delta values the servers. Our simple idea was that the total QoS IOPS is divided into
multiple clients within a subvolume based on the workload dynamics or evenly manner. For
example, assuming that 1000 IOPS is allocated to a subvolume and then the value is shared
to periodically multiple clients. If 100 clients simultaneously issue requests on the same
volume, each client can consume 10 IOPS by mClock scheduler.
Like this, we can consider a client workload metric-based approach, but it is not easy to
ensure QoS stability as client workloads are dynamically changed and the time period to
obtain metrics affects the allocation accuracy. Additionally, per client session QoS can
instead be considered, but it is difficult to predict and limit the number of sessions in
the subvolume.
For this reason, instead of applying dmClock, mClock scheduler can be considered as a good
solution to the noisy neighbor problem. Expected per subvolume QoS IOPS can also be
calculated as follows. (Assuming MDS and OSD requests are almost evenly distributed and
reservation and weight values are omitted for brief exploration.)
[MDS]
- Per MDS Limit IOPS * # of MDSs (Perf MDS Limit IOPS * 1 when ephemeral random pinning is
configured.)
[OSD]
- Per OSD Limit IOPS * # of OSDs
Of course, depending on the workload or server conditions, the above equations may not be
100% satisfied. However, the QoS scheduler can be implemented without the client and
manager modifications.
There are any other comments from a CephFS point of view?
Finally, we are going to implement our prototype that the mClock scheduler is applied to
MDS and then make a pull request to share and discuss them.
Thanks
Yongseok