New subject: [Ceph-users] Re: cephfs client high latency rm, failed to xclock

3 Jul 2019

Hi,

We are experiencing very slow "rm" on client lately, and it's not clear 
what is wrong. The symptom shows like this:

on client:
# cat /sys/kernel/debug/ceph/*/mdsc
3436737    mds0    rmdir     #100216d0113/utils 
(hpc/session/H9MKDmgVn2unmmR0Xox1SiGmABFKDmABFKDmmH9XDmABFKDmzmjnGo/pilot/radical/utils)

mds:
# ceph daemon mds.velikaponca dump_ops_in_flight
{
     "ops": [
         {
             "description": "client_request(client.14321210:3436737 
rmdir #0x100216d0113/utils 2019-07-03 09:48:02.000548 caller_uid=0, 
caller_gid=0{})",
             "initiated_at": "2019-07-03 09:48:02.001005",
             "age": 4.046726,
             "duration": 4.046759,
             "type_data": {
                 "flag_point": "failed to xlock, waiting",
                 "reqid": "client.14321210:3436737",
                 "op_type": "client_request",
                 "client_info": {
                     "client": "client.14321210",
                     "tid": 3436737
                 },
                 "events": [
                     {
                         "time": "2019-07-03 09:48:02.001005",
                         "event": "initiated"
                     },
                     {
                         "time": "2019-07-03 09:48:02.001005",
                         "event": "header_read"
                     },
                     {
                         "time": "2019-07-03 09:48:02.001006",
                         "event": "throttled"
                     },
                     {
                         "time": "2019-07-03 09:48:02.001011",
                         "event": "all_read"
                     },
                     {
                         "time": "2019-07-03 09:48:02.001131",
                         "event": "dispatched"
                     },
                     {
                         "time": "2019-07-03 09:48:02.001261",
                         "event": "failed to wrlock, waiting"
                     },
                     {
                         "time": "2019-07-03 09:48:02.001744",
                         "event": "failed to xlock, waiting"
                     },
                     {
                         "time": "2019-07-03 09:48:02.010963",
                         "event": "failed to xlock, waiting"
                     }
                 ]
             }
         }
     ],
     "num_ops": 1
}

The client requests appear to be very fast, but rmdir tends to fail to 
xclock frequently, and it takes up to 10s before the operation 
completes, which slows down a lot the services running on the client, 
and it's also not related to the load on ceph servers. Restarting mds 
solves the issue for few minutes, but then it reappears.

The version is mimic 13.2.6, and the cluster is healthy, most of the 
clients are on 5.1.* kernel. The server experiencing the issues is on 
centos7 kernel 3.10.0-957.21.3.el7.x86_64, we have also tested it with 
5.1.15 kernel showing the same symptoms.

Any ideas how to solve this problem?

Cheers,
Andrej

-- 
_____________________________________________________________
    prof. dr. Andrej Filipcic,   E-mail: Andrej.Filipcic(a)ijs.si
    Department of Experimental High Energy Physics - F9
    Jozef Stefan Institute, Jamova 39, P.o.Box 3000
    SI-1001 Ljubljana, Slovenia
    Tel.: +386-1-477-3674    Fax: +386-1-477-3166
-------------------------------------------------------------