[Ceph-users] Re: cephfs client high latency rm, failed to xclock

16 Jul 2019

Hi,

Some more info, kernel 5.0.13 works fine, 5.1 series not. For a test, if 
a linux kernel tarball is unpacked in cephfs on a 5.1 kernel node, the 
complete removal of the tree on another machine takes almost 4 hours. 
Unpacking on 5.0.13 kernel node and then removing on the server takes 
few minutes. Unpacking and removing on the same 5.1 kernel node is also 
fast.

So, something bad was introduced in 5.1 kernel series.

Cheers,
Andrej

On 09/07/2019 11:36, Andrej Filipčič wrote:
...

 Hi,

 further debugging showed this: The complete deletion works very fast, 
 if creation and deletion is done on the same host. In our case, the 
 directories to be deleted are created on a cluster node, and the 
 removal is done on the central server. The problem seem to be 
 correlated with the kernel upgrade from 4.19.* to 5.1.* on the cluster 
 nodes.

 Any ideas?

 Cheers,
 Andrej

 On 03/07/2019 09:58, Andrej Filipčič wrote:

 Hi,

 We are experiencing very slow "rm" on client lately, and it's not 
 clear what is wrong. The symptom shows like this:

 on client:
 # cat /sys/kernel/debug/ceph/*/mdsc
 3436737    mds0    rmdir     #100216d0113/utils 
 (hpc/session/H9MKDmgVn2unmmR0Xox1SiGmABFKDmABFKDmmH9XDmABFKDmzmjnGo/pilot/radical/utils)

 mds:
 # ceph daemon mds.velikaponca dump_ops_in_flight
 {
     "ops": [
         {
             "description": "client_request(client.14321210:3436737 
 rmdir #0x100216d0113/utils 2019-07-03 09:48:02.000548 caller_uid=0, 
 caller_gid=0{})",
             "initiated_at": "2019-07-03 09:48:02.001005",
             "age": 4.046726,
             "duration": 4.046759,
             "type_data": {
                 "flag_point": "failed to xlock, waiting",
                 "reqid": "client.14321210:3436737",
                 "op_type": "client_request",
                 "client_info": {
                     "client": "client.14321210",
                     "tid": 3436737
                 },
                 "events": [
                     {
                         "time": "2019-07-03 09:48:02.001005",
                         "event": "initiated"
                     },
                     {
                         "time": "2019-07-03 09:48:02.001005",
                         "event": "header_read"
                     },
                     {
                         "time": "2019-07-03 09:48:02.001006",
                         "event": "throttled"
                     },
                     {
                         "time": "2019-07-03 09:48:02.001011",
                         "event": "all_read"
                     },
                     {
                         "time": "2019-07-03 09:48:02.001131",
                         "event": "dispatched"
                     },
                     {
                         "time": "2019-07-03 09:48:02.001261",
                         "event": "failed to wrlock, waiting"
                     },
                     {
                         "time": "2019-07-03 09:48:02.001744",
                         "event": "failed to xlock, waiting"
                     },
                     {
                         "time": "2019-07-03 09:48:02.010963",
                         "event": "failed to xlock, waiting"
                     }
                 ]
             }
         }
     ],
     "num_ops": 1
 }

 The client requests appear to be very fast, but rmdir tends to fail 
 to xclock frequently, and it takes up to 10s before the operation 
 completes, which slows down a lot the services running on the client, 
 and it's also not related to the load on ceph servers. Restarting mds 
 solves the issue for few minutes, but then it reappears.

 The version is mimic 13.2.6, and the cluster is healthy, most of the 
 clients are on 5.1.* kernel. The server experiencing the issues is on 
 centos7 kernel 3.10.0-957.21.3.el7.x86_64, we have also tested it 
 with 5.1.15 kernel showing the same symptoms.

 Any ideas how to solve this problem?

 Cheers,
 Andrej

-- 
_____________________________________________________________
    prof. dr. Andrej Filipcic,   E-mail: Andrej.Filipcic(a)ijs.si
    Department of Experimental High Energy Physics - F9
    Jozef Stefan Institute, Jamova 39, P.o.Box 3000
    SI-1001 Ljubljana, Slovenia
    Tel.: +386-1-477-3674    Fax: +386-1-477-3166
-------------------------------------------------------------

2024

2023

2022

2021

2020

2019

[Ceph-users] Re: cephfs client high latency rm, failed to xclock