Hi,
further debugging showed this: The complete deletion works very fast, if
creation and deletion is done on the same host. In our case, the
directories to be deleted are created on a cluster node, and the removal
is done on the central server. The problem seem to be correlated with
the kernel upgrade from 4.19.* to 5.1.* on the cluster nodes.
Any ideas?
Cheers,
Andrej
On 03/07/2019 09:58, Andrej Filipčič wrote:
Hi,
We are experiencing very slow "rm" on client lately, and it's not
clear what is wrong. The symptom shows like this:
on client:
# cat /sys/kernel/debug/ceph/*/mdsc
3436737 mds0 rmdir #100216d0113/utils
(hpc/session/H9MKDmgVn2unmmR0Xox1SiGmABFKDmABFKDmmH9XDmABFKDmzmjnGo/pilot/radical/utils)
mds:
# ceph daemon mds.velikaponca dump_ops_in_flight
{
"ops": [
{
"description": "client_request(client.14321210:3436737
rmdir #0x100216d0113/utils 2019-07-03 09:48:02.000548 caller_uid=0,
caller_gid=0{})",
"initiated_at": "2019-07-03 09:48:02.001005",
"age": 4.046726,
"duration": 4.046759,
"type_data": {
"flag_point": "failed to xlock, waiting",
"reqid": "client.14321210:3436737",
"op_type": "client_request",
"client_info": {
"client": "client.14321210",
"tid": 3436737
},
"events": [
{
"time": "2019-07-03 09:48:02.001005",
"event": "initiated"
},
{
"time": "2019-07-03 09:48:02.001005",
"event": "header_read"
},
{
"time": "2019-07-03 09:48:02.001006",
"event": "throttled"
},
{
"time": "2019-07-03 09:48:02.001011",
"event": "all_read"
},
{
"time": "2019-07-03 09:48:02.001131",
"event": "dispatched"
},
{
"time": "2019-07-03 09:48:02.001261",
"event": "failed to wrlock, waiting"
},
{
"time": "2019-07-03 09:48:02.001744",
"event": "failed to xlock, waiting"
},
{
"time": "2019-07-03 09:48:02.010963",
"event": "failed to xlock, waiting"
}
]
}
}
],
"num_ops": 1
}
The client requests appear to be very fast, but rmdir tends to fail to
xclock frequently, and it takes up to 10s before the operation
completes, which slows down a lot the services running on the client,
and it's also not related to the load on ceph servers. Restarting mds
solves the issue for few minutes, but then it reappears.
The version is mimic 13.2.6, and the cluster is healthy, most of the
clients are on 5.1.* kernel. The server experiencing the issues is on
centos7 kernel 3.10.0-957.21.3.el7.x86_64, we have also tested it with
5.1.15 kernel showing the same symptoms.
Any ideas how to solve this problem?
Cheers,
Andrej
--
_____________________________________________________________
prof. dr. Andrej Filipcic, E-mail: Andrej.Filipcic(a)ijs.si
Department of Experimental High Energy Physics - F9
Jozef Stefan Institute, Jamova 39, P.o.Box 3000
SI-1001 Ljubljana, Slovenia
Tel.: +386-1-477-3674 Fax: +386-1-477-3166
-------------------------------------------------------------