Hi,
Sorry to ping this old thread, but we have a few kernel client nodes
stuck like this after an outage on their network.
MDS's are running v14.2.11 and the client has kernel
3.10.0-1127.19.1.el7.x86_64.
This is the first time at our lab that clients didn't reconnect after
a network issue (but this might be the first large client network
outage after we upgraded from luminous to nautilus).
It looks identical to Florian's issue:
Feb 08 10:07:23 hpc-qcd027.cern.ch kernel: libceph: mds0
10.32.5.17:6821 socket closed (con state NEGOTIATING)
Feb 08 10:07:51 hpc-qcd027.cern.ch kernel: ceph: get_quota_realm: ino
(10004fe5035.fffffffffffffffe) null i_snap_realm
The full kernel log | grep ceph is at
https://termbin.com/zdwc
As of now, this client's mountpoint is "stuck" and it does not have a
session open on mds.0, but has sessions on mds.1 and mds.2 (see below
[1]).
I evicted this client from all mds's but the client didn't manage to reconnect:
Feb 08 10:20:01 hpc-qcd027.cern.ch kernel: libceph: mds1
188.185.88.47:6801 socket closed (con state OPEN)
Feb 08 10:20:01 hpc-qcd027.cern.ch kernel: libceph: mds2
188.185.88.90:6801 socket closed (con state OPEN)
Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: mds1
188.185.88.47:6801 connection reset
Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: reset on mds1
Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds1 closed our session
Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds1 reconnect start
Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: mds2
188.185.88.90:6801 connection reset
Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: reset on mds2
Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds2 closed our session
Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds2 reconnect start
Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds1 reconnect denied
Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds2 reconnect denied
Feb 08 10:20:21 hpc-qcd027.cern.ch kernel: ceph: get_quota_realm: ino
(10004fe5035.fffffffffffffffe) null i_snap_realm
Feb 08 10:20:51 hpc-qcd027.cern.ch kernel: ceph: get_quota_realm: ino
(10004fe5035.fffffffffffffffe) null i_snap_realm
Here are some logs from mds.0:
# egrep '10.32.3.150|137564444'
/var/log/ceph/ceph-mds.cephflax-mds-ca21a8a1c6.log
2021-02-08 09:16:46.875 7f9b22faa700 0 log_channel(cluster) log [WRN]
: evicting unresponsive client hpc-qcd027.cern.ch:hpc (137564444),
after 304.536 seconds
2021-02-08 09:17:28.326 7f9b28a6c700 0 --1-
[v2:188.184.96.191:6800/1781566860,v1:188.184.96.191:6801/1781566860]
> v1:10.32.3.150:0/3998218413 conn(0x562ac988b800
0x562e1a6e8800
:6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_message_2 accept we reset (peer sent cseq 1),
sending RESETSESSION
2021-02-08 09:17:28.628 7f9b28a6c700 0 --1-
[v2:188.184.96.191:6800/1781566860,v1:188.184.96.191:6801/1781566860]
> v1:10.32.3.150:0/3998218413 conn(0x5629ce1c6800
0x56274776f000
:6801 s=OPENED pgs=26571 cs=1 l=0).fault server, going to standby
2021-02-08 09:49:56.318 7f9b28a6c700 0 --1-
[v2:188.184.96.191:6800/1781566860,v1:188.184.96.191:6801/1781566860]
> v1:10.32.3.150:0/3998218413 conn(0x5629f5eaa800
0x5627460ab800
:6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_message_2 accept peer reset, then tried to connect
to us, replacing
compared with mds.1 where the reconnect succeeded:
# egrep '10.32.3.150|137564444'
/var/log/ceph/ceph-mds.cephflax-mds-370212ad58.log
2021-02-08 09:16:42.970 7fc98299b700 0 log_channel(cluster) log [WRN]
: evicting unresponsive client hpc-qcd027.cern.ch:hpc (137564444),
after 300.629 seconds
2021-02-08 09:17:28.327 7fc987c2f700 0 --1-
[v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >>
v1:10.32.3.150:0/3998218413 conn(0x55a652fea400 0x55a7220ba800 :6801
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_message_2 accept we reset (peer sent cseq 1),
sending RESETSESSION
2021-02-08 09:17:28.414 7fc987c2f700 0 --1-
[v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >>
v1:10.32.3.150:0/3998218413 conn(0x55a791481000 0x55a6c7a2d800 :6801
s=OPENED pgs=26573 cs=1 l=0).fault server, going to standby
2021-02-08 10:05:12.810 7fc988430700 0 --1-
[v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >>
v1:10.32.3.150:0/3998218413 conn(0x55a75789d400 0x55a78695b000 :6801
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_message_2 accept peer reset, then tried to connect
to us, replacing
2021-02-08 10:05:13.374 7fc97e192700 2 mds.1.server New client
session:
addr="v1:10.32.3.150:0/3998218413",elapsed=0.057278,throttled=0.000007,status="ACCEPTED",root="/hpcqcd"
2021-02-08 10:20:01.666 7fc98499f700 1 mds.1.242924 Evicting client
session 137564444 (v1
10.32.3.150:0/3998218413)
2021-02-08 10:20:01.666 7fc98499f700 0 log_channel(cluster) log [INF]
: Evicting client session 137564444 (v1:10.32.3.150:0/3998218413)
2021-02-08 10:20:02.343 7fc988430700 0 --1-
[v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >>
v1:10.32.3.150:0/3998218413 conn(0x55a7a5d81c00 0x55a7823d1000 :6801
s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_message_2 accept we reset (peer sent cseq 2),
sending RESETSESSION
2021-02-08 10:20:02.345 7fc988430700 0 --1-
[v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >>
v1:10.32.3.150:0/3998218413 conn(0x55a75bb6a000 0x55a744f69800 :6801
s=OPENED pgs=26588 cs=1 l=0).fault server, going to standby
We have this in the mds config:
mds session blacklist on evict = false
mds session blacklist on timeout = false
The clients are working fine after they are rebooted.
Given the age of this thread -- maybe is this a known issue and
already solved in newer kernels?
Note that during this incident a few clients also crashed and rebooted
-- we are still trying to get the kernel backtrace for those cases, to
see if it matches
https://tracker.ceph.com/issues/40862.
Thanks!
Dan
[1] session ls:
mds.cephflax-mds-ca21a8a1c6: []
mds.cephflax-mds-370212ad58: [
{
"id": 137564444,
"entity": {
"name": {
"type": "client",
"num": 137564444
},
"addr": {
"type": "v1",
"addr": "10.32.3.150:0",
"nonce": 3998218413
}
},
"state": "open",
"num_leases": 0,
"num_caps": 0,
"request_load_avg": 0,
"uptime": 844.92158032299994,
"requests_in_flight": 0,
"completed_requests": 0,
"reconnecting": false,
"recall_caps": {
"value": 0,
"halflife": 60
},
"release_caps": {
"value": 0,
"halflife": 60
},
"recall_caps_throttle": {
"value": 0,
"halflife": 2.5
},
"recall_caps_throttle2o": {
"value": 0,
"halflife": 0.5
},
"session_cache_liveness": {
"value": 0,
"halflife": 300
},
"inst": "client.137564444 v1:10.32.3.150:0/3998218413",
"completed_requests": [],
"prealloc_inos": [],
"used_inos": [],
"client_metadata": {
"features": "0x00000000000000ff",
"entity_id": "hpc",
"hostname": "hpc-qcd027.cern.ch",
"kernel_version": "3.10.0-1127.19.1.el7.x86_64",
"root": "/hpcqcd"
}
}
]
mds.cephflax-mds-adccf51169: [
{
"id": 137564444,
"entity": {
"name": {
"type": "client",
"num": 137564444
},
"addr": {
"type": "v1",
"addr": "10.32.3.150:0",
"nonce": 3998218413
}
},
"state": "open",
"num_leases": 0,
"num_caps": 0,
"request_load_avg": 0,
"uptime": 1761.964491447,
"requests_in_flight": 0,
"completed_requests": 0,
"reconnecting": false,
"recall_caps": {
"value": 0,
"halflife": 60
},
"release_caps": {
"value": 0,
"halflife": 60
},
"recall_caps_throttle": {
"value": 0,
"halflife": 2.5
},
"recall_caps_throttle2o": {
"value": 0,
"halflife": 0.5
},
"session_cache_liveness": {
"value": 0,
"halflife": 300
},
"inst": "client.137564444 v1:10.32.3.150:0/3998218413",
"completed_requests": [],
"prealloc_inos": [],
"used_inos": [],
"client_metadata": {
"features": "0x00000000000000ff",
"entity_id": "hpc",
"hostname": "hpc-qcd027.cern.ch",
"kernel_version": "3.10.0-1127.19.1.el7.x86_64",
"root": "/hpcqcd"
}
}
]
On Mon, Oct 14, 2019 at 12:03 PM Florian Pritz
<florian.pritz(a)rise-world.com> wrote:
>
> On Wed, Oct 02, 2019 at 10:24:41PM +0800, "Yan, Zheng"
<ukernel(a)gmail.com> wrote:
> > Can you reproduce this. If you can, run 'ceph daemon mds.x session ls'
> > before restart mds.
>
> I just managed to run into this issue again. 'ceph daemon mds.x session
> ls' doesn't work because apparently our setup doesn't have the admin
> socket in the expected place. I've therefore used 'ceph tell mds.0
> session ls' which I think should be the same expect for how the daemon
> is contacted.
>
> When the issue happens and 2 clients are hanging, 'ceph tell mds.0
> session ls' shows only 9 clients instead of 11. The hanging clients are
> missing from the list. Once they are rebooted they show up in the
> output.
>
> On a potentially interesting note: The clients that were hanging this
> time are the same ones as last time. They aren't set up any differently
> from the others as far as I can tell though.
>
> Florian
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io