Hi all!
Hopefully some of you can shed some light on this. We have big problems with samba
crashing when macOS smb clients access certain/random folders/files over vfs_ceph.
When browsing cephfs folder in question directly on a cephnode where cephfs is mouted we
experience some issues like slow dir listing. We suspect that maybe macOS fetching of
xattr metadata creates a lot of traffic, but it should not lockup the cluster like this.
In logs we see both rdlock and wrlock, but mostly rdlocks.
End clients experience spurious disconnects when issue occurs, roughly up to a handfull
times a day. Is this a config issue? Have we hit a bug? It's certainly not a feature
:/
Any pointers on how to troubleshoot or rectify this problem is most welcome.
ceph version 14.2.11
samba version 4.12.10-SerNet-Ubuntu-10.focal
Supermicro X11, Intel Silver 4110, 9 ceph nodes, 2x40gbe network, 150OSD spinners, NVMe
db/journal
--
2020-11-17 22:09:07.525706 [WRN] evicting unresponsive client bo-samba-03 (3887652779),
after 301.746 seconds
2020-11-17 22:09:07.525580 [INF] Evicting (and blacklisting) client session 3877970532
(10.40.30.133:0/3971626932)
2020-11-17 22:09:07.525536 [WRN] evicting unresponsive client bo-samba-03 (3877970532),
after 302.034 seconds
2020-11-17 22:07:23.915412 [INF] Cluster is now healthy
2020-11-17 22:07:23.915381 [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs
report slow requests)
2020-11-17 22:07:23.915330 [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1
clients failing to respond to capability release)
2020-11-17 22:07:23.064492 [INF] MDS health message cleared (mds.?): 1 slow requests are
blocked > 30 secs
2020-11-17 22:07:23.064457 [INF] MDS health message cleared (mds.?): Client bo-samba-03
failing to respond to capability release
2020-11-17 22:07:17.524023 [WRN] client.3887663354 isn't responding to
mclientcaps(revoke), ino 0x10001202b55 pending pAsLsXsFs issued pAsLsXsFsx, sent 63.325997
seconds ago
2020-11-17 22:07:17.523987 [INF] Evicting (and blacklisting) client session 3887663354
(10.40.30.133:0/3230547239)
2020-11-17 22:07:17.523967 [WRN] evicting unresponsive client bo-samba-03 (3887663354),
after 64.5412 seconds
2020-11-17 22:07:17.523610 [WRN] slow request 63.325528 seconds old, received at
2020-11-17 22:06:14.197986: client_request(client.3878823430:4 lookup #0x100011f9a68/mappe
uten navn 2020-11-17 22:06:14.197908 caller_uid=111139, caller_gid=110513{}) currently
failed to rdlock, waiting
2020-11-17 22:07:17.523596 [WRN] 1 slow requests, 1 included below; oldest blocked for
> 63.325529 secs
2020-11-17 22:07:19.255177 [WRN] Health check failed: 1 clients failing to respond to
capability release (MDS_CLIENT_LATE_RELEASE)
2020-11-17 22:07:12.523453 [WRN] 1 slow requests, 0 included below; oldest blocked for
> 58.325433 secs
2020-11-17 22:07:07.523382 [WRN] 1 slow requests, 0 included below; oldest blocked for
> 53.325362 secs
2020-11-17 22:07:02.523360 [WRN] 1 slow requests, 0 included below; oldest blocked for
> 48.325307 secs
2020-11-17 22:06:57.523218 [WRN] 1 slow requests, 0 included below; oldest blocked for
> 43.325199 secs
2020-11-17 22:06:52.523203 [WRN] 1 slow requests, 0 included below; oldest blocked for
> 38.325158 secs
2020-11-17 22:06:47.523105 [WRN] slow request 33.325065 seconds old, received at
2020-11-17 22:06:14.197986: client_request(client.3878823430:4 lookup #0x100011f9a68/mappe
uten navn 2020-11-17 22:06:14.197908 caller_uid=111139, caller_gid=110513{}) currently
failed to rdlock, waiting
2020-11-17 22:06:47.523100 [WRN] 1 slow requests, 1 included below; oldest blocked for
> 33.325065 secs
2020-11-17 22:06:51.431745 [WRN] Health check failed: 1 MDSs report slow requests
(MDS_SLOW_REQUEST)
2020-11-17 22:06:20.045030 [INF] Cluster is now healthy
2020-11-17 22:06:20.045008 [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs
report slow requests)
2020-11-17 22:06:20.044960 [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1
clients failing to respond to capability release)
2020-11-17 22:06:19.062307 [INF] MDS health message cleared (mds.?): 1 slow requests are
blocked > 30 secs
2020-11-17 22:06:19.062253 [INF] MDS health message cleared (mds.?): Client bo-samba-03
failing to respond to capability release
2020-11-17 22:06:15.936150 [WRN] Health check failed: 1 clients failing to respond to
capability release (MDS_CLIENT_LATE_RELEASE)
2020-11-17 22:06:12.522624 [WRN] client.3869410498 isn't responding to
mclientcaps(revoke), ino 0x10001202b55 pending pAsLsXsFs issued pAsLsXsFsx, sent 64.045677
seconds ago
--thomas
--
Thomas Hukkelberg
thomas(a)hovedkvarteret.no
+47 971 81 192
--
support(a)hovedkvarteret.no
+47 966 44 999
Show replies by date
Thomas,
This is config controled by mds's mds_cap_revoke_eviction_timeout(300s
by default). If the client crashed or hung for long time, the cluster
will evict the client.
It can prevent others hung(waiting for locks). If you're the client will
recover later, you can set it zero.
Hoping this helps.
Yours, Norman
On 18/11/2020 上午6:49, Thomas Hukkelberg wrote:
> Hi all!
>
> Hopefully some of you can shed some light on this. We have big problems with samba
crashing when macOS smb clients access certain/random folders/files over vfs_ceph.
>
> When browsing cephfs folder in question directly on a cephnode where cephfs is mouted
we experience some issues like slow dir listing. We suspect that maybe macOS fetching of
xattr metadata creates a lot of traffic, but it should not lockup the cluster like this.
In logs we see both rdlock and wrlock, but mostly rdlocks.
>
> End clients experience spurious disconnects when issue occurs, roughly up to a
handfull times a day. Is this a config issue? Have we hit a bug? It's certainly not a
feature :/
>
> Any pointers on how to troubleshoot or rectify this problem is most welcome.
>
> ceph version 14.2.11
> samba version 4.12.10-SerNet-Ubuntu-10.focal
> Supermicro X11, Intel Silver 4110, 9 ceph nodes, 2x40gbe network, 150OSD spinners,
NVMe db/journal
>
> --
>
> 2020-11-17 22:09:07.525706 [WRN] evicting unresponsive client bo-samba-03
(3887652779), after 301.746 seconds
> 2020-11-17 22:09:07.525580 [INF] Evicting (and blacklisting) client session
3877970532 (10.40.30.133:0/3971626932)
> 2020-11-17 22:09:07.525536 [WRN] evicting unresponsive client bo-samba-03
(3877970532), after 302.034 seconds
> 2020-11-17 22:07:23.915412 [INF] Cluster is now healthy
> 2020-11-17 22:07:23.915381 [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs
report slow requests)
> 2020-11-17 22:07:23.915330 [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was:
1 clients failing to respond to capability release)
> 2020-11-17 22:07:23.064492 [INF] MDS health message cleared (mds.?): 1 slow requests
are blocked > 30 secs
> 2020-11-17 22:07:23.064457 [INF] MDS health message cleared (mds.?): Client
bo-samba-03 failing to respond to capability release
> 2020-11-17 22:07:17.524023 [WRN] client.3887663354 isn't responding to
mclientcaps(revoke), ino 0x10001202b55 pending pAsLsXsFs issued pAsLsXsFsx, sent 63.325997
seconds ago
> 2020-11-17 22:07:17.523987 [INF] Evicting (and blacklisting) client session
3887663354 (10.40.30.133:0/3230547239)
> 2020-11-17 22:07:17.523967 [WRN] evicting unresponsive client bo-samba-03
(3887663354), after 64.5412 seconds
> 2020-11-17 22:07:17.523610 [WRN] slow request 63.325528 seconds old, received at
2020-11-17 22:06:14.197986: client_request(client.3878823430:4 lookup #0x100011f9a68/mappe
uten navn 2020-11-17 22:06:14.197908 caller_uid=111139, caller_gid=110513{}) currently
failed to rdlock, waiting
> 2020-11-17 22:07:17.523596 [WRN] 1 slow requests, 1 included below; oldest blocked
for > 63.325529 secs
> 2020-11-17 22:07:19.255177 [WRN] Health check failed: 1 clients failing to respond to
capability release (MDS_CLIENT_LATE_RELEASE)
> 2020-11-17 22:07:12.523453 [WRN] 1 slow requests, 0 included below; oldest blocked
for > 58.325433 secs
> 2020-11-17 22:07:07.523382 [WRN] 1 slow requests, 0 included below; oldest blocked
for > 53.325362 secs
> 2020-11-17 22:07:02.523360 [WRN] 1 slow requests, 0 included below; oldest blocked
for > 48.325307 secs
> 2020-11-17 22:06:57.523218 [WRN] 1 slow requests, 0 included below; oldest blocked
for > 43.325199 secs
> 2020-11-17 22:06:52.523203 [WRN] 1 slow requests, 0 included below; oldest blocked
for > 38.325158 secs
> 2020-11-17 22:06:47.523105 [WRN] slow request 33.325065 seconds old, received at
2020-11-17 22:06:14.197986: client_request(client.3878823430:4 lookup #0x100011f9a68/mappe
uten navn 2020-11-17 22:06:14.197908 caller_uid=111139, caller_gid=110513{}) currently
failed to rdlock, waiting
> 2020-11-17 22:06:47.523100 [WRN] 1 slow requests, 1 included below; oldest blocked
for > 33.325065 secs
> 2020-11-17 22:06:51.431745 [WRN] Health check failed: 1 MDSs report slow requests
(MDS_SLOW_REQUEST)
> 2020-11-17 22:06:20.045030 [INF] Cluster is now healthy
> 2020-11-17 22:06:20.045008 [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs
report slow requests)
> 2020-11-17 22:06:20.044960 [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was:
1 clients failing to respond to capability release)
> 2020-11-17 22:06:19.062307 [INF] MDS health message cleared (mds.?): 1 slow requests
are blocked > 30 secs
> 2020-11-17 22:06:19.062253 [INF] MDS health message cleared (mds.?): Client
bo-samba-03 failing to respond to capability release
> 2020-11-17 22:06:15.936150 [WRN] Health check failed: 1 clients failing to respond to
capability release (MDS_CLIENT_LATE_RELEASE)
> 2020-11-17 22:06:12.522624 [WRN] client.3869410498 isn't responding to
mclientcaps(revoke), ino 0x10001202b55 pending pAsLsXsFs issued pAsLsXsFsx, sent 64.045677
seconds ago
>
>
> --thomas
>
> --
> Thomas Hukkelberg
> thomas(a)hovedkvarteret.no
> +47 971 81 192
> --
> support(a)hovedkvarteret.no
> +47 966 44 999
>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io