Hello Stefan,
On Thu, Feb 13, 2020 at 9:19 AM Stefan Kooman <stefan(a)bit.nl> wrote:
Hi,
We hit the following assert:
-10001> 2020-02-13 17:42:35.543 7f11b5669700 -1 /build/ceph-13.2.8/src/mds/MDCache.cc:
In function 'MDRequestRef MDCa
che::request_get(metareqid_t)' thread 7f11b5669700 time 2020-02-13 17:42:35.545815
/build/ceph-13.2.8/src/mds/MDCache.cc: 9523: FAILED assert(p != active_requests.end())
ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e)
[0x7f11bd8e69de]
2: (()+0x287b67) [0x7f11bd8e6b67]
3: (MDCache::request_get(metareqid_t)+0x94) [0x560cde8bb214]
4: (Server::journal_close_session(Session*, int, Context*)+0x9dd) [0x560cde829d1d]
5: (Server::handle_client_session(MClientSession*)+0x1071) [0x560cde82b0f1]
6: (Server::dispatch(Message*)+0x30b) [0x560cde86f87b]
7: (MDSRank::handle_deferrable_message(Message*)+0x434) [0x560cde7e1664]
8: (MDSRank::_dispatch(Message*, bool)+0x89b) [0x560cde7f8c7b]
9: (MDSRankDispatcher::ms_dispatch(Message*)+0xa3) [0x560cde7f92e3]
10: (MDSDaemon::ms_dispatch(Message*)+0xd3) [0x560cde7d92b3]
11: (DispatchQueue::entry()+0xb92) [0x7f11bd9a9e52]
12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f11bda46e2d]
13: (()+0x76db) [0x7f11bd1d76db]
14: (clone()+0x3f) [0x7f11bc3bd88f]
Before we hit this assert there were a few (kernel clients, 5.3.0-26/28)
that were not playing nicely:
16:32 < bitrot> mds.mds1 [WRN] client.61994841 isn't responding to
mclientcaps(revoke), ino 0x1003846ddc5 pending
pAsLsXsFscr issued pAsLsXsFscr, sent 62.342791 seconds ago
16:32 < bitrot> mon.mon1 [WRN] Health check failed: 1 clients failing to respond to
capability release
(MDS_CLIENT_LATE_RELEASE)
We rebooted both clients. After that one of them again had some slow
requests. We umounted the file system, slowly after that the MDS hit the
assert. Failover went fine this time.
This looks like issue:
https://tracker.ceph.com/issues/23059 ... but
that should already have been resolved. Is this the same issue, and or a
regression?
We run 13.2.8.
Thanks for the information. It looks like this bug:
https://tracker.ceph.com/issues/42467#note-7
Do you have logs you can share? You can use ceph-post-file [1] to share.
[1]
https://docs.ceph.com/docs/master/man/8/ceph-post-file/
--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D