...the "mdsmap_decode" errors stopped suddenly on all our clients...
Not exactly sure what the problem was, but restarting our standby mds
demons seems to have been the fix.
Here's the log on the standby mds exactly when the errors stopped:
2020-04-29 15:41:22.944 7f3d04e06700 1 mds.ceph-s2 Map has assigned me
to become a standby
2020-04-29 15:43:05.621 7f3d04e06700 1 mds.ceph-s2 Updating MDS map to
version 394712 from mon.0
2020-04-29 15:43:05.623 7f3d04e06700 1 mds.0.0 handle_mds_map i am now
mds.34541673.0 replaying mds.0.0
2020-04-29 15:43:05.623 7f3d04e06700 1 mds.0.0 handle_mds_map state
change up:boot --> up:standby-replay
2020-04-29 15:43:05.623 7f3d04e06700 1 mds.0.0 replay_start
2020-04-29 15:43:05.623 7f3d04e06700 1 mds.0.0 recovery set is
2020-04-29 15:43:05.655 7f3cfe5f9700 0 mds.0.cache creating system
inode with ino:0x100
2020-04-29 15:43:05.656 7f3cfe5f9700 0 mds.0.cache creating system
inode with ino:0x1
best regards,
Jake
On 29/04/2020 14:33, Jake Grimmett wrote:
Dear all,
After enabling "allow_standby_replay" on our cluster we are getting
(lots) of identical errors on the client /var/log/messages like
Apr 29 14:21:26 hal kernel: ceph: mdsmap_decode got incorrect
state(up:standby-replay)
We are using the ml kernel 5.6.4-1.el7 on Scientific Linux 7.8
Cluster and client are running Ceph v14.2.9
Setting was enabled with:
# ceph fs set cephfs allow_standby_replay true
[root@ceph-s1 ~]# ceph mds stat
cephfs:1 {0=ceph-s3=up:active} 1 up:standby-replay 2 up:standby
Is this something to worry about, or should we just disable
allow_standby_replay ?
any advice appreciated,
many thanks
Jake
Note: I am working from home until further notice.
For help, contact unixadmin(a)mrc-lmb.cam.ac.uk
Note: I am working from home until further notice.
For help, contact unixadmin(a)mrc-lmb.cam.ac.uk
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539