Colleagues, I have the update.
Starting from yestrerday the situation with ceph health is much worse than it was
previously.
We found that
- ceph -s informs us that some PGs are in stale state
- almost all diagnostic ceph subcommands hang! For example, "ceph osd ls" ,
"ceph osd dump", "ceph osd tree", "ceph health detail"
provide the output - but "ceph osd status", all the commands "ceph pg
..." and other ones hang.
So, it looks that the crashes of MDS daemons were the first signs of problems only.
I read that "stale" state for PGs means that all nodes storing this placement
group may be down - but it's wrong, all osd daemons are up on all three nodes:
------- ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 68.05609 root default
-3 22.68536 host asrv-dev-stor-1
0 hdd 5.45799 osd.0 up 1.00000 1.00000
1 hdd 5.45799 osd.1 up 1.00000 1.00000
2 hdd 5.45799 osd.2 up 1.00000 1.00000
3 hdd 5.45799 osd.3 up 1.00000 1.00000
12 ssd 0.42670 osd.12 up 1.00000 1.00000
13 ssd 0.42670 osd.13 up 1.00000 1.00000
-5 22.68536 host asrv-dev-stor-2
4 hdd 5.45799 osd.4 up 1.00000 1.00000
5 hdd 5.45799 osd.5 up 1.00000 1.00000
6 hdd 5.45799 osd.6 up 1.00000 1.00000
7 hdd 5.45799 osd.7 up 1.00000 1.00000
14 ssd 0.42670 osd.14 up 1.00000 1.00000
15 ssd 0.42670 osd.15 up 1.00000 1.00000
-7 22.68536 host asrv-dev-stor-3
8 hdd 5.45799 osd.8 up 1.00000 1.00000
10 hdd 5.45799 osd.10 up 1.00000 1.00000
11 hdd 5.45799 osd.11 up 1.00000 1.00000
18 hdd 5.45799 osd.18 up 1.00000 1.00000
16 ssd 0.42670 osd.16 up 1.00000 1.00000
17 ssd 0.42670 osd.17 up 1.00000 1.00000
May it be the physical problem with our drives? "smartctl -a" informs nothing
wrong. We started the surface check using dd command also but it will be 7 hours per
drive at least...
What should we do also?
The output of "ceph health detail":
ceph health detail
HEALTH_ERR 1 MDSs report damaged metadata; insufficient standby MDS daemons available;
Reduced data availability: 50 pgs stale; 90 daemons have recently crashed; 3 mgr modules
have recently crashed
[ERR] MDS_DAMAGE: 1 MDSs report damaged metadata
mds.asrv-dev-stor-2(mds.0): Metadata damage detected
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
have 0; want 1 more
[WRN] PG_AVAILABILITY: Reduced data availability: 50 pgs stale
pg 5.0 is stuck stale for 67m, current state stale+active+clean, last acting
[4,1,11]
pg 5.13 is stuck stale for 67m, current state stale+active+clean, last acting
[4,0,10]
pg 5.18 is stuck stale for 67m, current state stale+active+clean, last acting
[4,11,2]
pg 5.19 is stuck stale for 67m, current state stale+active+clean, last acting
[4,3,10]
pg 5.1e is stuck stale for 10h, current state stale+active+clean, last acting
[0,7,11]
pg 5.22 is stuck stale for 10h, current state stale+active+clean, last acting
[0,6,18]
pg 5.26 is stuck stale for 67m, current state stale+active+clean, last acting
[4,1,18]
pg 5.29 is stuck stale for 10h, current state stale+active+clean, last acting
[0,11,6]
pg 5.2b is stuck stale for 10h, current state stale+active+clean, last acting
[0,18,6]
pg 5.30 is stuck stale for 10h, current state stale+active+clean, last acting
[0,8,7]
pg 5.37 is stuck stale for 67m, current state stale+active+clean, last acting
[4,10,0]
pg 5.3c is stuck stale for 67m, current state stale+active+clean, last acting
[4,10,3]
pg 5.43 is stuck stale for 10h, current state stale+active+clean, last acting
[0,6,18]
pg 5.44 is stuck stale for 67m, current state stale+active+clean, last acting
[4,2,11]
pg 5.45 is stuck stale for 67m, current state stale+active+clean, last acting
[4,11,3]
pg 5.47 is stuck stale for 67m, current state stale+active+clean, last acting
[4,10,1]
pg 5.48 is stuck stale for 10h, current state stale+active+clean, last acting
[0,5,11]
pg 5.60 is stuck stale for 10h, current state stale+active+clean, last acting
[0,10,7]
pg 7.2 is stuck stale for 67m, current state stale+active+clean, last acting
[4,2,10]
pg 7.4 is stuck stale for 67m, current state stale+active+clean, last acting
[4,18,3]
pg 7.f is stuck stale for 10h, current state stale+active+clean, last acting [0,4,8]
pg 7.13 is stuck stale for 10h, current state stale+active+clean, last acting
[0,7,11]
pg 7.18 is stuck stale for 67m, current state stale+active+clean, last acting
[4,0,10]
pg 7.1b is stuck stale for 67m, current state stale+active+clean, last acting
[4,8,0]
pg 7.1f is stuck stale for 10h, current state stale+active+clean, last acting
[0,5,11]
pg 7.2a is stuck stale for 10h, current state stale+active+clean, last acting
[0,6,8]
pg 7.35 is stuck stale for 10h, current state stale+active+clean, last acting
[0,6,10]
pg 7.36 is stuck stale for 67m, current state stale+active+clean, last acting
[4,2,8]
pg 7.37 is stuck stale for 10h, current state stale+active+clean, last acting
[0,8,7]
pg 7.38 is stuck stale for 10h, current state stale+active+clean, last acting
[0,6,11]
pg 9.10 is stuck stale for 67m, current state stale+active+clean, last acting
[4,0,8]
pg 9.16 is stuck stale for 10h, current state stale+active+clean, last acting
[0,4,11]
pg 9.20 is stuck stale for 67m, current state stale+active+clean, last acting
[4,3,8]
pg 9.2a is stuck stale for 67m, current state stale+active+clean, last acting
[4,8,0]
pg 9.33 is stuck stale for 10h, current state stale+active+clean, last acting
[0,18,5]
pg 9.3a is stuck stale for 10h, current state stale+active+clean, last acting
[0,8,5]
pg 9.48 is stuck stale for 67m, current state stale+active+clean, last acting
[4,2,11]
pg 9.4b is stuck stale for 10h, current state stale+active+clean, last acting
[0,7,11]
pg 9.4f is stuck stale for 10h, current state stale+active+clean, last acting
[0,6,10]
pg 9.52 is stuck stale for 67m, current state stale+active+clean, last acting
[4,8,0]
pg 9.53 is stuck stale for 10h, current state stale+active+clean, last acting
[0,11,7]
pg 9.56 is stuck stale for 10h, current state stale+active+clean, last acting
[0,5,18]
pg 9.5a is stuck stale for 10h, current state stale+active+clean, last acting
[0,7,8]
pg 9.5d is stuck stale for 10h, current state stale+active+clean, last acting
[0,6,10]
pg 9.6b is stuck stale for 67m, current state stale+active+clean, last acting
[4,11,0]
pg 9.6f is stuck stale for 67m, current state stale+active+clean, last acting
[4,2,18]
pg 9.73 is stuck stale for 67m, current state stale+active+clean, last acting
[4,2,10]
pg 9.76 is stuck stale for 67m, current state stale+active+clean, last acting
[4,10,2]
pg 9.79 is stuck stale for 10h, current state stale+active+clean, last acting
[0,6,8]
pg 9.7f is stuck stale for 10h, current state stale+active+clean, last acting
[0,10,5]