OSD 12 looks much the same.I don't have logs back to the original date, but
this looks very similar — db/sst corruption. The standard fsck approaches
couldn't fix it. I believe it was a form of ATA failure — OSD 11 and 12, if
I recall correctly, did not actually experience SMARTD-reportable errors.
(Essentially, fans died on an internal SATA enclosure. As the enclosure had
no sensor mechanism, I didn't realize it until drive temps started to
climb. I believe most of the drives survived OK, but the enclosure itself I
ultimately had to completely bypass, even after replacing fans.)
My assumption, once ceph fsck approaches failed, was that I'd need to mark
11 and 12 (and maybe 4) as lost, but I was reluctant to do so until I
confirmed that I had absolutely lost data beyond recall.
On Sat, Dec 12, 2020 at 10:24 PM Igor Fedotov <ifedotov(a)suse.de> wrote:
Hi Jeremy,
wondering what were the OSDs' logs when they crashed for the first time?
And does OSD.12 reports the similar problem for now:
3> 2020-12-12 20:23:45.756 7f2d21404700 -1 rocksdb: submit_common error:
Corruption: block checksum mismatch: expected 3113305400, got 1242690251 in
db/000348.sst offset 47935290 size 4704 code = 2 Rocksdb transaction:
?
Thanks,
Igor
On 12/13/2020 8:48 AM, Jeremy Austin wrote:
I could use some input from more experienced folks…
First time seeing this behavior. I've been running ceph in production
(replicated) since 2016 or earlier.
This, however, is a small 3-node cluster for testing EC. Crush map rules
should sustain the loss of an entire node.
Here's the EC rule:
rule cephfs425 { id 6 type erasure min_size 3 max_size 6 step
set_chooseleaf_tries 40 step set_choose_tries 400 step take default step
choose indep 3 type host step choose indep 2 type osd step emit }
I had actual hardware failure on one node. Interestingly, this appears to
have resulted in data loss. OSDs began to crash in a cascade on other nodes
(i.e., nodes with no known hardware failure). Not a low RAM problem.
I could use some pointers about how to get the down PGs back up — I *think*
there are enough EC shards, even disregarding the OSDs that crash on start.
nautilus 14.2.15
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 54.75960 root default
-10 16.81067 host sumia
1 hdd 5.57719 osd.1 up 1.00000 1.00000
5 hdd 5.58469 osd.5 up 1.00000 1.00000
6 hdd 5.64879 osd.6 up 1.00000 1.00000
-7 16.73048 host sumib
0 hdd 5.57899 osd.0 up 1.00000 1.00000
2 hdd 5.56549 osd.2 up 1.00000 1.00000
3 hdd 5.58600 osd.3 up 1.00000 1.00000
-3 21.21844 host tower1
4 hdd 3.71680 osd.4 up 0 1.00000
7 hdd 1.84799 osd.7 up 1.00000 1.00000
8 hdd 3.71680 osd.8 up 1.00000 1.00000
9 hdd 1.84929 osd.9 up 1.00000 1.00000
10 hdd 2.72899 osd.10 up 1.00000 1.00000
11 hdd 3.71989 osd.11 down 0 1.00000
12 hdd 3.63869 osd.12 down 0 1.00000
cluster:
id: d0b4c175-02ba-4a64-8040-eb163002cba6
health: HEALTH_ERR
1 MDSs report slow requests
4/4239345 objects unfound (0.000%)
Too many repaired reads on 3 OSDs
Reduced data availability: 7 pgs inactive, 7 pgs down
Possible data damage: 4 pgs recovery_unfound
Degraded data redundancy: 95807/24738783 objects degraded
(0.387%), 4 pgs degraded, 3 pgs undersized
7 pgs not deep-scrubbed in time
7 pgs not scrubbed in time
services:
mon: 3 daemons, quorum sumib,tower1,sumia (age 4d)
mgr: sumib(active, since 7d), standbys: sumia, tower1
mds: cephfs:1 {0=sumib=up:active} 2 up:standby
osd: 13 osds: 11 up (since 3d), 10 in (since 4d); 3 remapped pgs
data:
pools: 5 pools, 256 pgs
objects: 4.24M objects, 15 TiB
usage: 24 TiB used, 24 TiB / 47 TiB avail
pgs: 2.734% pgs not active
95807/24738783 objects degraded (0.387%)
47910/24738783 objects misplaced (0.194%)
4/4239345 objects unfound (0.000%)
245 active+clean
7 down
3 active+recovery_unfound+undersized+degraded+remapped
1 active+recovery_unfound+degraded+repair
progress:
Rebalancing after osd.12 marked out
[============================..]
Rebalancing after osd.4 marked out
[=============================.]
An snipped from an example down pg:
"up": [
3,
2,
5,
1,
8,
9
],
"acting": [
3,
2,
5,
1,
8,
9
],
<snip>
],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
11,
12
],
"peering_blocked_by": [
{
"osd": 11,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let
us proceed"
},
{
"osd": 12,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let
us proceed"
}
]
},
{
Oddly, these OSDs possibly did NOT experience hardware failure. However,
they won't start -- see pastebin for ceph-osd.11.log
https://pastebin.com/6U6sQJuJ
HEALTH_ERR 1 MDSs report slow requests; 4/4239345 objects unfound (0.000%);
Too many repaired reads on 3 OSDs; Reduced data availability
: 7 pgs inactive, 7 pgs down; Possible data damage: 4 pgs recovery_unfound;
Degraded data redundancy: 95807/24738783 objects degraded (0
.387%), 4 pgs degraded, 3 pgs undersized; 7 pgs not deep-scrubbed in time;
7 pgs not scrubbed in time
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdssumib(mds.0): 42 slow requests are blocked > 30 secs
OBJECT_UNFOUND 4/4239345 objects unfound (0.000%)
pg 19.5 has 1 unfound objects
pg 15.2f has 1 unfound objects
pg 15.41 has 1 unfound objects
pg 15.58 has 1 unfound objects
OSD_TOO_MANY_REPAIRS Too many repaired reads on 3 OSDs
osd.9 had 9664 reads repaired
osd.7 had 9665 reads repaired
osd.4 had 12 reads repaired
PG_AVAILABILITY Reduced data availability: 7 pgs inactive, 7 pgs down
pg 15.10 is down, acting [3,2,5,1,8,9]
pg 15.1e is down, acting [5,1,9,8,2,3]
pg 15.40 is down, acting [7,10,1,5,3,2]
pg 15.4a is down, acting [0,3,5,6,9,10]
pg 15.6a is down, acting [3,2,6,1,10,8]
pg 15.71 is down, acting [3,2,1,6,8,10]
pg 15.76 is down, acting [2,0,6,5,10,9]
PG_DAMAGED Possible data damage: 4 pgs recovery_unfound
pg 15.2f is active+recovery_unfound+undersized+degraded+remapped,
acting [5,1,0,3,2147483647,7], 1 unfound
pg 15.41 is active+recovery_unfound+undersized+degraded+remapped,
acting [5,1,0,3,2147483647,2147483647], 1 unfound
pg 15.58 is active+recovery_unfound+undersized+degraded+remapped,
acting [10,2147483647,2,3,1,5], 1 unfound
pg 19.5 is active+recovery_unfound+degraded+repair, acting
[3,2,5,1,8,10], 1 unfound
PG_DEGRADED Degraded data redundancy: 95807/24738783 objects degraded
(0.387%), 4 pgs degraded, 3 pgs undersized
pg 15.2f is stuck undersized for 635305.932075, current state
active+recovery_unfound+undersized+degraded+remapped, last acting
[5,1,0,3,2147483647,7]
pg 15.41 is stuck undersized for 364298.836902, current state
active+recovery_unfound+undersized+degraded+remapped, last acting
[5,1,0,3,2147483647,2147483647]
pg 15.58 is stuck undersized for 384461.110229, current state
active+recovery_unfound+undersized+degraded+remapped, last acting
[10,2147483647,2,3,1,5]
pg 19.5 is active+recovery_unfound+degraded+repair, acting
[3,2,5,1,8,10], 1 unfound
PG_NOT_DEEP_SCRUBBED 7 pgs not deep-scrubbed in time
pg 15.76 not deep-scrubbed since 2020-10-21 14:30:03.935228
pg 15.71 not deep-scrubbed since 2020-10-21 12:20:46.235792
pg 15.6a not deep-scrubbed since 2020-10-21 07:52:33.914083
pg 15.10 not deep-scrubbed since 2020-10-22 03:24:40.465367
pg 15.1e not deep-scrubbed since 2020-10-22 10:37:36.169959
pg 15.40 not deep-scrubbed since 2020-10-23 05:33:35.208748
pg 15.4a not deep-scrubbed since 2020-10-22 05:14:06.981035
PG_NOT_SCRUBBED 7 pgs not scrubbed in time
pg 15.76 not scrubbed since 2020-10-24 08:12:40.090831
pg 15.71 not scrubbed since 2020-10-25 05:22:40.573572
pg 15.6a not scrubbed since 2020-10-24 15:03:09.189964
pg 15.10 not scrubbed since 2020-10-24 16:25:08.826981
pg 15.1e not scrubbed since 2020-10-24 16:05:03.080127
pg 15.40 not scrubbed since 2020-10-24 11:58:04.290488
pg 15.4a not scrubbed since 2020-10-24 11:32:44.573551