Its not the time:
[root@gnosis ~]# pdsh -w ceph-[01-20] date
ceph-01: Tue May 5 17:34:52 CEST 2020
ceph-03: Tue May 5 17:34:52 CEST 2020
ceph-02: Tue May 5 17:34:52 CEST 2020
ceph-04: Tue May 5 17:34:52 CEST 2020
ceph-07: Tue May 5 17:34:52 CEST 2020
ceph-14: Tue May 5 17:34:52 CEST 2020
ceph-10: Tue May 5 17:34:52 CEST 2020
ceph-12: Tue May 5 17:34:52 CEST 2020
ceph-11: Tue May 5 17:34:52 CEST 2020
ceph-15: Tue May 5 17:34:52 CEST 2020
ceph-08: Tue May 5 17:34:52 CEST 2020
ceph-09: Tue May 5 17:34:52 CEST 2020
ceph-05: Tue May 5 17:34:52 CEST 2020
ceph-13: Tue May 5 17:34:52 CEST 2020
ceph-19: Tue May 5 17:34:52 CEST 2020
ceph-06: Tue May 5 17:34:52 CEST 2020
ceph-17: Tue May 5 17:34:52 CEST 2020
ceph-18: Tue May 5 17:34:52 CEST 2020
ceph-20: Tue May 5 17:34:52 CEST 2020
ceph-16: Tue May 5 17:34:52 CEST 2020
I would guess a timeout or packet loss.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Alex Gorbachev <ag(a)iss-integration.com>
Sent: 05 May 2020 17:31:17
To: Frank Schilder
Cc: Dan van der Ster; ceph-users
Subject: Re: [ceph-users] Re: Ceph meltdown, need help
On Tue, May 5, 2020 at 11:27 AM Frank Schilder
<frans@dtu.dk<mailto:frans@dtu.dk>> wrote:
I tried that and get:
2020-05-05 17:23:17.008 7fbbeffff700 0 --
192.168.32.64:0/2061991714<http://192.168.32.64:0/2061991714> >>
192.168.32.68:6826/5216<http://192.168.32.68:6826/5216> conn(0x7fbbf01d6f80 :-1
s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect
got BADAUTHORIZER
I had that when my time was off on MONs. We had some NTP problems once at a client site
following major power outage, and I recall this exact message. Check your time sync.
--
Alex Gorbachev
Intelligent Systems Services Inc.
Strange.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Alex Gorbachev <ag@iss-integration.com<mailto:ag@iss-integration.com>>
Sent: 05 May 2020 17:19:26
To: Frank Schilder
Cc: Dan van der Ster; ceph-users
Subject: Re: [ceph-users] Re: Ceph meltdown, need help
Hi Frank,
On Tue, May 5, 2020 at 10:43 AM Frank Schilder
<frans@dtu.dk<mailto:frans@dtu.dk><mailto:frans@dtu.dk<mailto:frans@dtu.dk>>>
wrote:
Dear Dan,
thank you for your fast response. Please find the log of the first OSD that went down and
the ceph.log with these links:
https://files.dtu.dk/u/tF1zv5zdc6mmXXO_/ceph.log?l
https://files.dtu.dk/u/hPb5qax2-b6W9vmp/ceph-osd.2.log?l
I can collect more osd logs if this helps.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Dan van der Ster
<dan@vanderster.com<mailto:dan@vanderster.com><mailto:dan@vanderster.com<mailto:dan@vanderster.com>>>
Sent: 05 May 2020 16:25:31
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] Ceph meltdown, need help
Hi Frank,
Could you share any ceph-osd logs and also the ceph.log from a mon to
see why the cluster thinks all those osds are down?
Simply marking them up isn't going to help, I'm afraid.
Cheers, Dan
On Tue, May 5, 2020 at 4:12 PM Frank Schilder
<frans@dtu.dk<mailto:frans@dtu.dk><mailto:frans@dtu.dk<mailto:frans@dtu.dk>>>
wrote:
Hi all,
a lot of OSDs crashed in our cluster. Mimic 13.2.8. Current status included below. All
daemons are running, no OSD process crashed. Can I start marking OSDs in and up to get
them back talking to each other?
Please advice on next steps. Thanks!!
[root@gnosis ~]# ceph status
cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_WARN
2 MDSs report slow metadata IOs
1 MDSs report slow requests
nodown,noout,norecover flag(s) set
125 osds down
3 hosts (48 osds) down
Reduced data availability: 2221 pgs inactive, 1943 pgs down, 190 pgs peering,
13 pgs stale
Degraded data redundancy: 5134396/500993581 objects degraded (1.025%), 296
pgs degraded, 299 pgs undersized
9622 slow ops, oldest one blocked for 2913 sec, daemons
[osd.0,osd.100,osd.101,osd.112,osd.118,osd.133,osd.136,osd.142,osd.144,osd.145]... have
slow ops.
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-02(active), standbys: ceph-03, ceph-01
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 288 osds: 90 up, 215 in; 230 remapped pgs
flags nodown,noout,norecover
data:
pools: 10 pools, 2545 pgs
objects: 62.61 M objects, 144 TiB
usage: 219 TiB used, 1.6 PiB / 1.8 PiB avail
pgs: 1.729% pgs unknown
85.540% pgs not active
5134396/500993581 objects degraded (1.025%)
1796 down
226 active+undersized+degraded
147 down+remapped
140 peering
65 active+clean
44 unknown
38 undersized+degraded+peered
38 remapped+peering
17 active+undersized+degraded+remapped+backfill_wait
12 stale+peering
12 active+undersized+degraded+remapped+backfilling
4 active+undersized+remapped
2 remapped
2 undersized+degraded+remapped+peered
1 stale
1 undersized+degraded+remapped+backfilling+peered
io:
client: 26 KiB/s rd, 206 KiB/s wr, 21 op/s rd, 50 op/s wr
[root@gnosis ~]# ceph health detail
HEALTH_WARN 2 MDSs report slow metadata IOs; 1 MDSs report slow requests;
nodown,noout,norecover flag(s) set; 125 osds down; 3 hosts (48 osds) down; Reduced data
availability: 2219 pgs inactive, 1943 pgs down, 188 pgs peering, 13 pgs stale; Degraded
data redundancy: 5214696/500993589 objects degraded (1.041%), 298 pgs degraded, 299 pgs
undersized; 9788 slow ops, oldest one blocked for 2953 sec, daemons
[osd.0,osd.100,osd.101,osd.112,osd.118,osd.133,osd.136,osd.142,osd.144,osd.145]... have
slow ops.
MDS_SLOW_METADATA_IO 2 MDSs report slow metadata IOs
mdsceph-08(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked
for 2940 secs
mdsceph-12(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for
2942 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdsceph-08(mds.0): 100 slow requests are blocked > 30 secs
OSDMAP_FLAGS nodown,noout,norecover flag(s) set
OSD_DOWN 125 osds down
osd.0 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) is down
osd.6 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
is down
osd.7 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
is down
osd.8 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11)
is down
osd.16 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08)
is down
osd.18 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
is down
osd.19 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11)
is down
osd.21 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
is down
osd.31 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) is
down
osd.37 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) is
down
osd.38 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-07) is
down
osd.48 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) is
down
osd.51 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-22) is
down
osd.53 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) is
down
osd.55 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) is
down
osd.62 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17)
is down
osd.67 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11)
is down
osd.72 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) is
down
osd.75 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08)
is down
osd.78 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
is down
osd.79 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11)
is down
osd.80 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
is down
osd.81 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
is down
osd.82 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14)
is down
osd.83 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15)
is down
osd.88 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08)
is down
osd.89 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
is down
osd.92 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
is down
osd.93 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
is down
osd.95 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15)
is down
osd.96 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16)
is down
osd.97 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17)
is down
osd.100
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) is down
osd.104
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) is down
osd.105
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) is down
osd.107
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) is down
osd.108
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) is down
osd.109
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) is down
osd.111
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14) is down
osd.113
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) is down
osd.114
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09) is down
osd.116
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) is down
osd.117
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) is down
osd.119
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) is down
osd.122
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) is down
osd.123
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) is down
osd.124
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08) is down
osd.125
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09) is down
osd.126
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) is down
osd.128
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) is down
osd.131
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) is down
osd.134
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) is down
osd.139
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) is down
osd.140
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) is down
osd.141
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) is down
osd.145 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) is
down
osd.149
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) is down
osd.151
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09) is down
osd.152
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) is down
osd.153
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) is down
osd.154
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14) is down
osd.155
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) is down
osd.156 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) is
down
osd.157 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-05) is
down
osd.159 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-07) is
down
osd.161
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09) is down
osd.162
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) is down
osd.164
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) is down
osd.165
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) is down
osd.166
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) is down
osd.167
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14) is down
osd.171
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08) is down
osd.172 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-07) is
down
osd.174
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) is down
osd.176
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) is down
osd.177
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) is down
osd.179
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) is down
osd.182 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-06) is
down
osd.183 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-07) is
down
osd.184
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08) is down
osd.186
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) is down
osd.187
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11) is down
osd.190
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14) is down
osd.191
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) is down
osd.194
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) is down
osd.195
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) is down
osd.196
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) is down
osd.199
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) is down
osd.200
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) is down
osd.201
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) is down
osd.202
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) is down
osd.203
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) is down
osd.204
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) is down
osd.208
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08) is down
osd.210
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08) is down
osd.212
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) is down
osd.213
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11) is down
osd.214
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09) is down
osd.215
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) is down
osd.216
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11) is down
osd.218
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09) is down
osd.219
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11) is down
osd.221
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) is down
osd.224
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) is down
osd.226
(root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) is down
osd.228 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-20) is
down
osd.230 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-20) is
down
osd.233 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) is
down
osd.236 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) is
down
osd.238 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) is
down
osd.247 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) is
down
osd.248 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) is
down
osd.254 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) is
down
osd.256 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) is
down
osd.259 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) is
down
osd.260 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-20) is
down
osd.262 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) is
down
osd.266 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) is
down
osd.267 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) is
down
osd.272 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-20) is
down
osd.274 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) is
down
osd.275 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) is
down
osd.276 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-22) is
down
osd.281 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-22) is
down
osd.285 (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-05) is
down
OSD_HOST_DOWN 3 hosts (48 osds) down
host ceph-11 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1) (16
osds) is down
host ceph-10 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1) (16
osds) is down
host ceph-13 (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1) (16
osds) is down
PG_AVAILABILITY Reduced data availability: 2219 pgs inactive, 1943 pgs down, 188 pgs
peering, 13 pgs stale
pg 14.513 is stuck inactive for 1681.564244, current state down, last acting
[2147483647,2147483647,2147483647,2147483647,2147483647,143,2147483647,2147483647,2147483647,2147483647]
pg 14.514 is down, acting
[193,2147483647,2147483647,2147483647,2147483647,118,2147483647,2147483647,2147483647,2147483647]
pg 14.515 is down, acting
[2147483647,2147483647,2147483647,211,133,135,2147483647,2147483647,2147483647,2147483647]
pg 14.516 is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,205,2147483647]
pg 14.517 is down, acting
[2147483647,2147483647,5,2147483647,2147483647,2147483647,2147483647,2147483647,61,112]
pg 14.518 is down, acting
[2147483647,198,2147483647,2147483647,2147483647,2147483647,4,185,2147483647,2147483647]
pg 14.519 is down, acting
[2147483647,2147483647,68,2147483647,2147483647,2147483647,2147483647,185,2147483647,94]
pg 14.51a is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,101,2147483647]
pg 14.51b is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,197,2147483647,2147483647,2147483647,2147483647]
pg 14.51c is down, acting
[193,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,197]
pg 14.51d is down, acting
[2147483647,2147483647,61,2147483647,77,2147483647,2147483647,2147483647,112,2147483647]
pg 14.51e is down, acting
[2147483647,2147483647,2147483647,2147483647,112,2147483647,2147483647,193,2147483647,2147483647]
pg 14.51f is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,94,2147483647,2147483647]
pg 14.520 is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,207,2147483647,101,133,2147483647]
pg 14.521 is down, acting
[205,2147483647,133,2147483647,2147483647,2147483647,2147483647,4,2147483647,193]
pg 14.522 is down, acting
[101,2147483647,2147483647,11,197,2147483647,136,94,2147483647,2147483647]
pg 14.523 is down, acting
[2147483647,2147483647,2147483647,118,2147483647,71,2147483647,2147483647,2147483647,2147483647]
pg 14.524 is down, acting
[2147483647,111,2147483647,2147483647,2147483647,8,2147483647,112,2147483647,2147483647]
pg 14.525 is down, acting
[2147483647,2147483647,2147483647,142,2147483647,61,2147483647,2147483647,2147483647,2147483647]
pg 14.526 is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,61,193,2147483647,2147483647,2147483647]
pg 14.527 is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,109,2147483647,2147483647]
pg 14.528 is down, acting
[2147483647,133,2147483647,2147483647,2147483647,2147483647,4,2147483647,2147483647,2147483647]
pg 14.529 is down, acting
[2147483647,112,2147483647,2147483647,2147483647,2147483647,185,2147483647,118,2147483647]
pg 14.52a is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,136,2147483647,135,2147483647,2147483647]
pg 14.52b is down, acting
[2147483647,2147483647,2147483647,112,142,211,2147483647,2147483647,2147483647,2147483647]
pg 14.52c is down, acting
[185,2147483647,198,2147483647,118,2147483647,2147483647,2147483647,2147483647,2147483647]
pg 14.52d is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,5,2147483647,2147483647,2147483647]
pg 14.52e is down, acting
[71,101,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,142,2147483647]
pg 14.52f is down, acting
[198,2147483647,2147483647,2147483647,2147483647,11,2147483647,2147483647,118,2147483647]
pg 14.530 is down, acting
[142,2147483647,2147483647,2147483647,133,2147483647,2147483647,2147483647,2147483647,112]
pg 14.531 is down, acting
[2147483647,142,2147483647,2147483647,2147483647,185,2147483647,2147483647,2147483647,2147483647]
pg 14.532 is down, acting
[135,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,136,118]
pg 14.533 is down, acting
[2147483647,77,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]
pg 14.534 is down, acting
[2147483647,2147483647,2147483647,185,118,2147483647,2147483647,207,2147483647,2147483647]
pg 14.535 is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,136,142,133,2147483647]
pg 14.536 is down, acting
[2147483647,11,2147483647,2147483647,136,2147483647,2147483647,2147483647,2147483647,2147483647]
pg 14.537 is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,77,2147483647]
pg 14.538 is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,205,2147483647,2147483647]
pg 14.539 is down, acting
[2147483647,2147483647,2147483647,198,2147483647,2147483647,4,2147483647,2147483647,2147483647]
pg 14.53a is down, acting
[2147483647,11,136,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]
pg 14.53b is down, acting
[2147483647,2147483647,2147483647,2147483647,112,2147483647,2147483647,2147483647,2147483647,2147483647]
pg 14.53c is down, acting
[2147483647,2147483647,2147483647,71,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]
pg 14.53d is down, acting
[2147483647,2147483647,2147483647,185,2147483647,2147483647,2147483647,2147483647,2147483647,136]
pg 14.53e is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,112,185]
pg 14.53f is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,185,2147483647,2147483647,2147483647]
pg 14.540 is down, acting
[205,2147483647,2147483647,2147483647,2147483647,2147483647,142,2147483647,112,77]
pg 14.541 is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,197,211,2147483647,2147483647,2147483647]
pg 14.542 is down, acting
[112,2147483647,101,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]
pg 14.543 is down, acting
[111,2147483647,2147483647,2147483647,2147483647,101,2147483647,2147483647,2147483647,2147483647]
pg 14.544 is down, acting
[4,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,205]
pg 14.545 is down, acting
[2147483647,2147483647,2147483647,2147483647,2147483647,142,5,2147483647,2147483647,2147483647]
PG_DEGRADED Degraded data redundancy: 5214696/500993589 objects degraded (1.041%), 298
pgs degraded, 299 pgs undersized
pg 1.29 is stuck undersized for 2075.633328, current state
active+undersized+degraded, last acting [253,258]
pg 1.2a is stuck undersized for 1642.864920, current state
active+undersized+degraded, last acting [252,255]
pg 1.2b is stuck undersized for 2355.149928, current state
active+undersized+degraded+remapped+backfill_wait, last acting [240,268]
pg 1.2c is stuck undersized for 1459.277329, current state
active+undersized+degraded, last acting [241,273]
pg 1.2d is stuck undersized for 803.339131, current state undersized+degraded+peered,
last acting [282]
pg 2.25 is active+undersized+degraded, acting
[253,2147483647,2147483647,258,261,273,277,243]
pg 2.28 is stuck undersized for 803.340163, current state active+undersized+degraded,
last acting [282,241,246,2147483647,273,252,2147483647,268]
pg 2.29 is stuck undersized for 803.341160, current state active+undersized+degraded,
last acting [240,258,277,264,2147483647,2147483647,271,250]
pg 2.2a is stuck undersized for 1447.684978, current state
active+undersized+degraded+remapped+backfilling, last acting
[252,270,2147483647,261,2147483647,255,287,264]
pg 2.2e is stuck undersized for 2030.849944, current state
active+undersized+degraded, last acting [264,2147483647,251,245,257,286,261,258]
pg 2.51 is stuck undersized for 1459.274671, current state
active+undersized+degraded+remapped+backfilling, last acting
[270,2147483647,2147483647,265,241,243,240,252]
pg 2.52 is stuck undersized for 2030.850897, current state
active+undersized+degraded+remapped+backfilling, last acting
[240,2147483647,270,265,269,280,278,2147483647]
pg 2.53 is stuck undersized for 1459.273517, current state
active+undersized+degraded, last acting [261,2147483647,280,282,2147483647,245,243,241]
pg 2.61 is stuck undersized for 2075.633140, current state
active+undersized+degraded+remapped+backfilling, last acting
[269,2147483647,258,286,270,255,2147483647,264]
pg 2.62 is stuck undersized for 803.340577, current state active+undersized+degraded,
last acting [2147483647,253,258,2147483647,250,287,264,284]
pg 2.66 is stuck undersized for 803.341231, current state active+undersized+degraded,
last acting [264,280,265,255,257,269,2147483647,270]
pg 2.6c is stuck undersized for 963.369539, current state active+undersized+degraded,
last acting [286,269,278,251,2147483647,273,2147483647,280]
pg 2.70 is stuck undersized for 873.662725, current state active+undersized+degraded,
last acting [2147483647,268,255,273,253,265,278,2147483647]
pg 2.74 is stuck undersized for 2075.632312, current state
active+undersized+degraded+remapped+backfilling, last acting
[240,242,2147483647,245,243,269,2147483647,265]
pg 3.24 is stuck undersized for 1570.800184, current state
active+undersized+degraded, last acting [235,263]
pg 3.25 is stuck undersized for 733.673503, current state undersized+degraded+peered,
last acting [232]
pg 3.28 is stuck undersized for 2610.307886, current state
active+undersized+degraded, last acting [263,84]
pg 3.2a is stuck undersized for 1214.710839, current state
active+undersized+degraded, last acting [181,232]
pg 3.2b is stuck undersized for 2075.630671, current state
active+undersized+degraded, last acting [63,144]
pg 3.52 is stuck undersized for 1570.777598, current state
active+undersized+degraded, last acting [158,237]
pg 3.54 is stuck undersized for 1350.257189, current state
active+undersized+degraded, last acting [239,74]
pg 3.55 is stuck undersized for 2592.642531, current state
active+undersized+degraded, last acting [157,233]
pg 3.5a is stuck undersized for 2075.608257, current state
undersized+degraded+peered, last acting [168]
pg 3.5c is stuck undersized for 733.674836, current state active+undersized+degraded,
last acting [263,234]
pg 3.5d is stuck undersized for 2610.307220, current state
active+undersized+degraded, last acting [180,84]
pg 3.5e is stuck undersized for 1710.756037, current state
undersized+degraded+peered, last acting [146]
pg 3.61 is stuck undersized for 1080.210021, current state
active+undersized+degraded, last acting [168,239]
pg 3.62 is stuck undersized for 831.217622, current state active+undersized+degraded,
last acting [84,263]
pg 3.63 is stuck undersized for 733.674204, current state active+undersized+degraded,
last acting [263,232]
pg 3.65 is stuck undersized for 1570.790824, current state
active+undersized+degraded, last acting [63,84]
pg 3.66 is stuck undersized for 733.682973, current state undersized+degraded+peered,
last acting [63]
pg 3.68 is stuck undersized for 1570.624462, current state
active+undersized+degraded, last acting [229,148]
pg 3.69 is stuck undersized for 1350.316213, current state
undersized+degraded+peered, last acting [235]
pg 3.6b is stuck undersized for 783.813654, current state undersized+degraded+peered,
last acting [63]
pg 3.6c is stuck undersized for 783.819083, current state undersized+degraded+peered,
last acting [229]
pg 3.6f is stuck undersized for 2610.321349, current state
active+undersized+degraded, last acting [232,158]
pg 3.72 is stuck undersized for 1350.358149, current state
active+undersized+degraded, last acting [229,74]
pg 3.73 is stuck undersized for 1570.788310, current state
undersized+degraded+peered, last acting [234]
pg 11.20 is stuck undersized for 733.682510, current state
active+undersized+degraded, last acting [2147483647,239,87,2147483647,158,237,63,76]
pg 11.26 is stuck undersized for 1914.334332, current state
active+undersized+degraded, last acting [2147483647,237,2147483647,263,158,148,181,180]
pg 11.2d is stuck undersized for 1350.365988, current state
active+undersized+degraded, last acting [2147483647,2147483647,73,229,86,158,169,84]
pg 11.54 is stuck undersized for 1914.398125, current state
active+undersized+degraded, last acting [231,169,2147483647,229,84,85,237,63]
pg 11.5b is stuck undersized for 2047.980719, current state
active+undersized+degraded, last acting [86,237,168,263,144,1,229,2147483647]
pg 11.5e is stuck undersized for 873.643661, current state
active+undersized+degraded, last acting [181,2147483647,229,158,231,1,169,2147483647]
pg 11.62 is stuck undersized for 1144.491696, current state
active+undersized+degraded, last acting [2147483647,85,235,74,63,234,181,2147483647]
pg 11.6f is stuck undersized for 873.646628, current state
active+undersized+degraded, last acting [234,3,2147483647,158,180,63,2147483647,181]
SLOW_OPS 9788 slow ops, oldest one blocked for 2953 sec, daemons
[osd.0,osd.100,osd.101,osd.112,osd.118,osd.133,osd.136,osd.142,osd.144,osd.145]... have
slow ops.
I don't want to butt in, but I looked at your OSD log and saw these messages:
2020-05-05 15:28:09.593 7f2d9cf29700 0 log_channel(cluster) log [WRN] : Monitor daemon
marked osd.2 down, but it is still running
2020-05-05 15:28:09.593 7f2d9cf29700 0 log_channel(cluster) log [DBG] : map e112673
wrongly marked me down at e112634
As far as I know, this happens when an OSD is under stress, whether by IO, or network
communications being saturated. I typically inject a large recovery sleep values and see
if the OSDs come back, like so:
ceph tell osd.* injectargs '--osd-recovery-sleep 1'
ceph tell osd.* injectargs '--osd-max-backfills 1'
Hope this helps.
--
Alex Gorbachev
Intelligent Systems Services Inc.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list --
ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io>>
To unsubscribe send an email to
ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io><mailto:ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io>>
_______________________________________________
ceph-users mailing list --
ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io>>
To unsubscribe send an email to
ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io><mailto:ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io>>