I created a ticket:
https://tracker.ceph.com/issues/50637
Hope a purge will do the trick.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 03 May 2021 15:21:38
To: Dan van der Ster; Vladimir Sigunov
Cc: ceph-users(a)ceph.io
Subject: [ceph-users] Re: OSD slow ops warning not clearing after OSD down
Hi Dan,
just restarted all MONs, no change though :(
Thanks for looking at this. I will wait until tomorrow. My plan is to get the disk up
again with the same OSD ID and would expect that this will eventually allow the message to
be cleared.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Dan van der Ster <dan(a)vanderster.com>
Sent: 03 May 2021 15:08:03
To: Vladimir Sigunov
Cc: ceph-users(a)ceph.io; Frank Schilder
Subject: Re: [ceph-users] Re: OSD slow ops warning not clearing after OSD down
Wait, first just restart the leader mon.
See:
https://tracker.ceph.com/issues/47380 for a related issue.
-- dan
On Mon, May 3, 2021 at 2:55 PM Vladimir Sigunov
<vladimir.sigunov(a)gmail.com> wrote:
Hi Frank,
Yes, I would purge the osd. The cluster looks absolutely healthy except of this osd.584
Probably, the purge will help the cluster to forget this faulty one. Also, I would
restart monitors, too.
With the amount of data you maintain in your cluster, I don't think your ceph.conf
contains any information about some particular osds, but if it does, don't forget to
remove the configuration of osd.584 from the ceph.conf
Get Outlook for Android<https://aka.ms/ghei36>
________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, May 3, 2021 8:37:09 AM
To: Vladimir Sigunov <vladimir.sigunov(a)gmail.com>om>; ceph-users(a)ceph.io
<ceph-users(a)ceph.io>
Subject: Re: OSD slow ops warning not clearing after OSD down
Hi Vladimir,
thanks for your reply. I did, the cluster is healthy:
[root@gnosis ~]# ceph status
cluster:
id: ---
health: HEALTH_WARN
430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-02, ceph-03
mds: con-fs2-2/2/2 up {0=ceph-08=up:active,1=ceph-12=up:active}, 2 up:standby
osd: 584 osds: 578 up, 578 in
data:
pools: 11 pools, 3215 pgs
objects: 610.3 M objects, 1.2 PiB
usage: 1.5 PiB used, 4.6 PiB / 6.0 PiB avail
pgs: 3191 active+clean
13 active+clean+scrubbing+deep
9 active+clean+snaptrim_wait
2 active+clean+snaptrim
io:
client: 358 MiB/s rd, 56 MiB/s wr, 2.35 kop/s rd, 1.32 kop/s wr
[root@gnosis ~]# ceph health detail
HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
SLOW_OPS 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
OSD 580 is down+out and the message does not even increment the seconds. Its probably
stuck in some part of the health checking that tries to query 580 and doesn't
understand that the OSD being down means there are no ops.
I tried to restart the OSD on this disk, but it seems completely rigged. The iDRAC log on
the server says that the disk was removed during operation possibly due to a physical
connection fail on the SAS lanes. I somehow need to get rid of this message and am
wondering of purging the OSD would help.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Vladimir Sigunov <vladimir.sigunov(a)gmail.com>
Sent: 03 May 2021 13:45:19
To: ceph-users(a)ceph.io; Frank Schilder
Subject: Re: OSD slow ops warning not clearing after OSD down
Hi Frank.
Check your cluster for inactive/incomplete placement groups. I saw similar behavior on
Octopus when some pgs stuck in incomplete/inactive or peering state.
________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: Monday, May 3, 2021 3:42:48 AM
To: ceph-users(a)ceph.io <ceph-users(a)ceph.io>
Subject: [ceph-users] OSD slow ops warning not clearing after OSD down
Dear cephers,
I have a strange problem. An OSD went down and recovery finished. For some reason, I have
a slow ops warning for the failed OSD stuck in the system:
health: HEALTH_WARN
430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
The OSD is auto-out:
| 580 | ceph-22 | 0 | 0 | 0 | 0 | 0 | 0 | autoout,exists |
It is probably a warning dating back to just before the fail. How can I clear it?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io