Failed OSD has 29 Slow MDS Ops. - ceph-users

7 Jun 2021

Hello,

Nautilus 14.2.16

I had an OSD go bad about 10 days ago.  Apparently as it was going down
some MDS ops got hung up waiting for it to come back.  I was out of town
for a couple days and found the OSD 'Down and Out' when I checked in.
(Also, oddly, the cluster did not appear to initiate recovery right away -
it took until I rebooted the OSD node.)

As of right now, the damaged OSD is 'safe-to-destroy' but the slow ops are
still hanging around.  Earlier today I quiesced the clients that were
accessing the CephFS, then unmounted and re-mounted it.  However, this did
not clear the lingering ops.

When I had the node offline I verified that the HDD and NVMe associated
with the OSD seem to actually be healthy, so I plan to zap and re-deploy
using the same hardware.  I would also like to upgrade to 14.2.20 (latest
Ceph for debian 10), but I'm hesitant to do any of this until I get rid of
these 29 slow ops.

Can anybody suggest a path forward?

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdhall(a)binghamton.edu