[ceph-users] Re: Remove an OSD with hardware issue caused rgw 503

26 Apr 2024

Thank you Eugen for your warm help!

I'm trying to understand the difference between 2 methods.
For method 1, or "ceph orch osd rm osd_id", OSD Service — Ceph Documentation
<https://docs.ceph.com/en/latest/cephadm/services/osd/#remove-an-osd> says
it involves 2 steps:

   1.

   evacuating all placement groups (PGs) from the OSD
   2.

   removing the PG-free OSD from the cluster

For method 2, or the procedure you recommended, Adding/Removing OSDs — Ceph
Documentation
<https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#removing-osds-manual>
says
"After the OSD has been taken out of the cluster, Ceph begins rebalancing
the cluster by migrating placement groups out of the OSD that was removed.
"

What's the difference between "evacuating PGs" in method 1 and
"migrating
PGs" in method 2? I think method 1 must read the OSD to be removed.
Otherwise, we would not see slow ops warning. Does method 2 not involve
reading this OSD?

Thanks,
Mary

On Fri, Apr 26, 2024 at 5:15 AM Eugen Block &lt;eblock(a)nde.ag&gt; wrote:

...
  Hi,

 if you remove the OSD this way, it will be drained. Which means that
 it will try to recover PGs from this OSD, and in case of hardware
 failure it might lead to slow requests. It might make sense to
 forcefully remove the OSD without draining:

 - stop the osd daemon
 - mark it as out
 - osd purge <id|osd.id> [--force] [--yes-i-really-mean-it]

 Regards,
 Eugen

 Zitat von Mary Zhang &lt;maryzhang0920(a)gmail.com&gt;om>:

  Hi,

 We recently removed an osd from our Cepth cluster. Its underlying disk  has
  a hardware issue.

 We use command: ceph orch osd rm osd_id --zap

 During the process, sometimes ceph cluster enters warning state with slow
 ops on this osd. Our rgw also failed to respond to requests and returned
 503.

 We restarted rgw daemon to make it work again. But the same failure  occured
  from time to time. Eventually we noticed that rgw
503 error is a result  of
  osd slow ops.

 Our cluster has 18 hosts and 210 OSDs. We expect remove an osd with
 hardware issue won't impact cluster performance & rgw availbility. Is our
 expectation reasonable? What's the best way to handle osd with hardware
 failures?

 Thank you in advance for any comments or suggestions.

 Best Regards,
 Mary Zhang
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Remove an OSD with hardware issue caused rgw 503