[ceph-users] Re: Ceph cluster not recover after OSD down

5 May 2021

I think that the recovery might be blocked due to all those PGs in inactive state:
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/a…

"""
 Inactive: Placement groups cannot process reads or writes because they are waiting for an
OSD with the most up-to-date data to come back up.
"""

What is your pool configuration? And other configs?

Can you send the output of  "ceph config dump" and "osd pool detail"?

On 05/05 11:00, Andres Rojas Guerrero wrote:
...
  Yes, the principal problem is the MDS start to report
slowly and the
 information is no longer accessible, and the cluster never recover.

 # ceph status
   cluster:
     id:     c74da5b8-3d1b-483e-8b3a-739134db6cf8
     health: HEALTH_WARN
             2 clients failing to respond to capability release
             2 MDSs report slow metadata IOs
             1 MDSs report slow requests
             2 MDSs behind on trimming
             Reduced data availability: 238 pgs inactive, 8 pgs down, 230
 pgs incomplete
             Degraded data redundancy: 1400453/220552172 objects degraded
 (0.635%), 461 pgs degraded, 464 pgs undersized
             241 slow ops, oldest one blocked for 638 sec, daemons
 [osd.101,osd.127,osd.155,osd.166,osd.172,osd.189,osd.200,osd.210,osd.214,osd.233]...
 have slow ops.

   services:
     mon: 3 daemons, quorum ceph2mon01,ceph2mon02,ceph2mon03 (age 25h)
     mgr: ceph2mon02(active, since 6d), standbys: ceph2mon01, ceph2mon03
     mds: nxtclfs:2 {0=ceph2mon01=up:active,1=ceph2mon02=up:active} 1
 up:standby
     osd: 768 osds: 736 up (since 11m), 736 in (since 95s); 416 remapped pgs

   data:
     pools:   2 pools, 16384 pgs
     objects: 33.40M objects, 39 TiB
     usage:   63 TiB used, 2.6 PiB / 2.6 PiB avail
     pgs:     1.489% pgs not active
              1400453/220552172 objects degraded (0.635%)
              15676 active+clean
              285   active+undersized+degraded+remapped+backfill_wait
              230   incomplete
              176   active+undersized+degraded+remapped+backfilling
              8     down
              6     peering
              3     active+undersized+remapped

 El 5/5/21 a las 10:54, David Caro escribió:

 Can you share more information?

 The output of 'ceph status' when the osd is down would help, also 'ceph
health detail' could be useful.

 On 05/05 10:48, Andres Rojas Guerrero wrote:
  Hi, I have a Nautilus cluster version 14.2.6 ,
and I have noted that
 when some OSD go down the cluster doesn't start recover. I have checked
 that the option noout is unset.

 What could be the reason for this behavior?

 -- 
 *******************************************************
 Andrés Rojas Guerrero
 Unidad Sistemas Linux
 Area Arquitectura Tecnológica
 Secretaría General Adjunta de Informática
 Consejo Superior de Investigaciones Científicas (CSIC)
 Pinar 19
 28006 - Madrid
 Tel: +34 915680059 -- Ext. 990059
 email: a.rojas(a)csic.es
 ID comunicate.csic.es: @50852720l:matrix.csic.es
 *******************************************************
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

 -- 
 *******************************************************
 Andrés Rojas Guerrero
 Unidad Sistemas Linux
 Area Arquitectura Tecnológica
 Secretaría General Adjunta de Informática
 Consejo Superior de Investigaciones Científicas (CSIC)
 Pinar 19
 28006 - Madrid
 Tel: +34 915680059 -- Ext. 990059
 email: a.rojas(a)csic.es
 ID comunicate.csic.es: @50852720l:matrix.csic.es
 ******************************************************* 
-- 
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Ceph cluster not recover after OSD down