I think that the recovery might be blocked due to all those PGs in inactive state:
"""
Inactive: Placement groups cannot process reads or writes because they are waiting for an
OSD with the most up-to-date data to come back up.
"""
What is your pool configuration? And other configs?
Can you send the output of "ceph config dump" and "osd pool detail"?
On 05/05 11:00, Andres Rojas Guerrero wrote:
Yes, the principal problem is the MDS start to report
slowly and the
information is no longer accessible, and the cluster never recover.
# ceph status
cluster:
id: c74da5b8-3d1b-483e-8b3a-739134db6cf8
health: HEALTH_WARN
2 clients failing to respond to capability release
2 MDSs report slow metadata IOs
1 MDSs report slow requests
2 MDSs behind on trimming
Reduced data availability: 238 pgs inactive, 8 pgs down, 230
pgs incomplete
Degraded data redundancy: 1400453/220552172 objects degraded
(0.635%), 461 pgs degraded, 464 pgs undersized
241 slow ops, oldest one blocked for 638 sec, daemons
[osd.101,osd.127,osd.155,osd.166,osd.172,osd.189,osd.200,osd.210,osd.214,osd.233]...
have slow ops.
services:
mon: 3 daemons, quorum ceph2mon01,ceph2mon02,ceph2mon03 (age 25h)
mgr: ceph2mon02(active, since 6d), standbys: ceph2mon01, ceph2mon03
mds: nxtclfs:2 {0=ceph2mon01=up:active,1=ceph2mon02=up:active} 1
up:standby
osd: 768 osds: 736 up (since 11m), 736 in (since 95s); 416 remapped pgs
data:
pools: 2 pools, 16384 pgs
objects: 33.40M objects, 39 TiB
usage: 63 TiB used, 2.6 PiB / 2.6 PiB avail
pgs: 1.489% pgs not active
1400453/220552172 objects degraded (0.635%)
15676 active+clean
285 active+undersized+degraded+remapped+backfill_wait
230 incomplete
176 active+undersized+degraded+remapped+backfilling
8 down
6 peering
3 active+undersized+remapped
El 5/5/21 a las 10:54, David Caro escribió:
Can you share more information?
The output of 'ceph status' when the osd is down would help, also 'ceph
health detail' could be useful.
On 05/05 10:48, Andres Rojas Guerrero wrote:
Hi, I have a Nautilus cluster version 14.2.6 ,
and I have noted that
when some OSD go down the cluster doesn't start recover. I have checked
that the option noout is unset.
What could be the reason for this behavior?
--
*******************************************************
Andrés Rojas Guerrero
Unidad Sistemas Linux
Area Arquitectura Tecnológica
Secretaría General Adjunta de Informática
Consejo Superior de Investigaciones Científicas (CSIC)
Pinar 19
28006 - Madrid
Tel: +34 915680059 -- Ext. 990059
email: a.rojas(a)csic.es
ID comunicate.csic.es: @50852720l:matrix.csic.es
*******************************************************
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
--
*******************************************************
Andrés Rojas Guerrero
Unidad Sistemas Linux
Area Arquitectura Tecnológica
Secretaría General Adjunta de Informática
Consejo Superior de Investigaciones Científicas (CSIC)
Pinar 19
28006 - Madrid
Tel: +34 915680059 -- Ext. 990059
email: a.rojas(a)csic.es
ID comunicate.csic.es: @50852720l:matrix.csic.es
*******************************************************
--
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3
"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."