[ceph-users] Re: Huge HDD ceph monitor usage [EXT]

27 Oct 2020

I understand, but i delete the OSDs from CRUSH map, so ceph don't wait 
for these OSDs, i'm right?

El 2020-10-27 04:06, Eugen Block escribió:
> Hi,
> 
> just to clarify so I don't miss anything: you have two DCs and one of
> them is down. And two of the MONs were in that failed DC? Now you
> removed all OSDs and two MONs from the failed DC hoping that your
> cluster will recover? If you have reasonable crush rules in place
> (e.g. to recover from a failed DC) your cluster will never recover in
> the current state unless you bring OSDs back up on the second DC.
> That's why you don't see progress in the recovery process, the PGs are
>  waiting for their peers in the other DC so they can follow the crush
> rules.
> 
> Regards,
> Eugen
> 
> 
> Zitat von "Ing. Luis Felipe Domínguez Vega"
&lt;luis.dominguez(a)desoft.cu&gt;cu>:
> 
>> I was 3 mons, but i have 2 physical datacenters, one of them breaks  
>> with not short term fix, so i remove all osds and ceph mon (2 of  
>> them) and now i have only the osds of 1 datacenter with the monitor.  
>> I was stopped the ceph manager, but i was see that when i restart a  
>> ceph manager then ceph -s show recovering info for a short term of  20 
>> min more or less, then dissapear all info.
>> 
>> The thing is that sems the cluster is not self recovering and the  
>> ceph monitor is "eating" all of the HDD.
>> 
>> El 2020-10-26 15:57, Eugen Block escribió:
>>> The recovery process (ceph -s) is independent of the MGR service but
>>> only depends on the MON service. It seems you only have the one MON,
>>> if the MGR is overloading it (not clear why) it could help to leave
>>> MGR off and see if the MON service then has enough RAM to proceed 
>>> with
>>> the recovery. Do you have any chance to add two more MONs? A single
>>> MON is of course a single point of failure.
>>> 
>>> 
>>> Zitat von "Ing. Luis Felipe Domínguez Vega" 
>>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>>> 
>>>> El 2020-10-26 15:16, Eugen Block escribió:
>>>>> You could stop the MGRs and wait for the recovery to finish, MGRs 
>>>>> are
>>>>> not a critical component. You won’t have a dashboard or metrics
>>>>> during/of that time but it would prevent the high RAM usage.
>>>>> 
>>>>> Zitat von "Ing. Luis Felipe Domínguez Vega" 
>>>>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>>>>> 
>>>>>> El 2020-10-26 12:23, 胡 玮文 escribió:
>>>>>>>> 在 2020年10月26日，23:29，Ing. Luis Felipe Domínguez Vega    
>>>>>>>> &lt;luis.dominguez(a)desoft.cu&gt; 写道：
>>>>>>>> 
>>>>>>>> mgr: fond-beagle(active, since 39s)
>>>>>>> 
>>>>>>> Your manager seems crash looping, it only started since 39s.

>>>>>>> Looking
>>>>>>> at mgr logs may help you identify why your cluster is not 
>>>>>>> recovering.
>>>>>>> You may hit some bug in mgr.
>>>>>> Noup, I'm restarting the ceph manager because they eat all 
server 
>>>>>>   RAM and then i have an script that when i have 1GB of  Free Ram

>>>>>> (the  server has 94 Gb of RAM) then restart the  manager, i dont 

>>>>>> known why  and the logs of manager are:
>>>>>> 
>>>>>> -----------------------------------
>>>>>> root@fond-beagle:/var/lib/ceph/mon/ceph-fond-beagle/store.db#  
>>>>>> tail   -f /var/log/ceph/ceph-mgr.fond-beagle.log
>>>>>> 2020-10-26T12:54:12.497-0400 7f2a8112b700  0 
log_channel(cluster) 
>>>>>>   log [DBG] : pgmap v584: 2305 pgs: 4    
>>>>>> active+undersized+degraded+remapped, 4    
>>>>>> active+recovery_unfound+undersized+degraded+remapped, 2104    
>>>>>> active+clean, 5 active+undersized+degraded, 34 incomplete, 154   

>>>>>> unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;    
>>>>>> 347248/2606900 objects degraded (13.320%); 107570/2606900 
objects 
>>>>>>   misplaced (4.126%); 19/404328 objects unfound (0.005%)
>>>>>> 2020-10-26T12:54:12.497-0400 7f2a8112b700  0 
log_channel(cluster) 
>>>>>>   do_log log to syslog
>>>>>> 2020-10-26T12:54:14.501-0400 7f2a8112b700  0 
log_channel(cluster) 
>>>>>>   log [DBG] : pgmap v585: 2305 pgs: 4    
>>>>>> active+undersized+degraded+remapped, 4    
>>>>>> active+recovery_unfound+undersized+degraded+remapped, 2104    
>>>>>> active+clean, 5 active+undersized+degraded, 34 incomplete, 154   

>>>>>> unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;    
>>>>>> 347248/2606900 objects degraded (13.320%); 107570/2606900 
objects 
>>>>>>   misplaced (4.126%); 19/404328 objects unfound (0.005%)
>>>>>> 2020-10-26T12:54:14.501-0400 7f2a8112b700  0 
log_channel(cluster) 
>>>>>>   do_log log to syslog
>>>>>> 2020-10-26T12:54:16.517-0400 7f2a8112b700  0 
log_channel(cluster) 
>>>>>>   log [DBG] : pgmap v586: 2305 pgs: 4    
>>>>>> active+undersized+degraded+remapped, 4    
>>>>>> active+recovery_unfound+undersized+degraded+remapped, 2104    
>>>>>> active+clean, 5 active+undersized+degraded, 34 incomplete, 154   

>>>>>> unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;    
>>>>>> 347248/2606900 objects degraded (13.320%); 107570/2606900 
objects 
>>>>>>   misplaced (4.126%); 19/404328 objects unfound (0.005%)
>>>>>> 2020-10-26T12:54:16.517-0400 7f2a8112b700  0 
log_channel(cluster) 
>>>>>>   do_log log to syslog
>>>>>> 2020-10-26T12:54:18.521-0400 7f2a8112b700  0 
log_channel(cluster) 
>>>>>>   log [DBG] : pgmap v587: 2305 pgs: 4    
>>>>>> active+undersized+degraded+remapped, 4    
>>>>>> active+recovery_unfound+undersized+degraded+remapped, 2104    
>>>>>> active+clean, 5 active+undersized+degraded, 34 incomplete, 154   

>>>>>> unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;    
>>>>>> 347248/2606900 objects degraded (13.320%); 107570/2606900 
objects 
>>>>>>   misplaced (4.126%); 19/404328 objects unfound (0.005%)
>>>>>> 2020-10-26T12:54:18.521-0400 7f2a8112b700  0 
log_channel(cluster) 
>>>>>>   do_log log to syslog
>>>>>> 2020-10-26T12:54:20.537-0400 7f2a8112b700  0 
log_channel(cluster) 
>>>>>>   log [DBG] : pgmap v588: 2305 pgs: 4    
>>>>>> active+undersized+degraded+remapped, 4    
>>>>>> active+recovery_unfound+undersized+degraded+remapped, 2104    
>>>>>> active+clean, 5 active+undersized+degraded, 34 incomplete, 154   

>>>>>> unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;    
>>>>>> 347248/2606900 objects degraded (13.320%); 107570/2606900 
objects 
>>>>>>   misplaced (4.126%); 19/404328 objects unfound (0.005%)
>>>>>> 2020-10-26T12:54:20.537-0400 7f2a8112b700  0 
log_channel(cluster) 
>>>>>>   do_log log to syslog
>>>>>> 2020-10-26T12:54:22.541-0400 7f2a8112b700  0 
log_channel(cluster) 
>>>>>>   log [DBG] : pgmap v589: 2305 pgs: 4    
>>>>>> active+undersized+degraded+remapped, 4    
>>>>>> active+recovery_unfound+undersized+degraded+remapped, 2104    
>>>>>> active+clean, 5 active+undersized+degraded, 34 incomplete, 154   

>>>>>> unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;    
>>>>>> 347248/2606900 objects degraded (13.320%); 107570/2606900 
objects 
>>>>>>   misplaced (4.126%); 19/404328 objects unfound (0.005%)
>>>>>> 2020-10-26T12:54:22.541-0400 7f2a8112b700  0 
log_channel(cluster) 
>>>>>>   do_log log to syslog
>>>>>> ---------------
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>> 
>>>> Ok i will do that... but the thing is that the cluster not show   
>>>> recovering, not show that are doing nothing, like to show the   
>>>> recovering info on ceph -s command, and then i dont know if is   
>>>> recovering or doing what?

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Huge HDD ceph monitor usage [EXT]