[ceph-users] Re: Huge HDD ceph monitor usage [EXT]

27 Oct 2020

Needed data:

ceph -s                 : https://pastebin.ubuntu.com/p/S9gKjyZtdK/
ceph osd tree           : https://pastebin.ubuntu.com/p/SCZHkk6Mk4/
ceph osd df             : (later, because i'm waiting since 10 minutes 
and not output yet)
ceph osd pool ls detail : https://pastebin.ubuntu.com/p/GRdPjxhv3D/
crush rules             : (ceph osd crush rule dump) 
https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/

El 2020-10-27 07:14, Eugen Block escribió:
>> I understand, but i delete the OSDs from CRUSH map, so ceph don't  
>> wait for these OSDs, i'm right?
> 
> It depends on your actual crush tree and rules. Can you share (maybe
> you already did)
> 
> ceph osd tree
> ceph osd df
> ceph osd pool ls detail
> 
> and a dump of your crush rules?
> 
> As I already said, if you have rules in place that distribute data
> across 2 DCs and one of them is down the PGs will never recover even
> if you delete the OSDs from the failed DC.
> 
> 
> 
> Zitat von "Ing. Luis Felipe Domínguez Vega"
&lt;luis.dominguez(a)desoft.cu&gt;cu>:
> 
>> I understand, but i delete the OSDs from CRUSH map, so ceph don't  
>> wait for these OSDs, i'm right?
>> 
>> El 2020-10-27 04:06, Eugen Block escribió:
>>> Hi,
>>> 
>>> just to clarify so I don't miss anything: you have two DCs and one of
>>> them is down. And two of the MONs were in that failed DC? Now you
>>> removed all OSDs and two MONs from the failed DC hoping that your
>>> cluster will recover? If you have reasonable crush rules in place
>>> (e.g. to recover from a failed DC) your cluster will never recover in
>>> the current state unless you bring OSDs back up on the second DC.
>>> That's why you don't see progress in the recovery process, the PGs 
>>> are
>>> waiting for their peers in the other DC so they can follow the crush
>>> rules.
>>> 
>>> Regards,
>>> Eugen
>>> 
>>> 
>>> Zitat von "Ing. Luis Felipe Domínguez Vega" 
>>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>>> 
>>>> I was 3 mons, but i have 2 physical datacenters, one of them  breaks 
>>>>  with not short term fix, so i remove all osds and ceph mon  (2 of  
>>>> them) and now i have only the osds of 1 datacenter with the  
>>>> monitor.  I was stopped the ceph manager, but i was see that when  i 
>>>> restart a  ceph manager then ceph -s show recovering info for a  
>>>> short term of  20 min more or less, then dissapear all info.
>>>> 
>>>> The thing is that sems the cluster is not self recovering and the   
>>>> ceph monitor is "eating" all of the HDD.
>>>> 
>>>> El 2020-10-26 15:57, Eugen Block escribió:
>>>>> The recovery process (ceph -s) is independent of the MGR service 
>>>>> but
>>>>> only depends on the MON service. It seems you only have the one 
>>>>> MON,
>>>>> if the MGR is overloading it (not clear why) it could help to leave
>>>>> MGR off and see if the MON service then has enough RAM to proceed 
>>>>> with
>>>>> the recovery. Do you have any chance to add two more MONs? A single
>>>>> MON is of course a single point of failure.
>>>>> 
>>>>> 
>>>>> Zitat von "Ing. Luis Felipe Domínguez Vega" 
>>>>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>>>>> 
>>>>>> El 2020-10-26 15:16, Eugen Block escribió:
>>>>>>> You could stop the MGRs and wait for the recovery to finish,
MGRs 
>>>>>>> are
>>>>>>> not a critical component. You won’t have a dashboard or
metrics
>>>>>>> during/of that time but it would prevent the high RAM usage.
>>>>>>> 
>>>>>>> Zitat von "Ing. Luis Felipe Domínguez Vega" 
>>>>>>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>>>>>>> 
>>>>>>>> El 2020-10-26 12:23, 胡 玮文 escribió:
>>>>>>>>>> 在 2020年10月26日，23:29，Ing. Luis Felipe Domínguez
Vega     
>>>>>>>>>> &lt;luis.dominguez(a)desoft.cu&gt; 写道：
>>>>>>>>>> 
>>>>>>>>>> mgr: fond-beagle(active, since 39s)
>>>>>>>>> 
>>>>>>>>> Your manager seems crash looping, it only started
since 39s. 
>>>>>>>>> Looking
>>>>>>>>> at mgr logs may help you identify why your cluster is
not 
>>>>>>>>> recovering.
>>>>>>>>> You may hit some bug in mgr.
>>>>>>>> Noup, I'm restarting the ceph manager because they
eat all   
>>>>>>>> server   RAM and then i have an script that when i have
1GB of   
>>>>>>>> Free Ram  (the  server has 94 Gb of RAM) then restart the

>>>>>>>> manager, i dont  known why  and the logs of manager are:
>>>>>>>> 
>>>>>>>> -----------------------------------
>>>>>>>>
root@fond-beagle:/var/lib/ceph/mon/ceph-fond-beagle/store.db#   
>>>>>>>> tail   -f /var/log/ceph/ceph-mgr.fond-beagle.log
>>>>>>>> 2020-10-26T12:54:12.497-0400 7f2a8112b700  0   
>>>>>>>> log_channel(cluster)   log [DBG] : pgmap v584: 2305 pgs:
4     
>>>>>>>> active+undersized+degraded+remapped, 4     
>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104     
>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete, 154   
>>>>>>>>   unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB
avail;    
>>>>>>>>  347248/2606900 objects degraded (13.320%);
107570/2606900   
>>>>>>>> objects   misplaced (4.126%); 19/404328 objects unfound
(0.005%)
>>>>>>>> 2020-10-26T12:54:12.497-0400 7f2a8112b700  0   
>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>> 2020-10-26T12:54:14.501-0400 7f2a8112b700  0   
>>>>>>>> log_channel(cluster)   log [DBG] : pgmap v585: 2305 pgs:
4     
>>>>>>>> active+undersized+degraded+remapped, 4     
>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104     
>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete, 154   
>>>>>>>>   unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB
avail;    
>>>>>>>>  347248/2606900 objects degraded (13.320%);
107570/2606900   
>>>>>>>> objects   misplaced (4.126%); 19/404328 objects unfound
(0.005%)
>>>>>>>> 2020-10-26T12:54:14.501-0400 7f2a8112b700  0   
>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>> 2020-10-26T12:54:16.517-0400 7f2a8112b700  0   
>>>>>>>> log_channel(cluster)   log [DBG] : pgmap v586: 2305 pgs:
4     
>>>>>>>> active+undersized+degraded+remapped, 4     
>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104     
>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete, 154   
>>>>>>>>   unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB
avail;    
>>>>>>>>  347248/2606900 objects degraded (13.320%);
107570/2606900   
>>>>>>>> objects   misplaced (4.126%); 19/404328 objects unfound
(0.005%)
>>>>>>>> 2020-10-26T12:54:16.517-0400 7f2a8112b700  0   
>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>> 2020-10-26T12:54:18.521-0400 7f2a8112b700  0   
>>>>>>>> log_channel(cluster)   log [DBG] : pgmap v587: 2305 pgs:
4     
>>>>>>>> active+undersized+degraded+remapped, 4     
>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104     
>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete, 154   
>>>>>>>>   unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB
avail;    
>>>>>>>>  347248/2606900 objects degraded (13.320%);
107570/2606900   
>>>>>>>> objects   misplaced (4.126%); 19/404328 objects unfound
(0.005%)
>>>>>>>> 2020-10-26T12:54:18.521-0400 7f2a8112b700  0   
>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>> 2020-10-26T12:54:20.537-0400 7f2a8112b700  0   
>>>>>>>> log_channel(cluster)   log [DBG] : pgmap v588: 2305 pgs:
4     
>>>>>>>> active+undersized+degraded+remapped, 4     
>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104     
>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete, 154   
>>>>>>>>   unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB
avail;    
>>>>>>>>  347248/2606900 objects degraded (13.320%);
107570/2606900   
>>>>>>>> objects   misplaced (4.126%); 19/404328 objects unfound
(0.005%)
>>>>>>>> 2020-10-26T12:54:20.537-0400 7f2a8112b700  0   
>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>> 2020-10-26T12:54:22.541-0400 7f2a8112b700  0   
>>>>>>>> log_channel(cluster)   log [DBG] : pgmap v589: 2305 pgs:
4     
>>>>>>>> active+undersized+degraded+remapped, 4     
>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104     
>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete, 154   
>>>>>>>>   unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB
avail;    
>>>>>>>>  347248/2606900 objects degraded (13.320%);
107570/2606900   
>>>>>>>> objects   misplaced (4.126%); 19/404328 objects unfound
(0.005%)
>>>>>>>> 2020-10-26T12:54:22.541-0400 7f2a8112b700  0   
>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>> ---------------
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>> 
>>>>>> Ok i will do that... but the thing is that the cluster not show  

>>>>>> recovering, not show that are doing nothing, like to show the   

>>>>>> recovering info on ceph -s command, and then i dont know if is   

>>>>>> recovering or doing what?

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Huge HDD ceph monitor usage [EXT]