Needed data:
ceph -s :
https://pastebin.ubuntu.com/p/S9gKjyZtdK/
ceph osd tree :
https://pastebin.ubuntu.com/p/SCZHkk6Mk4/
ceph osd df : (later, because i'm waiting since 10 minutes
and not output yet)
ceph osd pool ls detail :
https://pastebin.ubuntu.com/p/GRdPjxhv3D/
crush rules : (ceph osd crush rule dump)
https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/
El 2020-10-27 07:14, Eugen Block escribió:
>> I understand, but i delete the OSDs from CRUSH map, so ceph don't
>> wait for these OSDs, i'm right?
>
> It depends on your actual crush tree and rules. Can you share (maybe
> you already did)
>
> ceph osd tree
> ceph osd df
> ceph osd pool ls detail
>
> and a dump of your crush rules?
>
> As I already said, if you have rules in place that distribute data
> across 2 DCs and one of them is down the PGs will never recover even
> if you delete the OSDs from the failed DC.
>
>
>
> Zitat von "Ing. Luis Felipe Domínguez Vega"
<luis.dominguez(a)desoft.cu>cu>:
>
>> I understand, but i delete the OSDs from CRUSH map, so ceph don't
>> wait for these OSDs, i'm right?
>>
>> El 2020-10-27 04:06, Eugen Block escribió:
>>> Hi,
>>>
>>> just to clarify so I don't miss anything: you have two DCs and one of
>>> them is down. And two of the MONs were in that failed DC? Now you
>>> removed all OSDs and two MONs from the failed DC hoping that your
>>> cluster will recover? If you have reasonable crush rules in place
>>> (e.g. to recover from a failed DC) your cluster will never recover in
>>> the current state unless you bring OSDs back up on the second DC.
>>> That's why you don't see progress in the recovery process, the PGs
>>> are
>>> waiting for their peers in the other DC so they can follow the crush
>>> rules.
>>>
>>> Regards,
>>> Eugen
>>>
>>>
>>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>>> <luis.dominguez(a)desoft.cu>cu>:
>>>
>>>> I was 3 mons, but i have 2 physical datacenters, one of them breaks
>>>> with not short term fix, so i remove all osds and ceph mon (2 of
>>>> them) and now i have only the osds of 1 datacenter with the
>>>> monitor. I was stopped the ceph manager, but i was see that when i
>>>> restart a ceph manager then ceph -s show recovering info for a
>>>> short term of 20 min more or less, then dissapear all info.
>>>>
>>>> The thing is that sems the cluster is not self recovering and the
>>>> ceph monitor is "eating" all of the HDD.
>>>>
>>>> El 2020-10-26 15:57, Eugen Block escribió:
>>>>> The recovery process (ceph -s) is independent of the MGR service
>>>>> but
>>>>> only depends on the MON service. It seems you only have the one
>>>>> MON,
>>>>> if the MGR is overloading it (not clear why) it could help to leave
>>>>> MGR off and see if the MON service then has enough RAM to proceed
>>>>> with
>>>>> the recovery. Do you have any chance to add two more MONs? A single
>>>>> MON is of course a single point of failure.
>>>>>
>>>>>
>>>>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>>>>> <luis.dominguez(a)desoft.cu>cu>:
>>>>>
>>>>>> El 2020-10-26 15:16, Eugen Block escribió:
>>>>>>> You could stop the MGRs and wait for the recovery to finish,
MGRs
>>>>>>> are
>>>>>>> not a critical component. You won’t have a dashboard or
metrics
>>>>>>> during/of that time but it would prevent the high RAM usage.
>>>>>>>
>>>>>>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>>>>>>> <luis.dominguez(a)desoft.cu>cu>:
>>>>>>>
>>>>>>>> El 2020-10-26 12:23, 胡 玮文 escribió:
>>>>>>>>>> 在 2020年10月26日,23:29,Ing. Luis Felipe Domínguez
Vega
>>>>>>>>>> <luis.dominguez(a)desoft.cu> 写道:
>>>>>>>>>>
>>>>>>>>>> mgr: fond-beagle(active, since 39s)
>>>>>>>>>
>>>>>>>>> Your manager seems crash looping, it only started
since 39s.
>>>>>>>>> Looking
>>>>>>>>> at mgr logs may help you identify why your cluster is
not
>>>>>>>>> recovering.
>>>>>>>>> You may hit some bug in mgr.
>>>>>>>> Noup, I'm restarting the ceph manager because they
eat all
>>>>>>>> server RAM and then i have an script that when i have
1GB of
>>>>>>>> Free Ram (the server has 94 Gb of RAM) then restart the
>>>>>>>> manager, i dont known why and the logs of manager are:
>>>>>>>>
>>>>>>>> -----------------------------------
>>>>>>>>
root@fond-beagle:/var/lib/ceph/mon/ceph-fond-beagle/store.db#
>>>>>>>> tail -f /var/log/ceph/ceph-mgr.fond-beagle.log
>>>>>>>> 2020-10-26T12:54:12.497-0400 7f2a8112b700 0
>>>>>>>> log_channel(cluster) log [DBG] : pgmap v584: 2305 pgs:
4
>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104
>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete, 154
>>>>>>>> unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB
avail;
>>>>>>>> 347248/2606900 objects degraded (13.320%);
107570/2606900
>>>>>>>> objects misplaced (4.126%); 19/404328 objects unfound
(0.005%)
>>>>>>>> 2020-10-26T12:54:12.497-0400 7f2a8112b700 0
>>>>>>>> log_channel(cluster) do_log log to syslog
>>>>>>>> 2020-10-26T12:54:14.501-0400 7f2a8112b700 0
>>>>>>>> log_channel(cluster) log [DBG] : pgmap v585: 2305 pgs:
4
>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104
>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete, 154
>>>>>>>> unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB
avail;
>>>>>>>> 347248/2606900 objects degraded (13.320%);
107570/2606900
>>>>>>>> objects misplaced (4.126%); 19/404328 objects unfound
(0.005%)
>>>>>>>> 2020-10-26T12:54:14.501-0400 7f2a8112b700 0
>>>>>>>> log_channel(cluster) do_log log to syslog
>>>>>>>> 2020-10-26T12:54:16.517-0400 7f2a8112b700 0
>>>>>>>> log_channel(cluster) log [DBG] : pgmap v586: 2305 pgs:
4
>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104
>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete, 154
>>>>>>>> unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB
avail;
>>>>>>>> 347248/2606900 objects degraded (13.320%);
107570/2606900
>>>>>>>> objects misplaced (4.126%); 19/404328 objects unfound
(0.005%)
>>>>>>>> 2020-10-26T12:54:16.517-0400 7f2a8112b700 0
>>>>>>>> log_channel(cluster) do_log log to syslog
>>>>>>>> 2020-10-26T12:54:18.521-0400 7f2a8112b700 0
>>>>>>>> log_channel(cluster) log [DBG] : pgmap v587: 2305 pgs:
4
>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104
>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete, 154
>>>>>>>> unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB
avail;
>>>>>>>> 347248/2606900 objects degraded (13.320%);
107570/2606900
>>>>>>>> objects misplaced (4.126%); 19/404328 objects unfound
(0.005%)
>>>>>>>> 2020-10-26T12:54:18.521-0400 7f2a8112b700 0
>>>>>>>> log_channel(cluster) do_log log to syslog
>>>>>>>> 2020-10-26T12:54:20.537-0400 7f2a8112b700 0
>>>>>>>> log_channel(cluster) log [DBG] : pgmap v588: 2305 pgs:
4
>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104
>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete, 154
>>>>>>>> unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB
avail;
>>>>>>>> 347248/2606900 objects degraded (13.320%);
107570/2606900
>>>>>>>> objects misplaced (4.126%); 19/404328 objects unfound
(0.005%)
>>>>>>>> 2020-10-26T12:54:20.537-0400 7f2a8112b700 0
>>>>>>>> log_channel(cluster) do_log log to syslog
>>>>>>>> 2020-10-26T12:54:22.541-0400 7f2a8112b700 0
>>>>>>>> log_channel(cluster) log [DBG] : pgmap v589: 2305 pgs:
4
>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104
>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete, 154
>>>>>>>> unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB
avail;
>>>>>>>> 347248/2606900 objects degraded (13.320%);
107570/2606900
>>>>>>>> objects misplaced (4.126%); 19/404328 objects unfound
(0.005%)
>>>>>>>> 2020-10-26T12:54:22.541-0400 7f2a8112b700 0
>>>>>>>> log_channel(cluster) do_log log to syslog
>>>>>>>> ---------------
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>
>>>>>> Ok i will do that... but the thing is that the cluster not show
>>>>>> recovering, not show that are doing nothing, like to show the
>>>>>> recovering info on ceph -s command, and then i dont know if is
>>>>>> recovering or doing what?