[ceph-users] Re: Huge HDD ceph monitor usage [EXT]

28 Oct 2020

Great response, thanks, i will use now only one site, but need first 
stabilice the cluster to remove the EC erasure coding and use replicate. 
Could you help me?

So the thing is that i have 2 pool, cinder-ceph and data_storage. 
data_storage is only as data_path for cinder-ceph pool, but now i use 
only the cinder-ceph with replication 3. How can i move all data from 
data_storage to cinder-ceph and remove the EC.

El 2020-10-28 06:55, Frank Schilder escribió:
> Hi all, I need to go back to a small piece of information:
> 
>> I was 3 mons, but i have 2 physical datacenters, one of them breaks 
>> with
>> not short term fix, so i remove all osds and ceph mon (2 of them) and
>> now i have only the osds of 1 datacenter with the monitor.
> 
> When I look at the data about pools and crush map, I don't see
> anything that is multi-site. Maybe the physical location was 2-site,
> but the crush rules don't reflect that. Consequently, the ceph cluster
> was configured single-site and will act accordingly when you loose 50%
> of it.
> 
> Quick interlude: when people recommend to add servers, they do not
> necessarily mean *new* servers. They mean you have to go to ground
> zero, dig out as much hardware as you can, drive it to the working
> site and make it rejoin the cluster.
> 
> A hypothesis. Assume we want to build a 2-site cluster (sites A and B)
> that can sustain the total loss of any 1 site, capacity at each site
> is equal (mirrored).
> 
> Short answer: this is not exactly possible due to the fact that you
> always need a qualified majority of monitors for quorum and you cannot
> distribute both, N MONs and a qualified majority evenly and
> simultaneously over 2 sites. We have already an additional constraint:
> site A will have 2 and site B 1 monitor. The condition is, that in
> case site A goes down the monitors from the site A can be rescued and
> moved to site B to restore data access. Hence, a loss of site A will
> imply temporary loss of service (Note that 2+2=4 MONs will not help,
> because now 3 MONs are required for a qualified majority; again MONs
> need to be rescued from the down site). If this constraint is
> satisfied, then one can configure pools as follows:
> 
> replicated: size 4, min_size 2, crush rule places 2 copies at each site
> erasure coded: k+m with min_size=k+1, m even and m>=k+2, for example,
> k=2, m=4, crush rule places 3 shards at each site
> 
> With such a configuration, it is possible to sustain the loss of any
> one site if the monitors can be recovered from site A. Note that such
> EC pools will be very compute intense and have high latency (use
> option fast_read to get at least reasonable read speeds). Essentially,
> EC is not really suitable for multi-site redundancy, but the above EC
> set up will require a bit less capacity than 4 copies.
> 
> This setup can sustain the total loss of 1 site (minus MONs on site A)
> and will rebuild all data once a large enough second site is brought
> up again.
> 
> When I look at the information you posted, I see replication 3(2) and
> EC 5+2 pools, all having crush root default. I do not see any of these
> mandatory configurations, the sites are ignored in the crush rules.
> Hence, if you can't get material from the down site back up, you look
> at permanent data loss.
> 
> You may be able to recover some more data in the replicated pools by
> setting min_size=1 for some time. However, you will loose any writes
> that are on the other 2 but not on the 1 disk now used for recovery
> and it will certainly not recover PGs with all 3 copies on the down
> site. Therefore, I would not attempt this, also because for the EC
> pools, you will need to get hold of the hosts from the down site and
> re-integrate these into the cluster any ways. If you can't do this,
> the data is lost.
> 
> In the long run, given your crush map and rules, you either stop
> placing stuff at 2 sites, or you create a proper 2-site set-up and
> copy data over.
> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Ing. Luis Felipe Domínguez Vega &lt;luis.dominguez(a)desoft.cu&gt;
> Sent: 28 October 2020 05:14:27
> To: Eugen Block
> Cc: Ceph Users
> Subject: [ceph-users] Re: Huge HDD ceph monitor usage [EXT]
> 
> Well recovering not working yet... i was started 6 servers more and the
> cluster not yet recovered.
> Ceph status not show any recover progress
> 
> ceph -s                 : https://pastebin.ubuntu.com/p/zRQPbvGzbw/
> ceph osd tree           : https://pastebin.ubuntu.com/p/sTDs8vd7Sk/
> ceph osd df             : https://pastebin.ubuntu.com/p/ysbh8r2VVz/
> ceph osd pool ls detail : https://pastebin.ubuntu.com/p/GRdPjxhv3D/
> crush rules             : (ceph osd crush rule dump)
> https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/
> 
> El 2020-10-27 09:59, Eugen Block escribió:
>> Your pool 'data_storage' has a size of 7 (or 7 chunks since it's
>> erasure-coded) and the rule requires each chunk on a different host
>> but you currently have only 5 hosts available, that's why the recovery
>>  is not progressing. It's waiting for two more hosts. Unfortunately,
>> you can't change the EC profile or the rule of that pool. I'm not sure
>>  if it would work in the current cluster state, but if you can't add
>> two more hosts (which would be your best option for recovery) it might
>>  be possible to create a new replicated pool (you seem to have enough
>> free space) and copy the contents from that EC pool. But as I said,
>> I'm not sure if that would work in a degraded state, I've never tried
>> that.
>> 
>> So your best bet is to get two more hosts somehow.
>> 
>> 
>>> pool 4 'data_storage' erasure profile desoft size 7 min_size 5
>>> crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32  
>>> autoscale_mode
>>> off last_change 154384 lfor 0/121016/121014 flags
>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
>>> application rbd
>> 
>> 
>> Zitat von "Ing. Luis Felipe Domínguez Vega" 
>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>> 
>>> Needed data:
>>> 
>>> ceph -s                 : https://pastebin.ubuntu.com/p/S9gKjyZtdK/
>>> ceph osd tree           : https://pastebin.ubuntu.com/p/SCZHkk6Mk4/
>>> ceph osd df             : (later, because i'm waiting since 10
>>> minutes and not output yet)
>>> ceph osd pool ls detail : https://pastebin.ubuntu.com/p/GRdPjxhv3D/
>>> crush rules             : (ceph osd crush rule dump)
>>> https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/
>>> 
>>> El 2020-10-27 07:14, Eugen Block escribió:
>>>>> I understand, but i delete the OSDs from CRUSH map, so ceph
don't
>>>>> wait for these OSDs, i'm right?
>>>> 
>>>> It depends on your actual crush tree and rules. Can you share (maybe
>>>> you already did)
>>>> 
>>>> ceph osd tree
>>>> ceph osd df
>>>> ceph osd pool ls detail
>>>> 
>>>> and a dump of your crush rules?
>>>> 
>>>> As I already said, if you have rules in place that distribute data
>>>> across 2 DCs and one of them is down the PGs will never recover even
>>>> if you delete the OSDs from the failed DC.
>>>> 
>>>> 
>>>> 
>>>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>>>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>>>> 
>>>>> I understand, but i delete the OSDs from CRUSH map, so ceph
don't
>>>>> wait for these OSDs, i'm right?
>>>>> 
>>>>> El 2020-10-27 04:06, Eugen Block escribió:
>>>>>> Hi,
>>>>>> 
>>>>>> just to clarify so I don't miss anything: you have two DCs
and one
>>>>>> of
>>>>>> them is down. And two of the MONs were in that failed DC? Now
you
>>>>>> removed all OSDs and two MONs from the failed DC hoping that
your
>>>>>> cluster will recover? If you have reasonable crush rules in
place
>>>>>> (e.g. to recover from a failed DC) your cluster will never
recover
>>>>>> in
>>>>>> the current state unless you bring OSDs back up on the second
DC.
>>>>>> That's why you don't see progress in the recovery
process, the PGs
>>>>>> are
>>>>>> waiting for their peers in the other DC so they can follow the
>>>>>> crush
>>>>>> rules.
>>>>>> 
>>>>>> Regards,
>>>>>> Eugen
>>>>>> 
>>>>>> 
>>>>>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>>>>>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>>>>>> 
>>>>>>> I was 3 mons, but i have 2 physical datacenters, one of them
>>>>>>> breaks  with not short term fix, so i remove all osds and
ceph
>>>>>>> mon  (2 of  them) and now i have only the osds of 1
datacenter
>>>>>>> with the  monitor.  I was stopped the ceph manager, but i was

>>>>>>> see
>>>>>>> that when  i restart a  ceph manager then ceph -s show  
>>>>>>> recovering
>>>>>>> info for a  short term of  20 min more or less, then 
dissapear
>>>>>>> all info.
>>>>>>> 
>>>>>>> The thing is that sems the cluster is not self recovering and

>>>>>>> the
>>>>>>>   ceph monitor is "eating" all of the HDD.
>>>>>>> 
>>>>>>> El 2020-10-26 15:57, Eugen Block escribió:
>>>>>>>> The recovery process (ceph -s) is independent of the MGR
service
>>>>>>>> but
>>>>>>>> only depends on the MON service. It seems you only have
the one
>>>>>>>> MON,
>>>>>>>> if the MGR is overloading it (not clear why) it could
help to
>>>>>>>> leave
>>>>>>>> MGR off and see if the MON service then has enough RAM to

>>>>>>>> proceed
>>>>>>>> with
>>>>>>>> the recovery. Do you have any chance to add two more
MONs? A
>>>>>>>> single
>>>>>>>> MON is of course a single point of failure.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>>>>>>>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>>>>>>>> 
>>>>>>>>> El 2020-10-26 15:16, Eugen Block escribió:
>>>>>>>>>> You could stop the MGRs and wait for the recovery
to finish,
>>>>>>>>>> MGRs are
>>>>>>>>>> not a critical component. You won’t have a
dashboard or 
>>>>>>>>>> metrics
>>>>>>>>>> during/of that time but it would prevent the high
RAM usage.
>>>>>>>>>> 
>>>>>>>>>> Zitat von "Ing. Luis Felipe Domínguez
Vega"
>>>>>>>>>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>>>>>>>>>> 
>>>>>>>>>>> El 2020-10-26 12:23, 胡 玮文 escribió:
>>>>>>>>>>>>> 在 2020年10月26日，23:29，Ing. Luis Felipe
Domínguez Vega
>>>>>>>>>>>>> &lt;luis.dominguez(a)desoft.cu&gt; 写道：
>>>>>>>>>>>>> 
>>>>>>>>>>>>> mgr: fond-beagle(active, since 39s)
>>>>>>>>>>>> 
>>>>>>>>>>>> Your manager seems crash looping, it only
started since 39s.
>>>>>>>>>>>> Looking
>>>>>>>>>>>> at mgr logs may help you identify why
your cluster is not
>>>>>>>>>>>> recovering.
>>>>>>>>>>>> You may hit some bug in mgr.
>>>>>>>>>>> Noup, I'm restarting the ceph manager
because they eat all
>>>>>>>>>>> server   RAM and then i have an script that
when i have 1GB
>>>>>>>>>>> of   Free Ram  (the  server has 94 Gb of RAM)
then restart
>>>>>>>>>>> the   manager, i dont  known why  and the
logs of manager 
>>>>>>>>>>> are:
>>>>>>>>>>> 
>>>>>>>>>>> -----------------------------------
>>>>>>>>>>>
root@fond-beagle:/var/lib/ceph/mon/ceph-fond-beagle/store.db#
>>>>>>>>>>>  tail    -f
/var/log/ceph/ceph-mgr.fond-beagle.log
>>>>>>>>>>> 2020-10-26T12:54:12.497-0400 7f2a8112b700  0
>>>>>>>>>>> log_channel(cluster)   log [DBG] : pgmap
v584: 2305 pgs: 4
>>>>>>>>>>>  active+undersized+degraded+remapped, 4
>>>>>>>>>>>
active+recovery_unfound+undersized+degraded+remapped, 2104
>>>>>>>>>>>  active+clean, 5 active+undersized+degraded,
34 incomplete,
>>>>>>>>>>> 154     unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB
>>>>>>>>>>> avail;     347248/2606900 objects degraded
(13.320%);
>>>>>>>>>>> 107570/2606900   objects   misplaced
(4.126%); 19/404328
>>>>>>>>>>> objects unfound (0.005%)
>>>>>>>>>>> 2020-10-26T12:54:12.497-0400 7f2a8112b700  0
>>>>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>>>>> 2020-10-26T12:54:14.501-0400 7f2a8112b700  0
>>>>>>>>>>> log_channel(cluster)   log [DBG] : pgmap
v585: 2305 pgs: 4
>>>>>>>>>>>  active+undersized+degraded+remapped, 4
>>>>>>>>>>>
active+recovery_unfound+undersized+degraded+remapped, 2104
>>>>>>>>>>>  active+clean, 5 active+undersized+degraded,
34 incomplete,
>>>>>>>>>>> 154     unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB
>>>>>>>>>>> avail;     347248/2606900 objects degraded
(13.320%);
>>>>>>>>>>> 107570/2606900   objects   misplaced
(4.126%); 19/404328
>>>>>>>>>>> objects unfound (0.005%)
>>>>>>>>>>> 2020-10-26T12:54:14.501-0400 7f2a8112b700  0
>>>>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>>>>> 2020-10-26T12:54:16.517-0400 7f2a8112b700  0
>>>>>>>>>>> log_channel(cluster)   log [DBG] : pgmap
v586: 2305 pgs: 4
>>>>>>>>>>>  active+undersized+degraded+remapped, 4
>>>>>>>>>>>
active+recovery_unfound+undersized+degraded+remapped, 2104
>>>>>>>>>>>  active+clean, 5 active+undersized+degraded,
34 incomplete,
>>>>>>>>>>> 154     unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB
>>>>>>>>>>> avail;     347248/2606900 objects degraded
(13.320%);
>>>>>>>>>>> 107570/2606900   objects   misplaced
(4.126%); 19/404328
>>>>>>>>>>> objects unfound (0.005%)
>>>>>>>>>>> 2020-10-26T12:54:16.517-0400 7f2a8112b700  0
>>>>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>>>>> 2020-10-26T12:54:18.521-0400 7f2a8112b700  0
>>>>>>>>>>> log_channel(cluster)   log [DBG] : pgmap
v587: 2305 pgs: 4
>>>>>>>>>>>  active+undersized+degraded+remapped, 4
>>>>>>>>>>>
active+recovery_unfound+undersized+degraded+remapped, 2104
>>>>>>>>>>>  active+clean, 5 active+undersized+degraded,
34 incomplete,
>>>>>>>>>>> 154     unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB
>>>>>>>>>>> avail;     347248/2606900 objects degraded
(13.320%);
>>>>>>>>>>> 107570/2606900   objects   misplaced
(4.126%); 19/404328
>>>>>>>>>>> objects unfound (0.005%)
>>>>>>>>>>> 2020-10-26T12:54:18.521-0400 7f2a8112b700  0
>>>>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>>>>> 2020-10-26T12:54:20.537-0400 7f2a8112b700  0
>>>>>>>>>>> log_channel(cluster)   log [DBG] : pgmap
v588: 2305 pgs: 4
>>>>>>>>>>>  active+undersized+degraded+remapped, 4
>>>>>>>>>>>
active+recovery_unfound+undersized+degraded+remapped, 2104
>>>>>>>>>>>  active+clean, 5 active+undersized+degraded,
34 incomplete,
>>>>>>>>>>> 154     unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB
>>>>>>>>>>> avail;     347248/2606900 objects degraded
(13.320%);
>>>>>>>>>>> 107570/2606900   objects   misplaced
(4.126%); 19/404328
>>>>>>>>>>> objects unfound (0.005%)
>>>>>>>>>>> 2020-10-26T12:54:20.537-0400 7f2a8112b700  0
>>>>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>>>>> 2020-10-26T12:54:22.541-0400 7f2a8112b700  0
>>>>>>>>>>> log_channel(cluster)   log [DBG] : pgmap
v589: 2305 pgs: 4
>>>>>>>>>>>  active+undersized+degraded+remapped, 4
>>>>>>>>>>>
active+recovery_unfound+undersized+degraded+remapped, 2104
>>>>>>>>>>>  active+clean, 5 active+undersized+degraded,
34 incomplete,
>>>>>>>>>>> 154     unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB
>>>>>>>>>>> avail;     347248/2606900 objects degraded
(13.320%);
>>>>>>>>>>> 107570/2606900   objects   misplaced
(4.126%); 19/404328
>>>>>>>>>>> objects unfound (0.005%)
>>>>>>>>>>> 2020-10-26T12:54:22.541-0400 7f2a8112b700  0
>>>>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>>>>> ---------------
>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>> ceph-users mailing list --
ceph-users(a)ceph.io
>>>>>>>>>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>>>>>> 
>>>>>>>>> Ok i will do that... but the thing is that the
cluster not  
>>>>>>>>> show
>>>>>>>>>    recovering, not show that are doing nothing, like
to  show
>>>>>>>>> the    recovering info on ceph -s command, and then i
 dont 
>>>>>>>>> know
>>>>>>>>> if is    recovering or doing what?
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Huge HDD ceph monitor usage [EXT]