Great response, thanks, i will use now only one site, but need first
stabilice the cluster to remove the EC erasure coding and use replicate.
Could you help me?
So the thing is that i have 2 pool, cinder-ceph and data_storage.
data_storage is only as data_path for cinder-ceph pool, but now i use
only the cinder-ceph with replication 3. How can i move all data from
data_storage to cinder-ceph and remove the EC.
El 2020-10-28 06:55, Frank Schilder escribió:
> Hi all, I need to go back to a small piece of information:
>
>> I was 3 mons, but i have 2 physical datacenters, one of them breaks
>> with
>> not short term fix, so i remove all osds and ceph mon (2 of them) and
>> now i have only the osds of 1 datacenter with the monitor.
>
> When I look at the data about pools and crush map, I don't see
> anything that is multi-site. Maybe the physical location was 2-site,
> but the crush rules don't reflect that. Consequently, the ceph cluster
> was configured single-site and will act accordingly when you loose 50%
> of it.
>
> Quick interlude: when people recommend to add servers, they do not
> necessarily mean *new* servers. They mean you have to go to ground
> zero, dig out as much hardware as you can, drive it to the working
> site and make it rejoin the cluster.
>
> A hypothesis. Assume we want to build a 2-site cluster (sites A and B)
> that can sustain the total loss of any 1 site, capacity at each site
> is equal (mirrored).
>
> Short answer: this is not exactly possible due to the fact that you
> always need a qualified majority of monitors for quorum and you cannot
> distribute both, N MONs and a qualified majority evenly and
> simultaneously over 2 sites. We have already an additional constraint:
> site A will have 2 and site B 1 monitor. The condition is, that in
> case site A goes down the monitors from the site A can be rescued and
> moved to site B to restore data access. Hence, a loss of site A will
> imply temporary loss of service (Note that 2+2=4 MONs will not help,
> because now 3 MONs are required for a qualified majority; again MONs
> need to be rescued from the down site). If this constraint is
> satisfied, then one can configure pools as follows:
>
> replicated: size 4, min_size 2, crush rule places 2 copies at each site
> erasure coded: k+m with min_size=k+1, m even and m>=k+2, for example,
> k=2, m=4, crush rule places 3 shards at each site
>
> With such a configuration, it is possible to sustain the loss of any
> one site if the monitors can be recovered from site A. Note that such
> EC pools will be very compute intense and have high latency (use
> option fast_read to get at least reasonable read speeds). Essentially,
> EC is not really suitable for multi-site redundancy, but the above EC
> set up will require a bit less capacity than 4 copies.
>
> This setup can sustain the total loss of 1 site (minus MONs on site A)
> and will rebuild all data once a large enough second site is brought
> up again.
>
> When I look at the information you posted, I see replication 3(2) and
> EC 5+2 pools, all having crush root default. I do not see any of these
> mandatory configurations, the sites are ignored in the crush rules.
> Hence, if you can't get material from the down site back up, you look
> at permanent data loss.
>
> You may be able to recover some more data in the replicated pools by
> setting min_size=1 for some time. However, you will loose any writes
> that are on the other 2 but not on the 1 disk now used for recovery
> and it will certainly not recover PGs with all 3 copies on the down
> site. Therefore, I would not attempt this, also because for the EC
> pools, you will need to get hold of the hosts from the down site and
> re-integrate these into the cluster any ways. If you can't do this,
> the data is lost.
>
> In the long run, given your crush map and rules, you either stop
> placing stuff at 2 sites, or you create a proper 2-site set-up and
> copy data over.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Ing. Luis Felipe Domínguez Vega <luis.dominguez(a)desoft.cu>
> Sent: 28 October 2020 05:14:27
> To: Eugen Block
> Cc: Ceph Users
> Subject: [ceph-users] Re: Huge HDD ceph monitor usage [EXT]
>
> Well recovering not working yet... i was started 6 servers more and the
> cluster not yet recovered.
> Ceph status not show any recover progress
>
> ceph -s :
https://pastebin.ubuntu.com/p/zRQPbvGzbw/
> ceph osd tree :
https://pastebin.ubuntu.com/p/sTDs8vd7Sk/
> ceph osd df :
https://pastebin.ubuntu.com/p/ysbh8r2VVz/
> ceph osd pool ls detail :
https://pastebin.ubuntu.com/p/GRdPjxhv3D/
> crush rules : (ceph osd crush rule dump)
>
https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/
>
> El 2020-10-27 09:59, Eugen Block escribió:
>> Your pool 'data_storage' has a size of 7 (or 7 chunks since it's
>> erasure-coded) and the rule requires each chunk on a different host
>> but you currently have only 5 hosts available, that's why the recovery
>> is not progressing. It's waiting for two more hosts. Unfortunately,
>> you can't change the EC profile or the rule of that pool. I'm not sure
>> if it would work in the current cluster state, but if you can't add
>> two more hosts (which would be your best option for recovery) it might
>> be possible to create a new replicated pool (you seem to have enough
>> free space) and copy the contents from that EC pool. But as I said,
>> I'm not sure if that would work in a degraded state, I've never tried
>> that.
>>
>> So your best bet is to get two more hosts somehow.
>>
>>
>>> pool 4 'data_storage' erasure profile desoft size 7 min_size 5
>>> crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32
>>> autoscale_mode
>>> off last_change 154384 lfor 0/121016/121014 flags
>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
>>> application rbd
>>
>>
>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>> <luis.dominguez(a)desoft.cu>cu>:
>>
>>> Needed data:
>>>
>>> ceph -s :
https://pastebin.ubuntu.com/p/S9gKjyZtdK/
>>> ceph osd tree :
https://pastebin.ubuntu.com/p/SCZHkk6Mk4/
>>> ceph osd df : (later, because i'm waiting since 10
>>> minutes and not output yet)
>>> ceph osd pool ls detail :
https://pastebin.ubuntu.com/p/GRdPjxhv3D/
>>> crush rules : (ceph osd crush rule dump)
>>>
https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/
>>>
>>> El 2020-10-27 07:14, Eugen Block escribió:
>>>>> I understand, but i delete the OSDs from CRUSH map, so ceph
don't
>>>>> wait for these OSDs, i'm right?
>>>>
>>>> It depends on your actual crush tree and rules. Can you share (maybe
>>>> you already did)
>>>>
>>>> ceph osd tree
>>>> ceph osd df
>>>> ceph osd pool ls detail
>>>>
>>>> and a dump of your crush rules?
>>>>
>>>> As I already said, if you have rules in place that distribute data
>>>> across 2 DCs and one of them is down the PGs will never recover even
>>>> if you delete the OSDs from the failed DC.
>>>>
>>>>
>>>>
>>>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>>>> <luis.dominguez(a)desoft.cu>cu>:
>>>>
>>>>> I understand, but i delete the OSDs from CRUSH map, so ceph
don't
>>>>> wait for these OSDs, i'm right?
>>>>>
>>>>> El 2020-10-27 04:06, Eugen Block escribió:
>>>>>> Hi,
>>>>>>
>>>>>> just to clarify so I don't miss anything: you have two DCs
and one
>>>>>> of
>>>>>> them is down. And two of the MONs were in that failed DC? Now
you
>>>>>> removed all OSDs and two MONs from the failed DC hoping that
your
>>>>>> cluster will recover? If you have reasonable crush rules in
place
>>>>>> (e.g. to recover from a failed DC) your cluster will never
recover
>>>>>> in
>>>>>> the current state unless you bring OSDs back up on the second
DC.
>>>>>> That's why you don't see progress in the recovery
process, the PGs
>>>>>> are
>>>>>> waiting for their peers in the other DC so they can follow the
>>>>>> crush
>>>>>> rules.
>>>>>>
>>>>>> Regards,
>>>>>> Eugen
>>>>>>
>>>>>>
>>>>>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>>>>>> <luis.dominguez(a)desoft.cu>cu>:
>>>>>>
>>>>>>> I was 3 mons, but i have 2 physical datacenters, one of them
>>>>>>> breaks with not short term fix, so i remove all osds and
ceph
>>>>>>> mon (2 of them) and now i have only the osds of 1
datacenter
>>>>>>> with the monitor. I was stopped the ceph manager, but i was
>>>>>>> see
>>>>>>> that when i restart a ceph manager then ceph -s show
>>>>>>> recovering
>>>>>>> info for a short term of 20 min more or less, then
dissapear
>>>>>>> all info.
>>>>>>>
>>>>>>> The thing is that sems the cluster is not self recovering and
>>>>>>> the
>>>>>>> ceph monitor is "eating" all of the HDD.
>>>>>>>
>>>>>>> El 2020-10-26 15:57, Eugen Block escribió:
>>>>>>>> The recovery process (ceph -s) is independent of the MGR
service
>>>>>>>> but
>>>>>>>> only depends on the MON service. It seems you only have
the one
>>>>>>>> MON,
>>>>>>>> if the MGR is overloading it (not clear why) it could
help to
>>>>>>>> leave
>>>>>>>> MGR off and see if the MON service then has enough RAM to
>>>>>>>> proceed
>>>>>>>> with
>>>>>>>> the recovery. Do you have any chance to add two more
MONs? A
>>>>>>>> single
>>>>>>>> MON is of course a single point of failure.
>>>>>>>>
>>>>>>>>
>>>>>>>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>>>>>>>> <luis.dominguez(a)desoft.cu>cu>:
>>>>>>>>
>>>>>>>>> El 2020-10-26 15:16, Eugen Block escribió:
>>>>>>>>>> You could stop the MGRs and wait for the recovery
to finish,
>>>>>>>>>> MGRs are
>>>>>>>>>> not a critical component. You won’t have a
dashboard or
>>>>>>>>>> metrics
>>>>>>>>>> during/of that time but it would prevent the high
RAM usage.
>>>>>>>>>>
>>>>>>>>>> Zitat von "Ing. Luis Felipe Domínguez
Vega"
>>>>>>>>>> <luis.dominguez(a)desoft.cu>cu>:
>>>>>>>>>>
>>>>>>>>>>> El 2020-10-26 12:23, 胡 玮文 escribió:
>>>>>>>>>>>>> 在 2020年10月26日,23:29,Ing. Luis Felipe
Domínguez Vega
>>>>>>>>>>>>> <luis.dominguez(a)desoft.cu> 写道:
>>>>>>>>>>>>>
>>>>>>>>>>>>> mgr: fond-beagle(active, since 39s)
>>>>>>>>>>>>
>>>>>>>>>>>> Your manager seems crash looping, it only
started since 39s.
>>>>>>>>>>>> Looking
>>>>>>>>>>>> at mgr logs may help you identify why
your cluster is not
>>>>>>>>>>>> recovering.
>>>>>>>>>>>> You may hit some bug in mgr.
>>>>>>>>>>> Noup, I'm restarting the ceph manager
because they eat all
>>>>>>>>>>> server RAM and then i have an script that
when i have 1GB
>>>>>>>>>>> of Free Ram (the server has 94 Gb of RAM)
then restart
>>>>>>>>>>> the manager, i dont known why and the
logs of manager
>>>>>>>>>>> are:
>>>>>>>>>>>
>>>>>>>>>>> -----------------------------------
>>>>>>>>>>>
root@fond-beagle:/var/lib/ceph/mon/ceph-fond-beagle/store.db#
>>>>>>>>>>> tail -f
/var/log/ceph/ceph-mgr.fond-beagle.log
>>>>>>>>>>> 2020-10-26T12:54:12.497-0400 7f2a8112b700 0
>>>>>>>>>>> log_channel(cluster) log [DBG] : pgmap
v584: 2305 pgs: 4
>>>>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>>>>>
active+recovery_unfound+undersized+degraded+remapped, 2104
>>>>>>>>>>> active+clean, 5 active+undersized+degraded,
34 incomplete,
>>>>>>>>>>> 154 unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB
>>>>>>>>>>> avail; 347248/2606900 objects degraded
(13.320%);
>>>>>>>>>>> 107570/2606900 objects misplaced
(4.126%); 19/404328
>>>>>>>>>>> objects unfound (0.005%)
>>>>>>>>>>> 2020-10-26T12:54:12.497-0400 7f2a8112b700 0
>>>>>>>>>>> log_channel(cluster) do_log log to syslog
>>>>>>>>>>> 2020-10-26T12:54:14.501-0400 7f2a8112b700 0
>>>>>>>>>>> log_channel(cluster) log [DBG] : pgmap
v585: 2305 pgs: 4
>>>>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>>>>>
active+recovery_unfound+undersized+degraded+remapped, 2104
>>>>>>>>>>> active+clean, 5 active+undersized+degraded,
34 incomplete,
>>>>>>>>>>> 154 unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB
>>>>>>>>>>> avail; 347248/2606900 objects degraded
(13.320%);
>>>>>>>>>>> 107570/2606900 objects misplaced
(4.126%); 19/404328
>>>>>>>>>>> objects unfound (0.005%)
>>>>>>>>>>> 2020-10-26T12:54:14.501-0400 7f2a8112b700 0
>>>>>>>>>>> log_channel(cluster) do_log log to syslog
>>>>>>>>>>> 2020-10-26T12:54:16.517-0400 7f2a8112b700 0
>>>>>>>>>>> log_channel(cluster) log [DBG] : pgmap
v586: 2305 pgs: 4
>>>>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>>>>>
active+recovery_unfound+undersized+degraded+remapped, 2104
>>>>>>>>>>> active+clean, 5 active+undersized+degraded,
34 incomplete,
>>>>>>>>>>> 154 unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB
>>>>>>>>>>> avail; 347248/2606900 objects degraded
(13.320%);
>>>>>>>>>>> 107570/2606900 objects misplaced
(4.126%); 19/404328
>>>>>>>>>>> objects unfound (0.005%)
>>>>>>>>>>> 2020-10-26T12:54:16.517-0400 7f2a8112b700 0
>>>>>>>>>>> log_channel(cluster) do_log log to syslog
>>>>>>>>>>> 2020-10-26T12:54:18.521-0400 7f2a8112b700 0
>>>>>>>>>>> log_channel(cluster) log [DBG] : pgmap
v587: 2305 pgs: 4
>>>>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>>>>>
active+recovery_unfound+undersized+degraded+remapped, 2104
>>>>>>>>>>> active+clean, 5 active+undersized+degraded,
34 incomplete,
>>>>>>>>>>> 154 unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB
>>>>>>>>>>> avail; 347248/2606900 objects degraded
(13.320%);
>>>>>>>>>>> 107570/2606900 objects misplaced
(4.126%); 19/404328
>>>>>>>>>>> objects unfound (0.005%)
>>>>>>>>>>> 2020-10-26T12:54:18.521-0400 7f2a8112b700 0
>>>>>>>>>>> log_channel(cluster) do_log log to syslog
>>>>>>>>>>> 2020-10-26T12:54:20.537-0400 7f2a8112b700 0
>>>>>>>>>>> log_channel(cluster) log [DBG] : pgmap
v588: 2305 pgs: 4
>>>>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>>>>>
active+recovery_unfound+undersized+degraded+remapped, 2104
>>>>>>>>>>> active+clean, 5 active+undersized+degraded,
34 incomplete,
>>>>>>>>>>> 154 unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB
>>>>>>>>>>> avail; 347248/2606900 objects degraded
(13.320%);
>>>>>>>>>>> 107570/2606900 objects misplaced
(4.126%); 19/404328
>>>>>>>>>>> objects unfound (0.005%)
>>>>>>>>>>> 2020-10-26T12:54:20.537-0400 7f2a8112b700 0
>>>>>>>>>>> log_channel(cluster) do_log log to syslog
>>>>>>>>>>> 2020-10-26T12:54:22.541-0400 7f2a8112b700 0
>>>>>>>>>>> log_channel(cluster) log [DBG] : pgmap
v589: 2305 pgs: 4
>>>>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>>>>>
active+recovery_unfound+undersized+degraded+remapped, 2104
>>>>>>>>>>> active+clean, 5 active+undersized+degraded,
34 incomplete,
>>>>>>>>>>> 154 unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB
>>>>>>>>>>> avail; 347248/2606900 objects degraded
(13.320%);
>>>>>>>>>>> 107570/2606900 objects misplaced
(4.126%); 19/404328
>>>>>>>>>>> objects unfound (0.005%)
>>>>>>>>>>> 2020-10-26T12:54:22.541-0400 7f2a8112b700 0
>>>>>>>>>>> log_channel(cluster) do_log log to syslog
>>>>>>>>>>> ---------------
>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>> ceph-users mailing list --
ceph-users(a)ceph.io
>>>>>>>>>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>>>>>>
>>>>>>>>> Ok i will do that... but the thing is that the
cluster not
>>>>>>>>> show
>>>>>>>>> recovering, not show that are doing nothing, like
to show
>>>>>>>>> the recovering info on ceph -s command, and then i
dont
>>>>>>>>> know
>>>>>>>>> if is recovering or doing what?
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io