[ceph-users] Re: Huge HDD ceph monitor usage [EXT]

28 Oct 2020

You have many unknown PGs because you removed lots of OSDs, this is  
likely to be a problem. Are the removed OSDs still at hand? It's  
possible that you could extract PGs which are missing and import them  
on healthy OSDs, but that's a lot of manual work. Do you have backups  
of the data? Then it would be easier to delete the EC pool and import  
the backups to a healthy pool.

Zitat von "Ing. Luis Felipe Domínguez Vega" &lt;luis.dominguez(a)desoft.cu&gt;cu>:

...
  Great response, thanks, i will use now only one site,
but need first  
 stabilice the cluster to remove the EC erasure coding and use  
 replicate. Could you help me?

 So the thing is that i have 2 pool, cinder-ceph and data_storage.  
 data_storage is only as data_path for cinder-ceph pool, but now i  
 use only the cinder-ceph with replication 3. How can i move all data  
 from data_storage to cinder-ceph and remove the EC.

 El 2020-10-28 06:55, Frank Schilder escribió:
  Hi all, I need to go back to a small piece of
information:

  I was 3 mons, but i have 2 physical datacenters,
one of them breaks with
 not short term fix, so i remove all osds and ceph mon (2 of them) and
 now i have only the osds of 1 datacenter with the monitor. 
 When I look at the data about pools and crush map, I don't see
 anything that is multi-site. Maybe the physical location was 2-site,
 but the crush rules don't reflect that. Consequently, the ceph cluster
 was configured single-site and will act accordingly when you loose 50%
 of it.

 Quick interlude: when people recommend to add servers, they do not
 necessarily mean *new* servers. They mean you have to go to ground
 zero, dig out as much hardware as you can, drive it to the working
 site and make it rejoin the cluster.

 A hypothesis. Assume we want to build a 2-site cluster (sites A and B)
 that can sustain the total loss of any 1 site, capacity at each site
 is equal (mirrored).

 Short answer: this is not exactly possible due to the fact that you
 always need a qualified majority of monitors for quorum and you cannot
 distribute both, N MONs and a qualified majority evenly and
 simultaneously over 2 sites. We have already an additional constraint:
 site A will have 2 and site B 1 monitor. The condition is, that in
 case site A goes down the monitors from the site A can be rescued and
 moved to site B to restore data access. Hence, a loss of site A will
 imply temporary loss of service (Note that 2+2=4 MONs will not help,
 because now 3 MONs are required for a qualified majority; again MONs
 need to be rescued from the down site). If this constraint is
 satisfied, then one can configure pools as follows:

 replicated: size 4, min_size 2, crush rule places 2 copies at each site
 erasure coded: k+m with min_size=k+1, m even and m>=k+2, for example,
 k=2, m=4, crush rule places 3 shards at each site

 With such a configuration, it is possible to sustain the loss of any
 one site if the monitors can be recovered from site A. Note that such
 EC pools will be very compute intense and have high latency (use
 option fast_read to get at least reasonable read speeds). Essentially,
 EC is not really suitable for multi-site redundancy, but the above EC
 set up will require a bit less capacity than 4 copies.

 This setup can sustain the total loss of 1 site (minus MONs on site A)
 and will rebuild all data once a large enough second site is brought
 up again.

 When I look at the information you posted, I see replication 3(2) and
 EC 5+2 pools, all having crush root default. I do not see any of these
 mandatory configurations, the sites are ignored in the crush rules.
 Hence, if you can't get material from the down site back up, you look
 at permanent data loss.

 You may be able to recover some more data in the replicated pools by
 setting min_size=1 for some time. However, you will loose any writes
 that are on the other 2 but not on the 1 disk now used for recovery
 and it will certainly not recover PGs with all 3 copies on the down
 site. Therefore, I would not attempt this, also because for the EC
 pools, you will need to get hold of the hosts from the down site and
 re-integrate these into the cluster any ways. If you can't do this,
 the data is lost.

 In the long run, given your crush map and rules, you either stop
 placing stuff at 2 sites, or you create a proper 2-site set-up and
 copy data over.

 Best regards,
 =================
 Frank Schilder
 AIT Risø Campus
 Bygning 109, rum S14

 ________________________________________
 From: Ing. Luis Felipe Domínguez Vega &lt;luis.dominguez(a)desoft.cu&gt;
 Sent: 28 October 2020 05:14:27
 To: Eugen Block
 Cc: Ceph Users
 Subject: [ceph-users] Re: Huge HDD ceph monitor usage [EXT]

 Well recovering not working yet... i was started 6 servers more and the
 cluster not yet recovered.
 Ceph status not show any recover progress

 ceph -s                 : https://pastebin.ubuntu.com/p/zRQPbvGzbw/
 ceph osd tree           : https://pastebin.ubuntu.com/p/sTDs8vd7Sk/
 ceph osd df             : https://pastebin.ubuntu.com/p/ysbh8r2VVz/
 ceph osd pool ls detail : https://pastebin.ubuntu.com/p/GRdPjxhv3D/
 crush rules             : (ceph osd crush rule dump)
 https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/

 El 2020-10-27 09:59, Eugen Block escribió:
  Your pool 'data_storage' has a size of 7
(or 7 chunks since it's
 erasure-coded) and the rule requires each chunk on a different host
 but you currently have only 5 hosts available, that's why the recovery
 is not progressing. It's waiting for two more hosts. Unfortunately,
 you can't change the EC profile or the rule of that pool. I'm not sure
 if it would work in the current cluster state, but if you can't add
 two more hosts (which would be your best option for recovery) it might
 be possible to create a new replicated pool (you seem to have enough
 free space) and copy the contents from that EC pool. But as I said,
 I'm not sure if that would work in a degraded state, I've never tried
 that.

 So your best bet is to get two more hosts somehow.

  pool 4 'data_storage' erasure profile
desoft size 7 min_size 5
 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32  autoscale_mode
 off last_change 154384 lfor 0/121016/121014 flags
 hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
 application rbd 

 Zitat von "Ing. Luis Felipe Domínguez Vega" &lt;luis.dominguez(a)desoft.cu&gt;cu>:

> Needed data:
>
> ceph -s                 : https://pastebin.ubuntu.com/p/S9gKjyZtdK/
> ceph osd tree           : https://pastebin.ubuntu.com/p/SCZHkk6Mk4/
> ceph osd df             : (later, because i'm waiting since 10
> minutes and not output yet)
> ceph osd pool ls detail : https://pastebin.ubuntu.com/p/GRdPjxhv3D/
> crush rules             : (ceph osd crush rule dump)
> https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/
>
> El 2020-10-27 07:14, Eugen Block escribió:
>>> I understand, but i delete the OSDs from CRUSH map, so ceph don't
>>> wait for these OSDs, i'm right?
>>
>> It depends on your actual crush tree and rules. Can you share (maybe
>> you already did)
>>
>> ceph osd tree
>> ceph osd df
>> ceph osd pool ls detail
>>
>> and a dump of your crush rules?
>>
>> As I already said, if you have rules in place that distribute data
>> across 2 DCs and one of them is down the PGs will never recover even
>> if you delete the OSDs from the failed DC.
>>
>>
>>
>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>>
>>> I understand, but i delete the OSDs from CRUSH map, so ceph don't
>>> wait for these OSDs, i'm right?
>>>
>>> El 2020-10-27 04:06, Eugen Block escribió:
>>>> Hi,
>>>>
>>>> just to clarify so I don't miss anything: you have two DCs and one
>>>> of
>>>> them is down. And two of the MONs were in that failed DC? Now you
>>>> removed all OSDs and two MONs from the failed DC hoping that your
>>>> cluster will recover? If you have reasonable crush rules in place
>>>> (e.g. to recover from a failed DC) your cluster will never recover
>>>> in
>>>> the current state unless you bring OSDs back up on the second DC.
>>>> That's why you don't see progress in the recovery process, the
PGs
>>>> are
>>>> waiting for their peers in the other DC so they can follow the
>>>> crush
>>>> rules.
>>>>
>>>> Regards,
>>>> Eugen
>>>>
>>>>
>>>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>>>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>>>>
>>>>> I was 3 mons, but i have 2 physical datacenters, one of them
>>>>> breaks  with not short term fix, so i remove all osds and ceph
>>>>> mon  (2 of  them) and now i have only the osds of 1 datacenter
>>>>> with the  monitor.  I was stopped the ceph manager, but i was  see
>>>>> that when  i restart a  ceph manager then ceph -s show  recovering
>>>>> info for a  short term of  20 min more or less, then  dissapear
>>>>> all info.
>>>>>
>>>>> The thing is that sems the cluster is not self recovering and  the
>>>>> ceph monitor is "eating" all of the HDD.
>>>>>
>>>>> El 2020-10-26 15:57, Eugen Block escribió:
>>>>>> The recovery process (ceph -s) is independent of the MGR service
>>>>>> but
>>>>>> only depends on the MON service. It seems you only have the one
>>>>>> MON,
>>>>>> if the MGR is overloading it (not clear why) it could help to
>>>>>> leave
>>>>>> MGR off and see if the MON service then has enough RAM to
proceed
>>>>>> with
>>>>>> the recovery. Do you have any chance to add two more MONs? A
>>>>>> single
>>>>>> MON is of course a single point of failure.
>>>>>>
>>>>>>
>>>>>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>>>>>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>>>>>>
>>>>>>> El 2020-10-26 15:16, Eugen Block escribió:
>>>>>>>> You could stop the MGRs and wait for the recovery to
finish,
>>>>>>>> MGRs are
>>>>>>>> not a critical component. You won’t have a dashboard or
metrics
>>>>>>>> during/of that time but it would prevent the high RAM
usage.
>>>>>>>>
>>>>>>>> Zitat von "Ing. Luis Felipe Domínguez Vega"
>>>>>>>> &lt;luis.dominguez(a)desoft.cu&gt;cu>:
>>>>>>>>
>>>>>>>>> El 2020-10-26 12:23, 胡 玮文 escribió:
>>>>>>>>>>> 在 2020年10月26日，23:29，Ing. Luis Felipe
Domínguez Vega
>>>>>>>>>>> &lt;luis.dominguez(a)desoft.cu&gt; 写道：
>>>>>>>>>>>
>>>>>>>>>>> mgr: fond-beagle(active, since 39s)
>>>>>>>>>>
>>>>>>>>>> Your manager seems crash looping, it only started
since 39s.
>>>>>>>>>> Looking
>>>>>>>>>> at mgr logs may help you identify why your
cluster is not
>>>>>>>>>> recovering.
>>>>>>>>>> You may hit some bug in mgr.
>>>>>>>>> Noup, I'm restarting the ceph manager because
they eat all
>>>>>>>>> server   RAM and then i have an script that when i
have 1GB
>>>>>>>>> of   Free Ram  (the  server has 94 Gb of RAM) then
restart
>>>>>>>>> the   manager, i dont  known why  and the logs of
manager are:
>>>>>>>>>
>>>>>>>>> -----------------------------------
>>>>>>>>>
root@fond-beagle:/var/lib/ceph/mon/ceph-fond-beagle/store.db#
>>>>>>>>> tail    -f /var/log/ceph/ceph-mgr.fond-beagle.log
>>>>>>>>> 2020-10-26T12:54:12.497-0400 7f2a8112b700  0
>>>>>>>>> log_channel(cluster)   log [DBG] : pgmap v584: 2305
pgs: 4
>>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104
>>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete,
>>>>>>>>> 154     unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB /
24 TiB
>>>>>>>>> avail;     347248/2606900 objects degraded
(13.320%);
>>>>>>>>> 107570/2606900   objects   misplaced (4.126%);
19/404328
>>>>>>>>> objects unfound (0.005%)
>>>>>>>>> 2020-10-26T12:54:12.497-0400 7f2a8112b700  0
>>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>>> 2020-10-26T12:54:14.501-0400 7f2a8112b700  0
>>>>>>>>> log_channel(cluster)   log [DBG] : pgmap v585: 2305
pgs: 4
>>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104
>>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete,
>>>>>>>>> 154     unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB /
24 TiB
>>>>>>>>> avail;     347248/2606900 objects degraded
(13.320%);
>>>>>>>>> 107570/2606900   objects   misplaced (4.126%);
19/404328
>>>>>>>>> objects unfound (0.005%)
>>>>>>>>> 2020-10-26T12:54:14.501-0400 7f2a8112b700  0
>>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>>> 2020-10-26T12:54:16.517-0400 7f2a8112b700  0
>>>>>>>>> log_channel(cluster)   log [DBG] : pgmap v586: 2305
pgs: 4
>>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104
>>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete,
>>>>>>>>> 154     unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB /
24 TiB
>>>>>>>>> avail;     347248/2606900 objects degraded
(13.320%);
>>>>>>>>> 107570/2606900   objects   misplaced (4.126%);
19/404328
>>>>>>>>> objects unfound (0.005%)
>>>>>>>>> 2020-10-26T12:54:16.517-0400 7f2a8112b700  0
>>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>>> 2020-10-26T12:54:18.521-0400 7f2a8112b700  0
>>>>>>>>> log_channel(cluster)   log [DBG] : pgmap v587: 2305
pgs: 4
>>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104
>>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete,
>>>>>>>>> 154     unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB /
24 TiB
>>>>>>>>> avail;     347248/2606900 objects degraded
(13.320%);
>>>>>>>>> 107570/2606900   objects   misplaced (4.126%);
19/404328
>>>>>>>>> objects unfound (0.005%)
>>>>>>>>> 2020-10-26T12:54:18.521-0400 7f2a8112b700  0
>>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>>> 2020-10-26T12:54:20.537-0400 7f2a8112b700  0
>>>>>>>>> log_channel(cluster)   log [DBG] : pgmap v588: 2305
pgs: 4
>>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104
>>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete,
>>>>>>>>> 154     unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB /
24 TiB
>>>>>>>>> avail;     347248/2606900 objects degraded
(13.320%);
>>>>>>>>> 107570/2606900   objects   misplaced (4.126%);
19/404328
>>>>>>>>> objects unfound (0.005%)
>>>>>>>>> 2020-10-26T12:54:20.537-0400 7f2a8112b700  0
>>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>>> 2020-10-26T12:54:22.541-0400 7f2a8112b700  0
>>>>>>>>> log_channel(cluster)   log [DBG] : pgmap v589: 2305
pgs: 4
>>>>>>>>> active+undersized+degraded+remapped, 4
>>>>>>>>> active+recovery_unfound+undersized+degraded+remapped,
2104
>>>>>>>>> active+clean, 5 active+undersized+degraded, 34
incomplete,
>>>>>>>>> 154     unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB /
24 TiB
>>>>>>>>> avail;     347248/2606900 objects degraded
(13.320%);
>>>>>>>>> 107570/2606900   objects   misplaced (4.126%);
19/404328
>>>>>>>>> objects unfound (0.005%)
>>>>>>>>> 2020-10-26T12:54:22.541-0400 7f2a8112b700  0
>>>>>>>>> log_channel(cluster)   do_log log to syslog
>>>>>>>>> ---------------
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>>
>>>>>>> Ok i will do that... but the thing is that the cluster not 
show
>>>>>>>  recovering, not show that are doing nothing, like to  show
>>>>>>> the    recovering info on ceph -s command, and then i  dont
know
>>>>>>> if is    recovering or doing what? 
_______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 
_______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Huge HDD ceph monitor usage [EXT]