[ceph-users] Re: Ceph cluster not recover after OSD down

6 May 2021

Hi Andres,

does the commando work with the original rule/crushmap?

___________________________________
Clyso GmbH - Ceph Foundation Member
support(a)clyso.com
https://www.clyso.com

Am 06.05.2021 um 15:21 schrieb Andres Rojas Guerrero:
> Yes, my ceph version is Nautilus:
>
> # ceph -v
> ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) 
> nautilus (stable)
>
> First dump the crush map:
>
> # ceph osd getcrushmap -o crush_map
>
> Then, decompile the crush map:
>
> # crushtool -d crush_map -o crush_map_d
>
>
> Now, edit the crush rule and compile:
>
> # crushtool -c crush_map_d -o crush_map_new
>
>
> An finally test the mappings:
>
> # crushtool -i crush_map_new --test --rule 2 --num-rep 7 --show-mappings
> CRUSH rule 2 x 0 [-5,-45,-49,-47,-43,-41,-29]
> *** Caught signal (Segmentation fault) **
>  in thread 7f2d717acb40 thread_name:crushtool
>
>
> El 6/5/21 a las 14:13, Eugen Block escribió:
>> Interesting, I haven't had that yet with crushtool. Your ceph version 
>> is Nautilus, right? And you did decompile the binary crushmap with 
>> crushtool, correct? I don't know how to reproduce that.
>>
>> Zitat von Andres Rojas Guerrero &lt;a.rojas(a)csic.es&gt;es>:
>>
>>> I have this error when try to show mappings with crushtool:
>>>
>>> # crushtool -i crush_map_new --test --rule 2 --num-rep 7 
>>> --show-mappings
>>> CRUSH rule 2 x 0 [-5,-45,-49,-47,-43,-41,-29]
>>> *** Caught signal (Segmentation fault) **
>>>  in thread 7f7f7a0ccb40 thread_name:crushtool
>>>
>>>
>>>
>>>
>>> El 6/5/21 a las 13:47, Eugen Block escribió:
>>>> Yes it is possible, but you should validate it with crushtool before
>>>> injecting it to make sure the PGs land where they belong.
>>>>
>>>> crushtool -i crushmap.bin --test --rule 2 --num-rep 7 --show-mappings
>>>> crushtool -i crushmap.bin --test --rule 2 --num-rep 7 
>>>> --show-bad-mappings
>>>>
>>>> If you don't get bad mappings and the 'show-mappings'
confirms the PG
>>>> distribution by host you can inject it. But be aware of a lot of data
>>>> movement, that could explain the (temporarily) unavailable PGs. But to
>>>> make your cluster resilient against host failure you'll have to go
>>>> through that at some point.
>>>>
>>>>
>>>> https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/
>>>>
>>>>
>>>> Zitat von Andres Rojas Guerrero &lt;a.rojas(a)csic.es&gt;es>:
>>>>
>>>>> Hi, I try to make a new crush rule (Nautilus) in order take the new
>>>>> correct_failure_domain to hosts:
>>>>>
>>>>>    "rule_id": 2,
>>>>>         "rule_name": "nxtcloudAFhost",
>>>>>         "ruleset": 2,
>>>>>         "type": 3,
>>>>>         "min_size": 3,
>>>>>         "max_size": 7,
>>>>>         "steps": [
>>>>>             {
>>>>>                 "op": "set_chooseleaf_tries",
>>>>>                 "num": 5
>>>>>             },
>>>>>             {
>>>>>                 "op": "set_choose_tries",
>>>>>                 "num": 100
>>>>>             },
>>>>>             {
>>>>>                 "op": "take",
>>>>>                 "item": -1,
>>>>>                 "item_name": "default"
>>>>>             },
>>>>>             {
>>>>>                 "op": "choose_indep",
>>>>>                 "num": 0,
>>>>>                 "type": "host"
>>>>>             },
>>>>>             {
>>>>>                 "op": "emit"
>>>>>
>>>>> And I have changed the pool to this new crush rule:
>>>>>
>>>>> # ceph osd pool set nxtcloudAF crush_rule nxtcloudAFhost
>>>>>
>>>>> But suddenly the cephfs it's unavailable:
>>>>>
>>>>> # ceph status
>>>>>   cluster:
>>>>>     id:     c74da5b8-3d1b-483e-8b3a-739134db6cf8
>>>>>     health: HEALTH_WARN
>>>>>             11 clients failing to respond to capability release
>>>>>             2 MDSs report slow metadata IOs
>>>>>             1 MDSs report slow requests
>>>>>
>>>>>
>>>>> And clients failing to respond:
>>>>>
>>>>> HEALTH_WARN 11 clients failing to respond to capability release; 2 
>>>>> MDSs
>>>>> report slow metadata IOs; 1 MDSs report slow requests
>>>>> MDS_CLIENT_LATE_RELEASE 11 clients failing to respond to capability
>>>>> release
>>>>>     mdsceph2mon03(mds.1): Client nxtcl3: failing to respond to
>>>>> capability release client_id: 1524269
>>>>>     mdsceph2mon01(mds.0): Client nxtcl5:nxtclproAF failing to 
>>>>> respond to
>>>>>
>>>>>
>>>>> I reversed the change, returning to the original crush rule, and all
>>>>> it's Ok. My question if it's possible to change on fly the
crush 
>>>>> rule of
>>>>> a EC pool.
>>>>>
>>>>>
>>>>> Thanks
>>>>> El 5/5/21 a las 18:14, Andres Rojas Guerrero escribió:
>>>>>> Thanks, I will test it.
>>>>>>
>>>>>> El 5/5/21 a las 16:37, Joachim Kraftmayer escribió:
>>>>>>> Create a new crush rule with the correct failure domain, test
it
>>>>>>> properly and assign it to the pool(s).
>>>>>>>
>>>>>>
>>>>>
>>>>> -- 
>>>>> *******************************************************
>>>>> Andrés Rojas Guerrero
>>>>> Unidad Sistemas Linux
>>>>> Area Arquitectura Tecnológica
>>>>> Secretaría General Adjunta de Informática
>>>>> Consejo Superior de Investigaciones Científicas (CSIC)
>>>>> Pinar 19
>>>>> 28006 - Madrid
>>>>> Tel: +34 915680059 -- Ext. 990059
>>>>> email: a.rojas(a)csic.es
>>>>> ID comunicate.csic.es: @50852720l:matrix.csic.es
>>>>> *******************************************************
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>
>>> -- 
>>> *******************************************************
>>> Andrés Rojas Guerrero
>>> Unidad Sistemas Linux
>>> Area Arquitectura Tecnológica
>>> Secretaría General Adjunta de Informática
>>> Consejo Superior de Investigaciones Científicas (CSIC)
>>> Pinar 19
>>> 28006 - Madrid
>>> Tel: +34 915680059 -- Ext. 990059
>>> email: a.rojas(a)csic.es
>>> ID comunicate.csic.es: @50852720l:matrix.csic.es
>>> *******************************************************
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Ceph cluster not recover after OSD down