[ceph-users] Re: Luminous and mimic: adding OSD can crash mon(s) and lead to loss of quorum

26 Aug 2019

On 23/08/2019 22:14, Paul Emmerich wrote:
...
  On Fri, Aug 23, 2019 at 3:54 PM Florian Haas
&lt;florian(a)citynetwork.eu&gt; wrote:

 On 23/08/2019 13:34, Paul Emmerich wrote:
  Is this reproducible with crushtool? 
 Not for me.

  ceph osd getcrushmap -o crushmap
 crushtool -i crushmap --update-item XX 1.0 osd.XX --loc host
 hostname-that-doesnt-exist-yet -o crushmap.modified
 Replacing XX with the osd ID you tried to add. 
 Just checking whether this was intentional. As the issue pops up when
 adding an new OSD *on* a new host, not moving an existing OSD *to* a new
 host, I would have used --add-item here. Is there a specific reason why
 you're suggesting to test with --update-item?  
 yes, update should map to create or move which it should use internally

 At any rate, I tried with multiple different combinations (this is on a
 12.2.12 test cluster; I can't test this in production):  
 which also ran into this bug? The idea of using crushtool is to not
 crash your production cluster but just the local tool. 
Ah, gotcha. I thought you wanted me to be able to at least do "ceph osd
setcrushmap" with the resulting crushmap, which would require a running
cluster.

So yes, doing this completely offline shows that you're definitely on to
something. I am able to crash crushtool with the original crushmap, and
what it appears to be falling over on is a choose_args map in there.

I've updated the bug report with this comment:
https://tracker.ceph.com/issues/40029#note-11

It would seem that there are two workarounds at this stage for
pre-Nautilus users with a choose_args map in their crushmap, and who for
some reason are unable to upgrade to Nautilus yet:

1. Add host buckets manually before adding new OSDs.
2. Drop any choose_args map from their crushmap.

As it happens I am not aware of any way to do #2 other than

- using getcrushmap,
- decompiling the crushmap,
- dropping the choose_args map from the textual representation of the
crushmap,
- recompiling, and then
- using setcrushmap.

Are you, by any chance?

Thanks again for your help!

Cheers,
Florian

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Luminous and mimic: adding OSD can crash mon(s) and lead to loss of quorum