[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process

30 May 2023

Hey Frank,

in regards to destroying a cluster, I'd suggest to reuse the old
--yes-i-really-mean-it parameter, as it is already in use by ceph osd
destroy [0]. Then it doesn't matter whether it's prod or not, if you
really mean it ... ;-)

Best regards,

Nico

[0] https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/

Frank Schilder &lt;frans(a)dtu.dk&gt; writes:

...
  Hi, I would like to second Nico's comment. What
happened to the idea that a deployment tool should be idempotent? The most natural option
would be:

 1) start install -> something fails
 2) fix problem
 3) repeat exact same deploy command -> deployment picks up at current state (including
cleaning up failed state markers) and tries to continue until next issue (go to 2)

 I'm not sure (meaning: its a terrible idea) if its a good idea to
 provide a single command to wipe a cluster. Just for the fat finger
 syndrome. This seems safe only if it would be possible to mark a
 cluster as production somehow (must be sticky, that is, cannot be
 unset), which prevents a cluster destroy command (or any too dangerous
 command) from executing. I understand the test case in the tracker,
 but having such test-case utils that can run on a production cluster
 and destroy everything seems a bit dangerous.

 I think destroying a cluster should be a manual and tedious process
 and figuring out how to do it should be part of the learning
 experience. So my answer to "how do I start over" would be "go figure
 it out, its an important lesson".

 Best regards,
 =================
 Frank Schilder
 AIT Risø Campus
 Bygning 109, rum S14

 ________________________________________
 From: Nico Schottelius &lt;nico.schottelius(a)ungleich.ch&gt;
 Sent: Friday, May 26, 2023 10:40 PM
 To: Redouane Kachach
 Cc: ceph-users(a)ceph.io
 Subject: [ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process

 Hello Redouane,

 much appreciated kick-off for improving cephadm. I was wondering why
 cephadm does not use a similar approach to rook in the sense of "repeat
 until it is fixed?"

 For the background, rook uses a controller that checks the state of the
 cluster, the state of monitors, whether there are disks to be added,
 etc. It periodically restarts the checks and when needed shifts
 monitors, creates OSDs, etc.

 My question is, why not have a daemon or checker subcommand of cephadm
 that a) checks what the current cluster status is (i.e. cephadm
 verify-cluster) and b) fixes the situation (i.e. cephadm verify-and-fix-cluster)?

 I think that option would be much more beneficial than the other two
 suggested ones.

 Best regards,

 Nico 

--
Sustainable and modern Infrastructures by ungleich.ch

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process