How does a "ceph orch restart SERVICE" affect availability?

List overview All Threads
Download

newer

older

Removing the encryption:...

Ceph Pacific bluefs enospc bug...

Mikael Öhman

19 Jun 2023 19 Jun '23

4:17 p.m.

The documentation very briefly explains a few core commands for restarting things; https://docs.ceph.com/en/quincy/cephadm/operations/#starting-and-stopping-d… but I feel I'm lacking quite some details of what is safe to do. I have a system in production, clusters connected via CephFS and some shared block devices. We would like to restart some things due to some new network configurations. Going daemon by daemon would take forever, so I'm curious as to what happens if one tries the command; ceph orch restart osd Will that try to be smart and just restart a few at a time to keep things up and available. Or will it just trigger a restart everywhere simultaneously. I guess in my current scenario, restarting one host at the time makes most sense, with a systemctl restart ceph-{fsid}.target and then checking that "ceph -s" says OK before proceeding to the next host, but I'm still curious as to what the "ceph orch restart xxx" command would do (but not enough to try it out in production) Best regards, Mikael Chalmers University of Technology

Show replies by date

Eugen Block

21 Jun 21 Jun

12:03 p.m.

Hi,

...

Will that try to be smart and just restart a few at a time to keep things up and available. Or will it just trigger a restart everywhere simultaneously.

basically, that's what happens for example during an upgrade if services are restarted. It's designed to be a rolling upgrade procedure so restarting all daemons of a specific service at the same time would cause an interruption. So the daemons are scheduled to restart and the mgr decides when it's safe to restart the next (this is a test cluster started on Nautilus, but it's on Quincy now): nautilus:~ # ceph orch restart osd.osd-hdd-ssd Scheduled to restart osd.5 on host 'nautilus' Scheduled to restart osd.0 on host 'nautilus' Scheduled to restart osd.2 on host 'nautilus' Scheduled to restart osd.1 on host 'nautilus2' Scheduled to restart osd.4 on host 'nautilus2' Scheduled to restart osd.7 on host 'nautilus2' Scheduled to restart osd.3 on host 'nautilus3' Scheduled to restart osd.8 on host 'nautilus3' Scheduled to restart osd.6 on host 'nautilus3' When it comes to OSDs it's possible (or even likely) that multiple OSDs are restarted at the same time, depending on the pools (and their replication size) they are part of. But ceph tries to avoid "inactive PGs" which is critical, of course. An edge case would be a pool with size 1 where restarting an OSD would cause an inactive PG until the OSD is up again. But since size 1 would be a bad idea anyway (except for testing purposes) you'd have to live with that. If you have the option I'd recommend to create a test cluster and play around with these things to get a better understanding, especially when it comes to upgrade tests etc.

...

I guess in my current scenario, restarting one host at the time makes most sense, with a systemctl restart ceph-{fsid}.target and then checking that "ceph -s" says OK before proceeding to the next

Yes, if your crush-failure-domain is host that should be safe, too. Regards, Eugen Zitat von Mikael Öhman <micketeer(a)gmail.com>om>: > The documentation very briefly explains a few core commands for restarting > things; > https://docs.ceph.com/en/quincy/cephadm/operations/#starting-and-stopping-d… > but I feel I'm lacking quite some details of what is safe to do. > > I have a system in production, clusters connected via CephFS and some > shared block devices. > We would like to restart some things due to some new network > configurations. Going daemon by daemon would take forever, so I'm curious > as to what happens if one tries the command; > > ceph orch restart osd >

...

Will that try to be smart and just restart a few at a time to keep things up and available. Or will it just trigger a restart everywhere simultaneously.

...

I guess in my current scenario, restarting one host at the time makes most sense, with a systemctl restart ceph-{fsid}.target and then checking that "ceph -s" says OK before proceeding to the next

> host, but I'm still curious as to what the "ceph orch restart xxx" command > would do (but not enough to try it out in production) > > Best regards, Mikael > Chalmers University of Technology > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Mikael Öhman

22 Jun 22 Jun

8:11 p.m.

Thank you Eugen! After finding what the target name actually was it all worked like a charm. Best regards, Mikael On Wed, Jun 21, 2023 at 11:05 AM Eugen Block <eblock(a)nde.ag> wrote:

...

Hi,

Will that try to be smart and just restart a few at a time to keep things up and available. Or will it just trigger a restart everywhere simultaneously.

I guess in my current scenario, restarting one host at the time makes

most

sense, with a systemctl restart ceph-{fsid}.target and then checking that "ceph -s" says OK before proceeding to the next

Yes, if your crush-failure-domain is host that should be safe, too. Regards, Eugen Zitat von Mikael Öhman <micketeer(a)gmail.com>om>:

The documentation very briefly explains a few core commands for

restarting

things;

https://docs.ceph.com/en/quincy/cephadm/operations/#starting-and-stopping-d… > but I feel I'm lacking quite some details of what is safe to do. > > I have a system in production, clusters connected via CephFS and some > shared block devices. > We would like to restart some things due to some new network > configurations. Going daemon by daemon would take forever, so I'm curious > as to what happens if one tries the command; > > ceph orch restart osd >

Will that try to be smart and just restart a few at a time to keep things up and available. Or will it just trigger a restart everywhere simultaneously.

I guess in my current scenario, restarting one host at the time makes

most

sense, with a systemctl restart ceph-{fsid}.target and then checking that "ceph -s" says OK before proceeding to the next

> host, but I'm still curious as to what the "ceph orch restart xxx" command

would do (but not enough to try it out in production) Best regards, Mikael Chalmers University of Technology _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

332

days inactive

335

days old

ceph-users@ceph.io

Manage subscription

2 comments

2 participants

tags (0)

participants (2)

Eugen Block
Mikael Öhman