Advice needed: stuck cluster halfway upgraded, comms issues and MON space usage

List overview All Threads
Download

newer

older

fixing future rctimes

Nautilus block-db resize -...

Sam Skipsey

22 Mar 2021 22 Mar '21

4:04 a.m.

Hi everyone: I posted to the list on Friday morning (UK time), but apparently my email is still in moderation (I have an email from the list bot telling me that it's held for moderation but no updates). Since this is a bit urgent - we have ~3PB of storage offline - I'm posting again. To save retyping the whole thing, I will direct you to a copy of the email I wrote on Friday: http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt (Since that was sent, we did successfully add big SSDs to the MON hosts so they don't fill up their disks with store.db s). I would appreciate any advice - assuming this also doesn't get stuck in moderation queues. -- Sam Skipsey (he/him, they/them)

Show replies by date

Dan van der Ster

22 Mar 22 Mar

5:10 a.m.

Hi Sam, The daemons restart (for *some* releases) because of this: https://tracker.ceph.com/issues/21672 In short, if the selinux module changes, and if you have selinux enabled, then midway through yum update, there will be a systemctl restart ceph.target issued. For the rest -- I think you should focus on getting the PGs all active+clean as soon as possible, because the degraded and remapped states are what leads to mon / osdmap growth. This kind of scenario is why we wrote this tool: https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema… It will use pg-upmap-items to force the PGs to the OSDs where they are currently residing. But there is some clarification needed before you go ahead with that. Could you share `ceph status`, `ceph health detail`? Cheers, Dan On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > > Hi everyone: > > I posted to the list on Friday morning (UK time), but apparently my email > is still in moderation (I have an email from the list bot telling me that > it's held for moderation but no updates). > > Since this is a bit urgent - we have ~3PB of storage offline - I'm posting > again. > > To save retyping the whole thing, I will direct you to a copy of the email > I wrote on Friday: > > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt > > (Since that was sent, we did successfully add big SSDs to the MON hosts so > they don't fill up their disks with store.db s). > > I would appreciate any advice - assuming this also doesn't get stuck in > moderation queues. > > -- > Sam Skipsey (he/him, they/them) > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Sam Skipsey

5:20 a.m.

Hi Dan: Thanks for the reply - at present, our mons and mgrs are off [because of the unsustainable nature of the filesystem usage]. We'll try putting them on again for long enough to get "ceph status" out of them, but because the mgr was unable to actually talk to anything, and reply at that point. (And thanks for the link to the bug tracker - I guess this mismatch of expectations is why the devs are so keen to move to containerised deployments where there is no co-location of different types of server, as it means they don't need to worry as much about the assumptions about when it's okay to restart a service on package update. Disappointing that it seems stale after 2 years...) Sam On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan(a)vanderster.com> wrote:

...

posting

again. To save retyping the whole thing, I will direct you to a copy of the

I wrote on Friday: http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt (Since that was sent, we did successfully add big SSDs to the MON hosts

they don't fill up their disks with store.db s). I would appreciate any advice - assuming this also doesn't get stuck in moderation queues. -- Sam Skipsey (he/him, they/them) _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Sam Skipsey (he/him, they/them)

Sam Skipsey

7:49 a.m.

So, we started the mons and mgr up again, and here's the relevant logs, including also ceph versions. We've also turned off all of the firewalls on all of the nodes so we know that there can't be network issues [and, indeed, all of our management of the OSDs happens via logins from the service nodes or to each other]

...

ceph status

cluster: id: a1148af2-6eaf-4486-a27e-a05a78c2b378 health: HEALTH_WARN pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set 1 nearfull osd(s) 3 pool(s) nearfull Reduced data availability: 2048 pgs inactive mons cephs01,cephs02,cephs03 are using a lot of disk space services: mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s) mgr: cephs01(active, since 76s) osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 remapped pgs flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover data: pools: 3 pools, 2048 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: 100.000% pgs unknown 2048 unknown

...

ceph health detail

HEALTH_WARN pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set; 1 nearfull osd(s); 3 pool(s) nearfull; Reduced data availability: 2048 pgs inactive; mons cephs01,cephs02,cephs03 are using a lot of disk space OSDMAP_FLAGS pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set OSD_NEARFULL 1 nearfull osd(s) osd.63 is near full POOL_NEARFULL 3 pool(s) nearfull pool 'dteam' is nearfull pool 'atlas' is nearfull pool 'atlas-localgroup' is nearfull PG_AVAILABILITY Reduced data availability: 2048 pgs inactive pg 13.1ef is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1f0 is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1f1 is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1f2 is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1f3 is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1f4 is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1f5 is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1f6 is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1f7 is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1f8 is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1f9 is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1fa is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1fb is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1fc is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1fd is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1fe is stuck inactive for 89.322981, current state unknown, last acting [] pg 13.1ff is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1ec is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1f0 is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1f1 is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1f2 is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1f3 is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1f4 is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1f5 is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1f6 is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1f7 is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1f8 is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1f9 is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1fa is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1fb is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1fc is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1fd is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1fe is stuck inactive for 89.322981, current state unknown, last acting [] pg 14.1ff is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1ed is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1f0 is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1f1 is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1f2 is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1f3 is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1f4 is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1f5 is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1f6 is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1f7 is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1f8 is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1f9 is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1fa is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1fb is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1fc is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1fd is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1fe is stuck inactive for 89.322981, current state unknown, last acting [] pg 15.1ff is stuck inactive for 89.322981, current state unknown, last acting [] MON_DISK_BIG mons cephs01,cephs02,cephs03 are using a lot of disk space mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB) mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB) mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB)

...

ceph versions

{ "mon": { "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)": 1, "ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188, "ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18, "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122 },

...

>>>>>

As a note, the log where the mgr explodes (which precipitated all of this) definitely shows the problem occurring on the 12th [when 14.2.17 dropped], but things didn't "break" until we tried upgrading OSDs to 14.2.18... Sam On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla(a)gmail.com> wrote:

...

Hi everyone: I posted to the list on Friday morning (UK time), but apparently my

is still in moderation (I have an email from the list bot telling me

that

it's held for moderation but no updates). Since this is a bit urgent - we have ~3PB of storage offline - I'm

posting

again. To save retyping the whole thing, I will direct you to a copy of the

I wrote on Friday: http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt (Since that was sent, we did successfully add big SSDs to the MON hosts

-- Sam Skipsey (he/him, they/them)

Dan van der Ster

7:57 a.m.

Hi, I would unset nodown (hiding osd failures) and norecover (blcoking PGs from recovering degraded objects), then start starting osds. As soon as you have some osd logs reporting some failures, then share those... - Dan On Mon, Mar 22, 2021 at 3:49 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > > So, we started the mons and mgr up again, and here's the relevant logs, including also ceph versions. We've also turned off all of the firewalls on all of the nodes so we know that there can't be network issues [and, indeed, all of our management of the OSDs happens via logins from the service nodes or to each other] > > > ceph status > > > cluster: > id: a1148af2-6eaf-4486-a27e-a05a78c2b378 > health: HEALTH_WARN > pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set > 1 nearfull osd(s) > 3 pool(s) nearfull > Reduced data availability: 2048 pgs inactive > mons cephs01,cephs02,cephs03 are using a lot of disk space > > services: > mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s) > mgr: cephs01(active, since 76s) > osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 remapped pgs > flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover > > data: > pools: 3 pools, 2048 pgs > objects: 0 objects, 0 B > usage: 0 B used, 0 B / 0 B avail > pgs: 100.000% pgs unknown > 2048 unknown > > > > ceph health detail > > HEALTH_WARN pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set; 1 nearfull osd(s); 3 pool(s) nearfull; Reduced data availability: 2048 pgs inactive; mons cephs01,cephs02,cephs03 are using a lot of disk space > OSDMAP_FLAGS pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set > OSD_NEARFULL 1 nearfull osd(s) > osd.63 is near full > POOL_NEARFULL 3 pool(s) nearfull > pool 'dteam' is nearfull > pool 'atlas' is nearfull > pool 'atlas-localgroup' is nearfull > PG_AVAILABILITY Reduced data availability: 2048 pgs inactive > pg 13.1ef is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1f0 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1f1 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1f2 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1f3 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1f4 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1f5 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1f6 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1f7 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1f8 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1f9 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1fa is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1fb is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1fc is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1fd is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1fe is stuck inactive for 89.322981, current state unknown, last acting [] > pg 13.1ff is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1ec is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1f0 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1f1 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1f2 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1f3 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1f4 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1f5 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1f6 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1f7 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1f8 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1f9 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1fa is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1fb is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1fc is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1fd is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1fe is stuck inactive for 89.322981, current state unknown, last acting [] > pg 14.1ff is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1ed is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1f0 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1f1 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1f2 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1f3 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1f4 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1f5 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1f6 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1f7 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1f8 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1f9 is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1fa is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1fb is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1fc is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1fd is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1fe is stuck inactive for 89.322981, current state unknown, last acting [] > pg 15.1ff is stuck inactive for 89.322981, current state unknown, last acting [] > MON_DISK_BIG mons cephs01,cephs02,cephs03 are using a lot of disk space > mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB) > mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB) > mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB) > > > > ceph versions > > { > "mon": { > "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 > }, > "mgr": { > "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 1 > }, > "osd": { > "ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)": 1, > "ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188, > "ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18, > "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122 > }, > > > >>>>>> > > As a note, the log where the mgr explodes (which precipitated all of this) definitely shows the problem occurring on the 12th [when 14.2.17 dropped], but things didn't "break" until we tried upgrading OSDs to 14.2.18... > > > Sam > > > On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla(a)gmail.com> wrote: >> >> Hi Dan: >> >> Thanks for the reply - at present, our mons and mgrs are off [because of the unsustainable nature of the filesystem usage]. We'll try putting them on again for long enough to get "ceph status" out of them, but because the mgr was unable to actually talk to anything, and reply at that point. >> >> (And thanks for the link to the bug tracker - I guess this mismatch of expectations is why the devs are so keen to move to containerised deployments where there is no co-location of different types of server, as it means they don't need to worry as much about the assumptions about when it's okay to restart a service on package update. Disappointing that it seems stale after 2 years...) >> >> Sam >> >> >> >> On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan(a)vanderster.com> wrote: >>> >>> Hi Sam, >>> >>> The daemons restart (for *some* releases) because of this: >>> https://tracker.ceph.com/issues/21672 >>> In short, if the selinux module changes, and if you have selinux >>> enabled, then midway through yum update, there will be a systemctl >>> restart ceph.target issued. >>> >>> For the rest -- I think you should focus on getting the PGs all >>> active+clean as soon as possible, because the degraded and remapped >>> states are what leads to mon / osdmap growth. >>> This kind of scenario is why we wrote this tool: >>> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema… >>> It will use pg-upmap-items to force the PGs to the OSDs where they are >>> currently residing. >>> >>> But there is some clarification needed before you go ahead with that. >>> Could you share `ceph status`, `ceph health detail`? >>> >>> Cheers, Dan >>> >>> >>> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >>> > >>> > Hi everyone: >>> > >>> > I posted to the list on Friday morning (UK time), but apparently my email >>> > is still in moderation (I have an email from the list bot telling me that >>> > it's held for moderation but no updates). >>> > >>> > Since this is a bit urgent - we have ~3PB of storage offline - I'm posting >>> > again. >>> > >>> > To save retyping the whole thing, I will direct you to a copy of the email >>> > I wrote on Friday: >>> > >>> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt >>> > >>> > (Since that was sent, we did successfully add big SSDs to the MON hosts so >>> > they don't fill up their disks with store.db s). >>> > >>> > I would appreciate any advice - assuming this also doesn't get stuck in >>> > moderation queues. >>> > >>> > -- >>> > Sam Skipsey (he/him, they/them) >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> >> -- >> Sam Skipsey (he/him, they/them) >> >> > > > -- > Sam Skipsey (he/him, they/them) > >

Sam Skipsey

10:06 a.m.

hi Dan: So, unsetting nodown results in... almost all of the OSDs being marked down. (231 down out of 328). Checking the actual OSD services, most of them were actually up and active on the nodes, even when the mons had marked them down. (On a few nodes, the down services corresponded to OSDs that had been flapping - but increasing osd_max_markdown locally to keep them up despite the previous flapping, and restarting the services... didn't help.) In fact, starting up the few OSD services which had actually stopped, resulted in a different set of OSDs being marked down, and some others coming up. We currently have a sort of "rolling OSD outness" passing through the cluster - there's always ~230 OSDs marked down now, but which ones those are changes (we've had everything from 1 HOST down to 4 HOSTS down over the past 14 minutes as things fluctuate. A log from one of the "down" OSDs [which is actually running, and on the same host as OSDs which are marked up] shows this worrying snippet 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot 2021-03-22 17:01:50.922 7f6c9f088700 -1 --2- 10.1.50.21:0/23673 >> [v2: 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] conn(0x56010903e400 0x56011a71fc00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] is using msgr V1 protocol 2021-03-22 17:01:50.922 7f6c9f889700 -1 --2- 10.1.50.21:0/23673 >> [v2: 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] conn(0x5600df434000 0x56011718e000 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] is using msgr V1 protocol 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] conn(0x5600f85ed800 0x560109df2a00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] is using msgr V1 protocol 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] conn(0x5600f22ea000 0x560117182300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] is using msgr V1 protocol 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] conn(0x5600df435c00 0x560139370300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] is using msgr V1 protocol 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot Any suggestions? Sam P.S. an example ceph status as it is now [with everything now on 14.2.18, since we had to restart osds anyway]: cluster: id: a1148af2-6eaf-4486-a27e-a05a78c2b378 health: HEALTH_WARN pauserd,pausewr,noout,nobackfill,norebalance flag(s) set 230 osds down 4 hosts (80 osds) down Reduced data availability: 2048 pgs inactive 8 slow ops, oldest one blocked for 901 sec, mon.cephs01 has slow ops services: mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 2h) mgr: cephs01(active, since 77m) osd: 329 osds: 98 up (since 4s), 328 in (since 4d) flags pauserd,pausewr,noout,nobackfill,norebalance data: pools: 3 pools, 2048 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: 100.000% pgs unknown 2048 unknown On Mon, 22 Mar 2021 at 14:57, Dan van der Ster <dan(a)vanderster.com> wrote:

...

So, we started the mons and mgr up again, and here's the relevant logs,

including also ceph versions. We've also turned off all of the firewalls on all of the nodes so we know that there can't be network issues [and, indeed, all of our management of the OSDs happens via logins from the service nodes or to each other]

ceph status

cluster: id: a1148af2-6eaf-4486-a27e-a05a78c2b378 health: HEALTH_WARN

pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set

1 nearfull osd(s) 3 pool(s) nearfull Reduced data availability: 2048 pgs inactive mons cephs01,cephs02,cephs03 are using a lot of disk space services: mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s) mgr: cephs01(active, since 76s) osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 remapped

pgs

flags

pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover

data: pools: 3 pools, 2048 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: 100.000% pgs unknown 2048 unknown

ceph health detail

HEALTH_WARN

pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set; 1 nearfull osd(s); 3 pool(s) nearfull; Reduced data availability: 2048 pgs inactive; mons cephs01,cephs02,cephs03 are using a lot of disk space

OSDMAP_FLAGS

pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set

OSD_NEARFULL 1 nearfull osd(s) osd.63 is near full POOL_NEARFULL 3 pool(s) nearfull pool 'dteam' is nearfull pool 'atlas' is nearfull pool 'atlas-localgroup' is nearfull PG_AVAILABILITY Reduced data availability: 2048 pgs inactive pg 13.1ef is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f0 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f1 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f2 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f3 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f4 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f5 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f6 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f7 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f8 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f9 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1fa is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1fb is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1fc is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1fd is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1fe is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1ff is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1ec is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f0 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f1 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f2 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f3 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f4 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f5 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f6 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f7 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f8 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f9 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1fa is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1fb is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1fc is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1fd is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1fe is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1ff is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1ed is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f0 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f1 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f2 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f3 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f4 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f5 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f6 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f7 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f8 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f9 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1fa is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1fb is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1fc is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1fd is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1fe is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1ff is stuck inactive for 89.322981, current state unknown,

last acting []

MON_DISK_BIG mons cephs01,cephs02,cephs03 are using a lot of disk space mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB) mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB) mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB)

ceph versions

{ "mon": { "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9)

nautilus (stable)": 3

}, "mgr": { "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9)

nautilus (stable)": 1

}, "osd": { "ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20)

nautilus (stable)": 1,

"ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5)

nautilus (stable)": 188,

"ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c)

nautilus (stable)": 18,

"ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9)

nautilus (stable)": 122

>>>>>

As a note, the log where the mgr explodes (which precipitated all of

this) definitely shows the problem occurring on the 12th [when 14.2.17 dropped], but things didn't "break" until we tried upgrading OSDs to 14.2.18...

Sam On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla(a)gmail.com> wrote: > > Hi Dan: > > Thanks for the reply - at present, our mons and mgrs are off [because

of the unsustainable nature of the filesystem usage]. We'll try putting them on again for long enough to get "ceph status" out of them, but because the mgr was unable to actually talk to anything, and reply at that point.

> > (And thanks for the link to the bug tracker - I guess this mismatch of

expectations is why the devs are so keen to move to containerised deployments where there is no co-location of different types of server, as it means they don't need to worry as much about the assumptions about when it's okay to restart a service on package update. Disappointing that it seems stale after 2 years...)

> > Sam > > > > On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan(a)vanderster.com>

wrote:

>> >> Hi Sam, >> >> The daemons restart (for *some* releases) because of this: >> https://tracker.ceph.com/issues/21672 >> In short, if the selinux module changes, and if you have selinux >> enabled, then midway through yum update, there will be a systemctl >> restart ceph.target issued. >> >> For the rest -- I think you should focus on getting the PGs all >> active+clean as soon as possible, because the degraded and remapped >> states are what leads to mon / osdmap growth. >> This kind of scenario is why we wrote this tool: >>

https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema…

>> It will use pg-upmap-items to force the PGs to the OSDs where they are >> currently residing. >> >> But there is some clarification needed before you go ahead with that. >> Could you share `ceph status`, `ceph health detail`? >> >> Cheers, Dan >> >> >> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >> > >> > Hi everyone: >> > >> > I posted to the list on Friday morning (UK time), but apparently my

>> > is still in moderation (I have an email from the list bot telling me

that

>> > it's held for moderation but no updates). >> > >> > Since this is a bit urgent - we have ~3PB of storage offline - I'm

posting

>> > again. >> > >> > To save retyping the whole thing, I will direct you to a copy of the

>> > I wrote on Friday: >> > >> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt >> > >> > (Since that was sent, we did successfully add big SSDs to the MON

hosts so

>> > they don't fill up their disks with store.db s). >> > >> > I would appreciate any advice - assuming this also doesn't get stuck

> moderation queues. > > -- > Sam Skipsey (he/him, they/them) > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Sam Skipsey (he/him, they/them)

Dan van der Ster

10:20 a.m.

...

So, we started the mons and mgr up again, and here's the relevant logs,

ceph status

cluster: id: a1148af2-6eaf-4486-a27e-a05a78c2b378 health: HEALTH_WARN

pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set

pgs

flags

pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover

data: pools: 3 pools, 2048 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: 100.000% pgs unknown 2048 unknown

ceph health detail

HEALTH_WARN

OSDMAP_FLAGS

pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set

last acting []

pg 13.1f0 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f1 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f2 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f3 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f4 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f5 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f6 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f7 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f8 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1f9 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1fa is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1fb is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1fc is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1fd is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1fe is stuck inactive for 89.322981, current state unknown,

last acting []

pg 13.1ff is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1ec is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f0 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f1 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f2 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f3 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f4 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f5 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f6 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f7 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f8 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1f9 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1fa is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1fb is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1fc is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1fd is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1fe is stuck inactive for 89.322981, current state unknown,

last acting []

pg 14.1ff is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1ed is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f0 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f1 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f2 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f3 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f4 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f5 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f6 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f7 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f8 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1f9 is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1fa is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1fb is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1fc is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1fd is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1fe is stuck inactive for 89.322981, current state unknown,

last acting []

pg 15.1ff is stuck inactive for 89.322981, current state unknown,

last acting []

ceph versions

{ "mon": { "ceph version 14.2.18

(befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3

}, "mgr": { "ceph version 14.2.18

(befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 1

}, "osd": { "ceph version 14.2.10

(b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)": 1,

"ceph version 14.2.15

(afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188,

"ceph version 14.2.16

(762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18,

"ceph version 14.2.18

(befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122

>>>>>

As a note, the log where the mgr explodes (which precipitated all of

this) definitely shows the problem occurring on the 12th [when 14.2.17 dropped], but things didn't "break" until we tried upgrading OSDs to 14.2.18...

Sam On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla(a)gmail.com> wrote: > > Hi Dan: > > Thanks for the reply - at present, our mons and mgrs are off [because

> > (And thanks for the link to the bug tracker - I guess this mismatch of

> > Sam > > > > On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan(a)vanderster.com>

wrote:

https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema…

wrote:

>> > >> > Hi everyone: >> > >> > I posted to the list on Friday morning (UK time), but apparently my

>> > is still in moderation (I have an email from the list bot telling

me that

>> > it's held for moderation but no updates). >> > >> > Since this is a bit urgent - we have ~3PB of storage offline - I'm

posting

>> > again. >> > >> > To save retyping the whole thing, I will direct you to a copy of

the email

>> > I wrote on Friday: >> > >> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt >> > >> > (Since that was sent, we did successfully add big SSDs to the MON

hosts so

>> > they don't fill up their disks with store.db s). >> > >> > I would appreciate any advice - assuming this also doesn't get

stuck in

> > moderation queues. > > > > -- > > Sam Skipsey (he/him, they/them) > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io -- Sam Skipsey (he/him, they/them)

-- Sam Skipsey (he/him, they/them)

Sam Skipsey

10:28 a.m.

Hm, yes it does [and I was wondering why loopbacks were showing up suddenly in the logs]. This wasn't happening with 14.2.16 so what's changed about how we specify stuff? This might correlate with the other person on the IRC list who has problems with 14.2.18 and their OSDs deciding they don't work sometimes until they forcibly restart their network links... Sam On Mon, 22 Mar 2021 at 17:20, Dan van der Ster <dan(a)vanderster.com> wrote:

...

What's with the OSDs having loopback addresses? E.g. v2: 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667 Does `ceph osd dump` show those same loopback addresses for each OSD? This sounds familiar... I'm trying to find the recent ticket. .. dan On Mon, Mar 22, 2021, 6:07 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > hi Dan: > > So, unsetting nodown results in... almost all of the OSDs being marked > down. (231 down out of 328). > Checking the actual OSD services, most of them were actually up and > active on the nodes, even when the mons had marked them down. > (On a few nodes, the down services corresponded to OSDs that had been > flapping - but increasing osd_max_markdown locally to keep them up despite > the previous flapping, and restarting the services... didn't help.) > > In fact, starting up the few OSD services which had actually stopped, > resulted in a different set of OSDs being marked down, and some others > coming up. > We currently have a sort of "rolling OSD outness" passing through the > cluster - there's always ~230 OSDs marked down now, but which ones those > are changes (we've had everything from 1 HOST down to 4 HOSTS down over the > past 14 minutes as things fluctuate. > > A log from one of the "down" OSDs [which is actually running, and on the > same host as OSDs which are marked up] shows this worrying snippet > > 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:50.922 7f6c9f088700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] conn(0x56010903e400 > 0x56011a71fc00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] is using msgr V1 > protocol > 2021-03-22 17:01:50.922 7f6c9f889700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] conn(0x5600df434000 > 0x56011718e000 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] is using msgr V1 > protocol > 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] conn(0x5600f85ed800 > 0x560109df2a00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] is using msgr V1 > protocol > 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] conn(0x5600f22ea000 > 0x560117182300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] is using msgr V1 > protocol > 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] conn(0x5600df435c00 > 0x560139370300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] is using msgr V1 > protocol > 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > > > > Any suggestions? > > Sam > > P.S. an example ceph status as it is now [with everything now on 14.2.18, > since we had to restart osds anyway]: > > cluster: > id: a1148af2-6eaf-4486-a27e-a05a78c2b378 > health: HEALTH_WARN > pauserd,pausewr,noout,nobackfill,norebalance flag(s) set > 230 osds down > 4 hosts (80 osds) down > Reduced data availability: 2048 pgs inactive > 8 slow ops, oldest one blocked for 901 sec, mon.cephs01 has > slow ops > > services: > mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 2h) > mgr: cephs01(active, since 77m) > osd: 329 osds: 98 up (since 4s), 328 in (since 4d) > flags pauserd,pausewr,noout,nobackfill,norebalance > > data: > pools: 3 pools, 2048 pgs > objects: 0 objects, 0 B > usage: 0 B used, 0 B / 0 B avail > pgs: 100.000% pgs unknown > 2048 unknown > > > > On Mon, 22 Mar 2021 at 14:57, Dan van der Ster <dan(a)vanderster.com> > wrote: > >> Hi, >> >> I would unset nodown (hiding osd failures) and norecover (blcoking PGs >> from recovering degraded objects), then start starting osds. >> As soon as you have some osd logs reporting some failures, then share >> those... >> >> - Dan >> >> On Mon, Mar 22, 2021 at 3:49 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >> > >> > So, we started the mons and mgr up again, and here's the relevant >> logs, including also ceph versions. We've also turned off all of the >> firewalls on all of the nodes so we know that there can't be network issues >> [and, indeed, all of our management of the OSDs happens via logins from the >> service nodes or to each other] >> > >> > > ceph status >> > >> > >> > cluster: >> > id: a1148af2-6eaf-4486-a27e-a05a78c2b378 >> > health: HEALTH_WARN >> > >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >> > 1 nearfull osd(s) >> > 3 pool(s) nearfull >> > Reduced data availability: 2048 pgs inactive >> > mons cephs01,cephs02,cephs03 are using a lot of disk space >> > >> > services: >> > mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s) >> > mgr: cephs01(active, since 76s) >> > osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 remapped >> pgs >> > flags >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover >> > >> > data: >> > pools: 3 pools, 2048 pgs >> > objects: 0 objects, 0 B >> > usage: 0 B used, 0 B / 0 B avail >> > pgs: 100.000% pgs unknown >> > 2048 unknown >> > >> > >> > > ceph health detail >> > >> > HEALTH_WARN >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set; >> 1 nearfull osd(s); 3 pool(s) nearfull; Reduced data availability: 2048 pgs >> inactive; mons cephs01,cephs02,cephs03 are using a lot of disk space >> > OSDMAP_FLAGS >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >> > OSD_NEARFULL 1 nearfull osd(s) >> > osd.63 is near full >> > POOL_NEARFULL 3 pool(s) nearfull >> > pool 'dteam' is nearfull >> > pool 'atlas' is nearfull >> > pool 'atlas-localgroup' is nearfull >> > PG_AVAILABILITY Reduced data availability: 2048 pgs inactive >> > pg 13.1ef is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f0 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f1 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f2 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f3 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f4 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f5 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f6 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f7 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f8 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f9 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fa is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fb is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fc is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fd is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fe is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1ff is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1ec is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f0 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f1 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f2 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f3 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f4 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f5 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f6 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f7 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f8 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f9 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fa is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fb is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fc is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fd is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fe is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1ff is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1ed is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f0 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f1 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f2 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f3 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f4 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f5 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f6 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f7 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f8 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f9 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fa is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fb is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fc is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fd is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fe is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1ff is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > MON_DISK_BIG mons cephs01,cephs02,cephs03 are using a lot of disk space >> > mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB) >> > mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB) >> > mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB) >> > >> > >> > > ceph versions >> > >> > { >> > "mon": { >> > "ceph version 14.2.18 >> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 >> > }, >> > "mgr": { >> > "ceph version 14.2.18 >> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 1 >> > }, >> > "osd": { >> > "ceph version 14.2.10 >> (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)": 1, >> > "ceph version 14.2.15 >> (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188, >> > "ceph version 14.2.16 >> (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18, >> > "ceph version 14.2.18 >> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122 >> > }, >> > >> > >> > >>>>>> >> > >> > As a note, the log where the mgr explodes (which precipitated all of >> this) definitely shows the problem occurring on the 12th [when 14.2.17 >> dropped], but things didn't "break" until we tried upgrading OSDs to >> 14.2.18... >> > >> > >> > Sam >> > >> > >> > On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla(a)gmail.com> wrote: >> >> >> >> Hi Dan: >> >> >> >> Thanks for the reply - at present, our mons and mgrs are off [because >> of the unsustainable nature of the filesystem usage]. We'll try putting >> them on again for long enough to get "ceph status" out of them, but because >> the mgr was unable to actually talk to anything, and reply at that point. >> >> >> >> (And thanks for the link to the bug tracker - I guess this mismatch >> of expectations is why the devs are so keen to move to containerised >> deployments where there is no co-location of different types of server, as >> it means they don't need to worry as much about the assumptions about when >> it's okay to restart a service on package update. Disappointing that it >> seems stale after 2 years...) >> >> >> >> Sam >> >> >> >> >> >> >> >> On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan(a)vanderster.com> >> wrote: >> >>> >> >>> Hi Sam, >> >>> >> >>> The daemons restart (for *some* releases) because of this: >> >>> https://tracker.ceph.com/issues/21672 >> >>> In short, if the selinux module changes, and if you have selinux >> >>> enabled, then midway through yum update, there will be a systemctl >> >>> restart ceph.target issued. >> >>> >> >>> For the rest -- I think you should focus on getting the PGs all >> >>> active+clean as soon as possible, because the degraded and remapped >> >>> states are what leads to mon / osdmap growth. >> >>> This kind of scenario is why we wrote this tool: >> >>> >> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema… >> >>> It will use pg-upmap-items to force the PGs to the OSDs where they >> are >> >>> currently residing. >> >>> >> >>> But there is some clarification needed before you go ahead with that. >> >>> Could you share `ceph status`, `ceph health detail`? >> >>> >> >>> Cheers, Dan >> >>> >> >>> >> >>> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla(a)gmail.com> >> wrote: >> >>> > >> >>> > Hi everyone: >> >>> > >> >>> > I posted to the list on Friday morning (UK time), but apparently >> my email >> >>> > is still in moderation (I have an email from the list bot telling >> me that >> >>> > it's held for moderation but no updates). >> >>> > >> >>> > Since this is a bit urgent - we have ~3PB of storage offline - I'm >> posting >> >>> > again. >> >>> > >> >>> > To save retyping the whole thing, I will direct you to a copy of >> the email >> >>> > I wrote on Friday: >> >>> > >> >>> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt >> >>> > >> >>> > (Since that was sent, we did successfully add big SSDs to the MON >> hosts so >> >>> > they don't fill up their disks with store.db s). >> >>> > >> >>> > I would appreciate any advice - assuming this also doesn't get >> stuck in >> >>> > moderation queues. >> >>> > >> >>> > -- >> >>> > Sam Skipsey (he/him, they/them) >> >>> > _______________________________________________ >> >>> > ceph-users mailing list -- ceph-users(a)ceph.io >> >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> >> >> >> >> >> -- >> >> Sam Skipsey (he/him, they/them) >> >> >> >> >> > >> > >> > -- >> > Sam Skipsey (he/him, they/them) >> > >> > >> > > > -- > Sam Skipsey (he/him, they/them) > > >

-- Sam Skipsey (he/him, they/them)

Dan van der Ster

10:34 a.m.

...

What's with the OSDs having loopback addresses? E.g. v2: 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667 Does `ceph osd dump` show those same loopback addresses for each OSD? This sounds familiar... I'm trying to find the recent ticket. .. dan On Mon, Mar 22, 2021, 6:07 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > hi Dan: > > So, unsetting nodown results in... almost all of the OSDs being marked > down. (231 down out of 328). > Checking the actual OSD services, most of them were actually up and > active on the nodes, even when the mons had marked them down. > (On a few nodes, the down services corresponded to OSDs that had been > flapping - but increasing osd_max_markdown locally to keep them up despite > the previous flapping, and restarting the services... didn't help.) > > In fact, starting up the few OSD services which had actually stopped, > resulted in a different set of OSDs being marked down, and some others > coming up. > We currently have a sort of "rolling OSD outness" passing through the > cluster - there's always ~230 OSDs marked down now, but which ones those > are changes (we've had everything from 1 HOST down to 4 HOSTS down over the > past 14 minutes as things fluctuate. > > A log from one of the "down" OSDs [which is actually running, and on the > same host as OSDs which are marked up] shows this worrying snippet > > 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:50.922 7f6c9f088700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] conn(0x56010903e400 > 0x56011a71fc00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] is using msgr V1 > protocol > 2021-03-22 17:01:50.922 7f6c9f889700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] conn(0x5600df434000 > 0x56011718e000 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] is using msgr V1 > protocol > 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] conn(0x5600f85ed800 > 0x560109df2a00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] is using msgr V1 > protocol > 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] conn(0x5600f22ea000 > 0x560117182300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] is using msgr V1 > protocol > 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] conn(0x5600df435c00 > 0x560139370300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] is using msgr V1 > protocol > 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 is_healthy false > -- only 0/10 up peers (less than 33%) > 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > > > > Any suggestions? > > Sam > > P.S. an example ceph status as it is now [with everything now on > 14.2.18, since we had to restart osds anyway]: > > cluster: > id: a1148af2-6eaf-4486-a27e-a05a78c2b378 > health: HEALTH_WARN > pauserd,pausewr,noout,nobackfill,norebalance flag(s) set > 230 osds down > 4 hosts (80 osds) down > Reduced data availability: 2048 pgs inactive > 8 slow ops, oldest one blocked for 901 sec, mon.cephs01 has > slow ops > > services: > mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 2h) > mgr: cephs01(active, since 77m) > osd: 329 osds: 98 up (since 4s), 328 in (since 4d) > flags pauserd,pausewr,noout,nobackfill,norebalance > > data: > pools: 3 pools, 2048 pgs > objects: 0 objects, 0 B > usage: 0 B used, 0 B / 0 B avail > pgs: 100.000% pgs unknown > 2048 unknown > > > > On Mon, 22 Mar 2021 at 14:57, Dan van der Ster <dan(a)vanderster.com> > wrote: > >> Hi, >> >> I would unset nodown (hiding osd failures) and norecover (blcoking PGs >> from recovering degraded objects), then start starting osds. >> As soon as you have some osd logs reporting some failures, then share >> those... >> >> - Dan >> >> On Mon, Mar 22, 2021 at 3:49 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >> > >> > So, we started the mons and mgr up again, and here's the relevant >> logs, including also ceph versions. We've also turned off all of the >> firewalls on all of the nodes so we know that there can't be network issues >> [and, indeed, all of our management of the OSDs happens via logins from the >> service nodes or to each other] >> > >> > > ceph status >> > >> > >> > cluster: >> > id: a1148af2-6eaf-4486-a27e-a05a78c2b378 >> > health: HEALTH_WARN >> > >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >> > 1 nearfull osd(s) >> > 3 pool(s) nearfull >> > Reduced data availability: 2048 pgs inactive >> > mons cephs01,cephs02,cephs03 are using a lot of disk space >> > >> > services: >> > mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s) >> > mgr: cephs01(active, since 76s) >> > osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 >> remapped pgs >> > flags >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover >> > >> > data: >> > pools: 3 pools, 2048 pgs >> > objects: 0 objects, 0 B >> > usage: 0 B used, 0 B / 0 B avail >> > pgs: 100.000% pgs unknown >> > 2048 unknown >> > >> > >> > > ceph health detail >> > >> > HEALTH_WARN >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set; >> 1 nearfull osd(s); 3 pool(s) nearfull; Reduced data availability: 2048 pgs >> inactive; mons cephs01,cephs02,cephs03 are using a lot of disk space >> > OSDMAP_FLAGS >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >> > OSD_NEARFULL 1 nearfull osd(s) >> > osd.63 is near full >> > POOL_NEARFULL 3 pool(s) nearfull >> > pool 'dteam' is nearfull >> > pool 'atlas' is nearfull >> > pool 'atlas-localgroup' is nearfull >> > PG_AVAILABILITY Reduced data availability: 2048 pgs inactive >> > pg 13.1ef is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f0 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f1 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f2 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f3 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f4 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f5 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f6 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f7 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f8 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f9 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fa is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fb is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fc is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fd is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fe is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1ff is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1ec is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f0 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f1 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f2 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f3 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f4 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f5 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f6 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f7 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f8 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f9 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fa is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fb is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fc is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fd is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fe is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1ff is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1ed is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f0 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f1 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f2 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f3 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f4 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f5 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f6 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f7 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f8 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f9 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fa is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fb is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fc is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fd is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fe is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1ff is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > MON_DISK_BIG mons cephs01,cephs02,cephs03 are using a lot of disk >> space >> > mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB) >> > mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB) >> > mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB) >> > >> > >> > > ceph versions >> > >> > { >> > "mon": { >> > "ceph version 14.2.18 >> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 >> > }, >> > "mgr": { >> > "ceph version 14.2.18 >> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 1 >> > }, >> > "osd": { >> > "ceph version 14.2.10 >> (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)": 1, >> > "ceph version 14.2.15 >> (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188, >> > "ceph version 14.2.16 >> (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18, >> > "ceph version 14.2.18 >> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122 >> > }, >> > >> > >> > >>>>>> >> > >> > As a note, the log where the mgr explodes (which precipitated all of >> this) definitely shows the problem occurring on the 12th [when 14.2.17 >> dropped], but things didn't "break" until we tried upgrading OSDs to >> 14.2.18... >> > >> > >> > Sam >> > >> > >> > On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla(a)gmail.com> wrote: >> >> >> >> Hi Dan: >> >> >> >> Thanks for the reply - at present, our mons and mgrs are off >> [because of the unsustainable nature of the filesystem usage]. We'll try >> putting them on again for long enough to get "ceph status" out of them, but >> because the mgr was unable to actually talk to anything, and reply at that >> point. >> >> >> >> (And thanks for the link to the bug tracker - I guess this mismatch >> of expectations is why the devs are so keen to move to containerised >> deployments where there is no co-location of different types of server, as >> it means they don't need to worry as much about the assumptions about when >> it's okay to restart a service on package update. Disappointing that it >> seems stale after 2 years...) >> >> >> >> Sam >> >> >> >> >> >> >> >> On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan(a)vanderster.com> >> wrote: >> >>> >> >>> Hi Sam, >> >>> >> >>> The daemons restart (for *some* releases) because of this: >> >>> https://tracker.ceph.com/issues/21672 >> >>> In short, if the selinux module changes, and if you have selinux >> >>> enabled, then midway through yum update, there will be a systemctl >> >>> restart ceph.target issued. >> >>> >> >>> For the rest -- I think you should focus on getting the PGs all >> >>> active+clean as soon as possible, because the degraded and remapped >> >>> states are what leads to mon / osdmap growth. >> >>> This kind of scenario is why we wrote this tool: >> >>> >> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema… >> >>> It will use pg-upmap-items to force the PGs to the OSDs where they >> are >> >>> currently residing. >> >>> >> >>> But there is some clarification needed before you go ahead with >> that. >> >>> Could you share `ceph status`, `ceph health detail`? >> >>> >> >>> Cheers, Dan >> >>> >> >>> >> >>> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla(a)gmail.com> >> wrote: >> >>> > >> >>> > Hi everyone: >> >>> > >> >>> > I posted to the list on Friday morning (UK time), but apparently >> my email >> >>> > is still in moderation (I have an email from the list bot telling >> me that >> >>> > it's held for moderation but no updates). >> >>> > >> >>> > Since this is a bit urgent - we have ~3PB of storage offline - >> I'm posting >> >>> > again. >> >>> > >> >>> > To save retyping the whole thing, I will direct you to a copy of >> the email >> >>> > I wrote on Friday: >> >>> > >> >>> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt >> >>> > >> >>> > (Since that was sent, we did successfully add big SSDs to the MON >> hosts so >> >>> > they don't fill up their disks with store.db s). >> >>> > >> >>> > I would appreciate any advice - assuming this also doesn't get >> stuck in >> >>> > moderation queues. >> >>> > >> >>> > -- >> >>> > Sam Skipsey (he/him, they/them) >> >>> > _______________________________________________ >> >>> > ceph-users mailing list -- ceph-users(a)ceph.io >> >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> >> >> >> >> >> -- >> >> Sam Skipsey (he/him, they/them) >> >> >> >> >> > >> > >> > -- >> > Sam Skipsey (he/him, they/them) >> > >> > >> > > > -- > Sam Skipsey (he/him, they/them) > > >

-- Sam Skipsey (he/him, they/them)

Sam Skipsey

11:42 a.m.

I don't think we explicitly set any ms settings in the OSD host ceph.conf [all the OSDs ceph.confs are identical across the entire cluster]. ip a gives: ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 4c:d9:8f:55:92:f6 brd ff:ff:ff:ff:ff:ff 3: em2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 4c:d9:8f:55:92:f7 brd ff:ff:ff:ff:ff:ff 4: p2p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether b4:96:91:3f:62:20 brd ff:ff:ff:ff:ff:ff 5: p2p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether b4:96:91:3f:62:22 brd ff:ff:ff:ff:ff:ff inet 10.1.50.21/8 brd 10.255.255.255 scope global noprefixroute p2p2 valid_lft forever preferred_lft forever inet6 fe80::b696:91ff:fe3f:6222/64 scope link noprefixroute valid_lft forever preferred_lft forever (where here p2p2 is the only active network link, and is also the private and public network for the ceph cluster) The output is similar on other hosts - with p2p2 either at position 3 or 5 depending on the order the interfaces were enumerated. Sam On Mon, 22 Mar 2021 at 17:34, Dan van der Ster <dan(a)vanderster.com> wrote:

...

Which `ms` settings do you have in the OSD host's ceph.conf or the ceph config dump? And how does `ip a` look on one of these hosts where the osd is registering itself as 127.0.0.1? You might as well set nodown again now. This will make ops pile up, but that's the least of your concerns at the moment. (With osds flapping the osdmaps churn and that inflates the mon store) .. Dan On Mon, Mar 22, 2021, 6:28 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > Hm, yes it does [and I was wondering why loopbacks were showing up > suddenly in the logs]. This wasn't happening with 14.2.16 so what's changed > about how we specify stuff? > > This might correlate with the other person on the IRC list who has > problems with 14.2.18 and their OSDs deciding they don't work sometimes > until they forcibly restart their network links... > > > Sam > > On Mon, 22 Mar 2021 at 17:20, Dan van der Ster <dan(a)vanderster.com> > wrote: > >> What's with the OSDs having loopback addresses? E.g. v2: >> 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667 >> >> Does `ceph osd dump` show those same loopback addresses for each OSD? >> >> This sounds familiar... I'm trying to find the recent ticket. >> >> .. dan >> >> >> On Mon, Mar 22, 2021, 6:07 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >> >>> hi Dan: >>> >>> So, unsetting nodown results in... almost all of the OSDs being marked >>> down. (231 down out of 328). >>> Checking the actual OSD services, most of them were actually up and >>> active on the nodes, even when the mons had marked them down. >>> (On a few nodes, the down services corresponded to OSDs that had been >>> flapping - but increasing osd_max_markdown locally to keep them up despite >>> the previous flapping, and restarting the services... didn't help.) >>> >>> In fact, starting up the few OSD services which had actually stopped, >>> resulted in a different set of OSDs being marked down, and some others >>> coming up. >>> We currently have a sort of "rolling OSD outness" passing through the >>> cluster - there's always ~230 OSDs marked down now, but which ones those >>> are changes (we've had everything from 1 HOST down to 4 HOSTS down over the >>> past 14 minutes as things fluctuate. >>> >>> A log from one of the "down" OSDs [which is actually running, and on >>> the same host as OSDs which are marked up] shows this worrying snippet >>> >>> 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 is_healthy false >>> -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 is_healthy false >>> -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 is_healthy false >>> -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 is_healthy false >>> -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 is_healthy false >>> -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 is_healthy false >>> -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:50.922 7f6c9f088700 -1 --2- 10.1.50.21:0/23673 >> [v2: >>> 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] >>> conn(0x56010903e400 0x56011a71fc00 unknown :-1 s=BANNER_CONNECTING pgs=0 >>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>> 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] is using msgr V1 >>> protocol >>> 2021-03-22 17:01:50.922 7f6c9f889700 -1 --2- 10.1.50.21:0/23673 >> [v2: >>> 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] >>> conn(0x5600df434000 0x56011718e000 unknown :-1 s=BANNER_CONNECTING pgs=0 >>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>> 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] is using msgr V1 >>> protocol >>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: >>> 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] >>> conn(0x5600f85ed800 0x560109df2a00 unknown :-1 s=BANNER_CONNECTING pgs=0 >>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>> 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] is using msgr V1 >>> protocol >>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: >>> 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] conn(0x5600f22ea000 >>> 0x560117182300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 >>> tx=0)._handle_peer_banner peer [v2: >>> 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] is using msgr V1 >>> protocol >>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: >>> 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] >>> conn(0x5600df435c00 0x560139370300 unknown :-1 s=BANNER_CONNECTING pgs=0 >>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>> 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] is using msgr V1 >>> protocol >>> 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 is_healthy false >>> -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 is_healthy false >>> -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 is_healthy false >>> -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 is_healthy false >>> -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 is_healthy false >>> -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 is_healthy false >>> -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 is_healthy false >>> -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> >>> >>> >>> Any suggestions? >>> >>> Sam >>> >>> P.S. an example ceph status as it is now [with everything now on >>> 14.2.18, since we had to restart osds anyway]: >>> >>> cluster: >>> id: a1148af2-6eaf-4486-a27e-a05a78c2b378 >>> health: HEALTH_WARN >>> pauserd,pausewr,noout,nobackfill,norebalance flag(s) set >>> 230 osds down >>> 4 hosts (80 osds) down >>> Reduced data availability: 2048 pgs inactive >>> 8 slow ops, oldest one blocked for 901 sec, mon.cephs01 has >>> slow ops >>> >>> services: >>> mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 2h) >>> mgr: cephs01(active, since 77m) >>> osd: 329 osds: 98 up (since 4s), 328 in (since 4d) >>> flags pauserd,pausewr,noout,nobackfill,norebalance >>> >>> data: >>> pools: 3 pools, 2048 pgs >>> objects: 0 objects, 0 B >>> usage: 0 B used, 0 B / 0 B avail >>> pgs: 100.000% pgs unknown >>> 2048 unknown >>> >>> >>> >>> On Mon, 22 Mar 2021 at 14:57, Dan van der Ster <dan(a)vanderster.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I would unset nodown (hiding osd failures) and norecover (blcoking PGs >>>> from recovering degraded objects), then start starting osds. >>>> As soon as you have some osd logs reporting some failures, then share >>>> those... >>>> >>>> - Dan >>>> >>>> On Mon, Mar 22, 2021 at 3:49 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >>>> > >>>> > So, we started the mons and mgr up again, and here's the relevant >>>> logs, including also ceph versions. We've also turned off all of the >>>> firewalls on all of the nodes so we know that there can't be network issues >>>> [and, indeed, all of our management of the OSDs happens via logins from the >>>> service nodes or to each other] >>>> > >>>> > > ceph status >>>> > >>>> > >>>> > cluster: >>>> > id: a1148af2-6eaf-4486-a27e-a05a78c2b378 >>>> > health: HEALTH_WARN >>>> > >>>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >>>> > 1 nearfull osd(s) >>>> > 3 pool(s) nearfull >>>> > Reduced data availability: 2048 pgs inactive >>>> > mons cephs01,cephs02,cephs03 are using a lot of disk >>>> space >>>> > >>>> > services: >>>> > mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s) >>>> > mgr: cephs01(active, since 76s) >>>> > osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 >>>> remapped pgs >>>> > flags >>>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover >>>> > >>>> > data: >>>> > pools: 3 pools, 2048 pgs >>>> > objects: 0 objects, 0 B >>>> > usage: 0 B used, 0 B / 0 B avail >>>> > pgs: 100.000% pgs unknown >>>> > 2048 unknown >>>> > >>>> > >>>> > > ceph health detail >>>> > >>>> > HEALTH_WARN >>>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set; >>>> 1 nearfull osd(s); 3 pool(s) nearfull; Reduced data availability: 2048 pgs >>>> inactive; mons cephs01,cephs02,cephs03 are using a lot of disk space >>>> > OSDMAP_FLAGS >>>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >>>> > OSD_NEARFULL 1 nearfull osd(s) >>>> > osd.63 is near full >>>> > POOL_NEARFULL 3 pool(s) nearfull >>>> > pool 'dteam' is nearfull >>>> > pool 'atlas' is nearfull >>>> > pool 'atlas-localgroup' is nearfull >>>> > PG_AVAILABILITY Reduced data availability: 2048 pgs inactive >>>> > pg 13.1ef is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f0 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f1 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f2 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f3 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f4 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f5 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f6 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f7 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f8 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f9 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1fa is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1fb is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1fc is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1fd is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1fe is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1ff is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1ec is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f0 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f1 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f2 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f3 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f4 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f5 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f6 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f7 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f8 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f9 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1fa is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1fb is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1fc is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1fd is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1fe is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1ff is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1ed is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f0 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f1 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f2 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f3 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f4 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f5 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f6 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f7 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f8 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f9 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1fa is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1fb is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1fc is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1fd is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1fe is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1ff is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > MON_DISK_BIG mons cephs01,cephs02,cephs03 are using a lot of disk >>>> space >>>> > mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB) >>>> > mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB) >>>> > mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB) >>>> > >>>> > >>>> > > ceph versions >>>> > >>>> > { >>>> > "mon": { >>>> > "ceph version 14.2.18 >>>> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 >>>> > }, >>>> > "mgr": { >>>> > "ceph version 14.2.18 >>>> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 1 >>>> > }, >>>> > "osd": { >>>> > "ceph version 14.2.10 >>>> (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)": 1, >>>> > "ceph version 14.2.15 >>>> (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188, >>>> > "ceph version 14.2.16 >>>> (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18, >>>> > "ceph version 14.2.18 >>>> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122 >>>> > }, >>>> > >>>> > >>>> > >>>>>> >>>> > >>>> > As a note, the log where the mgr explodes (which precipitated all of >>>> this) definitely shows the problem occurring on the 12th [when 14.2.17 >>>> dropped], but things didn't "break" until we tried upgrading OSDs to >>>> 14.2.18... >>>> > >>>> > >>>> > Sam >>>> > >>>> > >>>> > On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla(a)gmail.com> wrote: >>>> >> >>>> >> Hi Dan: >>>> >> >>>> >> Thanks for the reply - at present, our mons and mgrs are off >>>> [because of the unsustainable nature of the filesystem usage]. We'll try >>>> putting them on again for long enough to get "ceph status" out of them, but >>>> because the mgr was unable to actually talk to anything, and reply at that >>>> point. >>>> >> >>>> >> (And thanks for the link to the bug tracker - I guess this mismatch >>>> of expectations is why the devs are so keen to move to containerised >>>> deployments where there is no co-location of different types of server, as >>>> it means they don't need to worry as much about the assumptions about when >>>> it's okay to restart a service on package update. Disappointing that it >>>> seems stale after 2 years...) >>>> >> >>>> >> Sam >>>> >> >>>> >> >>>> >> >>>> >> On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan(a)vanderster.com> >>>> wrote: >>>> >>> >>>> >>> Hi Sam, >>>> >>> >>>> >>> The daemons restart (for *some* releases) because of this: >>>> >>> https://tracker.ceph.com/issues/21672 >>>> >>> In short, if the selinux module changes, and if you have selinux >>>> >>> enabled, then midway through yum update, there will be a systemctl >>>> >>> restart ceph.target issued. >>>> >>> >>>> >>> For the rest -- I think you should focus on getting the PGs all >>>> >>> active+clean as soon as possible, because the degraded and remapped >>>> >>> states are what leads to mon / osdmap growth. >>>> >>> This kind of scenario is why we wrote this tool: >>>> >>> >>>> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema… >>>> >>> It will use pg-upmap-items to force the PGs to the OSDs where they >>>> are >>>> >>> currently residing. >>>> >>> >>>> >>> But there is some clarification needed before you go ahead with >>>> that. >>>> >>> Could you share `ceph status`, `ceph health detail`? >>>> >>> >>>> >>> Cheers, Dan >>>> >>> >>>> >>> >>>> >>> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla(a)gmail.com> >>>> wrote: >>>> >>> > >>>> >>> > Hi everyone: >>>> >>> > >>>> >>> > I posted to the list on Friday morning (UK time), but apparently >>>> my email >>>> >>> > is still in moderation (I have an email from the list bot >>>> telling me that >>>> >>> > it's held for moderation but no updates). >>>> >>> > >>>> >>> > Since this is a bit urgent - we have ~3PB of storage offline - >>>> I'm posting >>>> >>> > again. >>>> >>> > >>>> >>> > To save retyping the whole thing, I will direct you to a copy of >>>> the email >>>> >>> > I wrote on Friday: >>>> >>> > >>>> >>> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt >>>> >>> > >>>> >>> > (Since that was sent, we did successfully add big SSDs to the >>>> MON hosts so >>>> >>> > they don't fill up their disks with store.db s). >>>> >>> > >>>> >>> > I would appreciate any advice - assuming this also doesn't get >>>> stuck in >>>> >>> > moderation queues. >>>> >>> > >>>> >>> > -- >>>> >>> > Sam Skipsey (he/him, they/them) >>>> >>> > _______________________________________________ >>>> >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>>> >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Sam Skipsey (he/him, they/them) >>>> >> >>>> >> >>>> > >>>> > >>>> > -- >>>> > Sam Skipsey (he/him, they/them) >>>> > >>>> > >>>> >>> >>> >>> -- >>> Sam Skipsey (he/him, they/them) >>> >>> >>> > > -- > Sam Skipsey (he/him, they/them) > > >

-- Sam Skipsey (he/him, they/them)

Dan van der Ster

12:12 p.m.

There are two commits between 14.2.16 and 14.2.18 related to loopback network. Perhaps one of these is responsible for your issue [1]. I'd try playing with the options like cluster/public bind addr and cluster/public bind interface until you can convince the osd to bind to the correct listening IP. (That said, i don't know which version you're running on the logs shared earlier. But I think you should try to get 14.2.18 working anyway). .. dan [1]

...

git log v14.2.18...v14.2.16 ipaddr.cc commit

...

Which `ms` settings do you have in the OSD host's ceph.conf or the ceph config dump? And how does `ip a` look on one of these hosts where the osd is registering itself as 127.0.0.1? You might as well set nodown again now. This will make ops pile up, but that's the least of your concerns at the moment. (With osds flapping the osdmaps churn and that inflates the mon store) .. Dan On Mon, Mar 22, 2021, 6:28 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > Hm, yes it does [and I was wondering why loopbacks were showing up > suddenly in the logs]. This wasn't happening with 14.2.16 so what's changed > about how we specify stuff? > > This might correlate with the other person on the IRC list who has > problems with 14.2.18 and their OSDs deciding they don't work sometimes > until they forcibly restart their network links... > > > Sam > > On Mon, 22 Mar 2021 at 17:20, Dan van der Ster <dan(a)vanderster.com> > wrote: > >> What's with the OSDs having loopback addresses? E.g. v2: >> 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667 >> >> Does `ceph osd dump` show those same loopback addresses for each OSD? >> >> This sounds familiar... I'm trying to find the recent ticket. >> >> .. dan >> >> >> On Mon, Mar 22, 2021, 6:07 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >> >>> hi Dan: >>> >>> So, unsetting nodown results in... almost all of the OSDs being marked >>> down. (231 down out of 328). >>> Checking the actual OSD services, most of them were actually up and >>> active on the nodes, even when the mons had marked them down. >>> (On a few nodes, the down services corresponded to OSDs that had been >>> flapping - but increasing osd_max_markdown locally to keep them up despite >>> the previous flapping, and restarting the services... didn't help.) >>> >>> In fact, starting up the few OSD services which had actually stopped, >>> resulted in a different set of OSDs being marked down, and some others >>> coming up. >>> We currently have a sort of "rolling OSD outness" passing through the >>> cluster - there's always ~230 OSDs marked down now, but which ones those >>> are changes (we've had everything from 1 HOST down to 4 HOSTS down over the >>> past 14 minutes as things fluctuate. >>> >>> A log from one of the "down" OSDs [which is actually running, and on >>> the same host as OSDs which are marked up] shows this worrying snippet >>> >>> 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 is_healthy >>> false -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 is_healthy >>> false -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 is_healthy >>> false -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 is_healthy >>> false -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 is_healthy >>> false -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 is_healthy >>> false -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:50.922 7f6c9f088700 -1 --2- 10.1.50.21:0/23673 >> >>> [v2:127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] >>> conn(0x56010903e400 0x56011a71fc00 unknown :-1 s=BANNER_CONNECTING pgs=0 >>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>> 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] is using msgr V1 >>> protocol >>> 2021-03-22 17:01:50.922 7f6c9f889700 -1 --2- 10.1.50.21:0/23673 >> >>> [v2:127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] >>> conn(0x5600df434000 0x56011718e000 unknown :-1 s=BANNER_CONNECTING pgs=0 >>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>> 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] is using msgr V1 >>> protocol >>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> >>> [v2:127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] >>> conn(0x5600f85ed800 0x560109df2a00 unknown :-1 s=BANNER_CONNECTING pgs=0 >>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>> 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] is using msgr V1 >>> protocol >>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> >>> [v2:127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] >>> conn(0x5600f22ea000 0x560117182300 unknown :-1 s=BANNER_CONNECTING pgs=0 >>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>> 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] is using msgr V1 >>> protocol >>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> >>> [v2:127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] >>> conn(0x5600df435c00 0x560139370300 unknown :-1 s=BANNER_CONNECTING pgs=0 >>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>> 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] is using msgr V1 >>> protocol >>> 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 is_healthy >>> false -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 is_healthy >>> false -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 is_healthy >>> false -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 is_healthy >>> false -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 is_healthy >>> false -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 is_healthy >>> false -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 is_healthy >>> false -- only 0/10 up peers (less than 33%) >>> 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 not healthy; >>> waiting to boot >>> >>> >>> >>> Any suggestions? >>> >>> Sam >>> >>> P.S. an example ceph status as it is now [with everything now on >>> 14.2.18, since we had to restart osds anyway]: >>> >>> cluster: >>> id: a1148af2-6eaf-4486-a27e-a05a78c2b378 >>> health: HEALTH_WARN >>> pauserd,pausewr,noout,nobackfill,norebalance flag(s) set >>> 230 osds down >>> 4 hosts (80 osds) down >>> Reduced data availability: 2048 pgs inactive >>> 8 slow ops, oldest one blocked for 901 sec, mon.cephs01 >>> has slow ops >>> >>> services: >>> mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 2h) >>> mgr: cephs01(active, since 77m) >>> osd: 329 osds: 98 up (since 4s), 328 in (since 4d) >>> flags pauserd,pausewr,noout,nobackfill,norebalance >>> >>> data: >>> pools: 3 pools, 2048 pgs >>> objects: 0 objects, 0 B >>> usage: 0 B used, 0 B / 0 B avail >>> pgs: 100.000% pgs unknown >>> 2048 unknown >>> >>> >>> >>> On Mon, 22 Mar 2021 at 14:57, Dan van der Ster <dan(a)vanderster.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I would unset nodown (hiding osd failures) and norecover (blcoking PGs >>>> from recovering degraded objects), then start starting osds. >>>> As soon as you have some osd logs reporting some failures, then share >>>> those... >>>> >>>> - Dan >>>> >>>> On Mon, Mar 22, 2021 at 3:49 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >>>> > >>>> > So, we started the mons and mgr up again, and here's the relevant >>>> logs, including also ceph versions. We've also turned off all of the >>>> firewalls on all of the nodes so we know that there can't be network issues >>>> [and, indeed, all of our management of the OSDs happens via logins from the >>>> service nodes or to each other] >>>> > >>>> > > ceph status >>>> > >>>> > >>>> > cluster: >>>> > id: a1148af2-6eaf-4486-a27e-a05a78c2b378 >>>> > health: HEALTH_WARN >>>> > >>>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >>>> > 1 nearfull osd(s) >>>> > 3 pool(s) nearfull >>>> > Reduced data availability: 2048 pgs inactive >>>> > mons cephs01,cephs02,cephs03 are using a lot of disk >>>> space >>>> > >>>> > services: >>>> > mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s) >>>> > mgr: cephs01(active, since 76s) >>>> > osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 >>>> remapped pgs >>>> > flags >>>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover >>>> > >>>> > data: >>>> > pools: 3 pools, 2048 pgs >>>> > objects: 0 objects, 0 B >>>> > usage: 0 B used, 0 B / 0 B avail >>>> > pgs: 100.000% pgs unknown >>>> > 2048 unknown >>>> > >>>> > >>>> > > ceph health detail >>>> > >>>> > HEALTH_WARN >>>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set; >>>> 1 nearfull osd(s); 3 pool(s) nearfull; Reduced data availability: 2048 pgs >>>> inactive; mons cephs01,cephs02,cephs03 are using a lot of disk space >>>> > OSDMAP_FLAGS >>>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >>>> > OSD_NEARFULL 1 nearfull osd(s) >>>> > osd.63 is near full >>>> > POOL_NEARFULL 3 pool(s) nearfull >>>> > pool 'dteam' is nearfull >>>> > pool 'atlas' is nearfull >>>> > pool 'atlas-localgroup' is nearfull >>>> > PG_AVAILABILITY Reduced data availability: 2048 pgs inactive >>>> > pg 13.1ef is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f0 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f1 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f2 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f3 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f4 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f5 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f6 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f7 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f8 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1f9 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1fa is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1fb is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1fc is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1fd is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1fe is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 13.1ff is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1ec is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f0 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f1 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f2 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f3 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f4 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f5 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f6 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f7 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f8 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1f9 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1fa is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1fb is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1fc is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1fd is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1fe is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 14.1ff is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1ed is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f0 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f1 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f2 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f3 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f4 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f5 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f6 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f7 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f8 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1f9 is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1fa is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1fb is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1fc is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1fd is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1fe is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > pg 15.1ff is stuck inactive for 89.322981, current state >>>> unknown, last acting [] >>>> > MON_DISK_BIG mons cephs01,cephs02,cephs03 are using a lot of disk >>>> space >>>> > mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB) >>>> > mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB) >>>> > mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB) >>>> > >>>> > >>>> > > ceph versions >>>> > >>>> > { >>>> > "mon": { >>>> > "ceph version 14.2.18 >>>> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 >>>> > }, >>>> > "mgr": { >>>> > "ceph version 14.2.18 >>>> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 1 >>>> > }, >>>> > "osd": { >>>> > "ceph version 14.2.10 >>>> (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)": 1, >>>> > "ceph version 14.2.15 >>>> (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188, >>>> > "ceph version 14.2.16 >>>> (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18, >>>> > "ceph version 14.2.18 >>>> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122 >>>> > }, >>>> > >>>> > >>>> > >>>>>> >>>> > >>>> > As a note, the log where the mgr explodes (which precipitated all >>>> of this) definitely shows the problem occurring on the 12th [when 14.2.17 >>>> dropped], but things didn't "break" until we tried upgrading OSDs to >>>> 14.2.18... >>>> > >>>> > >>>> > Sam >>>> > >>>> > >>>> > On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla(a)gmail.com> wrote: >>>> >> >>>> >> Hi Dan: >>>> >> >>>> >> Thanks for the reply - at present, our mons and mgrs are off >>>> [because of the unsustainable nature of the filesystem usage]. We'll try >>>> putting them on again for long enough to get "ceph status" out of them, but >>>> because the mgr was unable to actually talk to anything, and reply at that >>>> point. >>>> >> >>>> >> (And thanks for the link to the bug tracker - I guess this >>>> mismatch of expectations is why the devs are so keen to move to >>>> containerised deployments where there is no co-location of different types >>>> of server, as it means they don't need to worry as much about the >>>> assumptions about when it's okay to restart a service on package update. >>>> Disappointing that it seems stale after 2 years...) >>>> >> >>>> >> Sam >>>> >> >>>> >> >>>> >> >>>> >> On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan(a)vanderster.com> >>>> wrote: >>>> >>> >>>> >>> Hi Sam, >>>> >>> >>>> >>> The daemons restart (for *some* releases) because of this: >>>> >>> https://tracker.ceph.com/issues/21672 >>>> >>> In short, if the selinux module changes, and if you have selinux >>>> >>> enabled, then midway through yum update, there will be a systemctl >>>> >>> restart ceph.target issued. >>>> >>> >>>> >>> For the rest -- I think you should focus on getting the PGs all >>>> >>> active+clean as soon as possible, because the degraded and >>>> remapped >>>> >>> states are what leads to mon / osdmap growth. >>>> >>> This kind of scenario is why we wrote this tool: >>>> >>> >>>> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema… >>>> >>> It will use pg-upmap-items to force the PGs to the OSDs where >>>> they are >>>> >>> currently residing. >>>> >>> >>>> >>> But there is some clarification needed before you go ahead with >>>> that. >>>> >>> Could you share `ceph status`, `ceph health detail`? >>>> >>> >>>> >>> Cheers, Dan >>>> >>> >>>> >>> >>>> >>> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla(a)gmail.com> >>>> wrote: >>>> >>> > >>>> >>> > Hi everyone: >>>> >>> > >>>> >>> > I posted to the list on Friday morning (UK time), but >>>> apparently my email >>>> >>> > is still in moderation (I have an email from the list bot >>>> telling me that >>>> >>> > it's held for moderation but no updates). >>>> >>> > >>>> >>> > Since this is a bit urgent - we have ~3PB of storage offline - >>>> I'm posting >>>> >>> > again. >>>> >>> > >>>> >>> > To save retyping the whole thing, I will direct you to a copy >>>> of the email >>>> >>> > I wrote on Friday: >>>> >>> > >>>> >>> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt >>>> >>> > >>>> >>> > (Since that was sent, we did successfully add big SSDs to the >>>> MON hosts so >>>> >>> > they don't fill up their disks with store.db s). >>>> >>> > >>>> >>> > I would appreciate any advice - assuming this also doesn't get >>>> stuck in >>>> >>> > moderation queues. >>>> >>> > >>>> >>> > -- >>>> >>> > Sam Skipsey (he/him, they/them) >>>> >>> > _______________________________________________ >>>> >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>>> >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Sam Skipsey (he/him, they/them) >>>> >> >>>> >> >>>> > >>>> > >>>> > -- >>>> > Sam Skipsey (he/him, they/them) >>>> > >>>> > >>>> >>> >>> >>> -- >>> Sam Skipsey (he/him, they/them) >>> >>> >>> > > -- > Sam Skipsey (he/him, they/them) > > >

-- Sam Skipsey (he/him, they/them)

Sam Skipsey

12:35 p.m.

Hi Dan: Aha - I think the first commit is probably it - before that commit, the fact that lo is highest in the interfaces enumeration didn't matter for us [since it would always be skipped]. This actually almost certainly also is associated with that other site with a similar problem (OSDs drop out until you restart the network interface), since I imagine that would reorder the interface list. Playing with our public and cluster bind address explicitly does seem to help, so we'll iterate on that and get to a suitable ceph.conf. Thanks for the help [and it was the network all along]! Sam On Mon, 22 Mar 2021 at 19:12, Dan van der Ster <dan(a)vanderster.com> wrote:

...

git log v14.2.18...v14.2.16 ipaddr.cc commit

89321762ad4cfdd1a68cae467181bdd1a501f14d Author: Thomas Goirand <zigo(a)debian.org> Date: Fri Jan 15 10:50:05 2021 +0100 common/ipaddr: Allow binding on lo Commmit 5cf0fa872231f4eaf8ce6565a04ed675ba5b689b, solves the issue that the osd can't restart after seting a virtual local loopback IP. However, this commit also prevents a bgp-to-the-host over unumbered Ipv6 local-link is setup, where OSD typically are bound to the lo interface. To solve this, this single char patch simply checks against "lo:" to match only virtual interfaces instead of anything that starts with "lo". Fixes: https://tracker.ceph.com/issues/48893 Signed-off-by: Thomas Goirand <zigo(a)debian.org> (cherry picked from commit 201b59204374ebdab91bb554b986577a97b19c36) commit b52cae90d67eb878b3ddfe547b8bf16e0d4d1a45 Author: lijaiwei1 <lijiawei1(a)chinatelecom.cn> Date: Tue Dec 24 22:34:46 2019 +0800 common: skip interfaces starting with "lo" in find_ipv{4,6}_in_subnet() This will solve the issue that the osd can't restart after seting a virtual local loopback IP. In find_ipv4_in_subnet() and find_ipv6_in_subnet(), I use boost::starts_with(addrs->ifa_name, "lo") to ship the interfaces starting with "lo". Fixes: https://tracker.ceph.com/issues/43417 Signed-off-by: Jiawei Li <lijiawei1(a)chinatelecom.cn> (cherry picked from commit 5cf0fa872231f4eaf8ce6565a04ed675ba5b689b) On Mon, Mar 22, 2021, 7:42 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > I don't think we explicitly set any ms settings in the OSD host ceph.conf > [all the OSDs ceph.confs are identical across the entire cluster]. > > ip a gives: > > ip a > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group > default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > inet 127.0.0.1/8 scope host lo > valid_lft forever preferred_lft forever > inet6 ::1/128 scope host > valid_lft forever preferred_lft forever > 2: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN > group default qlen 1000 > link/ether 4c:d9:8f:55:92:f6 brd ff:ff:ff:ff:ff:ff > 3: em2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN > group default qlen 1000 > link/ether 4c:d9:8f:55:92:f7 brd ff:ff:ff:ff:ff:ff > 4: p2p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN > group default qlen 1000 > link/ether b4:96:91:3f:62:20 brd ff:ff:ff:ff:ff:ff > 5: p2p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP > group default qlen 1000 > link/ether b4:96:91:3f:62:22 brd ff:ff:ff:ff:ff:ff > inet 10.1.50.21/8 brd 10.255.255.255 scope global noprefixroute p2p2 > valid_lft forever preferred_lft forever > inet6 fe80::b696:91ff:fe3f:6222/64 scope link noprefixroute > valid_lft forever preferred_lft forever > > (where here p2p2 is the only active network link, and is also the private > and public network for the ceph cluster) > > The output is similar on other hosts - with p2p2 either at position 3 or > 5 depending on the order the interfaces were enumerated. > > Sam > > On Mon, 22 Mar 2021 at 17:34, Dan van der Ster <dan(a)vanderster.com> > wrote: > >> Which `ms` settings do you have in the OSD host's ceph.conf or the ceph >> config dump? >> >> And how does `ip a` look on one of these hosts where the osd is >> registering itself as 127.0.0.1? >> >> >> You might as well set nodown again now. This will make ops pile up, but >> that's the least of your concerns at the moment. >> (With osds flapping the osdmaps churn and that inflates the mon store) >> >> .. Dan >> >> On Mon, Mar 22, 2021, 6:28 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >> >>> Hm, yes it does [and I was wondering why loopbacks were showing up >>> suddenly in the logs]. This wasn't happening with 14.2.16 so what's changed >>> about how we specify stuff? >>> >>> This might correlate with the other person on the IRC list who has >>> problems with 14.2.18 and their OSDs deciding they don't work sometimes >>> until they forcibly restart their network links... >>> >>> >>> Sam >>> >>> On Mon, 22 Mar 2021 at 17:20, Dan van der Ster <dan(a)vanderster.com> >>> wrote: >>> >>>> What's with the OSDs having loopback addresses? E.g. v2: >>>> 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667 >>>> >>>> Does `ceph osd dump` show those same loopback addresses for each OSD? >>>> >>>> This sounds familiar... I'm trying to find the recent ticket. >>>> >>>> .. dan >>>> >>>> >>>> On Mon, Mar 22, 2021, 6:07 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >>>> >>>>> hi Dan: >>>>> >>>>> So, unsetting nodown results in... almost all of the OSDs being >>>>> marked down. (231 down out of 328). >>>>> Checking the actual OSD services, most of them were actually up and >>>>> active on the nodes, even when the mons had marked them down. >>>>> (On a few nodes, the down services corresponded to OSDs that had been >>>>> flapping - but increasing osd_max_markdown locally to keep them up despite >>>>> the previous flapping, and restarting the services... didn't help.) >>>>> >>>>> In fact, starting up the few OSD services which had actually stopped, >>>>> resulted in a different set of OSDs being marked down, and some others >>>>> coming up. >>>>> We currently have a sort of "rolling OSD outness" passing through the >>>>> cluster - there's always ~230 OSDs marked down now, but which ones those >>>>> are changes (we've had everything from 1 HOST down to 4 HOSTS down over the >>>>> past 14 minutes as things fluctuate. >>>>> >>>>> A log from one of the "down" OSDs [which is actually running, and on >>>>> the same host as OSDs which are marked up] shows this worrying snippet >>>>> >>>>> 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 is_healthy >>>>> false -- only 0/10 up peers (less than 33%) >>>>> 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 not healthy; >>>>> waiting to boot >>>>> 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 is_healthy >>>>> false -- only 0/10 up peers (less than 33%) >>>>> 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 not healthy; >>>>> waiting to boot >>>>> 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 is_healthy >>>>> false -- only 0/10 up peers (less than 33%) >>>>> 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 not healthy; >>>>> waiting to boot >>>>> 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 is_healthy >>>>> false -- only 0/10 up peers (less than 33%) >>>>> 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 not healthy; >>>>> waiting to boot >>>>> 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 is_healthy >>>>> false -- only 0/10 up peers (less than 33%) >>>>> 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 not healthy; >>>>> waiting to boot >>>>> 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 is_healthy >>>>> false -- only 0/10 up peers (less than 33%) >>>>> 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 not healthy; >>>>> waiting to boot >>>>> 2021-03-22 17:01:50.922 7f6c9f088700 -1 --2- 10.1.50.21:0/23673 >> >>>>> [v2:127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] >>>>> conn(0x56010903e400 0x56011a71fc00 unknown :-1 s=BANNER_CONNECTING pgs=0 >>>>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>>>> 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] is using msgr V1 >>>>> protocol >>>>> 2021-03-22 17:01:50.922 7f6c9f889700 -1 --2- 10.1.50.21:0/23673 >> >>>>> [v2:127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] >>>>> conn(0x5600df434000 0x56011718e000 unknown :-1 s=BANNER_CONNECTING pgs=0 >>>>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>>>> 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] is using msgr V1 >>>>> protocol >>>>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> >>>>> [v2:127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] >>>>> conn(0x5600f85ed800 0x560109df2a00 unknown :-1 s=BANNER_CONNECTING pgs=0 >>>>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>>>> 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] is using msgr V1 >>>>> protocol >>>>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> >>>>> [v2:127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] >>>>> conn(0x5600f22ea000 0x560117182300 unknown :-1 s=BANNER_CONNECTING pgs=0 >>>>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>>>> 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] is using msgr V1 >>>>> protocol >>>>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> >>>>> [v2:127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] >>>>> conn(0x5600df435c00 0x560139370300 unknown :-1 s=BANNER_CONNECTING pgs=0 >>>>> cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2: >>>>> 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] is using msgr V1 >>>>> protocol >>>>> 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 is_healthy >>>>> false -- only 0/10 up peers (less than 33%) >>>>> 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 not healthy; >>>>> waiting to boot >>>>> 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 is_healthy >>>>> false -- only 0/10 up peers (less than 33%) >>>>> 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 not healthy; >>>>> waiting to boot >>>>> 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 is_healthy >>>>> false -- only 0/10 up peers (less than 33%) >>>>> 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 not healthy; >>>>> waiting to boot >>>>> 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 is_healthy >>>>> false -- only 0/10 up peers (less than 33%) >>>>> 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 not healthy; >>>>> waiting to boot >>>>> 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 is_healthy >>>>> false -- only 0/10 up peers (less than 33%) >>>>> 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 not healthy; >>>>> waiting to boot >>>>> 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 is_healthy >>>>> false -- only 0/10 up peers (less than 33%) >>>>> 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 not healthy; >>>>> waiting to boot >>>>> 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 is_healthy >>>>> false -- only 0/10 up peers (less than 33%) >>>>> 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 not healthy; >>>>> waiting to boot >>>>> >>>>> >>>>> >>>>> Any suggestions? >>>>> >>>>> Sam >>>>> >>>>> P.S. an example ceph status as it is now [with everything now on >>>>> 14.2.18, since we had to restart osds anyway]: >>>>> >>>>> cluster: >>>>> id: a1148af2-6eaf-4486-a27e-a05a78c2b378 >>>>> health: HEALTH_WARN >>>>> pauserd,pausewr,noout,nobackfill,norebalance flag(s) set >>>>> 230 osds down >>>>> 4 hosts (80 osds) down >>>>> Reduced data availability: 2048 pgs inactive >>>>> 8 slow ops, oldest one blocked for 901 sec, mon.cephs01 >>>>> has slow ops >>>>> >>>>> services: >>>>> mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 2h) >>>>> mgr: cephs01(active, since 77m) >>>>> osd: 329 osds: 98 up (since 4s), 328 in (since 4d) >>>>> flags pauserd,pausewr,noout,nobackfill,norebalance >>>>> >>>>> data: >>>>> pools: 3 pools, 2048 pgs >>>>> objects: 0 objects, 0 B >>>>> usage: 0 B used, 0 B / 0 B avail >>>>> pgs: 100.000% pgs unknown >>>>> 2048 unknown >>>>> >>>>> >>>>> >>>>> On Mon, 22 Mar 2021 at 14:57, Dan van der Ster <dan(a)vanderster.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I would unset nodown (hiding osd failures) and norecover (blcoking >>>>>> PGs >>>>>> from recovering degraded objects), then start starting osds. >>>>>> As soon as you have some osd logs reporting some failures, then >>>>>> share those... >>>>>> >>>>>> - Dan >>>>>> >>>>>> On Mon, Mar 22, 2021 at 3:49 PM Sam Skipsey <aoanla(a)gmail.com> >>>>>> wrote: >>>>>> > >>>>>> > So, we started the mons and mgr up again, and here's the relevant >>>>>> logs, including also ceph versions. We've also turned off all of the >>>>>> firewalls on all of the nodes so we know that there can't be network issues >>>>>> [and, indeed, all of our management of the OSDs happens via logins from the >>>>>> service nodes or to each other] >>>>>> > >>>>>> > > ceph status >>>>>> > >>>>>> > >>>>>> > cluster: >>>>>> > id: a1148af2-6eaf-4486-a27e-a05a78c2b378 >>>>>> > health: HEALTH_WARN >>>>>> > >>>>>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >>>>>> > 1 nearfull osd(s) >>>>>> > 3 pool(s) nearfull >>>>>> > Reduced data availability: 2048 pgs inactive >>>>>> > mons cephs01,cephs02,cephs03 are using a lot of disk >>>>>> space >>>>>> > >>>>>> > services: >>>>>> > mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s) >>>>>> > mgr: cephs01(active, since 76s) >>>>>> > osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 >>>>>> remapped pgs >>>>>> > flags >>>>>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover >>>>>> > >>>>>> > data: >>>>>> > pools: 3 pools, 2048 pgs >>>>>> > objects: 0 objects, 0 B >>>>>> > usage: 0 B used, 0 B / 0 B avail >>>>>> > pgs: 100.000% pgs unknown >>>>>> > 2048 unknown >>>>>> > >>>>>> > >>>>>> > > ceph health detail >>>>>> > >>>>>> > HEALTH_WARN >>>>>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set; >>>>>> 1 nearfull osd(s); 3 pool(s) nearfull; Reduced data availability: 2048 pgs >>>>>> inactive; mons cephs01,cephs02,cephs03 are using a lot of disk space >>>>>> > OSDMAP_FLAGS >>>>>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >>>>>> > OSD_NEARFULL 1 nearfull osd(s) >>>>>> > osd.63 is near full >>>>>> > POOL_NEARFULL 3 pool(s) nearfull >>>>>> > pool 'dteam' is nearfull >>>>>> > pool 'atlas' is nearfull >>>>>> > pool 'atlas-localgroup' is nearfull >>>>>> > PG_AVAILABILITY Reduced data availability: 2048 pgs inactive >>>>>> > pg 13.1ef is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1f0 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1f1 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1f2 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1f3 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1f4 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1f5 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1f6 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1f7 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1f8 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1f9 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1fa is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1fb is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1fc is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1fd is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1fe is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 13.1ff is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1ec is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1f0 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1f1 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1f2 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1f3 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1f4 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1f5 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1f6 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1f7 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1f8 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1f9 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1fa is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1fb is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1fc is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1fd is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1fe is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 14.1ff is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1ed is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1f0 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1f1 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1f2 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1f3 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1f4 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1f5 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1f6 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1f7 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1f8 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1f9 is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1fa is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1fb is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1fc is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1fd is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1fe is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > pg 15.1ff is stuck inactive for 89.322981, current state >>>>>> unknown, last acting [] >>>>>> > MON_DISK_BIG mons cephs01,cephs02,cephs03 are using a lot of disk >>>>>> space >>>>>> > mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB) >>>>>> > mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB) >>>>>> > mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB) >>>>>> > >>>>>> > >>>>>> > > ceph versions >>>>>> > >>>>>> > { >>>>>> > "mon": { >>>>>> > "ceph version 14.2.18 >>>>>> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 >>>>>> > }, >>>>>> > "mgr": { >>>>>> > "ceph version 14.2.18 >>>>>> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 1 >>>>>> > }, >>>>>> > "osd": { >>>>>> > "ceph version 14.2.10 >>>>>> (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)": 1, >>>>>> > "ceph version 14.2.15 >>>>>> (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188, >>>>>> > "ceph version 14.2.16 >>>>>> (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18, >>>>>> > "ceph version 14.2.18 >>>>>> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122 >>>>>> > }, >>>>>> > >>>>>> > >>>>>> > >>>>>> >>>>>> > >>>>>> > As a note, the log where the mgr explodes (which precipitated all >>>>>> of this) definitely shows the problem occurring on the 12th [when 14.2.17 >>>>>> dropped], but things didn't "break" until we tried upgrading OSDs to >>>>>> 14.2.18... >>>>>> > >>>>>> > >>>>>> > Sam >>>>>> > >>>>>> > >>>>>> > On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla(a)gmail.com> >>>>>> wrote: >>>>>> >> >>>>>> >> Hi Dan: >>>>>> >> >>>>>> >> Thanks for the reply - at present, our mons and mgrs are off >>>>>> [because of the unsustainable nature of the filesystem usage]. We'll try >>>>>> putting them on again for long enough to get "ceph status" out of them, but >>>>>> because the mgr was unable to actually talk to anything, and reply at that >>>>>> point. >>>>>> >> >>>>>> >> (And thanks for the link to the bug tracker - I guess this >>>>>> mismatch of expectations is why the devs are so keen to move to >>>>>> containerised deployments where there is no co-location of different types >>>>>> of server, as it means they don't need to worry as much about the >>>>>> assumptions about when it's okay to restart a service on package update. >>>>>> Disappointing that it seems stale after 2 years...) >>>>>> >> >>>>>> >> Sam >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> On Mon, 22 Mar 2021 at 12:11, Dan van der Ster < >>>>>> dan(a)vanderster.com> wrote: >>>>>> >>> >>>>>> >>> Hi Sam, >>>>>> >>> >>>>>> >>> The daemons restart (for *some* releases) because of this: >>>>>> >>> https://tracker.ceph.com/issues/21672 >>>>>> >>> In short, if the selinux module changes, and if you have selinux >>>>>> >>> enabled, then midway through yum update, there will be a >>>>>> systemctl >>>>>> >>> restart ceph.target issued. >>>>>> >>> >>>>>> >>> For the rest -- I think you should focus on getting the PGs all >>>>>> >>> active+clean as soon as possible, because the degraded and >>>>>> remapped >>>>>> >>> states are what leads to mon / osdmap growth. >>>>>> >>> This kind of scenario is why we wrote this tool: >>>>>> >>> >>>>>> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema… >>>>>> >>> It will use pg-upmap-items to force the PGs to the OSDs where >>>>>> they are >>>>>> >>> currently residing. >>>>>> >>> >>>>>> >>> But there is some clarification needed before you go ahead with >>>>>> that. >>>>>> >>> Could you share `ceph status`, `ceph health detail`? >>>>>> >>> >>>>>> >>> Cheers, Dan >>>>>> >>> >>>>>> >>> >>>>>> >>> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla(a)gmail.com> >>>>>> wrote: >>>>>> >>> > >>>>>> >>> > Hi everyone: >>>>>> >>> > >>>>>> >>> > I posted to the list on Friday morning (UK time), but >>>>>> apparently my email >>>>>> >>> > is still in moderation (I have an email from the list bot >>>>>> telling me that >>>>>> >>> > it's held for moderation but no updates). >>>>>> >>> > >>>>>> >>> > Since this is a bit urgent - we have ~3PB of storage offline - >>>>>> I'm posting >>>>>> >>> > again. >>>>>> >>> > >>>>>> >>> > To save retyping the whole thing, I will direct you to a copy >>>>>> of the email >>>>>> >>> > I wrote on Friday: >>>>>> >>> > >>>>>> >>> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt >>>>>> >>> > >>>>>> >>> > (Since that was sent, we did successfully add big SSDs to the >>>>>> MON hosts so >>>>>> >>> > they don't fill up their disks with store.db s). >>>>>> >>> > >>>>>> >>> > I would appreciate any advice - assuming this also doesn't get >>>>>> stuck in >>>>>> >>> > moderation queues. >>>>>> >>> > >>>>>> >>> > -- >>>>>> >>> > Sam Skipsey (he/him, they/them) >>>>>> >>> > _______________________________________________ >>>>>> >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>>>>> >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> -- >>>>>> >> Sam Skipsey (he/him, they/them) >>>>>> >> >>>>>> >> >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > Sam Skipsey (he/him, they/them) >>>>>> > >>>>>> > >>>>>> >>>>> >>>>> >>>>> -- >>>>> Sam Skipsey (he/him, they/them) >>>>> >>>>> >>>>> >>> >>> -- >>> Sam Skipsey (he/him, they/them) >>> >>> >>> > > -- > Sam Skipsey (he/him, they/them) > > >

-- Sam Skipsey (he/him, they/them)

Dan van der Ster

23 Mar 23 Mar

12:29 a.m.

Hi Sam, Yeah somehow `lo:` is not getting skipped, probably due to those patches. (I guess it is because the 2nd patch looks for `lo:` but in fact the ifa_name is probably just `lo` without the colon) https://github.com/ceph/ceph/blob/master/src/common/ipaddr.cc#L110 I don't know why this impacts you but not us -- we already upgraded one of our clusters to 14.2.18 on Centos 8, and ceph is choosing the correct interface without needing any network options. And lo: is the first interface [1] here too. Could it be as simple as the iface names being sorted alphabetically? Here we have ens785f0 which would come before lo, but your interface `p2p2` would come after. -- dan [1] # ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether a4:bf:01:60:67:a0 brd ff:ff:ff:ff:ff:ff 3: ens785f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 0c:42:a1:ad:36:9a brd ff:ff:ff:ff:ff:ff inet 10.116.6.8/26 brd 10.116.6.63 scope global dynamic noprefixroute ens785f0 valid_lft 432177sec preferred_lft 432177sec inet6 fd01:1458:e00:1e::100:5/128 scope global dynamic noprefixroute valid_lft 513502sec preferred_lft 513502sec inet6 fe80::bdbd:76be:63fd:a4c2/64 scope link noprefixroute valid_lft forever preferred_lft forever 4: ens785f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 0c:42:a1:ad:36:9b brd ff:ff:ff:ff:ff:ff 5: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether a4:bf:01:60:67:a1 brd ff:ff:ff:ff:ff:ff\ On Mon, Mar 22, 2021 at 8:35 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > > Hi Dan: > > Aha - I think the first commit is probably it - before that commit, the fact that lo is highest in the interfaces enumeration didn't matter for us [since it would always be skipped]. > > This actually almost certainly also is associated with that other site with a similar problem (OSDs drop out until you restart the network interface), since I imagine that would reorder the interface list. > > Playing with our public and cluster bind address explicitly does seem to help, so we'll iterate on that and get to a suitable ceph.conf. > > Thanks for the help [and it was the network all along]! > > > Sam > > On Mon, 22 Mar 2021 at 19:12, Dan van der Ster <dan(a)vanderster.com> wrote: >> >> There are two commits between 14.2.16 and 14.2.18 related to loopback network. Perhaps one of these is responsible for your issue [1]. >> >> I'd try playing with the options like cluster/public bind addr and cluster/public bind interface until you can convince the osd to bind to the correct listening IP. >> >> (That said, i don't know which version you're running on the logs shared earlier. But I think you should try to get 14.2.18 working anyway). >> >> .. dan >> >> [1] >> >> > git log v14.2.18...v14.2.16 ipaddr.cc commit 89321762ad4cfdd1a68cae467181bdd1a501f14d >> Author: Thomas Goirand <zigo(a)debian.org> >> Date: Fri Jan 15 10:50:05 2021 +0100 >> >> common/ipaddr: Allow binding on lo >> >> Commmit 5cf0fa872231f4eaf8ce6565a04ed675ba5b689b, solves the issue that >> the osd can't restart after seting a virtual local loopback IP. However, >> this commit also prevents a bgp-to-the-host over unumbered Ipv6 >> local-link is setup, where OSD typically are bound to the lo interface. >> >> To solve this, this single char patch simply checks against "lo:" to >> match only virtual interfaces instead of anything that starts with "lo". >> >> Fixes: https://tracker.ceph.com/issues/48893 >> Signed-off-by: Thomas Goirand <zigo(a)debian.org> >> (cherry picked from commit 201b59204374ebdab91bb554b986577a97b19c36) >> >> commit b52cae90d67eb878b3ddfe547b8bf16e0d4d1a45 >> Author: lijaiwei1 <lijiawei1(a)chinatelecom.cn> >> Date: Tue Dec 24 22:34:46 2019 +0800 >> >> common: skip interfaces starting with "lo" in find_ipv{4,6}_in_subnet() >> >> This will solve the issue that the osd can't restart after seting a >> virtual local loopback IP. >> In find_ipv4_in_subnet() and find_ipv6_in_subnet(), I use >> boost::starts_with(addrs->ifa_name, "lo") to ship the interfaces >> starting with "lo". >> >> Fixes: https://tracker.ceph.com/issues/43417 >> Signed-off-by: Jiawei Li <lijiawei1(a)chinatelecom.cn> >> (cherry picked from commit 5cf0fa872231f4eaf8ce6565a04ed675ba5b689b) >> >> >> >> >> >> On Mon, Mar 22, 2021, 7:42 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >>> >>> I don't think we explicitly set any ms settings in the OSD host ceph.conf [all the OSDs ceph.confs are identical across the entire cluster]. >>> >>> ip a gives: >>> >>> ip a >>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 >>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>> inet 127.0.0.1/8 scope host lo >>> valid_lft forever preferred_lft forever >>> inet6 ::1/128 scope host >>> valid_lft forever preferred_lft forever >>> 2: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 >>> link/ether 4c:d9:8f:55:92:f6 brd ff:ff:ff:ff:ff:ff >>> 3: em2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 >>> link/ether 4c:d9:8f:55:92:f7 brd ff:ff:ff:ff:ff:ff >>> 4: p2p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 >>> link/ether b4:96:91:3f:62:20 brd ff:ff:ff:ff:ff:ff >>> 5: p2p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 >>> link/ether b4:96:91:3f:62:22 brd ff:ff:ff:ff:ff:ff >>> inet 10.1.50.21/8 brd 10.255.255.255 scope global noprefixroute p2p2 >>> valid_lft forever preferred_lft forever >>> inet6 fe80::b696:91ff:fe3f:6222/64 scope link noprefixroute >>> valid_lft forever preferred_lft forever >>> >>> (where here p2p2 is the only active network link, and is also the private and public network for the ceph cluster) >>> >>> The output is similar on other hosts - with p2p2 either at position 3 or 5 depending on the order the interfaces were enumerated. >>> >>> Sam >>> >>> On Mon, 22 Mar 2021 at 17:34, Dan van der Ster <dan(a)vanderster.com> wrote: >>>> >>>> Which `ms` settings do you have in the OSD host's ceph.conf or the ceph config dump? >>>> >>>> And how does `ip a` look on one of these hosts where the osd is registering itself as 127.0.0.1? >>>> >>>> >>>> You might as well set nodown again now. This will make ops pile up, but that's the least of your concerns at the moment. >>>> (With osds flapping the osdmaps churn and that inflates the mon store) >>>> >>>> .. Dan >>>> >>>> On Mon, Mar 22, 2021, 6:28 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >>>>> >>>>> Hm, yes it does [and I was wondering why loopbacks were showing up suddenly in the logs]. This wasn't happening with 14.2.16 so what's changed about how we specify stuff? >>>>> >>>>> This might correlate with the other person on the IRC list who has problems with 14.2.18 and their OSDs deciding they don't work sometimes until they forcibly restart their network links... >>>>> >>>>> >>>>> Sam >>>>> >>>>> On Mon, 22 Mar 2021 at 17:20, Dan van der Ster <dan(a)vanderster.com> wrote: >>>>>> >>>>>> What's with the OSDs having loopback addresses? E.g. v2:127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667 >>>>>> >>>>>> Does `ceph osd dump` show those same loopback addresses for each OSD? >>>>>> >>>>>> This sounds familiar... I'm trying to find the recent ticket. >>>>>> >>>>>> .. dan >>>>>> >>>>>> >>>>>> On Mon, Mar 22, 2021, 6:07 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >>>>>>> >>>>>>> hi Dan: >>>>>>> >>>>>>> So, unsetting nodown results in... almost all of the OSDs being marked down. (231 down out of 328). >>>>>>> Checking the actual OSD services, most of them were actually up and active on the nodes, even when the mons had marked them down. >>>>>>> (On a few nodes, the down services corresponded to OSDs that had been flapping - but increasing osd_max_markdown locally to keep them up despite the previous flapping, and restarting the services... didn't help.) >>>>>>> >>>>>>> In fact, starting up the few OSD services which had actually stopped, resulted in a different set of OSDs being marked down, and some others coming up. >>>>>>> We currently have a sort of "rolling OSD outness" passing through the cluster - there's always ~230 OSDs marked down now, but which ones those are changes (we've had everything from 1 HOST down to 4 HOSTS down over the past 14 minutes as things fluctuate. >>>>>>> >>>>>>> A log from one of the "down" OSDs [which is actually running, and on the same host as OSDs which are marked up] shows this worrying snippet >>>>>>> >>>>>>> 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) >>>>>>> 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot >>>>>>> 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) >>>>>>> 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot >>>>>>> 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) >>>>>>> 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot >>>>>>> 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) >>>>>>> 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot >>>>>>> 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) >>>>>>> 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot >>>>>>> 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) >>>>>>> 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot >>>>>>> 2021-03-22 17:01:50.922 7f6c9f088700 -1 --2- 10.1.50.21:0/23673 >> [v2:127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] conn(0x56010903e400 0x56011a71fc00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2:127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] is using msgr V1 protocol >>>>>>> 2021-03-22 17:01:50.922 7f6c9f889700 -1 --2- 10.1.50.21:0/23673 >> [v2:127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] conn(0x5600df434000 0x56011718e000 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2:127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] is using msgr V1 protocol >>>>>>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2:127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] conn(0x5600f85ed800 0x560109df2a00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2:127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] is using msgr V1 protocol >>>>>>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2:127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] conn(0x5600f22ea000 0x560117182300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2:127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] is using msgr V1 protocol >>>>>>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2:127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] conn(0x5600df435c00 0x560139370300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2:127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] is using msgr V1 protocol >>>>>>> 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) >>>>>>> 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot >>>>>>> 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) >>>>>>> 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot >>>>>>> 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) >>>>>>> 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot >>>>>>> 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) >>>>>>> 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot >>>>>>> 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) >>>>>>> 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot >>>>>>> 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) >>>>>>> 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot >>>>>>> 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) >>>>>>> 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot >>>>>>> >>>>>>> >>>>>>> >>>>>>> Any suggestions? >>>>>>> >>>>>>> Sam >>>>>>> >>>>>>> P.S. an example ceph status as it is now [with everything now on 14.2.18, since we had to restart osds anyway]: >>>>>>> >>>>>>> cluster: >>>>>>> id: a1148af2-6eaf-4486-a27e-a05a78c2b378 >>>>>>> health: HEALTH_WARN >>>>>>> pauserd,pausewr,noout,nobackfill,norebalance flag(s) set >>>>>>> 230 osds down >>>>>>> 4 hosts (80 osds) down >>>>>>> Reduced data availability: 2048 pgs inactive >>>>>>> 8 slow ops, oldest one blocked for 901 sec, mon.cephs01 has slow ops >>>>>>> >>>>>>> services: >>>>>>> mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 2h) >>>>>>> mgr: cephs01(active, since 77m) >>>>>>> osd: 329 osds: 98 up (since 4s), 328 in (since 4d) >>>>>>> flags pauserd,pausewr,noout,nobackfill,norebalance >>>>>>> >>>>>>> data: >>>>>>> pools: 3 pools, 2048 pgs >>>>>>> objects: 0 objects, 0 B >>>>>>> usage: 0 B used, 0 B / 0 B avail >>>>>>> pgs: 100.000% pgs unknown >>>>>>> 2048 unknown >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, 22 Mar 2021 at 14:57, Dan van der Ster <dan(a)vanderster.com> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I would unset nodown (hiding osd failures) and norecover (blcoking PGs >>>>>>>> from recovering degraded objects), then start starting osds. >>>>>>>> As soon as you have some osd logs reporting some failures, then share those... >>>>>>>> >>>>>>>> - Dan >>>>>>>> >>>>>>>> On Mon, Mar 22, 2021 at 3:49 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >>>>>>>> > >>>>>>>> > So, we started the mons and mgr up again, and here's the relevant logs, including also ceph versions. We've also turned off all of the firewalls on all of the nodes so we know that there can't be network issues [and, indeed, all of our management of the OSDs happens via logins from the service nodes or to each other] >>>>>>>> > >>>>>>>> > > ceph status >>>>>>>> > >>>>>>>> > >>>>>>>> > cluster: >>>>>>>> > id: a1148af2-6eaf-4486-a27e-a05a78c2b378 >>>>>>>> > health: HEALTH_WARN >>>>>>>> > pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >>>>>>>> > 1 nearfull osd(s) >>>>>>>> > 3 pool(s) nearfull >>>>>>>> > Reduced data availability: 2048 pgs inactive >>>>>>>> > mons cephs01,cephs02,cephs03 are using a lot of disk space >>>>>>>> > >>>>>>>> > services: >>>>>>>> > mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s) >>>>>>>> > mgr: cephs01(active, since 76s) >>>>>>>> > osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 remapped pgs >>>>>>>> > flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover >>>>>>>> > >>>>>>>> > data: >>>>>>>> > pools: 3 pools, 2048 pgs >>>>>>>> > objects: 0 objects, 0 B >>>>>>>> > usage: 0 B used, 0 B / 0 B avail >>>>>>>> > pgs: 100.000% pgs unknown >>>>>>>> > 2048 unknown >>>>>>>> > >>>>>>>> > >>>>>>>> > > ceph health detail >>>>>>>> > >>>>>>>> > HEALTH_WARN pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set; 1 nearfull osd(s); 3 pool(s) nearfull; Reduced data availability: 2048 pgs inactive; mons cephs01,cephs02,cephs03 are using a lot of disk space >>>>>>>> > OSDMAP_FLAGS pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >>>>>>>> > OSD_NEARFULL 1 nearfull osd(s) >>>>>>>> > osd.63 is near full >>>>>>>> > POOL_NEARFULL 3 pool(s) nearfull >>>>>>>> > pool 'dteam' is nearfull >>>>>>>> > pool 'atlas' is nearfull >>>>>>>> > pool 'atlas-localgroup' is nearfull >>>>>>>> > PG_AVAILABILITY Reduced data availability: 2048 pgs inactive >>>>>>>> > pg 13.1ef is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1f0 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1f1 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1f2 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1f3 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1f4 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1f5 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1f6 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1f7 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1f8 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1f9 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1fa is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1fb is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1fc is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1fd is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1fe is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 13.1ff is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1ec is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1f0 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1f1 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1f2 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1f3 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1f4 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1f5 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1f6 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1f7 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1f8 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1f9 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1fa is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1fb is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1fc is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1fd is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1fe is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 14.1ff is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1ed is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1f0 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1f1 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1f2 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1f3 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1f4 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1f5 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1f6 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1f7 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1f8 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1f9 is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1fa is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1fb is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1fc is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1fd is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1fe is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > pg 15.1ff is stuck inactive for 89.322981, current state unknown, last acting [] >>>>>>>> > MON_DISK_BIG mons cephs01,cephs02,cephs03 are using a lot of disk space >>>>>>>> > mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB) >>>>>>>> > mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB) >>>>>>>> > mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB) >>>>>>>> > >>>>>>>> > >>>>>>>> > > ceph versions >>>>>>>> > >>>>>>>> > { >>>>>>>> > "mon": { >>>>>>>> > "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 >>>>>>>> > }, >>>>>>>> > "mgr": { >>>>>>>> > "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 1 >>>>>>>> > }, >>>>>>>> > "osd": { >>>>>>>> > "ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)": 1, >>>>>>>> > "ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188, >>>>>>>> > "ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18, >>>>>>>> > "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122 >>>>>>>> > }, >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>> >>>>>>>> > >>>>>>>> > As a note, the log where the mgr explodes (which precipitated all of this) definitely shows the problem occurring on the 12th [when 14.2.17 dropped], but things didn't "break" until we tried upgrading OSDs to 14.2.18... >>>>>>>> > >>>>>>>> > >>>>>>>> > Sam >>>>>>>> > >>>>>>>> > >>>>>>>> > On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla(a)gmail.com> wrote: >>>>>>>> >> >>>>>>>> >> Hi Dan: >>>>>>>> >> >>>>>>>> >> Thanks for the reply - at present, our mons and mgrs are off [because of the unsustainable nature of the filesystem usage]. We'll try putting them on again for long enough to get "ceph status" out of them, but because the mgr was unable to actually talk to anything, and reply at that point. >>>>>>>> >> >>>>>>>> >> (And thanks for the link to the bug tracker - I guess this mismatch of expectations is why the devs are so keen to move to containerised deployments where there is no co-location of different types of server, as it means they don't need to worry as much about the assumptions about when it's okay to restart a service on package update. Disappointing that it seems stale after 2 years...) >>>>>>>> >> >>>>>>>> >> Sam >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan(a)vanderster.com> wrote: >>>>>>>> >>> >>>>>>>> >>> Hi Sam, >>>>>>>> >>> >>>>>>>> >>> The daemons restart (for *some* releases) because of this: >>>>>>>> >>> https://tracker.ceph.com/issues/21672 >>>>>>>> >>> In short, if the selinux module changes, and if you have selinux >>>>>>>> >>> enabled, then midway through yum update, there will be a systemctl >>>>>>>> >>> restart ceph.target issued. >>>>>>>> >>> >>>>>>>> >>> For the rest -- I think you should focus on getting the PGs all >>>>>>>> >>> active+clean as soon as possible, because the degraded and remapped >>>>>>>> >>> states are what leads to mon / osdmap growth. >>>>>>>> >>> This kind of scenario is why we wrote this tool: >>>>>>>> >>> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema… >>>>>>>> >>> It will use pg-upmap-items to force the PGs to the OSDs where they are >>>>>>>> >>> currently residing. >>>>>>>> >>> >>>>>>>> >>> But there is some clarification needed before you go ahead with that. >>>>>>>> >>> Could you share `ceph status`, `ceph health detail`? >>>>>>>> >>> >>>>>>>> >>> Cheers, Dan >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla(a)gmail.com> wrote: >>>>>>>> >>> > >>>>>>>> >>> > Hi everyone: >>>>>>>> >>> > >>>>>>>> >>> > I posted to the list on Friday morning (UK time), but apparently my email >>>>>>>> >>> > is still in moderation (I have an email from the list bot telling me that >>>>>>>> >>> > it's held for moderation but no updates). >>>>>>>> >>> > >>>>>>>> >>> > Since this is a bit urgent - we have ~3PB of storage offline - I'm posting >>>>>>>> >>> > again. >>>>>>>> >>> > >>>>>>>> >>> > To save retyping the whole thing, I will direct you to a copy of the email >>>>>>>> >>> > I wrote on Friday: >>>>>>>> >>> > >>>>>>>> >>> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt >>>>>>>> >>> > >>>>>>>> >>> > (Since that was sent, we did successfully add big SSDs to the MON hosts so >>>>>>>> >>> > they don't fill up their disks with store.db s). >>>>>>>> >>> > >>>>>>>> >>> > I would appreciate any advice - assuming this also doesn't get stuck in >>>>>>>> >>> > moderation queues. >>>>>>>> >>> > >>>>>>>> >>> > -- >>>>>>>> >>> > Sam Skipsey (he/him, they/them) >>>>>>>> >>> > _______________________________________________ >>>>>>>> >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>> >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> -- >>>>>>>> >> Sam Skipsey (he/him, they/them) >>>>>>>> >> >>>>>>>> >> >>>>>>>> > >>>>>>>> > >>>>>>>> > -- >>>>>>>> > Sam Skipsey (he/him, they/them) >>>>>>>> > >>>>>>>> > >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sam Skipsey (he/him, they/them) >>>>>>> >>>>>>> >>>>> >>>>> >>>>> -- >>>>> Sam Skipsey (he/him, they/them) >>>>> >>>>> >>> >>> >>> -- >>> Sam Skipsey (he/him, they/them) >>> >>> > > > -- > Sam Skipsey (he/him, they/them) > >

Dan van der Ster

1:18 a.m.

Sam, see https://tracker.ceph.com/issues/49938 and https://github.com/ceph/ceph/pull/40334 On Tue, Mar 23, 2021 at 8:29 AM Dan van der Ster <dan(a)vanderster.com> wrote: > > Hi Sam, > > Yeah somehow `lo:` is not getting skipped, probably due to those > patches. (I guess it is because the 2nd patch looks for `lo:` but in > fact the ifa_name is probably just `lo` without the colon) > > https://github.com/ceph/ceph/blob/master/src/common/ipaddr.cc#L110 > > I don't know why this impacts you but not us -- we already upgraded > one of our clusters to 14.2.18 on Centos 8, and ceph is choosing the > correct interface without needing any network options. And lo: is the > first interface [1] here too. > Could it be as simple as the iface names being sorted alphabetically? > Here we have ens785f0 which would come before lo, but your interface > `p2p2` would come after. > > -- dan > > [1] > # ip a > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN > group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > inet 127.0.0.1/8 scope host lo > valid_lft forever preferred_lft forever > inet6 ::1/128 scope host > valid_lft forever preferred_lft forever > 2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state > DOWN group default qlen 1000 > link/ether a4:bf:01:60:67:a0 brd ff:ff:ff:ff:ff:ff > 3: ens785f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state > UP group default qlen 1000 > link/ether 0c:42:a1:ad:36:9a brd ff:ff:ff:ff:ff:ff > inet 10.116.6.8/26 brd 10.116.6.63 scope global dynamic > noprefixroute ens785f0 > valid_lft 432177sec preferred_lft 432177sec > inet6 fd01:1458:e00:1e::100:5/128 scope global dynamic noprefixroute > valid_lft 513502sec preferred_lft 513502sec > inet6 fe80::bdbd:76be:63fd:a4c2/64 scope link noprefixroute > valid_lft forever preferred_lft forever > 4: ens785f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq > state DOWN group default qlen 1000 > link/ether 0c:42:a1:ad:36:9b brd ff:ff:ff:ff:ff:ff > 5: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state > DOWN group default qlen 1000 > link/ether a4:bf:01:60:67:a1 brd ff:ff:ff:ff:ff:ff\ > > On Mon, Mar 22, 2021 at 8:35 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > > > > Hi Dan: > > > > Aha - I think the first commit is probably it - before that commit, the fact that lo is highest in the interfaces enumeration didn't matter for us [since it would always be skipped]. > > > > This actually almost certainly also is associated with that other site with a similar problem (OSDs drop out until you restart the network interface), since I imagine that would reorder the interface list. > > > > Playing with our public and cluster bind address explicitly does seem to help, so we'll iterate on that and get to a suitable ceph.conf. > > > > Thanks for the help [and it was the network all along]! > > > > > > Sam > > > > On Mon, 22 Mar 2021 at 19:12, Dan van der Ster <dan(a)vanderster.com> wrote: > >> > >> There are two commits between 14.2.16 and 14.2.18 related to loopback network. Perhaps one of these is responsible for your issue [1]. > >> > >> I'd try playing with the options like cluster/public bind addr and cluster/public bind interface until you can convince the osd to bind to the correct listening IP. > >> > >> (That said, i don't know which version you're running on the logs shared earlier. But I think you should try to get 14.2.18 working anyway). > >> > >> .. dan > >> > >> [1] > >> > >> > git log v14.2.18...v14.2.16 ipaddr.cc commit 89321762ad4cfdd1a68cae467181bdd1a501f14d > >> Author: Thomas Goirand <zigo(a)debian.org> > >> Date: Fri Jan 15 10:50:05 2021 +0100 > >> > >> common/ipaddr: Allow binding on lo > >> > >> Commmit 5cf0fa872231f4eaf8ce6565a04ed675ba5b689b, solves the issue that > >> the osd can't restart after seting a virtual local loopback IP. However, > >> this commit also prevents a bgp-to-the-host over unumbered Ipv6 > >> local-link is setup, where OSD typically are bound to the lo interface. > >> > >> To solve this, this single char patch simply checks against "lo:" to > >> match only virtual interfaces instead of anything that starts with "lo". > >> > >> Fixes: https://tracker.ceph.com/issues/48893 > >> Signed-off-by: Thomas Goirand <zigo(a)debian.org> > >> (cherry picked from commit 201b59204374ebdab91bb554b986577a97b19c36) > >> > >> commit b52cae90d67eb878b3ddfe547b8bf16e0d4d1a45 > >> Author: lijaiwei1 <lijiawei1(a)chinatelecom.cn> > >> Date: Tue Dec 24 22:34:46 2019 +0800 > >> > >> common: skip interfaces starting with "lo" in find_ipv{4,6}_in_subnet() > >> > >> This will solve the issue that the osd can't restart after seting a > >> virtual local loopback IP. > >> In find_ipv4_in_subnet() and find_ipv6_in_subnet(), I use > >> boost::starts_with(addrs->ifa_name, "lo") to ship the interfaces > >> starting with "lo". > >> > >> Fixes: https://tracker.ceph.com/issues/43417 > >> Signed-off-by: Jiawei Li <lijiawei1(a)chinatelecom.cn> > >> (cherry picked from commit 5cf0fa872231f4eaf8ce6565a04ed675ba5b689b) > >> > >> > >> > >> > >> > >> On Mon, Mar 22, 2021, 7:42 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > >>> > >>> I don't think we explicitly set any ms settings in the OSD host ceph.conf [all the OSDs ceph.confs are identical across the entire cluster]. > >>> > >>> ip a gives: > >>> > >>> ip a > >>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 > >>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > >>> inet 127.0.0.1/8 scope host lo > >>> valid_lft forever preferred_lft forever > >>> inet6 ::1/128 scope host > >>> valid_lft forever preferred_lft forever > >>> 2: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 > >>> link/ether 4c:d9:8f:55:92:f6 brd ff:ff:ff:ff:ff:ff > >>> 3: em2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 > >>> link/ether 4c:d9:8f:55:92:f7 brd ff:ff:ff:ff:ff:ff > >>> 4: p2p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 > >>> link/ether b4:96:91:3f:62:20 brd ff:ff:ff:ff:ff:ff > >>> 5: p2p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 > >>> link/ether b4:96:91:3f:62:22 brd ff:ff:ff:ff:ff:ff > >>> inet 10.1.50.21/8 brd 10.255.255.255 scope global noprefixroute p2p2 > >>> valid_lft forever preferred_lft forever > >>> inet6 fe80::b696:91ff:fe3f:6222/64 scope link noprefixroute > >>> valid_lft forever preferred_lft forever > >>> > >>> (where here p2p2 is the only active network link, and is also the private and public network for the ceph cluster) > >>> > >>> The output is similar on other hosts - with p2p2 either at position 3 or 5 depending on the order the interfaces were enumerated. > >>> > >>> Sam > >>> > >>> On Mon, 22 Mar 2021 at 17:34, Dan van der Ster <dan(a)vanderster.com> wrote: > >>>> > >>>> Which `ms` settings do you have in the OSD host's ceph.conf or the ceph config dump? > >>>> > >>>> And how does `ip a` look on one of these hosts where the osd is registering itself as 127.0.0.1? > >>>> > >>>> > >>>> You might as well set nodown again now. This will make ops pile up, but that's the least of your concerns at the moment. > >>>> (With osds flapping the osdmaps churn and that inflates the mon store) > >>>> > >>>> .. Dan > >>>> > >>>> On Mon, Mar 22, 2021, 6:28 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > >>>>> > >>>>> Hm, yes it does [and I was wondering why loopbacks were showing up suddenly in the logs]. This wasn't happening with 14.2.16 so what's changed about how we specify stuff? > >>>>> > >>>>> This might correlate with the other person on the IRC list who has problems with 14.2.18 and their OSDs deciding they don't work sometimes until they forcibly restart their network links... > >>>>> > >>>>> > >>>>> Sam > >>>>> > >>>>> On Mon, 22 Mar 2021 at 17:20, Dan van der Ster <dan(a)vanderster.com> wrote: > >>>>>> > >>>>>> What's with the OSDs having loopback addresses? E.g. v2:127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667 > >>>>>> > >>>>>> Does `ceph osd dump` show those same loopback addresses for each OSD? > >>>>>> > >>>>>> This sounds familiar... I'm trying to find the recent ticket. > >>>>>> > >>>>>> .. dan > >>>>>> > >>>>>> > >>>>>> On Mon, Mar 22, 2021, 6:07 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > >>>>>>> > >>>>>>> hi Dan: > >>>>>>> > >>>>>>> So, unsetting nodown results in... almost all of the OSDs being marked down. (231 down out of 328). > >>>>>>> Checking the actual OSD services, most of them were actually up and active on the nodes, even when the mons had marked them down. > >>>>>>> (On a few nodes, the down services corresponded to OSDs that had been flapping - but increasing osd_max_markdown locally to keep them up despite the previous flapping, and restarting the services... didn't help.) > >>>>>>> > >>>>>>> In fact, starting up the few OSD services which had actually stopped, resulted in a different set of OSDs being marked down, and some others coming up. > >>>>>>> We currently have a sort of "rolling OSD outness" passing through the cluster - there's always ~230 OSDs marked down now, but which ones those are changes (we've had everything from 1 HOST down to 4 HOSTS down over the past 14 minutes as things fluctuate. > >>>>>>> > >>>>>>> A log from one of the "down" OSDs [which is actually running, and on the same host as OSDs which are marked up] shows this worrying snippet > >>>>>>> > >>>>>>> 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) > >>>>>>> 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot > >>>>>>> 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) > >>>>>>> 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot > >>>>>>> 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) > >>>>>>> 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot > >>>>>>> 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) > >>>>>>> 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot > >>>>>>> 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) > >>>>>>> 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot > >>>>>>> 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) > >>>>>>> 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot > >>>>>>> 2021-03-22 17:01:50.922 7f6c9f088700 -1 --2- 10.1.50.21:0/23673 >> [v2:127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] conn(0x56010903e400 0x56011a71fc00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2:127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] is using msgr V1 protocol > >>>>>>> 2021-03-22 17:01:50.922 7f6c9f889700 -1 --2- 10.1.50.21:0/23673 >> [v2:127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] conn(0x5600df434000 0x56011718e000 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2:127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] is using msgr V1 protocol > >>>>>>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2:127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] conn(0x5600f85ed800 0x560109df2a00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2:127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] is using msgr V1 protocol > >>>>>>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2:127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] conn(0x5600f22ea000 0x560117182300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2:127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] is using msgr V1 protocol > >>>>>>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2:127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] conn(0x5600df435c00 0x560139370300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2:127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] is using msgr V1 protocol > >>>>>>> 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) > >>>>>>> 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot > >>>>>>> 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) > >>>>>>> 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot > >>>>>>> 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) > >>>>>>> 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot > >>>>>>> 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) > >>>>>>> 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot > >>>>>>> 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) > >>>>>>> 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot > >>>>>>> 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) > >>>>>>> 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot > >>>>>>> 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 is_healthy false -- only 0/10 up peers (less than 33%) > >>>>>>> 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 not healthy; waiting to boot > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Any suggestions? > >>>>>>> > >>>>>>> Sam > >>>>>>> > >>>>>>> P.S. an example ceph status as it is now [with everything now on 14.2.18, since we had to restart osds anyway]: > >>>>>>> > >>>>>>> cluster: > >>>>>>> id: a1148af2-6eaf-4486-a27e-a05a78c2b378 > >>>>>>> health: HEALTH_WARN > >>>>>>> pauserd,pausewr,noout,nobackfill,norebalance flag(s) set > >>>>>>> 230 osds down > >>>>>>> 4 hosts (80 osds) down > >>>>>>> Reduced data availability: 2048 pgs inactive > >>>>>>> 8 slow ops, oldest one blocked for 901 sec, mon.cephs01 has slow ops > >>>>>>> > >>>>>>> services: > >>>>>>> mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 2h) > >>>>>>> mgr: cephs01(active, since 77m) > >>>>>>> osd: 329 osds: 98 up (since 4s), 328 in (since 4d) > >>>>>>> flags pauserd,pausewr,noout,nobackfill,norebalance > >>>>>>> > >>>>>>> data: > >>>>>>> pools: 3 pools, 2048 pgs > >>>>>>> objects: 0 objects, 0 B > >>>>>>> usage: 0 B used, 0 B / 0 B avail > >>>>>>> pgs: 100.000% pgs unknown > >>>>>>> 2048 unknown > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Mon, 22 Mar 2021 at 14:57, Dan van der Ster <dan(a)vanderster.com> wrote: > >>>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I would unset nodown (hiding osd failures) and norecover (blcoking PGs > >>>>>>>> from recovering degraded objects), then start starting osds. > >>>>>>>> As soon as you have some osd logs reporting some failures, then share those... > >>>>>>>> > >>>>>>>> - Dan > >>>>>>>> > >>>>>>>> On Mon, Mar 22, 2021 at 3:49 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > >>>>>>>> > > >>>>>>>> > So, we started the mons and mgr up again, and here's the relevant logs, including also ceph versions. We've also turned off all of the firewalls on all of the nodes so we know that there can't be network issues [and, indeed, all of our management of the OSDs happens via logins from the service nodes or to each other] > >>>>>>>> > > >>>>>>>> > > ceph status > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > cluster: > >>>>>>>> > id: a1148af2-6eaf-4486-a27e-a05a78c2b378 > >>>>>>>> > health: HEALTH_WARN > >>>>>>>> > pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set > >>>>>>>> > 1 nearfull osd(s) > >>>>>>>> > 3 pool(s) nearfull > >>>>>>>> > Reduced data availability: 2048 pgs inactive > >>>>>>>> > mons cephs01,cephs02,cephs03 are using a lot of disk space > >>>>>>>> > > >>>>>>>> > services: > >>>>>>>> > mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s) > >>>>>>>> > mgr: cephs01(active, since 76s) > >>>>>>>> > osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 remapped pgs > >>>>>>>> > flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover > >>>>>>>> > > >>>>>>>> > data: > >>>>>>>> > pools: 3 pools, 2048 pgs > >>>>>>>> > objects: 0 objects, 0 B > >>>>>>>> > usage: 0 B used, 0 B / 0 B avail > >>>>>>>> > pgs: 100.000% pgs unknown > >>>>>>>> > 2048 unknown > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > ceph health detail > >>>>>>>> > > >>>>>>>> > HEALTH_WARN pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set; 1 nearfull osd(s); 3 pool(s) nearfull; Reduced data availability: 2048 pgs inactive; mons cephs01,cephs02,cephs03 are using a lot of disk space > >>>>>>>> > OSDMAP_FLAGS pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set > >>>>>>>> > OSD_NEARFULL 1 nearfull osd(s) > >>>>>>>> > osd.63 is near full > >>>>>>>> > POOL_NEARFULL 3 pool(s) nearfull > >>>>>>>> > pool 'dteam' is nearfull > >>>>>>>> > pool 'atlas' is nearfull > >>>>>>>> > pool 'atlas-localgroup' is nearfull > >>>>>>>> > PG_AVAILABILITY Reduced data availability: 2048 pgs inactive > >>>>>>>> > pg 13.1ef is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1f0 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1f1 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1f2 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1f3 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1f4 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1f5 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1f6 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1f7 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1f8 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1f9 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1fa is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1fb is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1fc is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1fd is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1fe is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 13.1ff is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1ec is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1f0 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1f1 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1f2 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1f3 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1f4 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1f5 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1f6 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1f7 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1f8 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1f9 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1fa is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1fb is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1fc is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1fd is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1fe is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 14.1ff is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1ed is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1f0 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1f1 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1f2 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1f3 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1f4 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1f5 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1f6 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1f7 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1f8 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1f9 is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1fa is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1fb is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1fc is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1fd is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1fe is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > pg 15.1ff is stuck inactive for 89.322981, current state unknown, last acting [] > >>>>>>>> > MON_DISK_BIG mons cephs01,cephs02,cephs03 are using a lot of disk space > >>>>>>>> > mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB) > >>>>>>>> > mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB) > >>>>>>>> > mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB) > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > ceph versions > >>>>>>>> > > >>>>>>>> > { > >>>>>>>> > "mon": { > >>>>>>>> > "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 > >>>>>>>> > }, > >>>>>>>> > "mgr": { > >>>>>>>> > "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 1 > >>>>>>>> > }, > >>>>>>>> > "osd": { > >>>>>>>> > "ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)": 1, > >>>>>>>> > "ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188, > >>>>>>>> > "ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18, > >>>>>>>> > "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122 > >>>>>>>> > }, > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > >>>>>> > >>>>>>>> > > >>>>>>>> > As a note, the log where the mgr explodes (which precipitated all of this) definitely shows the problem occurring on the 12th [when 14.2.17 dropped], but things didn't "break" until we tried upgrading OSDs to 14.2.18... > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > Sam > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla(a)gmail.com> wrote: > >>>>>>>> >> > >>>>>>>> >> Hi Dan: > >>>>>>>> >> > >>>>>>>> >> Thanks for the reply - at present, our mons and mgrs are off [because of the unsustainable nature of the filesystem usage]. We'll try putting them on again for long enough to get "ceph status" out of them, but because the mgr was unable to actually talk to anything, and reply at that point. > >>>>>>>> >> > >>>>>>>> >> (And thanks for the link to the bug tracker - I guess this mismatch of expectations is why the devs are so keen to move to containerised deployments where there is no co-location of different types of server, as it means they don't need to worry as much about the assumptions about when it's okay to restart a service on package update. Disappointing that it seems stale after 2 years...) > >>>>>>>> >> > >>>>>>>> >> Sam > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan(a)vanderster.com> wrote: > >>>>>>>> >>> > >>>>>>>> >>> Hi Sam, > >>>>>>>> >>> > >>>>>>>> >>> The daemons restart (for *some* releases) because of this: > >>>>>>>> >>> https://tracker.ceph.com/issues/21672 > >>>>>>>> >>> In short, if the selinux module changes, and if you have selinux > >>>>>>>> >>> enabled, then midway through yum update, there will be a systemctl > >>>>>>>> >>> restart ceph.target issued. > >>>>>>>> >>> > >>>>>>>> >>> For the rest -- I think you should focus on getting the PGs all > >>>>>>>> >>> active+clean as soon as possible, because the degraded and remapped > >>>>>>>> >>> states are what leads to mon / osdmap growth. > >>>>>>>> >>> This kind of scenario is why we wrote this tool: > >>>>>>>> >>> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema… > >>>>>>>> >>> It will use pg-upmap-items to force the PGs to the OSDs where they are > >>>>>>>> >>> currently residing. > >>>>>>>> >>> > >>>>>>>> >>> But there is some clarification needed before you go ahead with that. > >>>>>>>> >>> Could you share `ceph status`, `ceph health detail`? > >>>>>>>> >>> > >>>>>>>> >>> Cheers, Dan > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > >>>>>>>> >>> > > >>>>>>>> >>> > Hi everyone: > >>>>>>>> >>> > > >>>>>>>> >>> > I posted to the list on Friday morning (UK time), but apparently my email > >>>>>>>> >>> > is still in moderation (I have an email from the list bot telling me that > >>>>>>>> >>> > it's held for moderation but no updates). > >>>>>>>> >>> > > >>>>>>>> >>> > Since this is a bit urgent - we have ~3PB of storage offline - I'm posting > >>>>>>>> >>> > again. > >>>>>>>> >>> > > >>>>>>>> >>> > To save retyping the whole thing, I will direct you to a copy of the email > >>>>>>>> >>> > I wrote on Friday: > >>>>>>>> >>> > > >>>>>>>> >>> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt > >>>>>>>> >>> > > >>>>>>>> >>> > (Since that was sent, we did successfully add big SSDs to the MON hosts so > >>>>>>>> >>> > they don't fill up their disks with store.db s). > >>>>>>>> >>> > > >>>>>>>> >>> > I would appreciate any advice - assuming this also doesn't get stuck in > >>>>>>>> >>> > moderation queues. > >>>>>>>> >>> > > >>>>>>>> >>> > -- > >>>>>>>> >>> > Sam Skipsey (he/him, they/them) > >>>>>>>> >>> > _______________________________________________ > >>>>>>>> >>> > ceph-users mailing list -- ceph-users(a)ceph.io > >>>>>>>> >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> -- > >>>>>>>> >> Sam Skipsey (he/him, they/them) > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > -- > >>>>>>>> > Sam Skipsey (he/him, they/them) > >>>>>>>> > > >>>>>>>> > > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Sam Skipsey (he/him, they/them) > >>>>>>> > >>>>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Sam Skipsey (he/him, they/them) > >>>>> > >>>>> > >>> > >>> > >>> -- > >>> Sam Skipsey (he/him, they/them) > >>> > >>> > > > > > > -- > > Sam Skipsey (he/him, they/them) > > > >

Stefan Kooman

4:43 a.m.

On 3/23/21 8:29 AM, Dan van der Ster wrote:

...

I renamed a 14.2.18 osd server's network interface to "zzzceph" (was ceph before" and restarted the node. It does come up with the correct ip (ms_bind_ipv4=false, ms_bind_ipv6=true). So it does not seem to be as simple as this, or the ms_bind option migth matter here as well, dunno. FYI, Gr. Stefan

Sam Skipsey

6:45 a.m.

I should note that our cluster is entirely IPv4 [because it's on a private network, so there's no need to go IPv6], which maybe influences matters? Sam On Tue, 23 Mar 2021 at 11:43, Stefan Kooman <stefan(a)bit.nl> wrote:

...

On 3/23/21 8:29 AM, Dan van der Ster wrote:

-- Sam Skipsey (he/him, they/them)

Dan van der Ster

6:52 a.m.

Not sure. But anyway ceph has been skipping interfaces named "lo" since v10, but then dropped that in 14.2.18 (by accident, IMO). You should be able to get your osds listening to the correct IP using cluster network = 10.1.50.0/8 public network = 10.1.50.0/8 does that work? - dan On Tue, Mar 23, 2021 at 2:45 PM Sam Skipsey <aoanla(a)gmail.com> wrote: > > I should note that our cluster is entirely IPv4 [because it's on a private network, so there's no need to go IPv6], which maybe influences matters? > > Sam > > On Tue, 23 Mar 2021 at 11:43, Stefan Kooman <stefan(a)bit.nl> wrote: >> >> On 3/23/21 8:29 AM, Dan van der Ster wrote: >> > Hi Sam, >> > >> > Yeah somehow `lo:` is not getting skipped, probably due to those >> > patches. (I guess it is because the 2nd patch looks for `lo:` but in >> > fact the ifa_name is probably just `lo` without the colon) >> > >> > https://github.com/ceph/ceph/blob/master/src/common/ipaddr.cc#L110 >> > >> > I don't know why this impacts you but not us -- we already upgraded >> > one of our clusters to 14.2.18 on Centos 8, and ceph is choosing the >> > correct interface without needing any network options. And lo: is the >> > first interface [1] here too. >> > Could it be as simple as the iface names being sorted alphabetically? >> > Here we have ens785f0 which would come before lo, but your interface >> > `p2p2` would come after. >> >> I renamed a 14.2.18 osd server's network interface to "zzzceph" (was >> ceph before" and restarted the node. It does come up with the correct ip >> (ms_bind_ipv4=false, ms_bind_ipv6=true). So it does not seem to be as >> simple as this, or the ms_bind option migth matter here as well, dunno. >> >> FYI, >> >> Gr. Stefan > > > > -- > Sam Skipsey (he/him, they/them) > >

Stefan Kooman

7:11 a.m.

On 3/23/21 2:52 PM, Dan van der Ster wrote:

...

And if that doesn't, you can tell each daemon to which IP it should bind like so: ceph config set osd.$id public_addr 10.1.50.x Gr. Stefan

Sam Skipsey

7:21 a.m.

Hi, Indeed, we ended up with a config like that yesterday, and the cluster is pretty healthy now [just moving a few pgs around as ceph is wont to do]. Sam On Tue, 23 Mar 2021 at 14:11, Stefan Kooman <stefan(a)bit.nl> wrote:

...

On 3/23/21 2:52 PM, Dan van der Ster wrote:

And if that doesn't, you can tell each daemon to which IP it should bind like so: ceph config set osd.$id public_addr 10.1.50.x Gr. Stefan

-- Sam Skipsey (he/him, they/them)

1128

days inactive

1129

days old

ceph-users@ceph.io

Manage subscription

18 comments

3 participants

tags (0)

participants (3)

Dan van der Ster
Sam Skipsey
Stefan Kooman