hi Dan:
So, unsetting nodown results in... almost all of the OSDs being marked
down. (231 down out of 328).
Checking the actual OSD services, most of them were actually up and active
on the nodes, even when the mons had marked them down.
(On a few nodes, the down services corresponded to OSDs that had been
flapping - but increasing osd_max_markdown locally to keep them up despite
the previous flapping, and restarting the services... didn't help.)
In fact, starting up the few OSD services which had actually stopped,
resulted in a different set of OSDs being marked down, and some others
coming up.
We currently have a sort of "rolling OSD outness" passing through the
cluster - there's always ~230 OSDs marked down now, but which ones those
are changes (we've had everything from 1 HOST down to 4 HOSTS down over the
past 14 minutes as things fluctuate.
A log from one of the "down" OSDs [which is actually running, and on the
same host as OSDs which are marked up] shows this worrying snippet
2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 is_healthy false --
only 0/10 up peers (less than 33%)
2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 not healthy;
waiting to boot
2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 is_healthy false --
only 0/10 up peers (less than 33%)
2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 not healthy;
waiting to boot
2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 is_healthy false --
only 0/10 up peers (less than 33%)
2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 not healthy;
waiting to boot
2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 is_healthy false --
only 0/10 up peers (less than 33%)
2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 not healthy;
waiting to boot
2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 is_healthy false --
only 0/10 up peers (less than 33%)
2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 not healthy;
waiting to boot
2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 is_healthy false --
only 0/10 up peers (less than 33%)
2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 not healthy;
waiting to boot
2021-03-22 17:01:50.922 7f6c9f088700 -1 --2- 10.1.50.21:0/23673 >> [v2:
127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] conn(0x56010903e400
0x56011a71fc00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0
tx=0)._handle_peer_banner peer [v2:
127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] is using msgr V1
protocol
2021-03-22 17:01:50.922 7f6c9f889700 -1 --2- 10.1.50.21:0/23673 >> [v2:
127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] conn(0x5600df434000
0x56011718e000 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0
tx=0)._handle_peer_banner peer [v2:
127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] is using msgr V1
protocol
2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2:
127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] conn(0x5600f85ed800
0x560109df2a00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0
tx=0)._handle_peer_banner peer [v2:
127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] is using msgr V1
protocol
2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2:
127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] conn(0x5600f22ea000
0x560117182300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0
tx=0)._handle_peer_banner peer [v2:
127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] is using msgr V1
protocol
2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2:
127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] conn(0x5600df435c00
0x560139370300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0
tx=0)._handle_peer_banner peer [v2:
127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] is using msgr V1
protocol
2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 is_healthy false --
only 0/10 up peers (less than 33%)
2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 not healthy;
waiting to boot
2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 is_healthy false --
only 0/10 up peers (less than 33%)
2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 not healthy;
waiting to boot
2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 is_healthy false --
only 0/10 up peers (less than 33%)
2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 not healthy;
waiting to boot
2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 is_healthy false --
only 0/10 up peers (less than 33%)
2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 not healthy;
waiting to boot
2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 is_healthy false --
only 0/10 up peers (less than 33%)
2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 not healthy;
waiting to boot
2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 is_healthy false --
only 0/10 up peers (less than 33%)
2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 not healthy;
waiting to boot
2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 is_healthy false --
only 0/10 up peers (less than 33%)
2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 not healthy;
waiting to boot
Any suggestions?
Sam
P.S. an example ceph status as it is now [with everything now on 14.2.18,
since we had to restart osds anyway]:
cluster:
id: a1148af2-6eaf-4486-a27e-a05a78c2b378
health: HEALTH_WARN
pauserd,pausewr,noout,nobackfill,norebalance flag(s) set
230 osds down
4 hosts (80 osds) down
Reduced data availability: 2048 pgs inactive
8 slow ops, oldest one blocked for 901 sec, mon.cephs01 has
slow ops
services:
mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 2h)
mgr: cephs01(active, since 77m)
osd: 329 osds: 98 up (since 4s), 328 in (since 4d)
flags pauserd,pausewr,noout,nobackfill,norebalance
data:
pools: 3 pools, 2048 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
2048 unknown
On Mon, 22 Mar 2021 at 14:57, Dan van der Ster <dan(a)vanderster.com> wrote:
Hi,
I would unset nodown (hiding osd failures) and norecover (blcoking PGs
from recovering degraded objects), then start starting osds.
As soon as you have some osd logs reporting some failures, then share
those...
- Dan
On Mon, Mar 22, 2021 at 3:49 PM Sam Skipsey <aoanla(a)gmail.com> wrote:
So, we started the mons and mgr up again, and here's the relevant logs,
including also ceph versions. We've also turned off all of the firewalls on
all of the nodes so we know that there can't be network issues [and,
indeed, all of our management of the OSDs happens via logins from the
service nodes or to each other]
ceph status
cluster:
id: a1148af2-6eaf-4486-a27e-a05a78c2b378
health: HEALTH_WARN
pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set
1 nearfull osd(s)
3 pool(s) nearfull
Reduced data availability: 2048 pgs inactive
mons cephs01,cephs02,cephs03 are using a lot of disk space
services:
mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s)
mgr: cephs01(active, since 76s)
osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 remapped
pgs
flags
pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover
data:
pools: 3 pools, 2048 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
2048 unknown
ceph health detail
HEALTH_WARN
pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s)
set;
1 nearfull osd(s); 3 pool(s) nearfull; Reduced data availability: 2048 pgs
inactive; mons cephs01,cephs02,cephs03 are using a lot of disk space
OSDMAP_FLAGS
pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set
OSD_NEARFULL 1 nearfull osd(s)
osd.63 is near full
POOL_NEARFULL 3 pool(s) nearfull
pool 'dteam' is nearfull
pool 'atlas' is nearfull
pool 'atlas-localgroup' is nearfull
PG_AVAILABILITY Reduced data availability: 2048 pgs inactive
pg 13.1ef is stuck inactive for 89.322981, current state unknown,
last acting
[]
pg 13.1f0 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1f1 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1f2 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1f3 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1f4 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1f5 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1f6 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1f7 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1f8 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1f9 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1fa is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1fb is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1fc is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1fd is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1fe is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 13.1ff is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1ec is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1f0 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1f1 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1f2 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1f3 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1f4 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1f5 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1f6 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1f7 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1f8 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1f9 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1fa is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1fb is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1fc is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1fd is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1fe is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 14.1ff is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1ed is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1f0 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1f1 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1f2 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1f3 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1f4 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1f5 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1f6 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1f7 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1f8 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1f9 is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1fa is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1fb is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1fc is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1fd is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1fe is stuck inactive for 89.322981,
current state unknown,
last acting []
pg 15.1ff is stuck inactive for 89.322981,
current state unknown,
last acting []
MON_DISK_BIG mons cephs01,cephs02,cephs03 are
using a lot of disk space
mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB)
mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB)
mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB)
ceph versions
{
"mon": {
"ceph version 14.2.18
(befbc92f3c11eedd8626487211d200c0b44786d9)
nautilus (stable)": 3
},
"mgr": {
"ceph version 14.2.18
(befbc92f3c11eedd8626487211d200c0b44786d9)
nautilus (stable)": 1
},
"osd": {
"ceph version 14.2.10
(b340acf629a010a74d90da5782a2c5fe0b54ac20)
nautilus (stable)": 1,
"ceph version 14.2.15
(afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188,
"ceph version 14.2.16
(762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18,
"ceph version 14.2.18
(befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122
},
>>>>>
As a note, the log where the mgr explodes (which precipitated all of
this)
definitely shows the problem occurring on the 12th [when 14.2.17
dropped], but things didn't "break" until we tried upgrading OSDs to
14.2.18...
Sam
On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla(a)gmail.com> wrote:
>
> Hi Dan:
>
> Thanks for the reply - at present, our mons and mgrs are off [because
of the
unsustainable nature of the filesystem usage]. We'll try putting
them on again for long enough to get "ceph status" out of them, but because
the mgr was unable to actually talk to anything, and reply at that point.
>
> (And thanks for the link to the bug tracker - I guess this mismatch of
expectations is why the devs are so keen to move to containerised
deployments where there is no co-location of different types of server, as
it means they don't need to worry as much about the assumptions about when
it's okay to restart a service on package update. Disappointing that it
seems stale after 2 years...)
>
> Sam
>
>
>
> On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan(a)vanderster.com>
wrote:
>>
>> Hi Sam,
>>
>> The daemons restart (for *some* releases) because of this:
>>
https://tracker.ceph.com/issues/21672
>> In short, if the selinux module changes, and if you have selinux
>> enabled, then midway through yum update, there will be a systemctl
>> restart ceph.target issued.
>>
>> For the rest -- I think you should focus on getting the PGs all
>> active+clean as soon as possible, because the degraded and remapped
>> states are what leads to mon / osdmap growth.
>> This kind of scenario is why we wrote this tool:
>>
https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema…
>> It will use pg-upmap-items to force the
PGs to the OSDs where they are
>> currently residing.
>>
>> But there is some clarification needed before you go ahead with that.
>> Could you share `ceph status`, `ceph health detail`?
>>
>> Cheers, Dan
>>
>>
>> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla(a)gmail.com>
wrote:
>> >
>> > Hi everyone:
>> >
>> > I posted to the list on Friday morning (UK time), but apparently my
email
>> > is still in moderation (I have an
email from the list bot telling
me that
>> > it's held for moderation but no
updates).
>> >
>> > Since this is a bit urgent - we have ~3PB of storage offline - I'm
posting
>> > again.
>> >
>> > To save retyping the whole thing, I will direct you to a copy of
the
email
>> > I wrote on Friday:
>> >
>> >
http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt
>> >
>> > (Since that was sent, we did successfully add big SSDs to the MON
hosts so
>> > they don't fill up their disks
with store.db s).
>> >
>> > I would appreciate any advice - assuming this also doesn't get
stuck in
> >
moderation queues.
> >
> > --
> > Sam Skipsey (he/him, they/them)
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
--
Sam Skipsey (he/him, they/them)
--
Sam Skipsey (he/him, they/them)