How to fix 1 pg marked as stale+active+clean
pg 30.4 is stuck stale for 175342.419261, current state
stale+active+clean, last acting [31]
Show replies by date
I had just one osd go down (31), why is ceph not auto-healing in this
'simple' case?
-----Original Message-----
To: ceph-users
Subject: [ceph-users] How to fix 1 pg stale+active+clean
How to fix 1 pg marked as stale+active+clean
pg 30.4 is stuck stale for 175342.419261, current state
stale+active+clean, last acting [31]
This is an edge case and probably not something that would be done in production, so I
suspect the answer is “lol, no,” but here goes:
I have three nodes running Nautilus courtesy of Proxmox. One of them is a self-built Ryzen
5 3600 system, and the other two are salvaged i5 Skylake desktops that I have pressed into
service as virtualization and storage nodes. I want to replace the i5 systems with
machines that are identical to the Ryzen 5 system. What I want to know is whether it’s
possible to just take the devices that are currently hosting the OSDs, together with the
hard drive that is hosting Proxmox, move them into the new machine, power up and have
everything working. I don’t *think* the device names should change. What does everything
think about this possibly insane plan? (Yes, I will back up all my important data before
trying this.)
Thanks,
J
As far as Ceph is concerned, as long as there are no separate
journal/blockdb/wal devices, you absolutely can transfer osds between
hosts. If there are separate journal/blockdb/wal devices, you can do
it still, provided they move with the OSDs.
With Nautilus and up, make sure the osd bootstrap key is on the new
host, and run 'ceph-volume lvm scan --all'. It will scan through the
devices, identify the ceph osds et al and start them on the new host.
There are no other "gotchas" that I remember.
I cannot speak to Proxmox, however.
--
Adam
On Sat, Apr 11, 2020 at 12:45 PM Jarett DeAngelis <jarett(a)reticulum.us> wrote:
>
> This is an edge case and probably not something that would be done in production, so
I suspect the answer is “lol, no,” but here goes:
>
> I have three nodes running Nautilus courtesy of Proxmox. One of them is a self-built
Ryzen 5 3600 system, and the other two are salvaged i5 Skylake desktops that I have
pressed into service as virtualization and storage nodes. I want to replace the i5 systems
with machines that are identical to the Ryzen 5 system. What I want to know is whether
it’s possible to just take the devices that are currently hosting the OSDs, together with
the hard drive that is hosting Proxmox, move them into the new machine, power up and have
everything working. I don’t *think* the device names should change. What does everything
think about this possibly insane plan? (Yes, I will back up all my important data before
trying this.)
>
> Thanks,
> J
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
The cause of the stale pg, is a fs_data.r1 1 replica pool. This should
be empty but ceph df shows 128 KiB used.
I have already marked the osd as lost and removed the osd from the crush
map.
PG_AVAILABILITY Reduced data availability: 1 pg stale
pg 30.4 is stuck stale for 407878.113092, current state
stale+active+clean, last acting [31]
[@c01 ~]# ceph pg map 30.4
osdmap e72814 pg 30.4 (30.4) -> up [29] acting [29]
[@c01 ~]# ceph pg 30.4 query
Error ENOENT: i don't have pgid 30.4
-----Original Message-----
To: ceph-users
Subject: [ceph-users] Re: How to fix 1 pg stale+active+clean
I had just one osd go down (31), why is ceph not auto-healing in this
'simple' case?
-----Original Message-----
To: ceph-users
Subject: [ceph-users] How to fix 1 pg stale+active+clean
How to fix 1 pg marked as stale+active+clean
pg 30.4 is stuck stale for 175342.419261, current state
stale+active+clean, last acting [31]
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io