[ceph-users] ERROR: osd init failed: (1) Operation not permitted

11 Feb 2020

Hello List,

first of all: Yes - i made mistakes. Now i am trying to recover :-/

I had a healthy 3 node cluster which i wanted to convert to a single one.
My goal was to reinstall a fresh 3 Node cluster and start with 2 nodes.

I was able to healthy turn it from a 3 Node Cluster to a 2 Node cluster.
Then the problems began.

I started to change size=1 and min_size=1.
Health was okay until here. Then over sudden both nodes got
fenced...one node refused to boot, mons where missing, etc...to make
long story short, here is where i am right now:

root@node03:~ # ceph -s
    cluster b3be313f-d0ef-42d5-80c8-6b41380a47e3
     health HEALTH_WARN
            53 pgs stale
            53 pgs stuck stale
     monmap e4: 2 mons at {0=10.15.15.3:6789/0,1=10.15.15.2:6789/0}
            election epoch 298, quorum 0,1 1,0
     osdmap e6097: 14 osds: 9 up, 9 in
      pgmap v93644673: 512 pgs, 1 pools, 1193 GB data, 304 kobjects
            1088 GB used, 32277 GB / 33366 GB avail
                 459 active+clean
                  53 stale+active+clean

root@node03:~ # ceph osd tree
ID WEIGHT   TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 32.56990 root default
-2 25.35992     host node03
 0  3.57999         osd.0        up  1.00000          1.00000
 5  3.62999         osd.5        up  1.00000          1.00000
 6  3.62999         osd.6        up  1.00000          1.00000
 7  3.62999         osd.7        up  1.00000          1.00000
 8  3.62999         osd.8        up  1.00000          1.00000
19  3.62999         osd.19       up  1.00000          1.00000
20  3.62999         osd.20       up  1.00000          1.00000
-3  7.20998     host node02
 3  3.62999         osd.3        up  1.00000          1.00000
 4  3.57999         osd.4        up  1.00000          1.00000
 1        0 osd.1              down        0          1.00000
 9        0 osd.9              down        0          1.00000
10        0 osd.10             down        0          1.00000
17        0 osd.17             down        0          1.00000
18        0 osd.18             down        0          1.00000

my main mistakes seemd to be:
--------------------------------
ceph osd out osd.1
ceph auth del osd.1
systemctl stop ceph-osd@1
ceph osd rm 1
umount /var/lib/ceph/osd/ceph-1
ceph osd crush remove osd.1

As far as i can tell, ceph waits and needs data from that OSD.1 (which
i removed)

root@node03:~ # ceph health detail
HEALTH_WARN 53 pgs stale; 53 pgs stuck stale
pg 0.1a6 is stuck stale for 5086.552795, current state
stale+active+clean, last acting [1]
pg 0.142 is stuck stale for 5086.552784, current state
stale+active+clean, last acting [1]
pg 0.1e is stuck stale for 5086.552820, current state
stale+active+clean, last acting [1]
pg 0.e0 is stuck stale for 5086.552855, current state
stale+active+clean, last acting [1]
pg 0.1d is stuck stale for 5086.552822, current state
stale+active+clean, last acting [1]
pg 0.13c is stuck stale for 5086.552791, current state
stale+active+clean, last acting [1]
[...] SNIP [...]
pg 0.e9 is stuck stale for 5086.552955, current state
stale+active+clean, last acting [1]
pg 0.87 is stuck stale for 5086.552939, current state
stale+active+clean, last acting [1]

When i try to start ODS.1 manually, i get:
--------------------------------------------
2020-02-10 18:48:26.107444 7f9ce31dd880  0 ceph version 0.94.10
(b1e0532418e4631af01acbc0cedd426f1905f4af), process ceph-osd, pid
10210
2020-02-10 18:48:26.134417 7f9ce31dd880  0
filestore(/var/lib/ceph/osd/ceph-1) backend xfs (magic 0x58465342)
2020-02-10 18:48:26.184202 7f9ce31dd880  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
FIEMAP ioctl is supported and appears to work
2020-02-10 18:48:26.184209 7f9ce31dd880  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2020-02-10 18:48:26.184526 7f9ce31dd880  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2020-02-10 18:48:26.184585 7f9ce31dd880  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_feature: extsize
is disabled by conf
2020-02-10 18:48:26.309755 7f9ce31dd880  0
filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2020-02-10 18:48:26.633926 7f9ce31dd880  1 journal _open
/var/lib/ceph/osd/ceph-1/journal fd 20: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1
2020-02-10 18:48:26.642185 7f9ce31dd880  1 journal _open
/var/lib/ceph/osd/ceph-1/journal fd 20: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1
2020-02-10 18:48:26.664273 7f9ce31dd880  0 <cls>
cls/hello/cls_hello.cc:271: loading cls_hello
2020-02-10 18:48:26.732154 7f9ce31dd880  0 osd.1 6002 crush map has
features 1107558400, adjusting msgr requires for clients
2020-02-10 18:48:26.732163 7f9ce31dd880  0 osd.1 6002 crush map has
features 1107558400 was 8705, adjusting msgr requires for mons
2020-02-10 18:48:26.732167 7f9ce31dd880  0 osd.1 6002 crush map has
features 1107558400, adjusting msgr requires for osds
2020-02-10 18:48:26.732179 7f9ce31dd880  0 osd.1 6002 load_pgs
2020-02-10 18:48:31.939810 7f9ce31dd880  0 osd.1 6002 load_pgs opened 53 pgs
2020-02-10 18:48:31.940546 7f9ce31dd880 -1 osd.1 6002 log_to_monitors
{default=true}
2020-02-10 18:48:31.942471 7f9ce31dd880  1 journal close
/var/lib/ceph/osd/ceph-1/journal
2020-02-10 18:48:31.969205 7f9ce31dd880 -1 ESC[0;31m ** ERROR: osd
init failed: (1) Operation not permittedESC[0m

Its mounted:
/dev/sdg1       3.7T  127G  3.6T   4% /var/lib/ceph/osd/ceph-1

Is there any way i can get the OSD.1 back in?

Thanks a lot,
mario

2024

2023

2022

2021

2020

2019

[ceph-users] ERROR: osd init failed: (1) Operation not permitted