Hi,
I'm still evaluating ceph 15.2.5 in a lab so the problem is not really hurting me, but
I want to understand it and hopefully fix it. It is a good practice. To test the
resilience of the cluster I try to break it by doing all kinds of things. Today I powered
off (clean shutdown) one osd node and powered it back on. Last time I tried this there was
no problem getting it back online. After a few minutes the cluster health was back to ok.
This time it stayed degraded forever. I checked and noticed that the service osd.0 on the
osd node was failing. So i used google and there people recommended to simply delete the
osd and re-create it. I tried it and still can't get the osd back in service.
First I removed the osd:
[root@gedasvl02 ~]# ceph osd out 0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
osd.0 is already out.
[root@gedasvl02 ~]# ceph auth del 0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
Error EINVAL: bad entity name
[root@gedasvl02 ~]# ceph auth del osd.0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
updated
[root@gedasvl02 ~]# ceph osd rm 0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
removed osd.0
[root@gedasvl02 ~]# ceph osd tree
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.43658 root default
-7 0.21829 host gedaopl01
2 ssd 0.21829 osd.2 up 1.00000 1.00000
-3 0 host gedaopl02
-5 0.21829 host gedaopl03
3 ssd 0.21829 osd.3 up 1.00000 1.00000
Looks ok it's gone...
Then i zapped it:
[root@gedasvl02 ~]# ceph orch device zap gedaopl02 /dev/sdb --force
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
INFO:cephadm:/usr/bin/podman:stderr WARNING: The same type, major and minor should not be
used for multiple devices.
INFO:cephadm:/usr/bin/podman:stderr --> Zapping: /dev/sdb
INFO:cephadm:/usr/bin/podman:stderr --> Zapping lvm member /dev/sdb. lv_path is
/dev/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a/osd-block-3a79800d-2a19-45d8-a850-82c6a8113323
INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero
of=/dev/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a/osd-block-3a79800d-2a19-45d8-a850-82c6a8113323
bs=1M count=10 conv=fsync
INFO:cephadm:/usr/bin/podman:stderr stderr: 10+0 records in
INFO:cephadm:/usr/bin/podman:stderr 10+0 records out
INFO:cephadm:/usr/bin/podman:stderr 10485760 bytes (10 MB, 10 MiB) copied, 0.0314447 s,
333 MB/s
INFO:cephadm:/usr/bin/podman:stderr stderr:
INFO:cephadm:/usr/bin/podman:stderr --> Only 1 LV left in VG, will proceed to destroy
volume group ceph-3bf1bb28-0858-4464-a848-d7f56319b40a
INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/sbin/vgremove -v -f
ceph-3bf1bb28-0858-4464-a848-d7f56319b40a
INFO:cephadm:/usr/bin/podman:stderr stderr: Removing
ceph--3bf1bb28--0858--4464--a848--d7f56319b40a-osd--block--3a79800d--2a19--45d8--a850--82c6a8113323
(253:0)
INFO:cephadm:/usr/bin/podman:stderr stderr: Archiving volume group
"ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" metadata (seqno 5).
INFO:cephadm:/usr/bin/podman:stderr stderr: Releasing logical volume
"osd-block-3a79800d-2a19-45d8-a850-82c6a8113323"
INFO:cephadm:/usr/bin/podman:stderr stderr: Creating volume group backup
"/etc/lvm/backup/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" (seqno 6).
INFO:cephadm:/usr/bin/podman:stderr stdout: Logical volume
"osd-block-3a79800d-2a19-45d8-a850-82c6a8113323" successfully removed
INFO:cephadm:/usr/bin/podman:stderr stderr: Removing physical volume "/dev/sdb"
from volume group "ceph-3bf1bb28-0858-4464-a848-d7f56319b40a"
INFO:cephadm:/usr/bin/podman:stderr stdout: Volume group
"ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" successfully removed
INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero of=/dev/sdb
bs=1M count=10 conv=fsync
INFO:cephadm:/usr/bin/podman:stderr stderr: 10+0 records in
INFO:cephadm:/usr/bin/podman:stderr 10+0 records out
INFO:cephadm:/usr/bin/podman:stderr stderr: 10485760 bytes (10 MB, 10 MiB) copied,
0.0355641 s, 295 MB/s
INFO:cephadm:/usr/bin/podman:stderr --> Zapping successful for: <Raw Device:
/dev/sdb>
And re-added it:
[root@gedasvl02 ~]# ceph orch daemon add osd gedaopl02:/dev/sdb
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
Created osd(s) 0 on host 'gedaopl02'
But the osd is still out...
[root@gedasvl02 ~]# ceph osd tree
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.43658 root default
-7 0.21829 host gedaopl01
2 ssd 0.21829 osd.2 up 1.00000 1.00000
-3 0 host gedaopl02
-5 0.21829 host gedaopl03
3 ssd 0.21829 osd.3 up 1.00000 1.00000
0 0 osd.0 down 0 1.00000
Looking at the cluster log in the webui i see the following error:
Failed to apply osd.dashboard-admin-1606745745154 spec
DriveGroupSpec(name=dashboard-admin-1606745745154->placement=PlacementSpec(host_pattern='*'),
service_id='dashboard-admin-1606745745154', service_type='osd',
data_devices=DeviceSelection(size='223.6GB', rotational=False, all=False),
osd_id_claims={}, unmanaged=False, filter_logic='AND', preview_only=False): No
filters applied Traceback (most recent call last): File
"/usr/share/ceph/mgr/cephadm/module.py", line 2108, in _apply_all_services if
self._apply_service(spec): File "/usr/share/ceph/mgr/cephadm/module.py", line
2005, in _apply_service self.osd_service.create_from_spec(cast(DriveGroupSpec, spec)) File
"/usr/share/ceph/mgr/cephadm/services/osd.py", line 43, in create_from_spec ret
= create_from_spec_one(self.prepare_drivegroup(drive_group)) File
"/usr/share/ceph/mgr/cephadm/services/osd.py", line 127, in prepare_drivegroup
drive_selection = DriveSelection(drive_group, inventory_for_host) File
"/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line
32, in __init__ self._data = self.assign_devices(self.spec.data_devices) File
"/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line
138, in assign_devices if not all(m.compare(disk) for m in
FilterGenerator(device_filter)): File
"/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line
138, in <genexpr> if not all(m.compare(disk) for m in
FilterGenerator(device_filter)): File
"/lib/python3.6/site-packages/ceph/deployment/drive_selection/matchers.py", line
410, in compare raise Exception("No filters applied") Exception: No filters
applied
I have another error "pgs undersized", maybe this is also causing trouble?
[root@gedasvl02 ~]# ceph -s
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
cluster:
id: d0920c36-2368-11eb-a5de-005056b703af
health: HEALTH_WARN
Degraded data redundancy: 13142/39426 objects degraded (33.333%), 176 pgs
degraded, 225 pgs undersized
services:
mon: 1 daemons, quorum gedasvl02 (age 2w)
mgr: gedasvl02.vqswxg(active, since 2w), standbys: gedaopl02.yrwzqh
mds: cephfs:1 {0=cephfs.gedaopl01.zjuhem=up:active} 1 up:standby
osd: 3 osds: 2 up (since 4d), 2 in (since 94m)
task status:
scrub status:
mds.cephfs.gedaopl01.zjuhem: idle
data:
pools: 7 pools, 225 pgs
objects: 13.14k objects, 77 GiB
usage: 148 GiB used, 299 GiB / 447 GiB avail
pgs: 13142/39426 objects degraded (33.333%)
176 active+undersized+degraded
49 active+undersized
io:
client: 0 B/s rd, 6.1 KiB/s wr, 0 op/s rd, 0 op/s wr
Best Regards,
Oliver