I have test one-node ceph cluster with 4 osds under assumption to add the second node just
before production.
Linux 4.19.0-6-amd64 - debian 10 - ceph version 12.2.11
Unfortunately, system drive was broken before it.
I recovered the system from full backup.
Since no changes was performed to the cluster configuration after that backup, I hoped
that it works.
For the reasons I can't understand first few seconds after boot ceph status was OK
(134 active+clean, 2 active+clean+scrubbing+deep), but a minute later status changed to:
# ceph status
cluster:
id: e02f2885-946b-46c8-91d5-146dd724ecaf
health: HEALTH_WARN
1 filesystem is degraded
2 osds down
1 slice (2 osds) down
Reduced data availability: 136 pgs inactive, 15 pgs peering
services:
mon: 1 daemons, quorum rbd0
mgr: rbd0(active)
mds: fs-1/1/1 up {0=rbd0=up:replay}
osd: 5 osds: 1 up, 3 in
data:
pools: 2 pools, 136 pgs
objects: 118.53k objects, 429GiB
usage: 7.15TiB used, 3.77TiB / 10.9TiB avail
pgs: 88.971% pgs unknown
11.029% pgs not active
121 unknown
15 peering
# ceph osd dump
epoch 1983
fsid e02f2885-946b-46c8-91d5-146dd724ecaf
created 2019-08-16 15:14:07.783009
modified 2020-02-29 13:55:39.212461
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 27
full_ratio 0.97
backfillfull_ratio 0.94
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release luminous
pool 1 'fs_data' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins
pg_num 128 pgp_num 128 last_change 1595 flags hashpspool stripe_width 0 application
cephfs
pool 2 'fs_meta' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins
pg_num 8 pgp_num 8 last_change 1595 flags hashpspool stripe_width 0 application cephfs
max_osd 5
osd.0 down out weight 0 up_from 1970 up_thru 1973 down_at 1975 last_clean_interval
[1949,1963) 192.168.101.111:6806/440 192.168.101.111:6807/440 192.168.101.111:6808/440
192.168.101.111:6809/440 autoout,exists 78eaeb63-47c9-4962-b8ff-46607921f4f6
osd.1 down in weight 1 up_from 1970 up_thru 1970 down_at 1975 last_clean_interval
[1952,1963) 192.168.101.111:6801/439 192.168.101.111:6810/439 192.168.101.111:6811/439
192.168.101.111:6812/439 exists c4c4c85d-f537-4199-823b-b7ab01c78f03
osd.2 down in weight 1 up_from 1969 up_thru 1975 down_at 1976 last_clean_interval
[1946,1963) 192.168.101.111:6802/441 192.168.101.111:6803/441 192.168.101.111:6804/441
192.168.101.111:6805/441 exists bd66a9c3-bfa4-4352-816e-2e4cd86389f3
osd.3 down out weight 0 up_from 1617 up_thru 1619 down_at 1631 last_clean_interval
[1602,1610) 192.168.101.111:6805/933 192.168.101.111:6806/933 192.168.101.111:6807/933
192.168.101.111:6808/933 exists f247115b-c6d5-49b1-9b0e-e799c50be379
osd.4 up in weight 1 up_from 1973 up_thru 1973 down_at 1972 last_clean_interval
[1956,1963) 192.168.101.111:6813/442 192.168.101.111:6814/442 192.168.101.111:6815/442
192.168.101.111:6816/442 exists,up c208221e-1228-4247-a742-0c16ce01d38f
blacklist 192.168.101.111:6800/2636437603 expires 2020-03-01 13:26:01.809132
"ceph pg query" of any PG didn't response.
I can't find any errors in journalctl or in /var/log/ceph/*
I wonder why only osd 4 up, what means outoout, why 15 pgs are peering, where to search
detail information, is it a way to restore data.
Please help me to understand what happend and how to restore data if it possible.
Show replies by date