Hi,all
we use openstack + ceph(hammer) in my production.There are 22 osds on a host and 11 osds
share one ssd for osd journal.
Unfortunately, one of the ssds does not work,so the 11 osds were down.The osd log shows as
below:
-1> 2019-09-19 11:35:52.681142 7fcab5354700 1 -- xxxxxxxxxxxx:6831/16460 -->
xxxxxxxxxxxx:0/14304 -- osd_ping(ping_reply e6152 stamp 2019-09-19 11:35:52.679939) v2 --
?+0 0x20af8400 con 0x20a4b340
0> 2019-09-19 11:35:52.682578 7fcabed3c700 -1 os/FileJournal.cc: In function
'void FileJournal::write_finish_thread_entry()' thread 7fcabed3c700 time
2019-09-19 11:35:52.640294
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85)
[0xbc8b55]
2: (FileJournal::write_finish_thread_entry()+0x695) [0xa795c5]
3: (FileJournal::WriteFinisher::entry()+0xd) [0x91cecd]
4: (()+0x7dc5) [0x7fcacb81cdc5]
5: (clone()+0x6d) [0x7fcaca2fd1cd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this
As it shows,osd was down at 09-19,But now the pg still is degraded and remapped,It seam
stuck.
ceph -s
cluster 26cc714c-ed78-4d62-9435-db3e87509c5f
health HEALTH_WARN
681 pgs degraded
681 pgs stuck degraded
797 pgs stuck unclean
681 pgs stuck undersized
681 pgs undersized
recovery 132155/11239182 objects degraded (1.176%)
recovery 22056/11239182 objects misplaced (0.196%)
monmap e1: 3 mons at
{ctrl01=xxx.xxx.xxx.xxx:6789/0,ctrl02=xxx.xxx.xxx.xxx:6789/0,ctrl03=xxx.xxx.xxx.xxx:6789/0}
election epoch 122, quorum 0,1,2 ctrl01,ctrl02,ctrl03
osdmap e6590: 324 osds: 313 up, 313 in; 116 remapped pgs
pgmap v40849600: 21504 pgs, 6 pools, 14048 GB data, 3658 kobjects
41661 GB used, 279 TB / 319 TB avail
132155/11239182 objects degraded (1.176%)
22056/11239182 objects misplaced (0.196%)
20707 active+clean
681 active+undersized+degraded
116 active+remapped
client io 121 MB/s rd, 144 MB/s wr, 1029 op/s
I query one of the pgs,the recovery_state is started [1]
I also found pg have not third osd to mapped ,as shows below.
[root@ctrl01 ~]# ceph pg map 4.75f
osdmap e6590 pg 4.75f (4.75f) -> up [34,106] acting [34,106]
crushmap at [2]
How i should to get the cluster come back ok?
can someone help me. very very thanks.
[1]
https://github.com/rongzhen-zhan/myfile/blob/master/pgquery
[2]
https://github.com/rongzhen-zhan/myfile/blob/master/crushmap