PG is stuck in repmapped and degraded - ceph-users

29 Sep 2019

Hi,all
  we use openstack + ceph(hammer) in my production.There are 22 osds on a host and 11 osds
share one ssd for osd journal.
Unfortunately, one of the ssds does not work,so the 11 osds were down.The osd log shows as
below:
    -1> 2019-09-19 11:35:52.681142 7fcab5354700  1 -- xxxxxxxxxxxx:6831/16460 -->
xxxxxxxxxxxx:0/14304 -- osd_ping(ping_reply e6152 stamp 2019-09-19 11:35:52.679939) v2 --
?+0 0x20af8400 con 0x20a4b340
     0> 2019-09-19 11:35:52.682578 7fcabed3c700 -1 os/FileJournal.cc: In function
'void FileJournal::write_finish_thread_entry()' thread 7fcabed3c700 time
2019-09-19 11:35:52.640294
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85)
[0xbc8b55]
 2: (FileJournal::write_finish_thread_entry()+0x695) [0xa795c5]
 3: (FileJournal::WriteFinisher::entry()+0xd) [0x91cecd]
 4: (()+0x7dc5) [0x7fcacb81cdc5]
 5: (clone()+0x6d) [0x7fcaca2fd1cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this

As it shows,osd was down at 09-19,But now the pg still is degraded and remapped,It seam
stuck.
ceph -s
    cluster 26cc714c-ed78-4d62-9435-db3e87509c5f
     health HEALTH_WARN
            681 pgs degraded
            681 pgs stuck degraded
            797 pgs stuck unclean
            681 pgs stuck undersized
            681 pgs undersized
            recovery 132155/11239182 objects degraded (1.176%)
            recovery 22056/11239182 objects misplaced (0.196%)
     monmap e1: 3 mons at
{ctrl01=xxx.xxx.xxx.xxx:6789/0,ctrl02=xxx.xxx.xxx.xxx:6789/0,ctrl03=xxx.xxx.xxx.xxx:6789/0}
            election epoch 122, quorum 0,1,2 ctrl01,ctrl02,ctrl03
     osdmap e6590: 324 osds: 313 up, 313 in; 116 remapped pgs
      pgmap v40849600: 21504 pgs, 6 pools, 14048 GB data, 3658 kobjects
            41661 GB used, 279 TB / 319 TB avail
            132155/11239182 objects degraded (1.176%)
            22056/11239182 objects misplaced (0.196%)
               20707 active+clean
                 681 active+undersized+degraded
                 116 active+remapped
  client io 121 MB/s rd, 144 MB/s wr, 1029 op/s
I query one of the pgs,the recovery_state is started [1]
I also found pg have not third osd to mapped ,as shows below.
[root@ctrl01 ~]# ceph pg map 4.75f
osdmap e6590 pg 4.75f (4.75f) -> up [34,106] acting [34,106]

crushmap at [2]

How i should to get the cluster come back ok?
can someone help me. very very thanks.

[1] https://github.com/rongzhen-zhan/myfile/blob/master/pgquery
[2] https://github.com/rongzhen-zhan/myfile/blob/master/crushmap