在 2021年2月19日,下午6:01,Konstantin Shalygin
<k0ste(a)k0ste.ru> 写道:
Please paste your `name smart-log /dev/nvme0n1` output
k
On 19 Feb 2021, at 12:53, zxcs
<zhuxiongcs(a)163.com <mailto:zhuxiongcs@163.com>> wrote:
I have one ceph cluster with nautilus 14.2.10 and one node has 3 SSD and 4 HDD each.
Also has two nvmes as cache. (Means nvme0n1 cache for 0-2 SSD and Nvme1n1 cache for 3-7
HDD)
but there is one nodes’ nvme0n1 always hit below issues(see name..I/O…timeout, aborting),
and sudden this nvme0n1 disappear .
After that i need reboot this node to recover.
Any one hit same issue ? and how to slow it? Any suggestion are welcome. Thanks in
advance!
I am once googled the issue, and see below link, but not see any help
https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd
<https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd>
<https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd
<https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd>><https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd
<https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd>
<https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd
<https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd>>>
From syslog
Feb 19 01:31:52 ip kernel: [1275313.393211] nvme 0000:03:00.0: I/O 949 QID 12 timeout,
aborting
Feb 19 01:31:53 ip kernel: [1275314.389232] nvme 0000:03:00.0: I/O 728 QID 5 timeout,
aborting
Feb 19 01:31:53 ip kernel: [1275314.389247] nvme 0000:03:00.0: I/O 515 QID 7 timeout,
aborting
Feb 19 01:31:53 ip kernel: [1275314.389252] nvme 0000:03:00.0: I/O 516 QID 7 timeout,
aborting
Feb 19 01:31:53 ip kernel: [1275314.389257] nvme 0000:03:00.0: I/O 517 QID 7 timeout,
aborting
Feb 19 01:31:53 ip kernel: [1275314.389263] nvme 0000:03:00.0: I/O 82 QID 9 timeout,
aborting
Feb 19 01:31:53 ip kernel: [1275314.389271] nvme 0000:03:00.0: I/O 853 QID 13 timeout,
aborting
Feb 19 01:31:53 ip kernel: [1275314.389275] nvme 0000:03:00.0: I/O 854 QID 13 timeout,
aborting
Feb 19 01:32:23 ip kernel: [1275344.401708] nvme 0000:03:00.0: I/O 728 QID 5 timeout,
reset controller
Feb 19 01:32:52 ip kernel: [1275373.394112] nvme 0000:03:00.0: I/O 0 QID 0 timeout, reset
controller
Feb 19 01:33:53 ip ceph-osd[3179]: /build/ceph-14.2.10/src/common/HeartbeatMap.cc
<http://heartbeatmap.cc/> <http://heartbeatmap.cc/
<http://heartbeatmap.cc/>>: In function 'bool
ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*,
ceph::time_detail::coarse_mono_clock::rep)' thread 7f36c03fb700 time 2021-02-19
01:33:53.436018
Feb 19 01:33:53 ip ceph-osd[3179]: /build/ceph-14.2.10/src/common/HeartbeatMap.cc
<http://heartbeatmap.cc/> <http://heartbeatmap.cc/
<http://heartbeatmap.cc/>>: 82: ceph_abort_msg("hit suicide timeout")
Feb 19 01:33:53 ip ceph-osd[3179]: ceph version 14.2.10
(b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
Feb 19 01:33:53 ip ceph-osd[3179]: 1: (ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xdf) [0x83eb8c]
Feb 19 01:33:53 ip ceph-osd[3179]: 2:
(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned
long)+0x4a5) [0xec56f5]
Feb 19 01:33:53 ip ceph-osd[3179]: 3: (ceph::HeartbeatMap::is_healthy()+0x106)
[0xec6846]
Feb 19 01:33:53 ip ceph-osd[3179]: 4: (OSD::handle_osd_ping(MOSDPing*)+0x67c)
[0x8aaf0c]
Feb 19 01:33:53 ip ceph-osd[3179]: 5: (OSD::heartbeat_dispatch(Message*)+0x1eb)
[0x8b3f4b]
Feb 19 01:33:53 ip ceph-osd[3179]: 6:
(DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0x27d)
[0x12456bd]
Feb 19 01:33:53 ip ceph-osd[3179]: 7: (ProtocolV2::handle_message()+0x9d6) [0x129b4e6]
Feb 19 01:33:53 ip ceph-osd[3179]: 8: (ProtocolV2::handle_read_frame_dispatch()+0x160)
[0x12ad330]
Feb 19 01:33:53 ip ceph-osd[3179]: 9:
(ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node,
ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x178) [0x12ad598]
Feb 19 01:33:53 ip ceph-osd[3179]: 10:
(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x12956b4]
Feb 19 01:33:53 ip ceph-osd[3179]: 11: (AsyncConnection::process()+0x186) [0x126f446]
Feb 19 01:33:53 ip ceph-osd[3179]: 12: (EventCenter::process_events(unsigned int,
std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x7cd)
[0x10b14cd]
Feb 19 01:33:53 ip ceph-osd[3179]: 13: /usr/bin/ceph-osd() [0x10b3fd8]
Feb 19 01:33:53 ip ceph-osd[3179]: 14: /usr/bin/ceph-osd() [0x162b59f]
Feb 19 01:33:53 ip ceph-osd[3179]: 15: (()+0x76ba) [0x7f36c2ed46ba]
Feb 19 01:33:53 ip ceph-osd[3179]: 16: (clone()+0x6d) [0x7f36c24db4dd]
Feb 19 01:33:53 ip ceph-osd[3179]: *** Caught signal (Aborted) **
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>