Hi
We are facing error with OSD crash after reboot of the server where it is installed
We rebooted servers in our ceph cluster for a patching and after rebooting two OSD where
crashing
One of them finally recovered but the other is still down
Cluster is currently rebalancing objects :
# ceph status
cluster:
id: 62d303dc-e46b-4863-93b3-7ee995594dd1
health: HEALTH_ERR
clients are using insecure global_id reclaim
mons are allowing insecure global_id reclaim
1 filesystem is degraded
1 filesystem is offline
1 mds daemon damaged
mons ac,ae,v are low on available space
4/1100134 objects unfound (0.000%)
1 osds down
1 host (1 osds) down
1 nearfull osd(s)
Reduced data availability: 9 pgs inactive, 8 pgs down, 1 pg incomplete
Possible data damage: 4 pgs recovery_unfound
Degraded data redundancy: 12/2204629 objects degraded (0.001%), 5 pgs
degraded, 14 pgs undersized
13 pool(s) nearfull
236 daemons have recently crashed
services:
mon: 3 daemons, quorum v,ac,ae (age 13h)
mgr: a(active, since 24m)
mds: myfs:0/1 2 up:standby, 1 damaged
osd: 7 osds: 6 up (since 68s), 7 in (since 18m); 131 remapped pgs
rgw: 1 daemon active (harbor.object.store.a)
task status:
data:
pools: 13 pools, 337 pgs
objects: 1.10M objects, 2.3 TiB
usage: 5.6 TiB used, 3.0 TiB / 8.5 TiB avail
pgs: 2.671% pgs not active
12/2204629 objects degraded (0.001%)
616501/2204629 objects misplaced (27.964%)
4/1100134 objects unfound (0.000%)
179 active+clean
78 active+clean+remapped
52 active+remapped+backfill_wait
13 active+undersized
8 down
4 active+recovery_unfound+degraded
1 active+undersized+degraded
1 incomplete
1 active+remapped+backfilling
io:
client: 8.0 KiB/s wr, 0 op/s rd, 0 op/s wr
recovery: 1.4 MiB/s, 7 objects/s
These are the last lines from the OSD crash log. We are no sure why this is crashing :(
4(1) r=-1 lpr=69148 pi=[69137,69148)/2 crt=7010'401 mlcod 0'0 remapped NOTIFY
m=276 mbc={}] exit Started/Stray 1.009572 7 0.001960
-19> 2023-07-15T06:58:40.723+0000 7f444a0b0700 3 osd.0 69149 handle_osd_map epochs
[69149,69149], i have 69149, src has [65617,69149]
-18> 2023-07-15T06:58:40.723+0000 7f444109e700 5 osd.0 pg_epoch: 69149 pg[10.14( v
9188'79 (0'0,9188'79] local-lis/les=69148/69149 n=4 ec=8825/8825
lis/c=69148/69143 les/c/f=69149/69144/0 sis=69148) [0,3] r=0 lpr=69148 pi=[66972,69148)/5
crt=9188'79 lcod 0'0 mlcod 0'0 active mbc={}] enter
Started/Primary/Active/Clean
-17> 2023-07-15T06:58:40.723+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[8.7s0( v
7010'401 lc 0'0 (0'0,7010'401] local-lis/les=0/0 n=151 ec=77/77
lis/c=69144/69144 les/c/f=69145/69145/0 sis=69148) [0,4,6]/[NONE,4,6]p4(1) r=-1 lpr=69148
pi=[69137,69148)/2 crt=7010'401 mlcod 0'0 remapped NOTIFY m=276 mbc={}] enter
Started/ReplicaActive
-16> 2023-07-15T06:58:40.723+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[8.7s0( v
7010'401 lc 0'0 (0'0,7010'401] local-lis/les=0/0 n=151 ec=77/77
lis/c=69144/69144 les/c/f=69145/69145/0 sis=69148) [0,4,6]/[NONE,4,6]p4(1) r=-1 lpr=69148
pi=[69137,69148)/2 crt=7010'401 mlcod 0'0 remapped NOTIFY m=276 mbc={}] enter
Started/ReplicaActive/RepNotRecovering
-15> 2023-07-15T06:58:40.723+0000 7f444a0b0700 3 osd.0 69149 handle_osd_map epochs
[69149,69149], i have 69149, src has [65617,69149]
-14> 2023-07-15T06:58:40.724+0000 7f444089d700 5 osd.0 pg_epoch: 69149 pg[10.1f( v
9092'7 lc 8831'4 (0'0,9092'7] local-lis/les=69148/69149 n=2 ec=8825/8825
lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [6,0] r=1 lpr=69148 pi=[69137,69148)/2
luod=0'0 crt=9092'7 lcod 0'0 mlcod 0'0 active m=2 mbc={}] exit
Started/ReplicaActive/RepNotRecovering 0.001805 4 0.000066
-13> 2023-07-15T06:58:40.724+0000 7f444089d700 5 osd.0 pg_epoch: 69149 pg[10.1f( v
9092'7 lc 8831'4 (0'0,9092'7] local-lis/les=69148/69149 n=2 ec=8825/8825
lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [6,0] r=1 lpr=69148 pi=[69137,69148)/2
luod=0'0 crt=9092'7 lcod 0'0 mlcod 0'0 active m=2 mbc={}] enter
Started/ReplicaActive/RepWaitRecoveryReserved
-12> 2023-07-15T06:58:40.724+0000 7f444089d700 5 osd.0 pg_epoch: 69149 pg[10.1f( v
9092'7 lc 8831'4 (0'0,9092'7] local-lis/les=69148/69149 n=2 ec=8825/8825
lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [6,0] r=1 lpr=69148 pi=[69137,69148)/2
luod=0'0 crt=9092'7 lcod 0'0 mlcod 0'0 active m=2 mbc={}] exit
Started/ReplicaActive/RepWaitRecoveryReserved 0.000043 1 0.000053
-11> 2023-07-15T06:58:40.724+0000 7f444089d700 5 osd.0 pg_epoch: 69149 pg[10.1f( v
9092'7 lc 8831'4 (0'0,9092'7] local-lis/les=69148/69149 n=2 ec=8825/8825
lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [6,0] r=1 lpr=69148 pi=[69137,69148)/2
luod=0'0 crt=9092'7 lcod 0'0 mlcod 0'0 active m=2 mbc={}] enter
Started/ReplicaActive/RepRecovering
-10> 2023-07-15T06:58:40.725+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[7.2( v
268'5 (79'3,268'5] lb MIN local-lis/les=69140/69141 n=0 ec=71/68
lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [0,2]/[2,5] r=-1 lpr=69147
pi=[69137,69147)/1 luod=0'0 crt=268'5 lcod 0'0 mlcod 0'0 active+remapped
mbc={}] exit Started/ReplicaActive/RepNotRecovering 1.010921 6 0.000077
-9> 2023-07-15T06:58:40.725+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[7.2( v
268'5 (79'3,268'5] lb MIN local-lis/les=69140/69141 n=0 ec=71/68
lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [0,2]/[2,5] r=-1 lpr=69147
pi=[69137,69147)/1 luod=0'0 crt=268'5 lcod 0'0 mlcod 0'0 active+remapped
mbc={}] enter Started/ReplicaActive/RepWaitBackfillReserved
-8> 2023-07-15T06:58:40.726+0000 7f444a0b0700 3 osd.0 69149 handle_osd_map epochs
[69149,69149], i have 69149, src has [65617,69149]
-7> 2023-07-15T06:58:40.726+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[12.c( v
65902'349780 (65642'347213,65902'349780] lb
12:30005f02:::1000060a6a4.00000000:head local-lis/les=67222/67223 n=1903 ec=9097/9097
lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [0,5]/[5,1] r=-1 lpr=69147
pi=[67170,69147)/2 luod=0'0 crt=65902'349780 mlcod 0'0 active+remapped mbc={}]
exit Started/ReplicaActive/RepRecovering 0.996833 5 0.000101
-6> 2023-07-15T06:58:40.726+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[12.c( v
65902'349780 (65642'347213,65902'349780] lb
12:30005f02:::1000060a6a4.00000000:head local-lis/les=67222/67223 n=1903 ec=9097/9097
lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [0,5]/[5,1] r=-1 lpr=69147
pi=[67170,69147)/2 luod=0'0 crt=65902'349780 mlcod 0'0 active+remapped mbc={}]
enter Started/ReplicaActive/RepNotRecovering
-5> 2023-07-15T06:58:40.726+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[8.7s0( v
7010'401 lc 0'0 (0'0,7010'401] local-lis/les=0/0 n=151 ec=77/77
lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [0,4,6]/[NONE,4,6]p4(1) r=-1 lpr=69148
pi=[69137,69148)/2 luod=0'0 crt=7010'401 mlcod 0'0 active+remapped m=276
mbc={}] exit Started/ReplicaActive/RepNotRecovering 0.002781 3 0.000076
-4> 2023-07-15T06:58:40.726+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[8.7s0( v
7010'401 lc 0'0 (0'0,7010'401] local-lis/les=0/0 n=151 ec=77/77
lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [0,4,6]/[NONE,4,6]p4(1) r=-1 lpr=69148
pi=[69137,69148)/2 luod=0'0 crt=7010'401 mlcod 0'0 active+remapped m=276
mbc={}] enter Started/ReplicaActive/RepWaitRecoveryReserved
-3> 2023-07-15T06:58:40.727+0000 7f444089d700 5 osd.0 pg_epoch: 69149 pg[12.1( v
65904'388229 (65626'385027,65904'388229] lb MIN local-lis/les=67170/67171
n=1876 ec=9097/9097 lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [5,0]/[5,3] r=-1
lpr=69147 pi=[67170,69147)/3 luod=0'0 crt=65904'388229 lcod 0'0 mlcod 0'0
active+remapped mbc={}] exit Started/ReplicaActive/RepNotRecovering 1.009973 6 0.000367
-2> 2023-07-15T06:58:40.727+0000 7f444089d700 5 osd.0 pg_epoch: 69149 pg[12.1( v
65904'388229 (65626'385027,65904'388229] lb MIN local-lis/les=67170/67171
n=1876 ec=9097/9097 lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [5,0]/[5,3] r=-1
lpr=69147 pi=[67170,69147)/3 luod=0'0 crt=65904'388229 lcod 0'0 mlcod 0'0
active+remapped mbc={}] enter Started/ReplicaActive/RepWaitBackfillReserved
-1> 2023-07-15T06:58:40.729+0000 7f444109e700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/osd/PGLog.cc:
In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t,
pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f444109e700
time 2023-07-15T06:58:40.727106+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/osd/PGLog.cc:
369: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail)
ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158)
[0x563620b4dbd8]
2: (()+0x507df2) [0x563620b4ddf2]
3: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&,
PGLog::LogEntryHandler*, bool&, bool&)+0x1ca1) [0x563620d121f1]
4: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&,
pg_shard_t)+0x75) [0x563620e982c5]
5: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x563620ed308c]
6: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started,
boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xa5) [0x563620efefb5]
7: (boost::statechart::state_machine<PeeringState::PeeringMachine,
PeeringState::Initial, std::allocator<boost::statechart::none>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x5b) [0x563620cf22ab]
8: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1)
[0x563620ce48a1]
9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>,
ThreadPool::TPHandle&)+0x29c) [0x563620c5bc7c]
10: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x563620e8d906]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef)
[0x563620c4e92f]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x56362128ef84]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563621291be4]
14: (()+0x814a) [0x7f446112c14a]
15: (clone()+0x43) [0x7f445fe63f23]
0> 2023-07-15T06:58:40.733+0000 7f444109e700 -1 *** Caught signal (Aborted) **
in thread 7f444109e700 thread_name:tp_osd_tp
ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
1: (()+0x12b20) [0x7f4461136b20]
2: (gsignal()+0x10f) [0x7f445fd9e7ff]
3: (abort()+0x127) [0x7f445fd88c35]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9)
[0x563620b4dc29]
5: (()+0x507df2) [0x563620b4ddf2]
6: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&,
PGLog::LogEntryHandler*, bool&, bool&)+0x1ca1) [0x563620d121f1]
7: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&,
pg_shard_t)+0x75) [0x563620e982c5]
8: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x563620ed308c]
9: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started,
boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xa5) [0x563620efefb5]
10: (boost::statechart::state_machine<PeeringState::PeeringMachine,
PeeringState::Initial, std::allocator<boost::statechart::none>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x5b) [0x563620cf22ab]
11: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1)
[0x563620ce48a1]
12: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>,
ThreadPool::TPHandle&)+0x29c) [0x563620c5bc7c]
13: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x563620e8d906]
14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef)
[0x563620c4e92f]
15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x56362128ef84]
16: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563621291be4]
17: (()+0x814a) [0x7f446112c14a]
18: (clone()+0x43) [0x7f445fe63f23]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_rwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
-2/-2 (syslog threshold)
99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
7f443e899700 / osd_srv_heartbt
7f443f09a700 / tp_osd_tp
7f443f89b700 / tp_osd_tp
7f444009c700 / tp_osd_tp
7f444089d700 / tp_osd_tp
7f444109e700 / tp_osd_tp
7f444a0b0700 / ms_dispatch
7f444b0b2700 / rocksdb:dump_st
7f444beae700 / fn_anonymous
7f444ceb0700 / cfin
7f444e28c700 / safe_timer
7f444f28e700 / ms_dispatch
7f4451eba700 / bstore_mempool
7f44570ca700 / fn_anonymous
7f44588cd700 / safe_timer
7f445a141700 / safe_timer
7f445a942700 / signal_handler
7f445b944700 / admin_socket
7f445c145700 / service
7f445c946700 / msgr-worker-2
7f445d147700 / msgr-worker-1
7f445d948700 / msgr-worker-0
7f44633cef00 / ceph-osd
max_recent 10000
max_new 1000
log_file
/var/lib/ceph/crash/2023-07-15T06:58:40.734237Z_21e01469-d6d6-4be4-b913-f9cc55a7ab22/log
--- end dump of recent events ---
I appreciate any help to point us to a possible troubleshooting path :)
thanks a lot and kind regards,