OSD crash after server reboot - ceph-users

15 Jul 2023

Hi

We are facing error with OSD crash after reboot of the server where it is installed
We rebooted servers in our ceph cluster for a patching and after rebooting  two OSD where
crashing
One of them finally recovered but the other is still down  

Cluster is currently rebalancing objects :
# ceph status
  cluster:
    id:     62d303dc-e46b-4863-93b3-7ee995594dd1
    health: HEALTH_ERR
            clients are using insecure global_id reclaim
            mons are allowing insecure global_id reclaim
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged
            mons ac,ae,v are low on available space
            4/1100134 objects unfound (0.000%)
            1 osds down
            1 host (1 osds) down
            1 nearfull osd(s)
            Reduced data availability: 9 pgs inactive, 8 pgs down, 1 pg incomplete
            Possible data damage: 4 pgs recovery_unfound
            Degraded data redundancy: 12/2204629 objects degraded (0.001%), 5 pgs
degraded, 14 pgs undersized
            13 pool(s) nearfull
            236 daemons have recently crashed

  services:
    mon: 3 daemons, quorum v,ac,ae (age 13h)
    mgr: a(active, since 24m)
    mds: myfs:0/1 2 up:standby, 1 damaged
    osd: 7 osds: 6 up (since 68s), 7 in (since 18m); 131 remapped pgs
    rgw: 1 daemon active (harbor.object.store.a)

  task status:

  data:
    pools:   13 pools, 337 pgs
    objects: 1.10M objects, 2.3 TiB
    usage:   5.6 TiB used, 3.0 TiB / 8.5 TiB avail
    pgs:     2.671% pgs not active
             12/2204629 objects degraded (0.001%)
             616501/2204629 objects misplaced (27.964%)
             4/1100134 objects unfound (0.000%)
             179 active+clean
             78  active+clean+remapped
             52  active+remapped+backfill_wait
             13  active+undersized
             8   down
             4   active+recovery_unfound+degraded
             1   active+undersized+degraded
             1   incomplete
             1   active+remapped+backfilling

  io:
    client:   8.0 KiB/s wr, 0 op/s rd, 0 op/s wr
    recovery: 1.4 MiB/s, 7 objects/s

These are the last lines from the OSD crash log. We are no sure why this is crashing :(

4(1) r=-1 lpr=69148 pi=[69137,69148)/2 crt=7010'401 mlcod 0'0 remapped NOTIFY
m=276 mbc={}] exit Started/Stray 1.009572 7 0.001960
   -19> 2023-07-15T06:58:40.723+0000 7f444a0b0700  3 osd.0 69149 handle_osd_map epochs
[69149,69149], i have 69149, src has [65617,69149]
   -18> 2023-07-15T06:58:40.723+0000 7f444109e700  5 osd.0 pg_epoch: 69149 pg[10.14( v
9188'79 (0'0,9188'79] local-lis/les=69148/69149 n=4 ec=8825/8825
lis/c=69148/69143 les/c/f=69149/69144/0 sis=69148) [0,3] r=0 lpr=69148 pi=[66972,69148)/5
crt=9188'79 lcod 0'0 mlcod 0'0 active mbc={}] enter
Started/Primary/Active/Clean
   -17> 2023-07-15T06:58:40.723+0000 7f444009c700  5 osd.0 pg_epoch: 69149 pg[8.7s0( v
7010'401 lc 0'0 (0'0,7010'401] local-lis/les=0/0 n=151 ec=77/77
lis/c=69144/69144 les/c/f=69145/69145/0 sis=69148) [0,4,6]/[NONE,4,6]p4(1) r=-1 lpr=69148
pi=[69137,69148)/2 crt=7010'401 mlcod 0'0 remapped NOTIFY m=276 mbc={}] enter
Started/ReplicaActive
   -16> 2023-07-15T06:58:40.723+0000 7f444009c700  5 osd.0 pg_epoch: 69149 pg[8.7s0( v
7010'401 lc 0'0 (0'0,7010'401] local-lis/les=0/0 n=151 ec=77/77
lis/c=69144/69144 les/c/f=69145/69145/0 sis=69148) [0,4,6]/[NONE,4,6]p4(1) r=-1 lpr=69148
pi=[69137,69148)/2 crt=7010'401 mlcod 0'0 remapped NOTIFY m=276 mbc={}] enter
Started/ReplicaActive/RepNotRecovering
   -15> 2023-07-15T06:58:40.723+0000 7f444a0b0700  3 osd.0 69149 handle_osd_map epochs
[69149,69149], i have 69149, src has [65617,69149]
   -14> 2023-07-15T06:58:40.724+0000 7f444089d700  5 osd.0 pg_epoch: 69149 pg[10.1f( v
9092'7 lc 8831'4 (0'0,9092'7] local-lis/les=69148/69149 n=2 ec=8825/8825
lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [6,0] r=1 lpr=69148 pi=[69137,69148)/2
luod=0'0 crt=9092'7 lcod 0'0 mlcod 0'0 active m=2 mbc={}] exit
Started/ReplicaActive/RepNotRecovering 0.001805 4 0.000066
   -13> 2023-07-15T06:58:40.724+0000 7f444089d700  5 osd.0 pg_epoch: 69149 pg[10.1f( v
9092'7 lc 8831'4 (0'0,9092'7] local-lis/les=69148/69149 n=2 ec=8825/8825
lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [6,0] r=1 lpr=69148 pi=[69137,69148)/2
luod=0'0 crt=9092'7 lcod 0'0 mlcod 0'0 active m=2 mbc={}] enter
Started/ReplicaActive/RepWaitRecoveryReserved
   -12> 2023-07-15T06:58:40.724+0000 7f444089d700  5 osd.0 pg_epoch: 69149 pg[10.1f( v
9092'7 lc 8831'4 (0'0,9092'7] local-lis/les=69148/69149 n=2 ec=8825/8825
lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [6,0] r=1 lpr=69148 pi=[69137,69148)/2
luod=0'0 crt=9092'7 lcod 0'0 mlcod 0'0 active m=2 mbc={}] exit
Started/ReplicaActive/RepWaitRecoveryReserved 0.000043 1 0.000053
   -11> 2023-07-15T06:58:40.724+0000 7f444089d700  5 osd.0 pg_epoch: 69149 pg[10.1f( v
9092'7 lc 8831'4 (0'0,9092'7] local-lis/les=69148/69149 n=2 ec=8825/8825
lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [6,0] r=1 lpr=69148 pi=[69137,69148)/2
luod=0'0 crt=9092'7 lcod 0'0 mlcod 0'0 active m=2 mbc={}] enter
Started/ReplicaActive/RepRecovering
   -10> 2023-07-15T06:58:40.725+0000 7f444009c700  5 osd.0 pg_epoch: 69149 pg[7.2( v
268'5 (79'3,268'5] lb MIN local-lis/les=69140/69141 n=0 ec=71/68
lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [0,2]/[2,5] r=-1 lpr=69147
pi=[69137,69147)/1 luod=0'0 crt=268'5 lcod 0'0 mlcod 0'0 active+remapped
mbc={}] exit Started/ReplicaActive/RepNotRecovering 1.010921 6 0.000077
    -9> 2023-07-15T06:58:40.725+0000 7f444009c700  5 osd.0 pg_epoch: 69149 pg[7.2( v
268'5 (79'3,268'5] lb MIN local-lis/les=69140/69141 n=0 ec=71/68
lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [0,2]/[2,5] r=-1 lpr=69147
pi=[69137,69147)/1 luod=0'0 crt=268'5 lcod 0'0 mlcod 0'0 active+remapped
mbc={}] enter Started/ReplicaActive/RepWaitBackfillReserved
    -8> 2023-07-15T06:58:40.726+0000 7f444a0b0700  3 osd.0 69149 handle_osd_map epochs
[69149,69149], i have 69149, src has [65617,69149]
    -7> 2023-07-15T06:58:40.726+0000 7f444009c700  5 osd.0 pg_epoch: 69149 pg[12.c( v
65902'349780 (65642'347213,65902'349780] lb
12:30005f02:::1000060a6a4.00000000:head local-lis/les=67222/67223 n=1903 ec=9097/9097
lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [0,5]/[5,1] r=-1 lpr=69147
pi=[67170,69147)/2 luod=0'0 crt=65902'349780 mlcod 0'0 active+remapped mbc={}]
exit Started/ReplicaActive/RepRecovering 0.996833 5 0.000101
    -6> 2023-07-15T06:58:40.726+0000 7f444009c700  5 osd.0 pg_epoch: 69149 pg[12.c( v
65902'349780 (65642'347213,65902'349780] lb
12:30005f02:::1000060a6a4.00000000:head local-lis/les=67222/67223 n=1903 ec=9097/9097
lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [0,5]/[5,1] r=-1 lpr=69147
pi=[67170,69147)/2 luod=0'0 crt=65902'349780 mlcod 0'0 active+remapped mbc={}]
enter Started/ReplicaActive/RepNotRecovering
    -5> 2023-07-15T06:58:40.726+0000 7f444009c700  5 osd.0 pg_epoch: 69149 pg[8.7s0( v
7010'401 lc 0'0 (0'0,7010'401] local-lis/les=0/0 n=151 ec=77/77
lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [0,4,6]/[NONE,4,6]p4(1) r=-1 lpr=69148
pi=[69137,69148)/2 luod=0'0 crt=7010'401 mlcod 0'0 active+remapped m=276
mbc={}] exit Started/ReplicaActive/RepNotRecovering 0.002781 3 0.000076
    -4> 2023-07-15T06:58:40.726+0000 7f444009c700  5 osd.0 pg_epoch: 69149 pg[8.7s0( v
7010'401 lc 0'0 (0'0,7010'401] local-lis/les=0/0 n=151 ec=77/77
lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [0,4,6]/[NONE,4,6]p4(1) r=-1 lpr=69148
pi=[69137,69148)/2 luod=0'0 crt=7010'401 mlcod 0'0 active+remapped m=276
mbc={}] enter Started/ReplicaActive/RepWaitRecoveryReserved
    -3> 2023-07-15T06:58:40.727+0000 7f444089d700  5 osd.0 pg_epoch: 69149 pg[12.1( v
65904'388229 (65626'385027,65904'388229] lb MIN local-lis/les=67170/67171
n=1876 ec=9097/9097 lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [5,0]/[5,3] r=-1
lpr=69147 pi=[67170,69147)/3 luod=0'0 crt=65904'388229 lcod 0'0 mlcod 0'0
active+remapped mbc={}] exit Started/ReplicaActive/RepNotRecovering 1.009973 6 0.000367
    -2> 2023-07-15T06:58:40.727+0000 7f444089d700  5 osd.0 pg_epoch: 69149 pg[12.1( v
65904'388229 (65626'385027,65904'388229] lb MIN local-lis/les=67170/67171
n=1876 ec=9097/9097 lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [5,0]/[5,3] r=-1
lpr=69147 pi=[67170,69147)/3 luod=0'0 crt=65904'388229 lcod 0'0 mlcod 0'0
active+remapped mbc={}] enter Started/ReplicaActive/RepWaitBackfillReserved
    -1> 2023-07-15T06:58:40.729+0000 7f444109e700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/osd/PGLog.cc:
In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t,
pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f444109e700
time 2023-07-15T06:58:40.727106+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/osd/PGLog.cc:
369: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail)

 ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158)
[0x563620b4dbd8]
 2: (()+0x507df2) [0x563620b4ddf2]
 3: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&,
PGLog::LogEntryHandler*, bool&, bool&)+0x1ca1) [0x563620d121f1]
 4: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&,
pg_shard_t)+0x75) [0x563620e982c5]
 5: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x563620ed308c]
 6: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started,
boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xa5) [0x563620efefb5]
 7: (boost::statechart::state_machine<PeeringState::PeeringMachine,
PeeringState::Initial, std::allocator<boost::statechart::none>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x5b) [0x563620cf22ab]
 8: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1)
[0x563620ce48a1]
 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>,
ThreadPool::TPHandle&)+0x29c) [0x563620c5bc7c]
 10: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x563620e8d906]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef)
[0x563620c4e92f]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x56362128ef84]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563621291be4]
 14: (()+0x814a) [0x7f446112c14a]
 15: (clone()+0x43) [0x7f445fe63f23]

     0> 2023-07-15T06:58:40.733+0000 7f444109e700 -1 *** Caught signal (Aborted) **
 in thread 7f444109e700 thread_name:tp_osd_tp

 ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
 1: (()+0x12b20) [0x7f4461136b20]
 2: (gsignal()+0x10f) [0x7f445fd9e7ff]
 3: (abort()+0x127) [0x7f445fd88c35]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9)
[0x563620b4dc29]
 5: (()+0x507df2) [0x563620b4ddf2]
 6: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&,
PGLog::LogEntryHandler*, bool&, bool&)+0x1ca1) [0x563620d121f1]
 7: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&,
pg_shard_t)+0x75) [0x563620e982c5]
 8: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x563620ed308c]
 9: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started,
boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xa5) [0x563620efefb5]
 10: (boost::statechart::state_machine<PeeringState::PeeringMachine,
PeeringState::Initial, std::allocator<boost::statechart::none>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x5b) [0x563620cf22ab]
 11: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1)
[0x563620ce48a1]
 12: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>,
ThreadPool::TPHandle&)+0x29c) [0x563620c5bc7c]
 13: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x563620e8d906]
 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef)
[0x563620c4e92f]
 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x56362128ef84]
 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563621291be4]
 17: (()+0x814a) [0x7f446112c14a]
 18: (clone()+0x43) [0x7f445fe63f23]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_rwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7f443e899700 / osd_srv_heartbt
  7f443f09a700 / tp_osd_tp
  7f443f89b700 / tp_osd_tp
  7f444009c700 / tp_osd_tp
  7f444089d700 / tp_osd_tp
  7f444109e700 / tp_osd_tp
  7f444a0b0700 / ms_dispatch
  7f444b0b2700 / rocksdb:dump_st
  7f444beae700 / fn_anonymous
  7f444ceb0700 / cfin
  7f444e28c700 / safe_timer
  7f444f28e700 / ms_dispatch
  7f4451eba700 / bstore_mempool
  7f44570ca700 / fn_anonymous
  7f44588cd700 / safe_timer
  7f445a141700 / safe_timer
  7f445a942700 / signal_handler
  7f445b944700 / admin_socket
  7f445c145700 / service
  7f445c946700 / msgr-worker-2
  7f445d147700 / msgr-worker-1
  7f445d948700 / msgr-worker-0
  7f44633cef00 / ceph-osd
  max_recent     10000
  max_new         1000
  log_file
/var/lib/ceph/crash/2023-07-15T06:58:40.734237Z_21e01469-d6d6-4be4-b913-f9cc55a7ab22/log
--- end dump of recent events ---

I appreciate any help to point us to a possible troubleshooting path :) 

thanks a lot and kind regards,