OSD will not start - ceph_assert(r == q->second->file_map.end()) - ceph-users

31 Mar 2023

I have a cluster that I increased the the number of PGs on because the
autoscaler wasn't working as expected. It's recovering the misplaced
objects, but a OSD just failed, and refuses to come back up. The device is
readable to the OS, and there are 2 other OSDs on the same node that are
online. I looked online, but haven't found anything relevant.

This is the end of the OSD log:
    -3> 2023-03-30T21:21:19.641+0000 7fcb026413c0  1 bluefs mount
    -2> 2023-03-30T21:21:19.641+0000 7fcb026413c0  1 bluefs _init_alloc
shared, id 1, capacity 0x4affc00000, block size 0x10000
    -1> 2023-03-30T21:21:19.673+0000 7fcb026413c0 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/os/bluestore/BlueFS.cc:
In function 'int BlueFS::_replay(bool, bool)' thread 7fcb026413c0 time
2023-03-30T21:21:19.665811+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/os/bluestore/BlueFS.cc:
1419: FAILED ceph_assert(r == q->second->file_map.end())

 ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x152) [0x56525ddd8954]
 2: /usr/bin/ceph-osd(+0x5d8b75) [0x56525ddd8b75]
 3: (BlueFS::_replay(bool, bool)+0x599c) [0x56525e5590ec]
 4: (BlueFS::mount()+0x120) [0x56525e559530]
 5: (BlueStore::_open_bluefs(bool, bool)+0x94) [0x56525e4160b4]
 6: (BlueStore::_prepare_db_environment(bool, bool,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >*, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >*)+0x6e1) [0x56525e417211]
 7: (BlueStore::_open_db(bool, bool, bool)+0x159) [0x56525e449f69]
 8: (BlueStore::_open_db_and_around(bool, bool)+0x2b4) [0x56525e493f14]
 9: (BlueStore::_mount()+0x1ae) [0x56525e4970fe]
 10: (OSD::init()+0x403) [0x56525df16f23]
 11: main()
 12: __libc_start_main()
 13: _start()

     0> 2023-03-30T21:21:19.681+0000 7fcb026413c0 -1 *** Caught signal
(Aborted) **
 in thread 7fcb026413c0 thread_name:ceph-osd

 ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy
(stable)
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7fcb00844ce0]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1b0) [0x56525ddd89b2]
 5: /usr/bin/ceph-osd(+0x5d8b75) [0x56525ddd8b75]
 6: (BlueFS::_replay(bool, bool)+0x599c) [0x56525e5590ec]
 7: (BlueFS::mount()+0x120) [0x56525e559530]
 8: (BlueStore::_open_bluefs(bool, bool)+0x94) [0x56525e4160b4]
 9: (BlueStore::_prepare_db_environment(bool, bool,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >*, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >*)+0x6e1) [0x56525e417211]
 10: (BlueStore::_open_db(bool, bool, bool)+0x159) [0x56525e449f69]
 11: (BlueStore::_open_db_and_around(bool, bool)+0x2b4) [0x56525e493f14]
 12: (BlueStore::_mount()+0x1ae) [0x56525e4970fe]
 13: (OSD::init()+0x403) [0x56525df16f23]
 14: main()
 15: __libc_start_main()
 16: _start()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/ 5 rgw_datacache
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_cleaner
   0/ 5 seastore_lba
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 alienstore
   1/ 5 mclock
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7fcafb03b700 / admin_socket
  7fcafb83c700 / msgr-worker-2
  7fcb026413c0 / ceph-osd
  max_recent     10000
  max_new        10000
  log_file /var/log/ceph/ceph-osd.5.log
--- end dump of recent events ---

I'd like to recover this OSD if possible. Does anyone have any suggestions?