I have a cluster that I increased the the number of PGs on because the
autoscaler wasn't working as expected. It's recovering the misplaced
objects, but a OSD just failed, and refuses to come back up. The device is
readable to the OS, and there are 2 other OSDs on the same node that are
online. I looked online, but haven't found anything relevant.
This is the end of the OSD log:
-3> 2023-03-30T21:21:19.641+0000 7fcb026413c0 1 bluefs mount
-2> 2023-03-30T21:21:19.641+0000 7fcb026413c0 1 bluefs _init_alloc
shared, id 1, capacity 0x4affc00000, block size 0x10000
-1> 2023-03-30T21:21:19.673+0000 7fcb026413c0 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/os/bluestore/BlueFS.cc:
In function 'int BlueFS::_replay(bool, bool)' thread 7fcb026413c0 time
2023-03-30T21:21:19.665811+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/os/bluestore/BlueFS.cc:
1419: FAILED ceph_assert(r == q->second->file_map.end())
ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x152) [0x56525ddd8954]
2: /usr/bin/ceph-osd(+0x5d8b75) [0x56525ddd8b75]
3: (BlueFS::_replay(bool, bool)+0x599c) [0x56525e5590ec]
4: (BlueFS::mount()+0x120) [0x56525e559530]
5: (BlueStore::_open_bluefs(bool, bool)+0x94) [0x56525e4160b4]
6: (BlueStore::_prepare_db_environment(bool, bool,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >*, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >*)+0x6e1) [0x56525e417211]
7: (BlueStore::_open_db(bool, bool, bool)+0x159) [0x56525e449f69]
8: (BlueStore::_open_db_and_around(bool, bool)+0x2b4) [0x56525e493f14]
9: (BlueStore::_mount()+0x1ae) [0x56525e4970fe]
10: (OSD::init()+0x403) [0x56525df16f23]
11: main()
12: __libc_start_main()
13: _start()
0> 2023-03-30T21:21:19.681+0000 7fcb026413c0 -1 *** Caught signal
(Aborted) **
in thread 7fcb026413c0 thread_name:ceph-osd
ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy
(stable)
1: /lib64/libpthread.so.0(+0x12ce0) [0x7fcb00844ce0]
2: gsignal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1b0) [0x56525ddd89b2]
5: /usr/bin/ceph-osd(+0x5d8b75) [0x56525ddd8b75]
6: (BlueFS::_replay(bool, bool)+0x599c) [0x56525e5590ec]
7: (BlueFS::mount()+0x120) [0x56525e559530]
8: (BlueStore::_open_bluefs(bool, bool)+0x94) [0x56525e4160b4]
9: (BlueStore::_prepare_db_environment(bool, bool,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >*, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >*)+0x6e1) [0x56525e417211]
10: (BlueStore::_open_db(bool, bool, bool)+0x159) [0x56525e449f69]
11: (BlueStore::_open_db_and_around(bool, bool)+0x2b4) [0x56525e493f14]
12: (BlueStore::_mount()+0x1ae) [0x56525e4970fe]
13: (OSD::init()+0x403) [0x56525df16f23]
14: main()
15: __libc_start_main()
16: _start()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/ 5 rgw_datacache
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
0/ 5 seastore
0/ 5 seastore_onode
0/ 5 seastore_odata
0/ 5 seastore_omap
0/ 5 seastore_tm
0/ 5 seastore_cleaner
0/ 5 seastore_lba
0/ 5 seastore_cache
0/ 5 seastore_journal
0/ 5 seastore_device
0/ 5 alienstore
1/ 5 mclock
-2/-2 (syslog threshold)
99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
7fcafb03b700 / admin_socket
7fcafb83c700 / msgr-worker-2
7fcb026413c0 / ceph-osd
max_recent 10000
max_new 10000
log_file /var/log/ceph/ceph-osd.5.log
--- end dump of recent events ---
I'd like to recover this OSD if possible. Does anyone have any suggestions?
Show replies by date