Hi,
Any ideas for resolving an issue where an OSD crashes on start-up?
I have one (large hdd) OSD that will no longer start – it crashes while loading pgs - see
attached log file - excerpt below:
2019-08-02 10:08:21.021207 7fea86d7be00 0 osd.1 1844 load_pgs
2019-08-02 10:08:39.370112 7fea86d7be00 -1 *** Caught signal (Aborted) **
in thread 7fea86d7be00 thread_name:ceph-osd
ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)
1: (()+0xa59c94) [0x55b835a6dc94]
2: (()+0x110e0) [0x7fea843800e0]
3: (gsignal()+0xcf) [0x7fea83347fff]
4: (abort()+0x16a) [0x7fea8334942a]
5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fea83c600ad]
6: (()+0x8f066) [0x7fea83c5e066]
7: (()+0x8f0b1) [0x7fea83c5e0b1]
8: (()+0x8f2c9) [0x7fea83c5e2c9]
9: (pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator&)+0x156)
[0x55b8356f57c6]
10: (void PGLog::read_log_and_missing<pg_missing_set<true> >(ObjectStore*,
coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&,
pg_missing_set<true>&, bool, std::__cxx11::basic_ostringstream<char,
std::char_traits<char>, std::allocator<char> >&, bool, bool*,
DoutPrefixProvider const*, std::set<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > > >*, bool)+0x1ab4) [0x55b8355a6584]
11: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x38b) [0x55b83554b7eb]
12: (OSD::load_pgs()+0x8b8) [0x55b835496678]
13: (OSD::init()+0x2237) [0x55b8354b75c7]
14: (main()+0x3092) [0x55b8353bf1c2]
15: (__libc_start_main()+0xf1) [0x7fea833352e1]
16: (_start()+0x2a) [0x55b83544b8ca]
I have ensured that kernel.pid_max is set to a high value – sysctl reports kernel.pid_max
= 4194304
This issue arose following an expansion of the ceph cluster:
https://forum.proxmox.com/threads/unable-to-start-osd-crashes-while-loading…
In summary: I added a third node, with extra OSD’s, and increased pg_num and pgp_num for
one pool before the cluster had settled. However, by now the cluster has settled – I no
longer have the global setting mon_max_pg_per_osd = 1000.
Only the issue with the OSD that will not start remains.
Best regards,
Jesper Stemann Andersen
Lead Software Engineer, R&D
IHP Systems A/S
+45 26 25 23 91