Hi,

 

Any ideas for resolving an issue where an OSD crashes on start-up?

 

I have one (large hdd) OSD that will no longer start – it crashes while loading pgs - see attached log file - excerpt below:

 

2019-08-02 10:08:21.021207 7fea86d7be00  0 osd.1 1844 load_pgs

2019-08-02 10:08:39.370112 7fea86d7be00 -1 *** Caught signal (Aborted) **

in thread 7fea86d7be00 thread_name:ceph-osd

 

ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)

1: (()+0xa59c94) [0x55b835a6dc94]

2: (()+0x110e0) [0x7fea843800e0]

3: (gsignal()+0xcf) [0x7fea83347fff]

4: (abort()+0x16a) [0x7fea8334942a]

5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fea83c600ad]

6: (()+0x8f066) [0x7fea83c5e066]

7: (()+0x8f0b1) [0x7fea83c5e0b1]

8: (()+0x8f2c9) [0x7fea83c5e2c9]

9: (pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator&)+0x156) [0x55b8356f57c6]

10: (void PGLog::read_log_and_missing<pg_missing_set<true> >(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&, pg_missing_set<true>&, bool, std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, bool, bool*, DoutPrefixProvider const*, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*, bool)+0x1ab4) [0x55b8355a6584]

11: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x38b) [0x55b83554b7eb]

12: (OSD::load_pgs()+0x8b8) [0x55b835496678]

13: (OSD::init()+0x2237) [0x55b8354b75c7]

14: (main()+0x3092) [0x55b8353bf1c2]

15: (__libc_start_main()+0xf1) [0x7fea833352e1]

16: (_start()+0x2a) [0x55b83544b8ca]

 

I have ensured that kernel.pid_max is set to a high value – sysctl reports kernel.pid_max = 4194304

 

This issue arose following an expansion of the ceph cluster: https://forum.proxmox.com/threads/unable-to-start-osd-crashes-while-loading-pgs.56597/

In summary: I added a third node, with extra OSD’s, and increased pg_num and pgp_num for one pool before the cluster had settled. However, by now the cluster has settled – I no longer have the global setting mon_max_pg_per_osd = 1000.

 

Only the issue with the OSD that will not start remains.

 

Best regards,

Jesper Stemann Andersen

Lead Software Engineer, R&D

IHP Systems A/S

+45 26 25 23 91