Luminous, OSDs down: "osd init failed" and "failed to load OSD map for epoch ... got 0 bytes" - ceph-users

22 May 2020

Hallo all,
      hope you can help me with very strange problems which arose
suddenly today. Tried to search, also in this mailing list, but could
not find anything relevant.

At some point today, without any action from my side, I noticed some
OSDs in my production cluster would go down and never come up.
I am on Luminous 12.2.13, CentOS7, kernel 3.10: my setup is non-standard
as OSD disks are served off a SAN (which is for sure OK now, although I
cannot exclude some glitch).
Tried to reboot OSD servers a few times, ran "activate --all", added
bluestore_ignore_data_csum=true in the [osd] section in ceph.conf...
the number of "down" OSDs changed for a while but now seems rather stable.

There are actually two classes of problems (bit more details right below):
- ERROR: osd init failed: (5) Input/output error
- failed to load OSD map for epoch 141282, got 0 bytes

*First problem*
This affects 50 OSDs (all disks of this kind, on all but one server):
these OSDs are reserved for object storage but I am not yet using them
so I may in principle recreate them. But would be interested in
understanding what the problem is, and learn how to solve it for future
reference.
Here is what I see in logs:
.....
2020-05-21 21:17:48.661348 7fa2e9a95ec0  1 bluefs add_block_device bdev
1 path /var/lib/ceph/osd/cephpa1-72/block size 14.5TiB
2020-05-21 21:17:48.661428 7fa2e9a95ec0  1 bluefs mount
2020-05-21 21:17:48.662040 7fa2e9a95ec0  1 bluefs _init_alloc id 1
alloc_size 0x10000 size 0xe83a3400000
2020-05-21 21:52:43.858464 7fa2e9a95ec0 -1 bluefs mount failed to replay
log: (5) Input/output error
2020-05-21 21:52:43.858589 7fa2e9a95ec0  1 fbmap_alloc 0x55c6bba92e00
shutdown
2020-05-21 21:52:43.858728 7fa2e9a95ec0 -1
bluestore(/var/lib/ceph/osd/cephpa1-72) _open_db failed bluefs mount:
(5) Input/output error
2020-05-21 21:52:43.858790 7fa2e9a95ec0  1 bdev(0x55c6bbdb6600
/var/lib/ceph/osd/cephpa1-72/block) close
2020-05-21 21:52:44.103536 7fa2e9a95ec0  1 bdev(0x55c6bbdb8600
/var/lib/ceph/osd/cephpa1-72/block) close
2020-05-21 21:52:44.352899 7fa2e9a95ec0 -1 osd.72 0 OSD:init: unable to
mount object store
2020-05-21 21:52:44.352956 7fa2e9a95ec0 -1 ESC[0;31m ** ERROR: osd init
failed: (5) Input/output errorESC[0m

*Second problem*
This affects 11 OSDs, which I use *in production* for Cinder block
storage: looks like all PGs for this pool are currently OK.
Here is the excerpt from the logs.
.....
      -5> 2020-05-21 20:52:06.756469 7fd2ccc19ec0  0 _get_class not
permitted to load kvs
      -4> 2020-05-21 20:52:06.759686 7fd2ccc19ec0  1 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/rgw/cls_rgw.cc:3869:

Loaded rgw class!
      -3> 2020-05-21 20:52:06.760021 7fd2ccc19ec0  1 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/log/cls_log.cc:299:

Loaded log class!
      -2> 2020-05-21 20:52:06.760730 7fd2ccc19ec0  1 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/replica_log/cls_replica_log.cc:135:

Loaded replica log class!
      -1> 2020-05-21 20:52:06.760873 7fd2ccc19ec0 -1 osd.63 0 failed to
load OSD map for epoch 141282, got 0 bytes
       0> 2020-05-21 20:52:06.763277 7fd2ccc19ec0 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/osd/OSD.h:

In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fd2ccc19ec0
time 2020-05-21 20:52:06.760916
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/osd/OSD.h:

994: FAILED assert(ret)

    Has anyone any idea how I could fix these problems, or what I could
do to try and shed some light? And also, what caused them, and whether
there is some magic configuration flag I could use to protect my cluster?

    Thanks a lot for your help!

              Fulvio