Mysteriously dead OSD process - ceph-users

6 Apr 2023

Hi,

We currently use Ceph Pacific 16.2.10 deployed with Cephadm on this 
storage cluster. Last night, one of our OSD died. However, since its 
storage is a SSD, we ran hardware checks and found no issue with the SSD 
itself. However, if we try starting the service again, the container 
just crashes 1 second after booting up. If I look at the logs, there's 
no error. You can see the OSD starting up normally and then the last 
line before the crash is :

debug 2023-04-05T18:32:57.433+0000 7f8078e0c700  1 osd.87 pg_epoch: 
207175 pg[2.99s3( v 207174'218628609 (207134'218623666,207174'218628609] 
local-lis/les=207140/207141 n=38969 ec=41966/315 lis/c=207140/207049 
les/c/f=207141/207050/0 sis=207175 pruub=11.464111328s) 
[5,228,217,NONE,17,25,167,114,158,178,159]/[5,228,217,87,17,25,167,114,158,178,159]p5(0) 
r=3 lpr=207175 pi=[207049,207175)/1 crt=207174'218628605 mlcod 0'0 
remapped NOTIFY pruub 12054.601562500s@ mbc={}] state<Start>: 
transitioning to Stray

I don't really see how this line could cause the OSD to crash. Systemd 
just writes :

Stopping Ceph osd.83 for (uuid)

What could cause this OSD to boot up and then suddenly die? Outside the 
ceph daemon logs and the systemd logs, is there another way I could gain 
more information?

-- 
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.