Hi,
We currently use Ceph Pacific 16.2.10 deployed with Cephadm on this
storage cluster. Last night, one of our OSD died. However, since its
storage is a SSD, we ran hardware checks and found no issue with the SSD
itself. However, if we try starting the service again, the container
just crashes 1 second after booting up. If I look at the logs, there's
no error. You can see the OSD starting up normally and then the last
line before the crash is :
debug 2023-04-05T18:32:57.433+0000 7f8078e0c700 1 osd.87 pg_epoch:
207175 pg[2.99s3( v 207174'218628609 (207134'218623666,207174'218628609]
local-lis/les=207140/207141 n=38969 ec=41966/315 lis/c=207140/207049
les/c/f=207141/207050/0 sis=207175 pruub=11.464111328s)
[5,228,217,NONE,17,25,167,114,158,178,159]/[5,228,217,87,17,25,167,114,158,178,159]p5(0)
r=3 lpr=207175 pi=[207049,207175)/1 crt=207174'218628605 mlcod 0'0
remapped NOTIFY pruub 12054.601562500s@ mbc={}] state<Start>:
transitioning to Stray
I don't really see how this line could cause the OSD to crash. Systemd
just writes :
Stopping Ceph osd.83 for (uuid)
What could cause this OSD to boot up and then suddenly die? Outside the
ceph daemon logs and the systemd logs, is there another way I could gain
more information?
--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
Show replies by date
Hi J-P Methot,
perhaps my response is a bit late but this to some degree recalls me an
issue we've been facing yesterday.
First of all you might want to set debug-osd to 20 for this specific OSD
and see if log would be more helpful. Please share if possible.
Secondly I'm curious if the last reported PG (2.99s3) is always the same
before the crash ? If so you might want to remove it from the OSD using
ceph-objectstore-tool's export-remove command - if our case this helped
to bring OSD up. Exported PG can be loaded to another OSD or (if that's
a single problematic OSD) just thrown away and fixed by scrubbing...
Thanks,
Igor
On 05/04/2023 23:36, J-P Methot wrote:
> Hi,
>
>
> We currently use Ceph Pacific 16.2.10 deployed with Cephadm on this
> storage cluster. Last night, one of our OSD died. However, since its
> storage is a SSD, we ran hardware checks and found no issue with the
> SSD itself. However, if we try starting the service again, the
> container just crashes 1 second after booting up. If I look at the
> logs, there's no error. You can see the OSD starting up normally and
> then the last line before the crash is :
>
> debug 2023-04-05T18:32:57.433+0000 7f8078e0c700 1 osd.87 pg_epoch:
> 207175 pg[2.99s3( v 207174'218628609
> (207134'218623666,207174'218628609] local-lis/les=207140/207141
> n=38969 ec=41966/315 lis/c=207140/207049 les/c/f=207141/207050/0
> sis=207175 pruub=11.464111328s)
>
[5,228,217,NONE,17,25,167,114,158,178,159]/[5,228,217,87,17,25,167,114,158,178,159]p5(0)
> r=3 lpr=207175 pi=[207049,207175)/1 crt=207174'218628605 mlcod 0'0
> remapped NOTIFY pruub 12054.601562500s@ mbc={}] state<Start>:
> transitioning to Stray
>
> I don't really see how this line could cause the OSD to crash. Systemd
> just writes :
>
> Stopping Ceph osd.83 for (uuid)
>
> What could cause this OSD to boot up and then suddenly die? Outside
> the ceph daemon logs and the systemd logs, is there another way I
> could gain more information?
>