Hi Milan,
given the log output mentioning 32768 spanning blobs I believe you're
facing
https://tracker.ceph.com/issues/48216
The root cause for this case is still unknown but PR attached to the
ticket allows to fix the issue using objectstore's fsck/repair.
Hence if you're able to deploy a custom build for this specific OSD you
can both fix your OSD and help - by sharing the fsck log ;) - in
troubleshooting the root cause.
Please let me know if you need more assistance on this way.
Thanks,
Igor
On 11/24/2020 9:55 AM, Milan Kupcevic wrote:
> Hello,
>
> Three OSD daemons crash at the same time while processing the same
> object located in an rbd ec4+2 pool leaving a placement group in
> inactive down state. Soon after I start the osd daemons back up they
> crash again choking on the same object.
>
> ----------------------------8<------------------------------------
> _dump_onode 0x5605a27ca000
> 4#7:8565da11:::rbd_data.6.a8a8356fd674f.00000000003dce34:head# nid
> 1889617 size 0x100000 (1048576) expected_object_size 0
> expected_write_size 0 in 8 shards, 32768 spanning blobs
> ----------------------------8<------------------------------------
>
> Please take a look at the attached log file.
>
>
> Ceph status reports:
>
> Reduced data availability: 1 pg inactive, 1 pg down
>
>
> Any hints on how to get this placement group back online would be
> greatly appreciated.
>
>
> Milan
>
>