[ceph-users] Re: osd crash: Caught signal (Aborted) thread_name:tp_osd_tp

25 Nov 2020

Hi Milan

Please DO NOT delete the object for all the EC shards (i.e. at all three 
OSDs) !!!!

Sorry, I missed that you have three shards crashing... Removing that 
many object shards will cause data lost.

Theoretically removing just a single object replica and then doing a 
scrub might help though. But honestly I'm not 100% sure this is safe....

Hence I'd recommend to use the mentioned patch if possible...

Thanks,

Igor

On 11/25/2020 6:17 AM, Milan Kupcevic wrote:
> Hi Igor,
>
> Thank you for quick and useful answer. We are looking at our options.
>
> Milan
>
>
>
> On 2020-11-24 06:49, Igor Fedotov wrote:
>> Another workaround would be to delete the object in question using
>> ceph-objectstore-tool and then do a scrub on the corresponding PG to fix
>> the absent object.
>>
>> But I would greatly appreciate if we dissect this case for a bit....
>>
>>
>> On 11/24/2020 9:55 AM, Milan Kupcevic wrote:
>>> Hello,
>>>
>>> Three OSD daemons crash at the same time while processing the same
>>> object located in an rbd ec4+2 pool leaving a placement group in
>>> inactive down state. Soon after I start the osd daemons back up they
>>> crash again choking on the same object.
>>>
>>> ----------------------------8<------------------------------------
>>> _dump_onode 0x5605a27ca000
>>> 4#7:8565da11:::rbd_data.6.a8a8356fd674f.00000000003dce34:head# nid
>>> 1889617 size 0x100000 (1048576) expected_object_size 0
>>> expected_write_size 0 in 8 shards, 32768 spanning blobs
>>> ----------------------------8<------------------------------------
>>>
>>> Please take a look at the attached log file.
>>>
>>>
>>> Ceph status reports:
>>>
>>> Reduced data availability: 1 pg inactive, 1 pg down
>>>
>>>
>>> Any hints on how to get this placement group back online would be
>>> greatly appreciated.
>>>
>>>
>>> Milan
>>>
>>>
>

2024

2023

2022

2021

2020

2019

[ceph-users] Re: osd crash: Caught signal (Aborted) thread_name:tp_osd_tp