Hi Igor,
The only thing abnormal about this osdstore is that it was created by
Mimic 13.2.8 and I can see that the OSDs size of this osdstore are not
the same as the others in the cluster (while they should be exactly
the same size).
Can it be
https://tracker.ceph.com/issues/39151 ?
hmm, may be... Did you change H/W at some point for this OSD's node as
it happened in the ticket?
And it's still unclear to me if the issue is reproducible for you.
Could you please also run fsck (at first) and then repair for this OSD
and collect log(s).
Thanks,
Igor
Thanks!
Saber
CTO @PlanetHoster
On Sep 25, 2020, at 5:46 AM, Igor Fedotov
<ifedotov(a)suse.de
<mailto:ifedotov@suse.de>> wrote:
Hi Saber,
I don't think this is related. New assertion happens along the write
path while the original one occurred on allocator shutdown.
Unfortunately there are not much information to troubleshoot this...
Are you able to reproduce the case?
Thanks,
Igor
On 9/25/2020 4:21 AM, Saber(a)PlanetHoster.info wrote:
> Hi Igor,
>
> We had an osd crash a week after running Nautilus. I have attached
> the logs, is it related to the same bug?
>
>
>
>
> Thanks,
> Saber
> CTO @PlanetHoster
>
>> On Sep 14, 2020, at 10:22 AM, Igor Fedotov <ifedotov(a)suse.de
>> <mailto:ifedotov@suse.de>> wrote:
>>
>> Thanks!
>>
>> Now got the root cause. The fix is on its way...
>>
>> Meanwhile you might want to try to workaround the issue via setting
>> "bluestore_hybrid_alloc_mem_cap" to 0 or using different allocator,
>> e.g. avl for bluestore_allocator (and optionally for
>> bluefs_allocator too).
>>
>>
>> Hope this helps,
>>
>> Igor.
>>
>>
>>
>> On 9/14/2020 5:02 PM, Jean-Philippe Méthot wrote:
>>> Alright, here’s the full log file.
>>>
>>>
>>>
>>>
>>>
>>> Jean-Philippe Méthot
>>> Senior Openstack system administrator
>>> Administrateur système Openstack sénior
>>> PlanetHoster inc.
>>> 4414-4416 Louis B Mayer
>>> Laval, QC, H7P 0G1, Canada
>>> TEL : +1.514.802.1644 - Poste : 2644
>>> FAX : +1.514.612.0678
>>> CA/US : 1.855.774.4678
>>> FR : 01 76 60 41 43
>>> UK : 0808 189 0423
>>>
>>>
>>>
>>>
>>>
>>>
>>>> Le 14 sept. 2020 à 06:49, Igor Fedotov <ifedotov(a)suse.de
>>>> <mailto:ifedotov@suse.de>> a écrit :
>>>>
>>>> Well, I can see duplicate admin socket command
>>>> registration/de-registration (and the second de-registration
>>>> asserts) but don't understand how this could happen.
>>>>
>>>> Would you share the full log, please?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>> On 9/11/2020 7:26 PM, Jean-Philippe Méthot wrote:
>>>>> Here’s the out file, as requested.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Jean-Philippe Méthot
>>>>> Senior Openstack system administrator
>>>>> Administrateur système Openstack sénior
>>>>> PlanetHoster inc.
>>>>> 4414-4416 Louis B Mayer
>>>>> Laval, QC, H7P 0G1, Canada
>>>>> TEL : +1.514.802.1644 - Poste : 2644
>>>>> FAX : +1.514.612.0678
>>>>> CA/US : 1.855.774.4678
>>>>> FR : 01 76 60 41 43
>>>>> UK : 0808 189 0423
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Le 11 sept. 2020 à 10:38, Igor Fedotov <ifedotov(a)suse.de
>>>>>> <mailto:ifedotov@suse.de>> a écrit :
>>>>>>
>>>>>> Could you please run:
>>>>>>
>>>>>> CEPH_ARGS="--log-file log --debug-asok 5"
ceph-bluestore-tool
>>>>>> repair --path <...> ; cat log | grep asok > out
>>>>>>
>>>>>> and share 'out' file.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Igor
>>>>>>
>>>>>> On 9/11/2020 5:15 PM, Jean-Philippe Méthot wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> We’re upgrading our cluster OSD node per OSD node to Nautilus
>>>>>>> from Mimic. From some release notes, it was recommended to
run
>>>>>>> the following command to fix stats after an upgrade :
>>>>>>>
>>>>>>> ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0
>>>>>>>
>>>>>>> However, running that command gives us the following error
>>>>>>> message:
>>>>>>>
>>>>>>>>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc
>>>>>>>> <http://allocator.cc/>: In
>>>>>>>> function 'virtual
Allocator::SocketHook::~SocketHook()'
>>>>>>>> thread 7f1a6467eec0 time 2020-09-10 14:40:25.872353
>>>>>>>>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc
>>>>>>>> <http://allocator.cc/>: 53
>>>>>>>> : FAILED ceph_assert(r == 0)
>>>>>>>> ceph version 14.2.11
>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus
(stable)
>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*,
int,
>>>>>>>> char const*)+0x14a) [0x7f1a5a823025]
>>>>>>>> 2: (()+0x25c1ed) [0x7f1a5a8231ed]
>>>>>>>> 3: (()+0x3c7a4f) [0x55b33537ca4f]
>>>>>>>> 4: (HybridAllocator::~HybridAllocator()+0x17)
[0x55b3353ac517]
>>>>>>>> 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>>>>>>>> 6: (BlueStore::_close_db_and_around(bool)+0x2f8)
>>>>>>>> [0x55b335274528]
>>>>>>>> 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1)
>>>>>>>> [0x55b3352749a1]
>>>>>>>> 8: (main()+0x10b3) [0x55b335187493]
>>>>>>>> 9: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>>>>>>>> 10: (()+0x1f9b5f) [0x55b3351aeb5f]
>>>>>>>> 2020-09-10 14:40:25.873 7f1a6467eec0 -1
>>>>>>>>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc
>>>>>>>> <http://allocator.cc/>: In function 'virtual
>>>>>>>> Allocator::SocketHook::~SocketHook()' thread
7f1a6467eec0
>>>>>>>> time 2020-09-10 14:40:25.872353
>>>>>>>>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc
>>>>>>>> <http://allocator.cc/>: 53: FAILED ceph_assert(r ==
0)
>>>>>>>>
>>>>>>>> ceph version 14.2.11
>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus
(stable)
>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*,
int,
>>>>>>>> char const*)+0x14a) [0x7f1a5a823025]
>>>>>>>> 2: (()+0x25c1ed) [0x7f1a5a8231ed]
>>>>>>>> 3: (()+0x3c7a4f) [0x55b33537ca4f]
>>>>>>>> 4: (HybridAllocator::~HybridAllocator()+0x17)
[0x55b3353ac517]
>>>>>>>> 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>>>>>>>> 6: (BlueStore::_close_db_and_around(bool)+0x2f8)
>>>>>>>> [0x55b335274528]
>>>>>>>> 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1)
>>>>>>>> [0x55b3352749a1]
>>>>>>>> 8: (main()+0x10b3) [0x55b335187493]
>>>>>>>> 9: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>>>>>>>> 10: (()+0x1f9b5f) [0x55b3351aeb5f]
>>>>>>>> *** Caught signal (Aborted) **
>>>>>>>> in thread 7f1a6467eec0 thread_name:ceph-bluestore-
>>>>>>>> ceph version 14.2.11
>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus
(stable)
>>>>>>>> 1: (()+0xf630) [0x7f1a58cf0630]
>>>>>>>> 2: (gsignal()+0x37) [0x7f1a574be387]
>>>>>>>> 3: (abort()+0x148) [0x7f1a574bfa78]
>>>>>>>> 4: (ceph::__ceph_assert_fail(char const*, char const*,
int,
>>>>>>>> char const*)+0x199) [0x7f1a5a823074]
>>>>>>>> 5: (()+0x25c1ed) [0x7f1a5a8231ed]
>>>>>>>> 6: (()+0x3c7a4f) [0x55b33537ca4f]
>>>>>>>> 7: (HybridAllocator::~HybridAllocator()+0x17)
[0x55b3353ac517]
>>>>>>>> 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>>>>>>>> 9: (BlueStore::_close_db_and_around(bool)+0x2f8)
>>>>>>>> [0x55b335274528]
>>>>>>>> 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1)
>>>>>>>> [0x55b3352749a1]
>>>>>>>> 11: (main()+0x10b3) [0x55b335187493]
>>>>>>>> 12: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>>>>>>>> 13: (()+0x1f9b5f) [0x55b3351aeb5f]
>>>>>>>> 2020-09-10 14:40:25.874 7f1a6467eec0 -1 *** Caught signal
>>>>>>>> (Aborted) **
>>>>>>>> in thread 7f1a6467eec0 thread_name:ceph-bluestore-
>>>>>>>>
>>>>>>>> ceph version 14.2.11
>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus
(stable)
>>>>>>>> 1: (()+0xf630) [0x7f1a58cf0630]
>>>>>>>> 2: (gsignal()+0x37) [0x7f1a574be387]
>>>>>>>> 3: (abort()+0x148) [0x7f1a574bfa78]
>>>>>>>> 4: (ceph::__ceph_assert_fail(char const*, char const*,
int,
>>>>>>>> char const*)+0x199) [0x7f1a5a823074]
>>>>>>>> 5: (()+0x25c1ed) [0x7f1a5a8231ed]
>>>>>>>> 6: (()+0x3c7a4f) [0x55b33537ca4f]
>>>>>>>> 7: (HybridAllocator::~HybridAllocator()+0x17)
[0x55b3353ac517]
>>>>>>>> 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>>>>>>>> 9: (BlueStore::_close_db_and_around(bool)+0x2f8)
>>>>>>>> [0x55b335274528]
>>>>>>>> 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1)
>>>>>>>> [0x55b3352749a1]
>>>>>>>> 11: (main()+0x10b3) [0x55b335187493]
>>>>>>>> 12: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>>>>>>>> 13: (()+0x1f9b5f) [0x55b3351aeb5f]
>>>>>>>> NOTE: a copy of the executable, or `objdump -rdS
>>>>>>>> <executable>` is needed to interpret this.
>>>>>>>
>>>>>>> What could be the source of this error? I haven’t found much
>>>>>>> of anything about it online.
>>>>>>>
>>>>>>>
>>>>>>> Jean-Philippe Méthot
>>>>>>> Senior Openstack system administrator
>>>>>>> Administrateur système Openstack sénior
>>>>>>> PlanetHoster inc.
>>>>>>> 4414-4416 Louis B Mayer
>>>>>>> Laval, QC, H7P 0G1, Canada
>>>>>>> TEL : +1.514.802.1644 - Poste : 2644
>>>>>>> FAX : +1.514.612.0678
>>>>>>> CA/US : 1.855.774.4678
>>>>>>> FR : 01 76 60 41 43
>>>>>>> UK : 0808 189 0423
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>> <mailto:ceph-users@ceph.io>
>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>> <mailto:ceph-users-leave@ceph.io>
>>>>>
>>>
>