Given the symptoms high CPU usage within RocksDB and corresponding
slowdown were presumably caused by RocksDB fragmentation.
And temporary workaround would be to do manual DB compaction using
ceph-kvstore-tool's compact command.
Thanks,
Igor
On 4/13/2020 1:01 AM, Jack wrote:
> Yep I am
>
> The issue is solved now .. and by solved, brace yourselves, I mean I had
> to recreate all OSDs
>
> And this the cluster would not heal itself (because of the original
> issue), I had to drop every rados pool, stop all OSDs, destroy &
> recreate them ..
> Yeah, well, hum
>
> There is definitly an underlying issue there
> Those OSDs were created and upgraded since Luminous
>
> I have no more cue on the bug
> Sadly, there is only so much downtime I can afford on this cluster
>
> Anyway ..
>
> On 4/9/20 4:51 AM, Ashley Merrick wrote:
>> Are you sure your not being hit by:
>>
>>
>>
>> ceph config set osd bluestore_fsck_quick_fix_on_mount false @
https://docs.ceph.com/docs/master/releases/octopus/
>>
>> Have all your OSD's successfully completed the fsck?
>>
>>
>>
>> Reasons I say that is I can see "20 OSD(s) reporting legacy (not per-pool)
BlueStore omap usage stats"
>>
>>
>>
>>
>>
>> ---- On Thu, 09 Apr 2020 02:15:02 +0800 Jack <mailto:ceph@jack.fr.eu.org>
wrote ----
>>
>>
>>
>> Just to confirm this does not get better:
>>
>> root@backup1:~# ceph status
>> cluster:
>> id: 9cd41f0f-936d-4b59-8e5d-9b679dae9140
>> health: HEALTH_WARN
>> 20 OSD(s) reporting legacy (not per-pool) BlueStore omap
>> usage stats
>> 4/50952060 objects unfound (0.000%)
>> nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
>> 1 osds down
>> 3 nearfull osd(s)
>> Reduced data availability: 826 pgs inactive, 616 pgs down,
>> 185 pgs peering, 158 pgs stale
>> Low space hindering backfill (add storage if this doesn't
>> resolve itself): 93 pgs backfill_toofull
>> Degraded data redundancy: 13285415/101904120 objects
>> degraded (13.037%), 706 pgs degraded, 696 pgs undersized
>> 989 pgs not deep-scrubbed in time
>> 378 pgs not scrubbed in time
>> 10 pool(s) nearfull
>> 2216 slow ops, oldest one blocked for 13905 sec, daemons
>> [osd.1,osd.11,osd.20,osd.24,osd.25,osd.29,osd.31,osd.37,osd.4,osd.5]...
>> have slow ops.
>>
>> services:
>> mon: 1 daemons, quorum backup1 (age 8d)
>> mgr: backup1(active, since 8d)
>> osd: 37 osds: 26 up (since 9m), 27 in (since 2h); 626 remapped pgs
>> flags nobackfill,norecover,noscrub,nodeep-scrub
>> rgw: 1 daemon active (
backup1.odiso.net)
>>
>> task status:
>>
>> data:
>> pools: 10 pools, 2785 pgs
>> objects: 50.95M objects, 92 TiB
>> usage: 121 TiB used, 39 TiB / 160 TiB avail
>> pgs: 29.659% pgs not active
>> 13285415/101904120 objects degraded (13.037%)
>> 433992/101904120 objects misplaced (0.426%)
>> 4/50952060 objects unfound (0.000%)
>> 840 active+clean+snaptrim_wait
>> 536 down
>> 490 active+undersized+degraded+remapped+backfilling
>> 326 active+clean
>> 113 peering
>> 88 active+undersized+degraded
>> 83 active+undersized+degraded+remapped+backfill_toofull
>> 79 stale+down
>> 63 stale+peering
>> 51 active+clean+snaptrim
>> 24 activating
>> 22 active+recovering+degraded
>> 19 active+remapped+backfilling
>> 13 stale+active+undersized+degraded
>> 9 remapped+peering
>> 9 active+undersized+remapped+backfilling
>> 9
>> active+undersized+degraded+remapped+backfill_wait+backfill_toofull
>> 2 stale+active+clean+snaptrim
>> 2 active+undersized
>> 1 stale+active+clean+snaptrim_wait
>> 1 active+remapped+backfill_toofull
>> 1 active+clean+snaptrim_wait+laggy
>> 1 active+recovering+undersized+remapped
>> 1 down+remapped
>> 1 activating+undersized+degraded+remapped
>> 1 active+recovering+laggy
>>
>> On 4/8/20 3:27 PM, Jack wrote:
>>> The CPU is used by userspace, not kernelspace
>>>
>>> Here is the perf top, see attachment
>>>
>>> Rocksdb eats everything :/
>>>
>>>
>>> On 4/8/20 3:14 PM, Paul Emmerich wrote:
>>>> What's the CPU busy with while spinning at 100%?
>>>>
>>>> Check "perf top" for a quick overview
>>>>
>>>>
>>>> Paul
>>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- mailto:ceph-users@ceph.io
>>> To unsubscribe send an email to mailto:ceph-users-leave@ceph.io
>>>
>> _______________________________________________
>> ceph-users mailing list -- mailto:ceph-users@ceph.io
>> To unsubscribe send an email to mailto:ceph-users-leave@ceph.io
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io