Dan,
Thank you for the suggestion. Changing osd_max_pg_per_osd_hard_ratio to
10 and also setting mon_max_pg_per_osd to 500 allowed me to resume IO (I
did have to restart the OSDs with stuck slow ops).
I'll have to do some reading into why our PG count appears so high, and
if it's safe to leave these values set.
Thanks again for your help!
Justin Goetz
Systems Engineer, TeraSwitch Inc.
jgoetz(a)teraswitch.com
412-945-7045 (NOC) | 412-459-7945 (Direct)
On 6/23/21 10:41 AM, Dan van der Ster wrote:
Hi,
Stuck activating could be an old known issue: if the cluster has many
(>100) PGs per OSD, they may temporarily need to hold more than the
max (300) and therefore PGs get stuck activating.
We always use this option as a workaround:
osd max pg per osd hard ratio = 10.0
I suggest giving this a try -- it can't hurt much.
Cheers, Dan
On Wed, Jun 23, 2021 at 4:29 PM Justin Goetz <jgoetz(a)teraswitch.com> wrote:
> Hello!
>
> We are in the process of expanding our CEPH cluster (by both adding OSD
> hosts and replacing smaller-sized HDDs on our existing hosts). So far we
> have gone host by host, removing the old OSDs, swapping the physical
> HDDs, and re-adding them. This process has gone smooth, aside from one
> issue: upon any action taken on the cluster (adding new OSDs, replacing
> old, etc), we have PGs get stuck "activating"which causes around 3.5% of
> PGs go inactive, causing IO to stop.
>
> Here is a current look at our ceph -s command:
>
> cluster:
> id: e8ffe2eb-f8fc-4110-a4bc-1715e878fb7b
> health: HEALTH_WARN
> Reduced data availability: 166 pgs inactive
> Degraded data redundancy: 137153907/3658405707 objects
> degraded (3.749%), 930 pgs degraded, 928 pgs undersized
> 10 pgs not deep-scrubbed in time
> 33709 slow ops, oldest one blocked for 35956 sec, daemons
> [osd.103,osd.104,osd.105,osd.106,osd.107,osd.109,osd.111,osd.112,osd.113,osd.114]...
> have slow ops.
>
> services:
> mon: 3 daemons, quorum lb3,lb2,lb1 (age 8w)
> mgr: lb1(active, since 6w), standbys: lb3, lb2
> osd: 117 osds: 117 up (since 15m), 117 in (since 10h); 2033
> remapped pgs
> rgw: 3 daemons active (lb1.rgw0, lb2.rgw0, lb3.rgw0)
>
> task status:
>
> data:
> pools: 8 pools, 5793 pgs
> objects: 609.74M objects, 169 TiB
> usage: 308 TiB used, 430 TiB / 738 TiB avail
> pgs: 2.866% pgs not active
> 137153907/3658405707 objects degraded (3.749%)
> 262215404/3658405707 objects misplaced (7.167%)
> 3754 active+clean
> 963 active+remapped+backfill_wait
> 892 active+undersized+degraded+remapped+backfill_wait
> 136 activating+remapped
> 27 activating+undersized+degraded+remapped
> 8 active+undersized+degraded+remapped+backfilling
> 6 active+clean+scrubbing+deep
> 3 activating+degraded+remapped
> 3 active+remapped+backfilling
> 1 active+undersized+remapped+backfill_wait
>
> io:
> client: 94 KiB/s rd, 94 op/s rd, 0 op/s wr
> recovery: 112 MiB/s, 372 objects/s
>
> progress:
> Rebalancing after osd.20 marked in (10h)
> [............................] (remaining: 11d)
> Rebalancing after osd.41 marked in (10h)
> [=...........................] (remaining: 8d)
> Rebalancing after osd.30 marked in (10h)
> [=...........................] (remaining: 9d)
> Rebalancing after osd.1 marked in (10h)
> [=======.....................] (remaining: 2h)
> Rebalancing after osd.10 marked in (10h)
> [............................] (remaining: 12d)
> Rebalancing after osd.50 marked in (10h)
> [............................] (remaining: 2w)
> Rebalancing after osd.71 marked out (10h)
> [==..........................] (remaining: 5d)
>
> What you may find interesting is the "slow ops" warnings. This is where
> our inactive PGs become stuck. Once the cluster gets into this state,
> I'm able to recover IO usually by restarting the OSDs with slow ops.
> However, what's extremely strange, is this workaround only works after
> about 12 hours since the last OSD addition. Restarting the slow ops OSDs
> before roughly 12 hours results in the slow ops returning immediately.
>
> Our first thought was hardware issues, however we ruled this out after
> the slow ops warnings appeared on brand new HDDs and OSD hosts.
> Monitoring the IO saturation of the OSDs reporting slow ops shows actual
> usage nowhere near saturation, and no hardware issues are present on the
> drives themselves.
>
> Looking at the journalctl logs of one of the affected OSDs above, we see
> the following repeated multiple times:
>
> osd.103 56934 get_health_metrics reporting 2 slow ops, oldest is
> osd_op(client.467952.0:1520304537 8.6fbs0 8.1e6826fb (undecoded)
> ondisk+retry+write+known_if_redirected e56923
>
> So far my procedure for the disk swaps have been as follows:
>
> 1. Set noout,norebalance, and norecover on the cluster.
> 2. Use ceph-ansible to remove the old disk OSD IDs
> 3. Swap physical HDDs, re-add with ceph-ansible
> 4. Unset noout,norebalance,norecover
>
> I should note this issue appears even with simple OSD additions (not
> removals), as we added 2 brand new hosts to the cluster and saw the same
> issue.
>
> I've been trying to think of any possible cause of this issue, I should
> mention our cluster is messy at the moment hardware-wise (we have a mix
> of 7T HDDs, 4T HDDs, and 10T HDDs - moving to all 10T HDDs but the
> process to swap has been taking a while). One warning I've noticed
> during the old disk removals is a warning about too many PGs per OSD,
> however this warning clears once the new OSDs are added, which is to be
> expected I assume.
>
> If anyone would be willing to provide any hints of where to look, it
> would be much appreciated!
>
> Thanks for your time.
> --
>
> Justin Goetz
> Systems Engineer, TeraSwitch Inc.
> jgoetz(a)teraswitch.com
> 412-945-7045 (NOC) | 412-459-7945 (Direct)
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io