New subject: Very slow backfilling/remapping of EC pool PGs

21 Mar 2023

(adding back the list)

On Tue, Mar 21, 2023 at 11:25 AM Joachim Kraftmayer <
joachim.kraftmayer(a)clyso.com&gt; wrote:

...
  i added the questions and answers below.

 ___________________________________
 Best Regards,
 Joachim Kraftmayer
 CEO | Clyso GmbH

 Clyso GmbH
 p: +49 89 21 55 23 91 2
 a: Loristraße 8 | 80335 München | Germany
 w: https://clyso.com | e: joachim.kraftmayer(a)clyso.com

 We are hiring: https://www.clyso.com/jobs/
 ---
 CEO: Dipl. Inf. (FH) Joachim Kraftmayer
 Unternehmenssitz: Utting am Ammersee
 Handelsregister beim Amtsgericht: Augsburg
 Handelsregister-Nummer: HRB 25866
 USt. ID-Nr.: DE275430677

 Am 21.03.23 um 11:14 schrieb Gauvain Pocentek:

 Hi Joachim,

 On Tue, Mar 21, 2023 at 10:13 AM Joachim Kraftmayer <
 joachim.kraftmayer(a)clyso.com&gt; wrote:

  Which Ceph version are you running, is mclock
active?

  We're using Quincy (17.2.5), upgraded step by step from Luminous if I
 remember correctly.

 did you recreate the osds? if yes, at which version?

I actually don't remember all the history, but I think we added the HDD
nodes while running Pacific.

...

 mlock seems active, set to high_client_ops profile. HDD OSDs have very
 different settings for max capacity iops:

 osd.137        basic     osd_mclock_max_capacity_iops_hdd
  929.763899
 osd.161        basic     osd_mclock_max_capacity_iops_hdd
  4754.250946
 osd.222        basic     osd_mclock_max_capacity_iops_hdd
  540.016984
 osd.281        basic     osd_mclock_max_capacity_iops_hdd
  1029.193945
 osd.282        basic     osd_mclock_max_capacity_iops_hdd
  1061.762870
 osd.283        basic     osd_mclock_max_capacity_iops_hdd
  462.984562

 We haven't set those explicitly, could they be the reason of the slow
 recovery?

 i recommend to disable mclock for now, and yes we have seen slow recovery
 caused by mclock.

Stupid question: how do you do that? I've looked through the docs but could
only find information about changing the settings.

...

 Bonus question: does ceph set that itself?

 yes and if you have a setup with HDD + SSD (db & wal) the discovery works
 not in the right way.

Good to know!

Gauvain

...

 Thanks!

 Gauvain

  Joachim

 ___________________________________
 Clyso GmbH - Ceph Foundation Member

 Am 21.03.23 um 06:53 schrieb Gauvain Pocentek:
  Hello all,

 We have an EC (4+2) pool for RGW data, with HDDs + SSDs for WAL/DB. This
 pool has 9 servers with each 12 disks of 16TBs. About 10 days ago we  lost a
  server and we've removed its OSDs from the
cluster. Ceph has started to
 remap and backfill as expected, but the process has been getting slower  and
  slower. Today the recovery rate is around 12
MiB/s and 10 objects/s. All
 the remaining unclean PGs are backfilling:

    data:
      volumes: 1/1 healthy
      pools:   14 pools, 14497 pgs
      objects: 192.38M objects, 380 TiB
      usage:   764 TiB used, 1.3 PiB / 2.1 PiB avail
      pgs:     771559/1065561630 objects degraded (0.072%)
               1215899/1065561630 objects misplaced (0.114%)
               14428 active+clean
               50    active+undersized+degraded+remapped+backfilling
               18    active+remapped+backfilling
               1     active+clean+scrubbing+deep

 We've checked the health of the remaining servers, and everything looks
 like (CPU/RAM/network/disks).

 Any hints on what could be happening?

 Thank you,
 Gauvain
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

Re: Very slow backfilling/remapping of EC pool PGs