Ceph OSD imbalance and performance

List overview All Threads
Download

newer

older

Next quincy release (17.2.6)

Do not use SSDs with (small) SLC...

Dave Ingram

28 Feb 2023 28 Feb '23

8:42 p.m.

Hello, Our ceph cluster performance has become horrifically slow over the past few months. Nobody here is terribly familiar with ceph and we're inheriting this cluster without much direction. Architecture: 40Gbps QDR IB fabric between all ceph nodes and our ovirt VM hosts. 11 OSD nodes with a total of 163 OSDs. 14 pools, 3616 PGs, 1.19PB total capacity. Ceph versions: { "mon": { "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 3 }, "mgr": { "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 3 }, "osd": { "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 118, "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 22, "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 19 }, "mds": {}, "overall": { "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 124, "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 22, "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 19 } } The majority of disks are spindles but there are also NVMe SSDs. There is a lot of variability in drive sizes - two different sets of admins added disks sized between 6TB and 16TB and I suspect this and imbalanced weighting is to blame. Performance on the ovirt VMs can dip as low as several *kilobytes* per-second (!) on reads and a few MB/sec on writes. There are also several scrub errors. In short, it's a complete wreck. STATUS: [root@ceph-admin davei]# ceph -s cluster: id: 1b8d958c-e50b-40ef-a681-16cfeb9390b8 health: HEALTH_ERR 3 scrub errors Possible data damage: 3 pgs inconsistent services: mon: 3 daemons, quorum ceph1,ceph2,ceph3 mgr: ceph3(active), standbys: ceph2, ceph1 osd: 163 osds: 159 up, 158 in data: pools: 14 pools, 3616 pgs objects: 46.28M objects, 174TiB usage: 527TiB used, 694TiB / 1.19PiB avail pgs: 3609 active+clean 4 active+clean+scrubbing+deep 3 active+clean+inconsistent io: client: 74.3MiB/s rd, 96.0MiB/s wr, 3.85kop/s rd, 3.68kop/s wr --- HEALTH: [root@ceph-admin davei]# ceph health detail HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent OSD_SCRUB_ERRORS 3 scrub errors PG_DAMAGED Possible data damage: 3 pgs inconsistent pg 2.8a is active+clean+inconsistent, acting [13,152,127] pg 2.ce is active+clean+inconsistent, acting [145,13,152] pg 2.e8 is active+clean+inconsistent, acting [150,162,42] --- CEPH OSD DF: (not going to paste that all in here): https://pastebin.com/CNW5RKWx What else am I missing in terms of what to share with you all? Any advice on how we should 'reweight' these to get the performance to improve? Thanks all, -Dave

Show replies by date

Janne Johansson

28 Feb 28 Feb

9:50 p.m.

Den tis 28 feb. 2023 kl 18:13 skrev Dave Ingram <dave(a)adaptable.sh>sh>:

...

There are also several scrub errors. In short, it's a complete wreck. health: HEALTH_ERR 3 scrub errors Possible data damage: 3 pgs inconsistent

...

[root@ceph-admin davei]# ceph health detail HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent OSD_SCRUB_ERRORS 3 scrub errors PG_DAMAGED Possible data damage: 3 pgs inconsistent pg 2.8a is active+clean+inconsistent, acting [13,152,127] pg 2.ce is active+clean+inconsistent, acting [145,13,152] pg 2.e8 is active+clean+inconsistent, acting [150,162,42]

You can ask the cluster to repair those three, "ceph pg repair 2.8a" "ceph pg repair 2.ce" "ceph pg repair 2.e8" and they should start fixing themselves. -- May the most significant bit of your life be positive.

Dave Ingram

10:30 p.m.

When I suggested this to the senior admin here I was told that was a bad idea because it would negatively impact performance. Is that true? I thought all that would do was accept the information from the other 2 OSDs and the one with the errors would rebuild the record. The underlying disks don't appear to have actual catastrophic errors based on smartctl and other tools. On Tue, Feb 28, 2023 at 12:21 PM Janne Johansson <icepic.dz(a)gmail.com> wrote:

...

Den tis 28 feb. 2023 kl 18:13 skrev Dave Ingram <dave(a)adaptable.sh>sh>:

There are also several scrub errors. In short, it's a complete wreck. health: HEALTH_ERR 3 scrub errors Possible data damage: 3 pgs inconsistent

Reed Dier

10:26 p.m.

I think a few other things that could help would be `ceph osd df tree` which will show the hierarchy across different crush domains. And if you’re doing something like erasure coded pools, or something other than replication 3, maybe `ceph osd crush rule dump` may provide some further context with the tree output. Also, the cluster is running Luminous (12) which went EOL 3 years ago tomorrow <https://docs.ceph.com/en/latest/releases/index.html#archived-releases>. So there are also likely a good bit of improvements all around under the hood to be gained by moving forward from Luminous. Though, I would say take care of the scrub errors prior to doing any major upgrades, as well as checking your upgrade path (can only upgrade two releases at a time, if you have filestore OSDs, etc). -Reed

...

On Feb 28, 2023, at 11:12 AM, Dave Ingram <dave(a)adaptable.sh> wrote: There is a lot of variability in drive sizes - two different sets of admins added disks sized between 6TB and 16TB and I suspect this and imbalanced weighting is to blame. CEPH OSD DF: (not going to paste that all in here): https://pastebin.com/CNW5RKWx What else am I missing in terms of what to share with you all? Thanks all, -Dave _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Dave Ingram

10:41 p.m.

On Tue, Feb 28, 2023 at 12:56 PM Reed Dier <reed.dier(a)focusvq.com> wrote:

...

I think a few other things that could help would be `ceph osd df tree` which will show the hierarchy across different crush domains.

Good idea: https://pastebin.com/y07TKt52

...

And if you’re doing something like erasure coded pools, or something other than replication 3, maybe `ceph osd crush rule dump` may provide some further context with the tree output.

No erasure coded pools - all replication.

...

Also, the cluster is running Luminous (12) which went EOL 3 years ago tomorrow <https://docs.ceph.com/en/latest/releases/index.html#archived-releases>. So there are also likely a good bit of improvements all around under the hood to be gained by moving forward from Luminous.

Yes, nobody here wants to touch upgrading this at all - too terrified of breaking things. This ceph deployment is serving several hundred VMs. The general feeling is that we're stuck on luminous and that it's destructive to upgrade to anything else. I refuse to believe that is true. At least if we upgraded everything to 12.2.3 we'd have the 'balancer' stuff that came with I think 12.2.2... What would you recommend upgrading luminous to?

...

Though, I would say take care of the scrub errors prior to doing any major upgrades, as well as checking your upgrade path (can only upgrade two releases at a time, if you have filestore OSDs, etc).

Yeah, there seems to be a fear that attempting to repair those will negatively impact performance even more. I disagree and think we should do them immediately. Also, there seems to be a belief that bluestore is an 'all-or-nothing' proposition and that it's impossible to migrate from filestore to bluestore. Yet I see that you can have a mixture of both in your deployments and it's indeed possible to migrate from filestore to bluestore. TL;DR -- there is a *lot* of fear of touching this thing because nobody is truly an 'expert' in it atm. But not touching it is why we've gotten ourselves into a situation with broken stuff and horrendous performance. Thanks Reed! -Dave

...

-Reed On Feb 28, 2023, at 11:12 AM, Dave Ingram <dave(a)adaptable.sh> wrote: There is a lot of variability in drive sizes - two different sets of admins added disks sized between 6TB and 16TB and I suspect this and imbalanced weighting is to blame. CEPH OSD DF: (not going to paste that all in here): https://pastebin.com/CNW5RKWx What else am I missing in terms of what to share with you all? Thanks all, -Dave _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Mark Nelson

11:17 p.m.

On 2/28/23 13:11, Dave Ingram wrote:

...

On Tue, Feb 28, 2023 at 12:56 PM Reed Dier <reed.dier(a)focusvq.com> wrote:

I think a few other things that could help would be `ceph osd df tree` which will show the hierarchy across different crush domains.

Good idea: https://pastebin.com/y07TKt52

Yeah, it looks like OSD.147 has over 3x the amount of data on it vs some of the smaller HDD OSDs. I bet it's getting hammered. Are the drives different rotational speeds? That's going to hurt too, especially if the bigger drives are slower and you aren't using flash for Journals/WALs. You might want to look at the device queue wait times and see which drives are slow to service IOs. I suspect it will be 147 leading the pack with the other 16TB drives following. You never know though, sometimes you see an odd one that's slow but not showing smartctl errors yet. Mark > > >> And if you’re doing something like erasure coded pools, or something other >> than replication 3, maybe `ceph osd crush rule dump` may provide some >> further context with the tree output. >> > > No erasure coded pools - all replication. > > >> >> Also, the cluster is running Luminous (12) which went EOL 3 years ago >> tomorrow >> <https://docs.ceph.com/en/latest/releases/index.html#archived-releases>. >> So there are also likely a good bit of improvements all around under the >> hood to be gained by moving forward from Luminous. >> > > Yes, nobody here wants to touch upgrading this at all - too terrified of > breaking things. This ceph deployment is serving several hundred VMs. > > The general feeling is that we're stuck on luminous and that it's > destructive to upgrade to anything else. I refuse to believe that is true. > At least if we upgraded everything to 12.2.3 we'd have the 'balancer' stuff > that came with I think 12.2.2... > > What would you recommend upgrading luminous to? > > >> Though, I would say take care of the scrub errors prior to doing any major >> upgrades, as well as checking your upgrade path (can only upgrade two >> releases at a time, if you have filestore OSDs, etc). >> > > Yeah, there seems to be a fear that attempting to repair those will > negatively impact performance even more. I disagree and think we should do > them immediately. > > Also, there seems to be a belief that bluestore is an 'all-or-nothing' > proposition and that it's impossible to migrate from filestore to > bluestore. Yet I see that you can have a mixture of both in your > deployments and it's indeed possible to migrate from filestore to bluestore. > > TL;DR -- there is a *lot* of fear of touching this thing because nobody is > truly an 'expert' in it atm. But not touching it is why we've gotten > ourselves into a situation with broken stuff and horrendous performance. > > Thanks Reed! > -Dave > > >> >> -Reed >> >> On Feb 28, 2023, at 11:12 AM, Dave Ingram <dave(a)adaptable.sh> wrote: >> >> There is a >> lot of variability in drive sizes - two different sets of admins added >> disks sized between 6TB and 16TB and I suspect this and imbalanced >> weighting is to blame. >> >> CEPH OSD DF: >> >> (not going to paste that all in here): https://pastebin.com/CNW5RKWx >> >> What else am I missing in terms of what to share with you all? >> >> Thanks all, >> -Dave >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reed Dier

1 Mar 1 Mar

1:25 a.m.

...

Yeah, there seems to be a fear that attempting to repair those will negatively impact performance even more. I disagree and think we should do them immediately.

There shouldn’t really be too much of a noticeable performance hit. Some good documentation here <https://docs.ceph.com/en/pacific/rados/operations/pg-repair/#more-information-on-placement-group-repair>.

...

The general feeling is that we're stuck on luminous and that it's destructive to upgrade to anything else. I refuse to believe that is true. At least if we upgraded everything to 12.2.3 we'd have the 'balancer' stuff that came with I think 12.2.2...

Upgrades are definitely not destructive, however, they also aren’t trivial. You can upgrade 2 releases at a time, but the distro’s those packages are for may vary release to release. For example, if you were to want to get to Quincy from Luminous, you should be able to step from Luminous (12) to Nautilus (14), then to Pacific (16), and on to Quincy (17) if you wanted. However, your Luminous install may be on Ubuntu 14.04 or 16.04, which you can immediately move to Nautilus with. To get to Pacific, you’re going to then need to move to Ubuntu 18.04 (Nautilus compatible), and then on to Pacific. If you then wanted to move to Quincy, you then need to upgrade to Ubuntu 20.04, before moving on to Quincy with 20.04. This probably sounds daunting, and it is certainly non-trivial, but definitely doable if you take things in small steps, and should be possible with no downtime if planned out.

...

Also, there seems to be a belief that bluestore is an 'all-or-nothing' proposition Yet I see that you can have a mixture of both in your deployments

You can mix filestore and bluestore OSDs in your cluster, however —

...

[…] and that it's impossible to migrate from filestore to bluestore.

...

[…] and it's indeed possible to migrate from filestore to bluestore.

If you have filestore OSDs, the only way to migrate them to bluestore is by destroying the OSD, and recreating it as bluestore, see here <https://docs.ceph.com/en/quincy/rados/operations/bluestore-migration/>. This can be a time consuming process if you drain an OSD, let it backfill off, blow it away, recreate, and then bring data back. This can also prove to be IO expensive as well if your ceph cluster is already IO saturated, due to all of the backfill IO on top of the client IO.

...

TL;DR -- there is a *lot* of fear of touching this thing because nobody is truly an 'expert' in it atm. But not touching it is why we've gotten ourselves into a situation with broken stuff and horrendous performance.

Given how critical (and brittle) this infrastructure is sounding to your org, it might be best to pull in some experts <https://croit.io/consulting>, and I think most on the mailing list would likely recommend Croit as a good place to start outside of any existing support contracts. Hope thats helpful, Reed > On Feb 28, 2023, at 1:11 PM, Dave Ingram <dave(a)adaptable.sh> wrote: > > > On Tue, Feb 28, 2023 at 12:56 PM Reed Dier <reed.dier(a)focusvq.com <mailto:reed.dier@focusvq.com>> wrote: > I think a few other things that could help would be `ceph osd df tree` which will show the hierarchy across different crush domains. > > Good idea: https://pastebin.com/y07TKt52 <https://pastebin.com/y07TKt52> > > And if you’re doing something like erasure coded pools, or something other than replication 3, maybe `ceph osd crush rule dump` may provide some further context with the tree output. > > No erasure coded pools - all replication. > > > Also, the cluster is running Luminous (12) which went EOL 3 years ago tomorrow <https://docs.ceph.com/en/latest/releases/index.html#archived-releases>. > So there are also likely a good bit of improvements all around under the hood to be gained by moving forward from Luminous. > > Yes, nobody here wants to touch upgrading this at all - too terrified of breaking things. This ceph deployment is serving several hundred VMs. > > The general feeling is that we're stuck on luminous and that it's destructive to upgrade to anything else. I refuse to believe that is true. At least if we upgraded everything to 12.2.3 we'd have the 'balancer' stuff that came with I think 12.2.2... > > What would you recommend upgrading luminous to? > > Though, I would say take care of the scrub errors prior to doing any major upgrades, as well as checking your upgrade path (can only upgrade two releases at a time, if you have filestore OSDs, etc). >

...

Yeah, there seems to be a fear that attempting to repair those will negatively impact performance even more. I disagree and think we should do them immediately.

> > Also, there seems to be a belief that bluestore is an 'all-or-nothing' proposition and that it's impossible to migrate from filestore to bluestore. Yet I see that you can have a mixture of both in your deployments and it's indeed possible to migrate from filestore to bluestore. > > TL;DR -- there is a *lot* of fear of touching this thing because nobody is truly an 'expert' in it atm. But not touching it is why we've gotten ourselves into a situation with broken stuff and horrendous performance. > > Thanks Reed! > -Dave > > > -Reed > >> On Feb 28, 2023, at 11:12 AM, Dave Ingram <dave(a)adaptable.sh <mailto:dave@adaptable.sh>> wrote: >> >> There is a >> lot of variability in drive sizes - two different sets of admins added >> disks sized between 6TB and 16TB and I suspect this and imbalanced >> weighting is to blame. >> >> CEPH OSD DF: >> >> (not going to paste that all in here): https://pastebin.com/CNW5RKWx <https://pastebin.com/CNW5RKWx> >> >> What else am I missing in terms of what to share with you all? >> >> Thanks all, >> -Dave >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io> > >

426

days inactive

426

days old

ceph-users@ceph.io

Manage subscription

6 comments

4 participants

tags (0)

participants (4)

Dave Ingram
Janne Johansson
Mark Nelson
Reed Dier