(Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time - ceph-users

List overview All Threads
Download

newer

(Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

older

Accessing Ceph Storage Data via...

Ceph RBD - High IOWait during the...

seffyroff＠gmail.com

7 Nov 2020 7 Nov '20

12:14 a.m.

I've inherited a Ceph Octopus cluster that seems like it needs urgent maintenance before data loss begins to happen. I'm the guy with the most Ceph experience on hand and that's not saying much. I'm experiencing most of the ops and repair tasks for the first time here. Ceph health output looks like this: HEALTH_WARN Degraded data redundancy: 3640401/8801868 objects degraded (41.359%), 128 pgs degraded, 128 pgs undersized; 128 pgs not deep-scrubbed in time; 128 pgs not scrubbed in time Ceph -s output: https://termbin.com/i06u The crush rule 'cephfs.media' is here: https://termbin.com/2klmq So, it seems like all PGs are in a 'warning' state for the main pool, which is erasure coded and 11TiB across 4 OSDs, of which around 6.4TiB is used. The Ceph services themselves seem happy, they're stable and have Quorum. I'm able to access the web panel fine also. The block devices are of different sizes and types (2 large, different sized spinners, and 2 identical SSDs) I would welcome any pointers on what my steps to bring this up to full health may be. If it's undersized, can I simply add another block device/OSD? Or perhaps adjusting config somewhere will get it to rebalance successfully? (the rebalance jobs have been stuck at 0% for weeks) Thank you for your time reading this message.

Show replies by date

Robert Sander

11 Nov 11 Nov

9:29 a.m.

Am 07.11.20 um 01:14 schrieb seffyroff(a)gmail.com:

...

My condolences. Get the data from that cluster and put the cluster down. In the current setup it will never work. Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin

Hans van den Bogert

10:20 a.m.

Hoping to learn from this myself, why will the current setup never work? On 11/11/20 10:29 AM, Robert Sander wrote: > Am 07.11.20 um 01:14 schrieb seffyroff(a)gmail.com: >> I've inherited a Ceph Octopus cluster that seems like it needs urgent maintenance before data loss begins to happen. I'm the guy with the most Ceph experience on hand and that's not saying much. I'm experiencing most of the ops and repair tasks for the first time here. > > My condolences. Get the data from that cluster and put the cluster down. > > In the current setup it will never work. > > Regards > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io >

Robert Sander

11:22 a.m.

Hi, Am 11.11.20 um 11:20 schrieb Hans van den Bogert:

...

Hoping to learn from this myself, why will the current setup never work?

There are only 4 OSDs in the cluster, with a mix of HDD and SSD. And they try to use erasure coding on that small setup. Erasure coding starts to work with at least 7 to 10 nodes and a corresponding number of OSDs. This cluster is too small to do any amount of "real" work. Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin

Anthony D'Atri

12 Nov 12 Nov

12:45 a.m.

...

Am 11.11.20 um 11:20 schrieb Hans van den Bogert: > Hoping to learn from this myself, why will the current setup never work?

That was a bit harsh to have said. Without seeing your EC profile and the topology, it’s hard to say for sure, but I suspect that adding another node with at least one larger OSD might help. The info that Hans asked for would help us say for sure.

...

There are only 4 OSDs in the cluster, with a mix of HDD and SSD. And they try to use erasure coding on that small setup.

Agreed on both points, neither is ideal, but it’s not clear that even the OP thinks it is.

...

Erasure coding starts to work with at least 7 to 10 nodes and a corresponding number of OSDs. This cluster is too small to do any amount of "real" work.

Agreed, but again we don’t *know* that they expect it to. Is this just a PoC / demo cluster cobbled together out of whatever was laying around? The OSDs, by chance, aren’t running on top of RAID volumes are they? How many nodes? Is it possible that this cluster previously had more nodes/OSDs that were removed? I’ll speculate that the 4 OSDs are spread across a total of 2 nodes? — aad

Phil Merricks

10:18 p.m.

Thanks for the reply Robert. Could you briefly explain the issue with the current setup and "what good looks like" here, or point me to some documentation that would help me figure that out myself? I'm guessing here it has something to do with the different sizes and types of dial, and possibly the EC crush rule setup? Best regards Phil Merricks On Wed., Nov. 11, 2020, 1:30 a.m. Robert Sander, < r.sander(a)heinlein-support.de> wrote:

...

Am 07.11.20 um 01:14 schrieb seffyroff(a)gmail.com:

I've inherited a Ceph Octopus cluster that seems like it needs urgent

maintenance before data loss begins to happen. I'm the guy with the most Ceph experience on hand and that's not saying much. I'm experiencing most of the ops and repair tasks for the first time here. My condolences. Get the data from that cluster and put the cluster down. In the current setup it will never work. Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Robert Sander

16 Nov 16 Nov

9:12 a.m.

Am 12.11.20 um 23:18 schrieb Phil Merricks:

...

The cluster is just too small to do anything useful. It can be used to learn Ceph but really not for much more. Erasure Coding needs a lot of CPU, this is usually achieved with a large number of nodes (more than 10) and a proportional number of OSDs. Mixed HDDs and SSDs in one pool is not good practice as a pool should have OSDs of the same speed. Kindest Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin

Hans van den Bogert

11 Nov 11 Nov

11:46 a.m.

Can you show a `ceph osd tree` ? On 11/7/20 1:14 AM, seffyroff(a)gmail.com wrote: > I've inherited a Ceph Octopus cluster that seems like it needs urgent maintenance before data loss begins to happen. I'm the guy with the most Ceph experience on hand and that's not saying much. I'm experiencing most of the ops and repair tasks for the first time here. > > Ceph health output looks like this: > > HEALTH_WARN Degraded data redundancy: 3640401/8801868 objects degraded (41.359%), > 128 pgs degraded, 128 pgs undersized; 128 pgs not deep-scrubbed in time; > 128 pgs not scrubbed in time > > Ceph -s output: https://termbin.com/i06u > > The crush rule 'cephfs.media' is here: https://termbin.com/2klmq > > So, it seems like all PGs are in a 'warning' state for the main pool, which is erasure coded and 11TiB across 4 OSDs, of which around 6.4TiB is used. The Ceph services themselves seem happy, they're stable and have Quorum. I'm able to access the web panel fine also. The block devices are of different sizes and types (2 large, different sized spinners, and 2 identical SSDs) > > I would welcome any pointers on what my steps to bring this up to full health may be. If it's undersized, can I simply add another block device/OSD? Or perhaps adjusting config somewhere will get it to rebalance successfully? (the rebalance jobs have been stuck at 0% for weeks) > > Thank you for your time reading this message. > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Hans van den Bogert

12:05 p.m.

And also the erasure coded profile, so an example on my cluster would be: $ ceph osd pool get objects.rgw.buckets.data erasure_code_profile erasure_code_profile: objects_ecprofile $ ceph osd erasure-code-profile get objects_ecprofile crush-device-class= crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=2 m=1 plugin=jerasure technique=reed_sol_van w=8 On 11/11/20 12:46 PM, Hans van den Bogert wrote: > Can you show a `ceph osd tree` ? > > On 11/7/20 1:14 AM, seffyroff(a)gmail.com wrote: >> I've inherited a Ceph Octopus cluster that seems like it needs urgent >> maintenance before data loss begins to happen. I'm the guy with the >> most Ceph experience on hand and that's not saying much. I'm >> experiencing most of the ops and repair tasks for the first time here. >> >> Ceph health output looks like this: >> >> HEALTH_WARN Degraded data redundancy: 3640401/8801868 objects degraded >> (41.359%), >> 128 pgs degraded, 128 pgs undersized; 128 pgs not deep-scrubbed in >> time; >> 128 pgs not scrubbed in time >> >> Ceph -s output: https://termbin.com/i06u >> >> The crush rule 'cephfs.media' is here: https://termbin.com/2klmq >> >> So, it seems like all PGs are in a 'warning' state for the main pool, >> which is erasure coded and 11TiB across 4 OSDs, of which around 6.4TiB >> is used. The Ceph services themselves seem happy, they're stable and >> have Quorum. I'm able to access the web panel fine also. The block >> devices are of different sizes and types (2 large, different sized >> spinners, and 2 identical SSDs) >> >> I would welcome any pointers on what my steps to bring this up to full >> health may be. If it's undersized, can I simply add another block >> device/OSD? Or perhaps adjusting config somewhere will get it to >> rebalance successfully? (the rebalance jobs have been stuck at 0% for >> weeks) >> >> Thank you for your time reading this message. >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io

Robert Sander

16 Nov 16 Nov

9:14 a.m.

Am 11.11.20 um 13:05 schrieb Hans van den Bogert:

...

And also the erasure coded profile, so an example on my cluster would be: k=2 m=1

With this profile you can only loose one OSD at a time, which is really not that redundant. Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin

Hans van den Bogert

9:53 a.m.

...

With this profile you can only loose one OSD at a time, which is really not that redundant.

That's rather situation dependent. I don't have really large disks, so the repair time isn't that large. Further, my SLO isn't that high that I need 99.xxx% uptime, if 2 disks break in the same repair window, that would be unfortunate, but I'd just grab a backup from a mirroring cluster. Looking at it from another perspective, I came from a single host RAID5 scenario, I'd argue this is better since I can survive a host failure. Also this is a sliding problem right? Someone with K+3 could argue K+2 is not enough as well. Hans

Janne Johansson

10:31 a.m.

Den mån 16 nov. 2020 kl 10:54 skrev Hans van den Bogert < hansbogert(a)gmail.com>gt;:

...

With this profile you can only loose one OSD at a time, which is really not that redundant.

There are a few situations like when you are moving data or when a scrub found a bad PG where you are suddenly out of copies in case something bad happens. I think Raid5 operators also found this out, when your cold spare disk kicks in, you find that old undetected error on one of the other disks and think repairs are bad or stress your raid too much. As with raids, the cheapest resource is often the actual disks and not operator time, restore-wait-times and so on, so that is why many on this list advocates for K+2-or-more, or Repl=3 because we have seen the errors one normally didn't expect. Yes, a double surprise of two disks failing in the same night after running for years is uncommon, but it is not as uncommon to resize pools, move PGs around or find a scrub error or two some day. So while one could always say "one more drive is better than your amount", there are people losing data with repl=2 or K+1 because some more normal operation was in flight and _then_ a single surprise happens. So you can have a weird reboot, causing those PGs needing backfill later, and if one of the uptodate hosts have any single surprise during the recovery, the cluster will lack some of the current data even if two disks were never down at the same time. Drive manufacturers print Mean Time Between Failures, storage admins count Mean Time Between Surprises.. -- May the most significant bit of your life be positive.

Hans van den Bogert

11:59 a.m.

I think we're deviating from the original thread quite a bit and I would never argue that in a production environment with plenty OSDs you should go for R=2 or K+1, so my example cluster which happens to be 2+1 is a bit unlucky. However I'm interested in the following On 11/16/20 11:31 AM, Janne Johansson wrote:

...

So while one could always say "one more drive is better than your amount", there are people losing data with repl=2 or K+1 because some more normal operation was in flight and _then_ a single surprise happens. So you can have a weird reboot, causing those PGs needing backfill later, and if one of the uptodate hosts have any single surprise during the recovery, the cluster will lack some of the current data even if two disks were never down at the same time.

Frank Schilder

12:17 p.m.

To throw in my 5 cents. Choosing m in k+m EC replication is not random and the argument that anyone with larger m could always say lower m is wrong is also not working. Why are people recommending m>=2 for production (or R>=3 replicas)? Its very simple. What is forgotten below is maintenance. Whenever you do maintenance on ceph, there will be longer episodes of degraded redundancy as OSDs are down. However, on production storage systems, writes *always* need to go to redundant storage. Hence, minimum redundancy under maintenance is the keyword here. With m=1 (R=2) one could never do any maintenance without down time as shutting down just 1 OSD would imply writes to non-redundant storage, which in turn would mean data loss in case a disk dies during maintenance. Basically, with m parity shards you can do maintenance on m-1 failure domains at the same time without downtime or non-redundant writes. With R copies you can do maintenance on R-2 failure domains without downtime. If your SLAs require higher minimum redundancy at all times, m (R) need to be large enough to allow maintenance unless you do downtime. However, the latter would be odd, because one of the key features of ceph is its ability to provides infinite uptime while hardware gets renewed all the time. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Hans van den Bogert <hansbogert(a)gmail.com> Sent: 16 November 2020 12:59:31 Cc: ceph-users Subject: [ceph-users] Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time I think we're deviating from the original thread quite a bit and I would never argue that in a production environment with plenty OSDs you should go for R=2 or K+1, so my example cluster which happens to be 2+1 is a bit unlucky. However I'm interested in the following On 11/16/20 11:31 AM, Janne Johansson wrote:

...

I'm not sure I follow, from a logical perspective they *are* down at the same time right? In your scenario 1 up-to-date replica was left, but even that had a surprise. Okay well that's the risk you take with R=2, but it's not intrinsically different than R=3. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Janne Johansson

1:36 p.m.

...

However I'm interested in the following On 11/16/20 11:31 AM, Janne Johansson wrote:

I was trying to describe something like this http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013237.h… There are more posts from ceph consultants that get called in after someone with "only" R=2/EC K+1 seeing data loss, but I didn't dig them all up. Ie, a kind of split-brain scenario where a small fault/outage on one of the drives and later a bigger fault on another will hurt you in R=2 or K+1 scenarios, even if you don't have two full faults, only one that is temporarily out and the second being the one "real" fault which could be "disk died" or something which we usually imagine as the scenario to handle for raids or replication sizes. Not trying to say you don't understand this, but rather that people who run small ceph clusters tend to start out with R=2 or K+1 EC because the larger faults are easier to imagine. When you have R=3 and you move one of the 3 PG copies for disk resize or something, then you are temporarily reduced to two copies (at least two up-to-date copies if writes are happening during the move), so you can still bear one surprise until this is completed without losing data. With R=2/EC K+1 not so much. Also, me calling it small and large faults mean that there is a huge difference in "a few PGs with issues" and "disk completely broken", but if the PGs are for a pool with disk images, then all images on the pool are prone to have errors and not just "we lost 1% of the files, we can get only those back", but rather the disk images are all having random holes in them. -- May the most significant bit of your life be positive.

Hans van den Bogert

2:14 p.m.

All good points (also replying to Frank Schilder) On 11/16/20 2:36 PM, Janne Johansson wrote:

...

Not trying to say you don't understand this, but rather that people who run small ceph clusters tend to start out with R=2 or K+1 EC because the larger faults are easier to imagine.

TBH, I think I did kind of underestimate this with EC. The implications were clear-cut for me in the case of 2xReplication. My wrong rationale was, I never had down time with my 2+1 EC with min_size=2, even when doing maintenance so I'm doing this right! But the min_size should of course be '3' for data integrity, for reasons that you and Frank have described so well in this thread. Thanks.

Phil Merricks

17 Nov 17 Nov

12:51 a.m.

Thanks for all the replies folks. I think it's testament to the versatility of Ceph that there are some differences of opinion and experience here. With regards to the purpose of this cluster, it is providing distributed storage for stateful workloads of containers. The data produced is somewhat immutable, it can be regenerated over time, however that does cause some slowdown for the teams that use the data as part of their development pipeline. To the best of my understanding the goals here were to provide a data loss safety net but still make efficient use of the block devices assigned to the cluster, which is I imagine where the EC direction came from. The cluster is 3 nodes with the OSDs themselves mainly housed in two of those. Additionally there was an initiative to 'use what we have' (or as I like to put it, 'cobble it together') with commodity hardware that was immediately available to hand. The departure of my predecessor has left some unanswered questions so I am not going to bother second guessing beyond what I already know. As I understand it my steps are: 1: Move off the data and scrap the cluster as it stands currently. (already under way) 2: Group the block devices into pools of the same geometry and type (and maybe do some tiering?) 3. Spread the OSDs across all 3 nodes so recovery scope isn't so easily compromised by a loss at the bare metal level 4. Add more hosts/OSDs if EC is the right solution (this may be outside of the scope of this implementation, but I'll keep a-cobblin'!) The additional ceph outputs follow: ceph osd tree <https://termbin.com/vq63> ceph osd erasure-code-profile get cephfs-media-ec <https://termbin.com/h33h> I am fully prepared to do away with EC to keep things simple and efficient in terms of CPU occupancy. On Mon, 16 Nov 2020 at 02:32, Janne Johansson <icepic.dz(a)gmail.com> wrote:

...

Den mån 16 nov. 2020 kl 10:54 skrev Hans van den Bogert < hansbogert(a)gmail.com>gt;:

With this profile you can only loose one OSD at a time, which is really not that redundant.

Robert Sander

12:52 p.m.

Hi Phil, thanks for the background info. Am 17.11.20 um 01:51 schrieb Phil Merricks:

...

1: Move off the data and scrap the cluster as it stands currently. (already under way) 2: Group the block devices into pools of the same geometry and type (and maybe do some tiering?) 3. Spread the OSDs across all 3 nodes so recovery scope isn't so easily compromised by a loss at the bare metal level 4. Add more hosts/OSDs if EC is the right solution (this may be outside of the scope of this implementation, but I'll keep a-cobblin'!)

This looks like a plan.

...

The additional ceph outputs follow: ceph osd tree <https://termbin.com/vq63> ceph osd erasure-code-profile get cephfs-media-ec <https://termbin.com/h33h>

Your EC profile will not work on two hosts: crush-device-class= crush-failure-domain=host crush-root=default k=2 m=2 You need k+m=4 independent hosts for the EC parts, but your CRUSH map only shows two hosts. This is why all your PGs are undersized and degraded. Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin

DHilsbos＠performair.com

6:10 p.m.

Phil; I'm probably going to get crucified for this, but I put a year of testing into this before determining it was sufficient to the needs of my organization... If the primary concerns are capability and cost (not top of the line performance), then I can tell you that we have had great success utilizing Intel Atom C3000 series CPUs. We have built 2 clusters with capacities on the order of 130TiB, for less than $30,000 each. The initial clusters cost $20,000 each, for half the capacity. Our testing cluster cost $8,000 to build, and most of that hardware could have been wrapped into the first production cluster build. For those keeping track, no that is not the lowest cost / unit space. Thank you, Dominic L. Hilsbos, MBA Director – Information Technology Perform Air International Inc. DHilsbos(a)PerformAir.com www.PerformAir.com -----Original Message----- From: Phil Merricks [mailto:seffyroff@gmail.com] Sent: Monday, November 16, 2020 5:52 PM To: Janne Johansson Cc: Hans van den Bogert; ceph-users Subject: [ceph-users] Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time Thanks for all the replies folks. I think it's testament to the versatility of Ceph that there are some differences of opinion and experience here. With regards to the purpose of this cluster, it is providing distributed storage for stateful workloads of containers. The data produced is somewhat immutable, it can be regenerated over time, however that does cause some slowdown for the teams that use the data as part of their development pipeline. To the best of my understanding the goals here were to provide a data loss safety net but still make efficient use of the block devices assigned to the cluster, which is I imagine where the EC direction came from. The cluster is 3 nodes with the OSDs themselves mainly housed in two of those. Additionally there was an initiative to 'use what we have' (or as I like to put it, 'cobble it together') with commodity hardware that was immediately available to hand. The departure of my predecessor has left some unanswered questions so I am not going to bother second guessing beyond what I already know. As I understand it my steps are: 1: Move off the data and scrap the cluster as it stands currently. (already under way) 2: Group the block devices into pools of the same geometry and type (and maybe do some tiering?) 3. Spread the OSDs across all 3 nodes so recovery scope isn't so easily compromised by a loss at the bare metal level 4. Add more hosts/OSDs if EC is the right solution (this may be outside of the scope of this implementation, but I'll keep a-cobblin'!) The additional ceph outputs follow: ceph osd tree <https://termbin.com/vq63> ceph osd erasure-code-profile get cephfs-media-ec <https://termbin.com/h33h> I am fully prepared to do away with EC to keep things simple and efficient in terms of CPU occupancy. On Mon, 16 Nov 2020 at 02:32, Janne Johansson <icepic.dz(a)gmail.com> wrote:

...

Den mån 16 nov. 2020 kl 10:54 skrev Hans van den Bogert < hansbogert(a)gmail.com>gt;:

With this profile you can only loose one OSD at a time, which is really not that redundant.

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Anthony D'Atri

6:18 p.m.

...

I'm probably going to get crucified for this

Naw. The <> in your From: header, though …. ;)

1279

days inactive

1289

days old

ceph-users@ceph.io

Manage subscription

19 comments

8 participants

tags (0)

participants (8)

Anthony D'Atri
DHilsbos＠performair.com
Frank Schilder
Hans van den Bogert
Janne Johansson
Phil Merricks
Robert Sander
seffyroff＠gmail.com