disk failure - ceph-users - lists.ceph.io

List overview All Threads
Download

disk failure

Heavily-linked lists.ceph.com...

Re: CephFS+NFS For VMWare

solarflow99

5 Sep 2019 5 Sep '19

5:55 p.m.

One of the things i've come to notice is when HDD drives fail, they often recover in a short time and get added back to the cluster. This causes the data to rebalance back and forth, and if I set the noout flag I get a health warning. Is there a better way to avoid this?

Attachments:

attachment.htm (text/html — 342 bytes)

Reply

Show replies by date

Ashley Merrick

5 Sep 5 Sep

6 p.m.

Is your HD actually failing and vanishing from the OS and then coming back shortly? Or do you just mean your OSD is crashing and then restarting it self shortly later? ---- On Fri, 06 Sep 2019 01:55:25 +0800 solarflow99(a)gmail.com wrote ---- One of the things i've come to notice is when HDD drives fail, they often recover in a short time and get added back to the cluster. This causes the data to rebalance back and forth, and if I set the noout flag I get a health warning. Is there a better way to avoid this? _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

solarflow99

6:11 p.m.

no, I mean ceph sees it as a failure and marks it out for a while On Thu, Sep 5, 2019 at 11:00 AM Ashley Merrick <singapore(a)amerrick.co.uk> wrote:

Is your HD actually failing and vanishing from the OS and then coming back shortly? Or do you just mean your OSD is crashing and then restarting it self shortly later? ---- On Fri, 06 Sep 2019 01:55:25 +0800 * solarflow99(a)gmail.com <solarflow99(a)gmail.com> * wrote ---- One of the things i've come to notice is when HDD drives fail, they often recover in a short time and get added back to the cluster. This causes the data to rebalance back and forth, and if I set the noout flag I get a health warning. Is there a better way to avoid this? _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Ashley Merrick

6:13 p.m.

I would suggest checking the logs and seeing the exact reason its being marked out. If the disk is being hit hard and their is heavy I/O delays then Ceph may see that as a delayed reply outside of the set windows and mark as out. There is some variables that can be changed to give an OSD more time to reply to a heartbeat, but I would definitely suggest checking the OSD log at the time of the disk being marked out to see exactly what's going on. As the last thing you want to do is just patch an actually issue if there is one. ---- On Fri, 06 Sep 2019 02:11:06 +0800 solarflow99(a)gmail.com wrote ---- no, I mean ceph sees it as a failure and marks it out for a while On Thu, Sep 5, 2019 at 11:00 AM Ashley Merrick <singapore(a)amerrick.co.uk> wrote: Is your HD actually failing and vanishing from the OS and then coming back shortly? Or do you just mean your OSD is crashing and then restarting it self shortly later? ---- On Fri, 06 Sep 2019 01:55:25 +0800 solarflow99(a)gmail.com wrote ---- One of the things i've come to notice is when HDD drives fail, they often recover in a short time and get added back to the cluster. This causes the data to rebalance back and forth, and if I set the noout flag I get a health warning. Is there a better way to avoid this? _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

solarflow99

6:26 p.m.

dicks are expected to fail, and every once in a while i'll lose one, so that was expected and didn't come as any surprise to me. Are you suggesting failed drives almost always stay down and out? On Thu, Sep 5, 2019 at 11:13 AM Ashley Merrick <singapore(a)amerrick.co.uk> wrote:

I would suggest checking the logs and seeing the exact reason its being marked out. If the disk is being hit hard and their is heavy I/O delays then Ceph may see that as a delayed reply outside of the set windows and mark as out. There is some variables that can be changed to give an OSD more time to reply to a heartbeat, but I would definitely suggest checking the OSD log at the time of the disk being marked out to see exactly what's going on. As the last thing you want to do is just patch an actually issue if there is one. ---- On Fri, 06 Sep 2019 02:11:06 +0800 * solarflow99(a)gmail.com <solarflow99(a)gmail.com> * wrote ---- no, I mean ceph sees it as a failure and marks it out for a while On Thu, Sep 5, 2019 at 11:00 AM Ashley Merrick <singapore(a)amerrick.co.uk> wrote: Is your HD actually failing and vanishing from the OS and then coming back shortly? Or do you just mean your OSD is crashing and then restarting it self shortly later? ---- On Fri, 06 Sep 2019 01:55:25 +0800 * solarflow99(a)gmail.com <solarflow99(a)gmail.com> * wrote ---- One of the things i've come to notice is when HDD drives fail, they often recover in a short time and get added back to the cluster. This causes the data to rebalance back and forth, and if I set the noout flag I get a health warning. Is there a better way to avoid this? _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Nathan Fish

6:31 p.m.

Disks failing should cause the OSD to exit, be marked down, and after around 15 minutes marked out. That's routine. An OSD flapping is something you need to look into. It could be a flaky drive, or extreme load as was mentioned. On Thu, Sep 5, 2019 at 2:27 PM solarflow99 <solarflow99(a)gmail.com> wrote: > > dicks are expected to fail, and every once in a while i'll lose one, so that was expected and didn't come as any surprise to me. Are you suggesting failed drives almost always stay down and out? > > > On Thu, Sep 5, 2019 at 11:13 AM Ashley Merrick <singapore(a)amerrick.co.uk> wrote: >> >> I would suggest checking the logs and seeing the exact reason its being marked out. >> >> If the disk is being hit hard and their is heavy I/O delays then Ceph may see that as a delayed reply outside of the set windows and mark as out. >> >> There is some variables that can be changed to give an OSD more time to reply to a heartbeat, but I would definitely suggest checking the OSD log at the time of the disk being marked out to see exactly what's going on. >> >> As the last thing you want to do is just patch an actually issue if there is one. >> >> >> ---- On Fri, 06 Sep 2019 02:11:06 +0800 solarflow99(a)gmail.com wrote ---- >> >> no, I mean ceph sees it as a failure and marks it out for a while >> >> On Thu, Sep 5, 2019 at 11:00 AM Ashley Merrick <singapore(a)amerrick.co.uk> wrote: >> >> Is your HD actually failing and vanishing from the OS and then coming back shortly? >> >> Or do you just mean your OSD is crashing and then restarting it self shortly later? >> >> >> ---- On Fri, 06 Sep 2019 01:55:25 +0800 solarflow99(a)gmail.com wrote ---- >> >> One of the things i've come to notice is when HDD drives fail, they often recover in a short time and get added back to the cluster. This causes the data to rebalance back and forth, and if I set the noout flag I get a health warning. Is there a better way to avoid this? >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Anthony D'Atri

6:38 p.m.

Are you using Filestore? If so directory splitting can manifest this way. Check your networking too, packet loss between OSD nodes or between OSD nodes and the mons can also manifest this way, say if bonding isn’t working properly or you have a bad link. But as suggested below, check the OSD logs for a clue.

I would suggest checking the logs and seeing the exact reason its being marked out. If the disk is being hit hard and their is heavy I/O delays then Ceph may see that as a delayed reply outside of the set windows and mark as out. There is some variables that can be changed to give an OSD more time to reply to a heartbeat, but I would definitely suggest checking the OSD log at the time of the disk being marked out to see exactly what's going on. As the last thing you want to do is just patch an actually issue if there is one. ---- On Fri, 06 Sep 2019 02:11:06 +0800 solarflow99(a)gmail.com wrote ---- no, I mean ceph sees it as a failure and marks it out for a while On Thu, Sep 5, 2019 at 11:00 AM Ashley Merrick <singapore(a)amerrick.co.uk> wrote: Is your HD actually failing and vanishing from the OS and then coming back shortly? Or do you just mean your OSD is crashing and then restarting it self shortly later? ---- On Fri, 06 Sep 2019 01:55:25 +0800 solarflow99(a)gmail.com wrote ---- One of the things i've come to notice is when HDD drives fail, they often recover in a short time and get added back to the cluster. This causes the data to rebalance back and forth, and if I set the noout flag I get a health warning. Is there a better way to avoid this? _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

1694

days inactive

1694

days old

ceph-users@ceph.io

Manage subscription

6 comments

4 participants

tags (0)

participants (4)

Anthony D'Atri
Ashley Merrick
Nathan Fish
solarflow99