Data loss by adding 2OSD causing Long heartbeat ping times

List overview All Threads
Download

newer

older

Re: rados buckets copy

Rados clone_range

Frank Schilder

25 Apr 2020 25 Apr '20

10:34 p.m.

Dear all, Two days ago I added very few disks to a ceph cluster and run into a problem I have never seen before when doing that. The entire cluster was deployed with mimic 13.2.2 and recently upgraded to 13.2.8. This is the first time I added OSDs under 13.2.8. I had a few hosts that I needed to add 1 or 2 OSDs to and I started with one that needed 1. Procedure was as usual: ceph osd set norebalance deploy additional OSD The OSD came up and PGs started peering, so far so good. To my surprise, however, I started seeing health-warnings about slow ping times: Long heartbeat ping times on back interface seen, longest is 1171.910 msec Long heartbeat ping times on front interface seen, longest is 1180.764 msec After peering it looked like it got better and I waited it out until the messages were gone. This took a really long time, at least 5-10 minutes. I went on to the next host and deployed 2 new OSDs this time. Same as above, but with much worse consequences. Apparently, the ping times exceeded a timeout for a very short moment and an OSD was marked out for ca. 2 seconds. Now all hell broke loose. I got health errors with the dreaded "backfill_toofull", undersized PGs and a large amount of degraded objects. I don't know what is causing what, but I ended up with data loss by just adding 2 disks. We have dedicated network hardware and each of the OSD hosts has 20GBit front and 40GBit back network capacity (LACP trunking). There are currently no more than 16 disks per server. The disks were added to an SSD pool. There was no traffic nor any other exceptional load on the system. I have ganglia resource monitoring on all nodes and cannot see a single curve going up. Network, CPU utilisation, load, everything below measurement accuracy. The hosts and network are quite overpowered and dimensioned to host many more OSDs (in future expansions). I have three questions, ordered by how urgently I need an answer: 1) I need to add more disks next week and need a workaround. Will something like this help avoiding the heartbeat time-out: ceph osd set noout ceph osd set nodown ceph osd set norebalance 2) The "lost" shards of the degraded objects were obviously still on the cluster somewhere. Is there any way to force the cluster to rescan OSDs for the shards that went orphan during the incident? 3) This smells a bit like a bug that requires attention. I was probably just lucky that I only lost 1 shard per PG. Has something similar reported before? Is this fixed in 13.2.10? Is it something new? Any settings that need to be looked at? If logs need to be collected, I can do so during my next attempt. However, I cannot risk data integrity of a production cluster and, therefore, probably not run the original procedure again. Many thanks for your help and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

Show replies by date

Frank Schilder

6 May 6 May

11:08 p.m.

To answer some of my own questions: 1) Setting ceph osd set noout ceph osd set nodown ceph osd set norebalance before restart/re-deployment did not harm. I don't know if it helped, because I didn't retry the procedure that led to OSDs going down. See also point 3 below. 2) A peculiarity of this specific deployment of 2 OSDs was, that it was a mix of OSD deployment and restart after a reboot. I'm working on getting this sorted and this is a different story. For anyone who might find him-/herself in a situation where some OSDs are temporarily down/out with PGs remapped and objects degraded for whatever reason while new OSDs come up, the way to have ceph rescan the down/out OSDs after they come up is to - "ceph osd crush move" the new OSDs temporarily to a location outside the crush sub tree covering any pools (I have such a parking space in the crush hierarchy for easy draining and parking disks) - bring up the down/out OSDs - at this point, the cluster will fall back to the original crush map that was in place when the OSDs went down/out - the cluster will now find all shards that went orphan and health will be restored very quickly - once the cluster is healthy, "ceph osd crush move" the new OSDs back to their desired location - now you will see remapped PGs/misplaced objects, but no degraded objects 3) I still don't have an answer why long heartbeat ping times were observed. There seems to be a more serious issue and this will continue in its own thread "Cluster outage due to client IO" to be opened soon. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans(a)dtu.dk> Sent: 25 April 2020 15:34:25 To: ceph-users Subject: [ceph-users] Data loss by adding 2OSD causing Long heartbeat ping times Dear all, Two days ago I added very few disks to a ceph cluster and run into a problem I have never seen before when doing that. The entire cluster was deployed with mimic 13.2.2 and recently upgraded to 13.2.8. This is the first time I added OSDs under 13.2.8. I had a few hosts that I needed to add 1 or 2 OSDs to and I started with one that needed 1. Procedure was as usual: ceph osd set norebalance deploy additional OSD The OSD came up and PGs started peering, so far so good. To my surprise, however, I started seeing health-warnings about slow ping times: Long heartbeat ping times on back interface seen, longest is 1171.910 msec Long heartbeat ping times on front interface seen, longest is 1180.764 msec After peering it looked like it got better and I waited it out until the messages were gone. This took a really long time, at least 5-10 minutes. I went on to the next host and deployed 2 new OSDs this time. Same as above, but with much worse consequences. Apparently, the ping times exceeded a timeout for a very short moment and an OSD was marked out for ca. 2 seconds. Now all hell broke loose. I got health errors with the dreaded "backfill_toofull", undersized PGs and a large amount of degraded objects. I don't know what is causing what, but I ended up with data loss by just adding 2 disks. We have dedicated network hardware and each of the OSD hosts has 20GBit front and 40GBit back network capacity (LACP trunking). There are currently no more than 16 disks per server. The disks were added to an SSD pool. There was no traffic nor any other exceptional load on the system. I have ganglia resource monitoring on all nodes and cannot see a single curve going up. Network, CPU utilisation, load, everything below measurement accuracy. The hosts and network are quite overpowered and dimensioned to host many more OSDs (in future expansions). I have three questions, ordered by how urgently I need an answer: 1) I need to add more disks next week and need a workaround. Will something like this help avoiding the heartbeat time-out: ceph osd set noout ceph osd set nodown ceph osd set norebalance 2) The "lost" shards of the degraded objects were obviously still on the cluster somewhere. Is there any way to force the cluster to rescan OSDs for the shards that went orphan during the incident? 3) This smells a bit like a bug that requires attention. I was probably just lucky that I only lost 1 shard per PG. Has something similar reported before? Is this fixed in 13.2.10? Is it something new? Any settings that need to be looked at? If logs need to be collected, I can do so during my next attempt. However, I cannot risk data integrity of a production cluster and, therefore, probably not run the original procedure again. Many thanks for your help and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

XuYun

7 May 7 May

7:06 p.m.

We had got some ping back/front problems after upgrading from filestore to bluestore. It turned out to be related to insufficient memory/swap usage.

...

2020年5月6日下午10:08，Frank Schilder <frans(a)dtu.dk> 写道： To answer some of my own questions: 1) Setting ceph osd set noout ceph osd set nodown ceph osd set norebalance before restart/re-deployment did not harm. I don't know if it helped, because I didn't retry the procedure that led to OSDs going down. See also point 3 below. 2) A peculiarity of this specific deployment of 2 OSDs was, that it was a mix of OSD deployment and restart after a reboot. I'm working on getting this sorted and this is a different story. For anyone who might find him-/herself in a situation where some OSDs are temporarily down/out with PGs remapped and objects degraded for whatever reason while new OSDs come up, the way to have ceph rescan the down/out OSDs after they come up is to - "ceph osd crush move" the new OSDs temporarily to a location outside the crush sub tree covering any pools (I have such a parking space in the crush hierarchy for easy draining and parking disks) - bring up the down/out OSDs - at this point, the cluster will fall back to the original crush map that was in place when the OSDs went down/out - the cluster will now find all shards that went orphan and health will be restored very quickly - once the cluster is healthy, "ceph osd crush move" the new OSDs back to their desired location - now you will see remapped PGs/misplaced objects, but no degraded objects 3) I still don't have an answer why long heartbeat ping times were observed. There seems to be a more serious issue and this will continue in its own thread "Cluster outage due to client IO" to be opened soon. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans(a)dtu.dk> Sent: 25 April 2020 15:34:25 To: ceph-users Subject: [ceph-users] Data loss by adding 2OSD causing Long heartbeat ping times Dear all, Two days ago I added very few disks to a ceph cluster and run into a problem I have never seen before when doing that. The entire cluster was deployed with mimic 13.2.2 and recently upgraded to 13.2.8. This is the first time I added OSDs under 13.2.8. I had a few hosts that I needed to add 1 or 2 OSDs to and I started with one that needed 1. Procedure was as usual: ceph osd set norebalance deploy additional OSD The OSD came up and PGs started peering, so far so good. To my surprise, however, I started seeing health-warnings about slow ping times: Long heartbeat ping times on back interface seen, longest is 1171.910 msec Long heartbeat ping times on front interface seen, longest is 1180.764 msec After peering it looked like it got better and I waited it out until the messages were gone. This took a really long time, at least 5-10 minutes. I went on to the next host and deployed 2 new OSDs this time. Same as above, but with much worse consequences. Apparently, the ping times exceeded a timeout for a very short moment and an OSD was marked out for ca. 2 seconds. Now all hell broke loose. I got health errors with the dreaded "backfill_toofull", undersized PGs and a large amount of degraded objects. I don't know what is causing what, but I ended up with data loss by just adding 2 disks. We have dedicated network hardware and each of the OSD hosts has 20GBit front and 40GBit back network capacity (LACP trunking). There are currently no more than 16 disks per server. The disks were added to an SSD pool. There was no traffic nor any other exceptional load on the system. I have ganglia resource monitoring on all nodes and cannot see a single curve going up. Network, CPU utilisation, load, everything below measurement accuracy. The hosts and network are quite overpowered and dimensioned to host many more OSDs (in future expansions). I have three questions, ordered by how urgently I need an answer: 1) I need to add more disks next week and need a workaround. Will something like this help avoiding the heartbeat time-out: ceph osd set noout ceph osd set nodown ceph osd set norebalance 2) The "lost" shards of the degraded objects were obviously still on the cluster somewhere. Is there any way to force the cluster to rescan OSDs for the shards that went orphan during the incident? 3) This smells a bit like a bug that requires attention. I was probably just lucky that I only lost 1 shard per PG. Has something similar reported before? Is this fixed in 13.2.10? Is it something new? Any settings that need to be looked at? If logs need to be collected, I can do so during my next attempt. However, I cannot risk data integrity of a production cluster and, therefore, probably not run the original procedure again. Many thanks for your help and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Martin Verges

7:17 p.m.

Hello XuYun, In my experience, I would always disable swap, it won't do any good. -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.verges(a)croit.io Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Do., 7. Mai 2020 um 12:07 Uhr schrieb XuYun <yunxu(a)me.com>om>:

...

We had got some ping back/front problems after upgrading from filestore to bluestore. It turned out to be related to insufficient memory/swap usage.

because I didn't retry the procedure that led to OSDs going down. See also point 3 below.

2) A peculiarity of this specific deployment of 2 OSDs was, that it was

a mix of OSD deployment and restart after a reboot. I'm working on getting this sorted and this is a different story. For anyone who might find him-/herself in a situation where some OSDs are temporarily down/out with PGs remapped and objects degraded for whatever reason while new OSDs come up, the way to have ceph rescan the down/out OSDs after they come up is to

- "ceph osd crush move" the new OSDs temporarily to a location outside

the crush sub tree covering any pools (I have such a parking space in the crush hierarchy for easy draining and parking disks)

- bring up the down/out OSDs - at this point, the cluster will fall back to the original crush map

that was in place when the OSDs went down/out

- the cluster will now find all shards that went orphan and health will

be restored very quickly

- once the cluster is healthy, "ceph osd crush move" the new OSDs back

to their desired location

- now you will see remapped PGs/misplaced objects, but no degraded

objects

3) I still don't have an answer why long heartbeat ping times were

observed. There seems to be a more serious issue and this will continue in its own thread "Cluster outage due to client IO" to be opened soon.

Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans(a)dtu.dk> Sent: 25 April 2020 15:34:25 To: ceph-users Subject: [ceph-users] Data loss by adding 2OSD causing Long heartbeat

ping times

Dear all, Two days ago I added very few disks to a ceph cluster and run into a

problem I have never seen before when doing that. The entire cluster was deployed with mimic 13.2.2 and recently upgraded to 13.2.8. This is the first time I added OSDs under 13.2.8.

I had a few hosts that I needed to add 1 or 2 OSDs to and I started with

one that needed 1. Procedure was as usual:

ceph osd set norebalance deploy additional OSD The OSD came up and PGs started peering, so far so good. To my surprise,

however, I started seeing health-warnings about slow ping times:

Long heartbeat ping times on back interface seen, longest is 1171.910

msec

Long heartbeat ping times on front interface seen, longest is 1180.764

msec

After peering it looked like it got better and I waited it out until the

messages were gone. This took a really long time, at least 5-10 minutes.

I went on to the next host and deployed 2 new OSDs this time. Same as

above, but with much worse consequences. Apparently, the ping times exceeded a timeout for a very short moment and an OSD was marked out for ca. 2 seconds. Now all hell broke loose. I got health errors with the dreaded "backfill_toofull", undersized PGs and a large amount of degraded objects. I don't know what is causing what, but I ended up with data loss by just adding 2 disks.

We have dedicated network hardware and each of the OSD hosts has 20GBit

front and 40GBit back network capacity (LACP trunking). There are currently no more than 16 disks per server. The disks were added to an SSD pool. There was no traffic nor any other exceptional load on the system. I have ganglia resource monitoring on all nodes and cannot see a single curve going up. Network, CPU utilisation, load, everything below measurement accuracy. The hosts and network are quite overpowered and dimensioned to host many more OSDs (in future expansions).

I have three questions, ordered by how urgently I need an answer: 1) I need to add more disks next week and need a workaround. Will

something like this help avoiding the heartbeat time-out:

ceph osd set noout ceph osd set nodown ceph osd set norebalance 2) The "lost" shards of the degraded objects were obviously still on the

cluster somewhere. Is there any way to force the cluster to rescan OSDs for the shards that went orphan during the incident?

3) This smells a bit like a bug that requires attention. I was probably

just lucky that I only lost 1 shard per PG. Has something similar reported before? Is this fixed in 13.2.10? Is it something new? Any settings that need to be looked at? If logs need to be collected, I can do so during my next attempt. However, I cannot risk data integrity of a production cluster and, therefore, probably not run the original procedure again.

Many thanks for your help and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Frank Schilder

8:26 p.m.

Hi XuYun and Martin, I checked that already. The OSDs in question have 8GB memory limit and the RAM of the servers is about 50% used. It could be memory fragmentation, which used to be a problem before bitmap allocator. However, my OSDs are configured to use bitmap, at least that is what they claim they are using. There might be a somewhat more fundamental issue, also related to my recent experience described in "Ceph meltdown, need help". The problems seem to have the same source, busy OSDs get behind with their internal cluster communication because (I suspect) client IO and admin IO (heartbeats, beacons, etc.) are handled in the same queue. If things get a bit busy, admin I/O slows down and avalanches happen. Also in the case here (long heartbeat) I first saw remapping and peering going on, then the heartbeats times of a few OSDs suddenly shot up. It is possible that some OSDs were already busy with client I/O. The additional peering seems to have the ability to add so much additional load to some OSDs that they start falling behind and getting marked out erroneously, with the consequence of even more peering, load, etc. I'm working on a new conversation "Cluster outage due to client IO" to have a clean focused thread. I need a bit more time to collect information though. For now, our cluster is up and running healthy. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Martin Verges <martin.verges(a)croit.io> Sent: 07 May 2020 12:17:10 To: XuYun Cc: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: Data loss by adding 2OSD causing Long heartbeat ping times Hello XuYun, In my experience, I would always disable swap, it won't do any good. -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.verges@croit.io<mailto:martin.verges@croit.io> Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Do., 7. Mai 2020 um 12:07 Uhr schrieb XuYun <yunxu@me.com<mailto:yunxu@me.com>>: We had got some ping back/front problems after upgrading from filestore to bluestore. It turned out to be related to insufficient memory/swap usage.

...

2020年5月6日下午10:08，Frank Schilder <frans@dtu.dk<mailto:frans@dtu.dk>> 写道： To answer some of my own questions: 1) Setting ceph osd set noout ceph osd set nodown ceph osd set norebalance before restart/re-deployment did not harm. I don't know if it helped, because I didn't retry the procedure that led to OSDs going down. See also point 3 below. 2) A peculiarity of this specific deployment of 2 OSDs was, that it was a mix of OSD deployment and restart after a reboot. I'm working on getting this sorted and this is a different story. For anyone who might find him-/herself in a situation where some OSDs are temporarily down/out with PGs remapped and objects degraded for whatever reason while new OSDs come up, the way to have ceph rescan the down/out OSDs after they come up is to - "ceph osd crush move" the new OSDs temporarily to a location outside the crush sub tree covering any pools (I have such a parking space in the crush hierarchy for easy draining and parking disks) - bring up the down/out OSDs - at this point, the cluster will fall back to the original crush map that was in place when the OSDs went down/out - the cluster will now find all shards that went orphan and health will be restored very quickly - once the cluster is healthy, "ceph osd crush move" the new OSDs back to their desired location - now you will see remapped PGs/misplaced objects, but no degraded objects 3) I still don't have an answer why long heartbeat ping times were observed. There seems to be a more serious issue and this will continue in its own thread "Cluster outage due to client IO" to be opened soon. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@dtu.dk<mailto:frans@dtu.dk>> Sent: 25 April 2020 15:34:25 To: ceph-users Subject: [ceph-users] Data loss by adding 2OSD causing Long heartbeat ping times Dear all, Two days ago I added very few disks to a ceph cluster and run into a problem I have never seen before when doing that. The entire cluster was deployed with mimic 13.2.2 and recently upgraded to 13.2.8. This is the first time I added OSDs under 13.2.8. I had a few hosts that I needed to add 1 or 2 OSDs to and I started with one that needed 1. Procedure was as usual: ceph osd set norebalance deploy additional OSD The OSD came up and PGs started peering, so far so good. To my surprise, however, I started seeing health-warnings about slow ping times: Long heartbeat ping times on back interface seen, longest is 1171.910 msec Long heartbeat ping times on front interface seen, longest is 1180.764 msec After peering it looked like it got better and I waited it out until the messages were gone. This took a really long time, at least 5-10 minutes. I went on to the next host and deployed 2 new OSDs this time. Same as above, but with much worse consequences. Apparently, the ping times exceeded a timeout for a very short moment and an OSD was marked out for ca. 2 seconds. Now all hell broke loose. I got health errors with the dreaded "backfill_toofull", undersized PGs and a large amount of degraded objects. I don't know what is causing what, but I ended up with data loss by just adding 2 disks. We have dedicated network hardware and each of the OSD hosts has 20GBit front and 40GBit back network capacity (LACP trunking). There are currently no more than 16 disks per server. The disks were added to an SSD pool. There was no traffic nor any other exceptional load on the system. I have ganglia resource monitoring on all nodes and cannot see a single curve going up. Network, CPU utilisation, load, everything below measurement accuracy. The hosts and network are quite overpowered and dimensioned to host many more OSDs (in future expansions). I have three questions, ordered by how urgently I need an answer: 1) I need to add more disks next week and need a workaround. Will something like this help avoiding the heartbeat time-out: ceph osd set noout ceph osd set nodown ceph osd set norebalance 2) The "lost" shards of the degraded objects were obviously still on the cluster somewhere. Is there any way to force the cluster to rescan OSDs for the shards that went orphan during the incident? 3) This smells a bit like a bug that requires attention. I was probably just lucky that I only lost 1 shard per PG. Has something similar reported before? Is this fixed in 13.2.10? Is it something new? Any settings that need to be looked at? If logs need to be collected, I can do so during my next attempt. However, I cannot risk data integrity of a production cluster and, therefore, probably not run the original procedure again. Many thanks for your help and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io>

_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-leave@ceph.io<mailto:ceph-users-leave@ceph.io>

1452

days inactive

1464

days old

ceph-users@ceph.io

Manage subscription

4 comments

3 participants

tags (0)

participants (3)

Frank Schilder
Martin Verges
XuYun