mds behind on trimming - replay until memory exhausted - ceph-users

Frank Schilder

5 Jun 5 Jun

11:49 a.m.

Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). How many rsync processes are you running in parallel? Do you have these settings enabled: osd_op_queue=wpq osd_op_queue_cut_off=high WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): - reduce the MDS cache memory limit to force recall of caps much earlier than now - reduce client cach size - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. Harder measures: - stop all I/O from the FS clients, throw users out if necessary - ideally, try to cleanly (!) shut down clients or force trimming the cache by either * umount or * sync; echo 3 > /proc/sys/vm/drop_caches Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again. Hope that helps. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> Sent: 05 June 2020 11:42:42 To: ceph-users Subject: [ceph-users] mds behind on trimming - replay until memory exhausted Hi all, We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and 3 mds (1 active for each fs + one failover). We are transfering all the datas (~600M files) from one FS (which was in EC 3+2) to the other FS (in R3). On the old FS we first removed the snapshots (to avoid strays problems when removing files) and the ran some rsync deleting the files after the transfer. The operation should last a few weeks more to complete. But few days ago, we started to have some warning mds behind on trimming from the mds managing the old FS. Yesterday, I restarted the active mds service to force the takeover by the standby mds (basically because the standby is more powerfull and have more memory, i.e 48GB over 32). The standby mds took the rank 0 and started to replay... the mds behind on trimming came back and the number of segments rised as well as the memory usage of the server. Finally, it exhausted the memory of the mds and the service stopped and the previous mds took rank 0 and started to replay... until memory exhaustion and a new switch of mds etc... It thus seems that we are in a never ending loop ! And of course, as the mds is always in replay, the data are not accessible and the transfers are blocked. I stopped all the rsync and unmount the clients. My questions are : - Does the mds trim during the replay so we could hope that after a while it will purge everything and the mds will be able to become active at the end ? - Is there a way to accelerate the operation or to fix this situation ? Thanks for you help. F. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Francois Legrand

12:46 p.m.

I was also wondering if setting mds dump cache after rejoin could help ? Le 05/06/2020 à 12:49, Frank Schilder a écrit :

...

Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). How many rsync processes are you running in parallel? Do you have these settings enabled: osd_op_queue=wpq osd_op_queue_cut_off=high WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): - reduce the MDS cache memory limit to force recall of caps much earlier than now - reduce client cach size - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. Harder measures: - stop all I/O from the FS clients, throw users out if necessary - ideally, try to cleanly (!) shut down clients or force trimming the cache by either * umount or * sync; echo 3 > /proc/sys/vm/drop_caches Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again. Hope that helps. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> Sent: 05 June 2020 11:42:42 To: ceph-users Subject: [ceph-users] mds behind on trimming - replay until memory exhausted Hi all, We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and 3 mds (1 active for each fs + one failover). We are transfering all the datas (~600M files) from one FS (which was in EC 3+2) to the other FS (in R3). On the old FS we first removed the snapshots (to avoid strays problems when removing files) and the ran some rsync deleting the files after the transfer. The operation should last a few weeks more to complete. But few days ago, we started to have some warning mds behind on trimming from the mds managing the old FS. Yesterday, I restarted the active mds service to force the takeover by the standby mds (basically because the standby is more powerfull and have more memory, i.e 48GB over 32). The standby mds took the rank 0 and started to replay... the mds behind on trimming came back and the number of segments rised as well as the memory usage of the server. Finally, it exhausted the memory of the mds and the service stopped and the previous mds took rank 0 and started to replay... until memory exhaustion and a new switch of mds etc... It thus seems that we are in a never ending loop ! And of course, as the mds is always in replay, the data are not accessible and the transfers are blocked. I stopped all the rsync and unmount the clients. My questions are : - Does the mds trim during the replay so we could hope that after a while it will purge everything and the mds will be able to become active at the end ? - Is there a way to accelerate the operation or to fix this situation ? Thanks for you help. F. _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Francois Legrand

1:34 p.m.

Le 05/06/2020 à 14:18, Frank Schilder a écrit :

...

Hi Francois,

I was also wondering if setting mds dump cache after rejoin could help ?

Haven't heard of that option. Is there some documentation?

I found it on : https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ mds dump cache after rejoin Description Ceph will dump MDS cache contents to a file after rejoining the cache (during recovery). Type Boolean Default false but I don't think it can help in my case, because rejoin occurs after replay and in my case replay never ends !

...

I have : osd_op_queue=wpq osd_op_queue_cut_off=low I can try to set osd_op_queue_cut_off to high, but it will be useful only if the mds get active, true ?

I think so. If you have no clients connected, there should not be queue priority issues. Maybe it is best to wait until your cluster is healthy again as you will need to restart all daemons. Make sure you set this in [global]. When I applied that change and after re-starting all OSDs my MDSes had reconnect issues until I set it on them too. I think all daemons use that option (the prefix osd_ is misleading).

For sure I would prefer not to restart all daemons because the second filesystem is up and running (with production clients).

...

For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB which seems reasonable for a mds server with 32/48GB).

This sounds bad. 8GB should not cause any issues. Maybe you are hitting a bug, I believe there is a regression in Nautilus. There were recent threads on absurdly high memory use by MDSes. Maybe its worth searching for these in the list.

I will have a look.

...

I already force the clients to unmount (and even rebooted the ones from which the rsync and the rmdir .snaps were launched).

I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. without -f or forced by reboot) the MDS should have dropped the clients already. If it was an unclean unmount it might not be that easy to get the stale client session out. However, I don't know about that.

Moreover when I did that, the mds was already not active but in replay, so for sure the unmount was not acknowledged by any mds !

...

I think that providing more swap maybe the solution ! I will try that if I cannot find another way to fix it.

If the memory overrun is somewhat limited, this should allow the MDS to trim the logs. Will take a while, but it will do eventually. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> Sent: 05 June 2020 13:46:03 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted I was also wondering if setting mds dump cache after rejoin could help ? Le 05/06/2020 à 12:49, Frank Schilder a écrit : > Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). > > How many rsync processes are you running in parallel? > Do you have these settings enabled: > > osd_op_queue=wpq > osd_op_queue_cut_off=high > > WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. > > You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): > > - reduce the MDS cache memory limit to force recall of caps much earlier than now > - reduce client cach size > - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful > > At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. > > If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. > > Harder measures: > > - stop all I/O from the FS clients, throw users out if necessary > - ideally, try to cleanly (!) shut down clients or force trimming the cache by either > * umount or > * sync; echo 3 > /proc/sys/vm/drop_caches > Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. > > At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. > > My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again. > > Hope that helps. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> > Sent: 05 June 2020 11:42:42 > To: ceph-users > Subject: [ceph-users] mds behind on trimming - replay until memory exhausted > > Hi all, > We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and > 3 mds (1 active for each fs + one failover). > We are transfering all the datas (~600M files) from one FS (which was in > EC 3+2) to the other FS (in R3). > On the old FS we first removed the snapshots (to avoid strays problems > when removing files) and the ran some rsync deleting the files after the > transfer. > The operation should last a few weeks more to complete. > But few days ago, we started to have some warning mds behind on trimming > from the mds managing the old FS. > Yesterday, I restarted the active mds service to force the takeover by > the standby mds (basically because the standby is more powerfull and > have more memory, i.e 48GB over 32). > The standby mds took the rank 0 and started to replay... the mds behind > on trimming came back and the number of segments rised as well as the > memory usage of the server. Finally, it exhausted the memory of the mds > and the service stopped and the previous mds took rank 0 and started to > replay... until memory exhaustion and a new switch of mds etc... > It thus seems that we are in a never ending loop ! And of course, as the > mds is always in replay, the data are not accessible and the transfers > are blocked. > I stopped all the rsync and unmount the clients. > > My questions are : > - Does the mds trim during the replay so we could hope that after a > while it will purge everything and the mds will be able to become active > at the end ? > - Is there a way to accelerate the operation or to fix this situation ? > > Thanks for you help. > F. > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Francois Legrand

10:51 p.m.

Hi, Unfortunately adding swap did not solve the problem ! I added 400 GB of swap. It used about 18GB of swap after consuming all the ram and stopped with the following logs : 2020-06-05 21:33:31.967 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324691 from mon.1 2020-06-05 21:33:40.355 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324692 from mon.1 2020-06-05 21:33:59.787 7f251b7e5700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-06-05 21:33:59.787 7f251b7e5700 0 mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors (last acked 3.99979s ago); MDS internal heartbeat is not healthy! 2020-06-05 21:34:00.287 7f251b7e5700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-06-05 21:34:00.287 7f251b7e5700 0 mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors (last acked 4.49976s ago); MDS internal heartbeat is not healthy! .... 2020-06-05 21:39:05.991 7f251bfe6700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to be laggy; 310.228s since last acked beacon 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy 2020-06-05 21:39:19.838 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324694 from mon.1 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Map removed me (mds.-1 gid:210070681) from cluster due to lost contact; respawning 2020-06-05 21:39:19.870 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr respawn! --- begin dump of recent events --- -9999> 2020-06-05 19:28:07.982 7f25217f1700 5 mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq 2131 rtt 0.930951 -9998> 2020-06-05 19:28:11.053 7f251b7e5700 5 mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132 -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient: _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0 -9996> 2020-06-05 19:28:12.176 7f25217f1700 5 mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq 2132 rtt 1.12294 -9995> 2020-06-05 19:28:12.176 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1 -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2020-06-05 19:27:44.290995) ... 2020-06-05 21:39:31.092 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324749 from mon.1 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324750 from mon.1 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Map has assigned me to become a standby However, the mons doesn't seems particularly loaded ! So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did it both for mds and mons daemons because it's seems to be present in both conf). I will tells you if it works. Any other clue ? F. Le 05/06/2020 à 14:44, Frank Schilder a écrit :

...

Hi Francois, thanks for the link. The option "mds dump cache after rejoin" is for debugging purposes only. It will write the cache after rejoin to a file, but not drop the cache. This will not help you. I think this was implemented recently to make it possible to send a cache dump file to developers after an MDS crash before the restarting MDS changes the cache. In your case, I would set osd_op_queue_cut_off during the next regular cluster service or upgrade. My best bet right now is to try to add swap. Maybe someone else reading this has a better idea or you find a hint in one of the other threads. Good luck! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> Sent: 05 June 2020 14:34:06 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted Le 05/06/2020 à 14:18, Frank Schilder a écrit :

Hi Francois,

I was also wondering if setting mds dump cache after rejoin could help ?

Haven't heard of that option. Is there some documentation?

I found it on : https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ mds dump cache after rejoin Description Ceph will dump MDS cache contents to a file after rejoining the cache (during recovery). Type Boolean Default false but I don't think it can help in my case, because rejoin occurs after replay and in my case replay never ends !

I have : osd_op_queue=wpq osd_op_queue_cut_off=low I can try to set osd_op_queue_cut_off to high, but it will be useful only if the mds get active, true ?

I think so. If you have no clients connected, there should not be queue priority issues. Maybe it is best to wait until your cluster is healthy again as you will need to restart all daemons. Make sure you set this in [global]. When I applied that change and after re-starting all OSDs my MDSes had reconnect issues until I set it on them too. I think all daemons use that option (the prefix osd_ is misleading).

For sure I would prefer not to restart all daemons because the second filesystem is up and running (with production clients).

For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB which seems reasonable for a mds server with 32/48GB).

This sounds bad. 8GB should not cause any issues. Maybe you are hitting a bug, I believe there is a regression in Nautilus. There were recent threads on absurdly high memory use by MDSes. Maybe its worth searching for these in the list.

I will have a look.

I already force the clients to unmount (and even rebooted the ones from which the rsync and the rmdir .snaps were launched).

I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. without -f or forced by reboot) the MDS should have dropped the clients already. If it was an unclean unmount it might not be that easy to get the stale client session out. However, I don't know about that.

Moreover when I did that, the mds was already not active but in replay, so for sure the unmount was not acknowledged by any mds ! >> I think that providing more swap maybe the solution ! I will try that if >> I cannot find another way to fix it. > If the memory overrun is somewhat limited, this should allow the MDS to trim the logs. Will take a while, but it will do eventually. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> > Sent: 05 June 2020 13:46:03 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted > > I was also wondering if setting mds dump cache after rejoin could help ? > > > Le 05/06/2020 à 12:49, Frank Schilder a écrit : >> Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). >> >> How many rsync processes are you running in parallel? >> Do you have these settings enabled: >> >> osd_op_queue=wpq >> osd_op_queue_cut_off=high >> >> WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. >> >> You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): >> >> - reduce the MDS cache memory limit to force recall of caps much earlier than now >> - reduce client cach size >> - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful >> >> At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. >> >> If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. >> >> Harder measures: >> >> - stop all I/O from the FS clients, throw users out if necessary >> - ideally, try to cleanly (!) shut down clients or force trimming the cache by either >> * umount or >> * sync; echo 3 > /proc/sys/vm/drop_caches >> Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. >> >> At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. >> >> My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again. >> >> Hope that helps. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >> Sent: 05 June 2020 11:42:42 >> To: ceph-users >> Subject: [ceph-users] mds behind on trimming - replay until memory exhausted >> >> Hi all, >> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and >> 3 mds (1 active for each fs + one failover). >> We are transfering all the datas (~600M files) from one FS (which was in >> EC 3+2) to the other FS (in R3). >> On the old FS we first removed the snapshots (to avoid strays problems >> when removing files) and the ran some rsync deleting the files after the >> transfer. >> The operation should last a few weeks more to complete. >> But few days ago, we started to have some warning mds behind on trimming >> from the mds managing the old FS. >> Yesterday, I restarted the active mds service to force the takeover by >> the standby mds (basically because the standby is more powerfull and >> have more memory, i.e 48GB over 32). >> The standby mds took the rank 0 and started to replay... the mds behind >> on trimming came back and the number of segments rised as well as the >> memory usage of the server. Finally, it exhausted the memory of the mds >> and the service stopped and the previous mds took rank 0 and started to >> replay... until memory exhaustion and a new switch of mds etc... >> It thus seems that we are in a never ending loop ! And of course, as the >> mds is always in replay, the data are not accessible and the transfers >> are blocked. >> I stopped all the rsync and unmount the clients. >> >> My questions are : >> - Does the mds trim during the replay so we could hope that after a >> while it will purge everything and the mds will be able to become active >> at the end ? >> - Is there a way to accelerate the operation or to fix this situation ? >> >> Thanks for you help. >> F. >> _______________________________________________ >> ceph-users mailing list --ceph-users(a)ceph.io >> To unsubscribe send an email toceph-users-leave(a)ceph.io

Reply

Frank Schilder

6 Jun 6 Jun

8:47 a.m.

Hi Francois, there is actually one more parameter you might consider changing in case the MDS gets kicked out again. For a system under such high memory pressure, the value for the kernel parameter vm.min_free_kbytes might need adjusting. You can check the current value with sysctl vm.min_free_kbytes In your case with heavy swap usage, this value should probably be somewhere between 2-4GB. Careful, do not change this value while memory is in high demand. If not enough memory is available, setting this will immediately OOM kill your machine. Make sure that plenty of pages are unused. Drop page cache if necessary or reboot the machine before setting this value. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans(a)dtu.dk> Sent: 06 June 2020 00:36:13 To: ceph-users; fleg(a)lpnhe.in2p3.fr Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Hi Francois, yes, the beacon grace needs to be higher due to the latency of swap. Not sure if 60s will do. For this particular recovery operation, you might want to go much higher (1h) and watch the cluster health closely. Good luck and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> Sent: 05 June 2020 23:51:04 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted Hi, Unfortunately adding swap did not solve the problem ! I added 400 GB of swap. It used about 18GB of swap after consuming all the ram and stopped with the following logs : 2020-06-05 21:33:31.967 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324691 from mon.1 2020-06-05 21:33:40.355 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324692 from mon.1 2020-06-05 21:33:59.787 7f251b7e5700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-06-05 21:33:59.787 7f251b7e5700 0 mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors (last acked 3.99979s ago); MDS internal heartbeat is not healthy! 2020-06-05 21:34:00.287 7f251b7e5700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-06-05 21:34:00.287 7f251b7e5700 0 mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors (last acked 4.49976s ago); MDS internal heartbeat is not healthy! .... 2020-06-05 21:39:05.991 7f251bfe6700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to be laggy; 310.228s since last acked beacon 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy 2020-06-05 21:39:19.838 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324694 from mon.1 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Map removed me (mds.-1 gid:210070681) from cluster due to lost contact; respawning 2020-06-05 21:39:19.870 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr respawn! --- begin dump of recent events --- -9999> 2020-06-05 19:28:07.982 7f25217f1700 5 mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq 2131 rtt 0.930951 -9998> 2020-06-05 19:28:11.053 7f251b7e5700 5 mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132 -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient: _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0 -9996> 2020-06-05 19:28:12.176 7f25217f1700 5 mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq 2132 rtt 1.12294 -9995> 2020-06-05 19:28:12.176 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1 -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2020-06-05 19:27:44.290995) ... 2020-06-05 21:39:31.092 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324749 from mon.1 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324750 from mon.1 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Map has assigned me to become a standby However, the mons doesn't seems particularly loaded ! So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did it both for mds and mons daemons because it's seems to be present in both conf). I will tells you if it works. Any other clue ? F. Le 05/06/2020 à 14:44, Frank Schilder a écrit :

...

Hi Francois, thanks for the link. The option "mds dump cache after rejoin" is for debugging purposes only. It will write the cache after rejoin to a file, but not drop the cache. This will not help you. I think this was implemented recently to make it possible to send a cache dump file to developers after an MDS crash before the restarting MDS changes the cache. In your case, I would set osd_op_queue_cut_off during the next regular cluster service or upgrade. My best bet right now is to try to add swap. Maybe someone else reading this has a better idea or you find a hint in one of the other threads. Good luck! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> Sent: 05 June 2020 14:34:06 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted Le 05/06/2020 à 14:18, Frank Schilder a écrit :

Hi Francois,

I was also wondering if setting mds dump cache after rejoin could help ?

Haven't heard of that option. Is there some documentation?

I found it on : https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ mds dump cache after rejoin Description Ceph will dump MDS cache contents to a file after rejoining the cache (during recovery). Type Boolean Default false but I don't think it can help in my case, because rejoin occurs after replay and in my case replay never ends !

I have : osd_op_queue=wpq osd_op_queue_cut_off=low I can try to set osd_op_queue_cut_off to high, but it will be useful only if the mds get active, true ?

I think so. If you have no clients connected, there should not be queue priority issues. Maybe it is best to wait until your cluster is healthy again as you will need to restart all daemons. Make sure you set this in [global]. When I applied that change and after re-starting all OSDs my MDSes had reconnect issues until I set it on them too. I think all daemons use that option (the prefix osd_ is misleading).

For sure I would prefer not to restart all daemons because the second filesystem is up and running (with production clients).

For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB which seems reasonable for a mds server with 32/48GB).

This sounds bad. 8GB should not cause any issues. Maybe you are hitting a bug, I believe there is a regression in Nautilus. There were recent threads on absurdly high memory use by MDSes. Maybe its worth searching for these in the list.

I will have a look.

I already force the clients to unmount (and even rebooted the ones from which the rsync and the rmdir .snaps were launched).

I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. without -f or forced by reboot) the MDS should have dropped the clients already. If it was an unclean unmount it might not be that easy to get the stale client session out. However, I don't know about that.

Moreover when I did that, the mds was already not active but in replay, so for sure the unmount was not acknowledged by any mds ! >> I think that providing more swap maybe the solution ! I will try that if >> I cannot find another way to fix it. > If the memory overrun is somewhat limited, this should allow the MDS to trim the logs. Will take a while, but it will do eventually. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> > Sent: 05 June 2020 13:46:03 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted > > I was also wondering if setting mds dump cache after rejoin could help ? > > > Le 05/06/2020 à 12:49, Frank Schilder a écrit : >> Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). >> >> How many rsync processes are you running in parallel? >> Do you have these settings enabled: >> >> osd_op_queue=wpq >> osd_op_queue_cut_off=high >> >> WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. >> >> You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): >> >> - reduce the MDS cache memory limit to force recall of caps much earlier than now >> - reduce client cach size >> - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful >> >> At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. >> >> If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. >> >> Harder measures: >> >> - stop all I/O from the FS clients, throw users out if necessary >> - ideally, try to cleanly (!) shut down clients or force trimming the cache by either >> * umount or >> * sync; echo 3 > /proc/sys/vm/drop_caches >> Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. >> >> At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. >> >> My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again. >> >> Hope that helps. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >> Sent: 05 June 2020 11:42:42 >> To: ceph-users >> Subject: [ceph-users] mds behind on trimming - replay until memory exhausted >> >> Hi all, >> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and >> 3 mds (1 active for each fs + one failover). >> We are transfering all the datas (~600M files) from one FS (which was in >> EC 3+2) to the other FS (in R3). >> On the old FS we first removed the snapshots (to avoid strays problems >> when removing files) and the ran some rsync deleting the files after the >> transfer. >> The operation should last a few weeks more to complete. >> But few days ago, we started to have some warning mds behind on trimming >> from the mds managing the old FS. >> Yesterday, I restarted the active mds service to force the takeover by >> the standby mds (basically because the standby is more powerfull and >> have more memory, i.e 48GB over 32). >> The standby mds took the rank 0 and started to replay... the mds behind >> on trimming came back and the number of segments rised as well as the >> memory usage of the server. Finally, it exhausted the memory of the mds >> and the service stopped and the previous mds took rank 0 and started to >> replay... until memory exhaustion and a new switch of mds etc... >> It thus seems that we are in a never ending loop ! And of course, as the >> mds is always in replay, the data are not accessible and the transfers >> are blocked. >> I stopped all the rsync and unmount the clients. >> >> My questions are : >> - Does the mds trim during the replay so we could hope that after a >> while it will purge everything and the mds will be able to become active >> at the end ? >> - Is there a way to accelerate the operation or to fix this situation ? >> >> Thanks for you help. >> F. >> _______________________________________________ >> ceph-users mailing list --ceph-users(a)ceph.io >> To unsubscribe send an email toceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Frank Schilder

2:21 p.m.

I think you have a problem similar to one I have. The priority of beacons seems very low. As soon as something gets busy, beacons are ignored or not sent. This was part of your log messages from the MDS. It stopped reporting to the MONs due to laggy connection. This laggyness is a result of swapping:

...

2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy

Hence, during the (entire) time you are trying to get the MDS back using swap, it will almost certainly stop sending beacons. Therefore, you need to disable the time-out temporarily, otherwise the MON will always kill it for no real reason. The time-out should be long enough to cover the entire recovery period. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> Sent: 06 June 2020 11:11 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Thanks for the tip, I will try that. For now vm.min_free_kbytes = 90112 Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0 but this didn't change anything... -27> 2020-06-06 06:15:07.373 7f83e3626700 1 mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to be laggy; 332.044s since last acked beacon Which is the same time since last acked beacon I had before changing the parameter. As mds beacon interval is 4 s setting mds_beacon_grace to 240 should lead to 960 s (16mn). Thus I think that the bottleneck is elsewhere. F. Le 06/06/2020 à 09:47, Frank Schilder a écrit : > Hi Francois, > > there is actually one more parameter you might consider changing in case the MDS gets kicked out again. For a system under such high memory pressure, the value for the kernel parameter vm.min_free_kbytes might need adjusting. You can check the current value with > > sysctl vm.min_free_kbytes > > In your case with heavy swap usage, this value should probably be somewhere between 2-4GB. > > Careful, do not change this value while memory is in high demand. If not enough memory is available, setting this will immediately OOM kill your machine. Make sure that plenty of pages are unused. Drop page cache if necessary or reboot the machine before setting this value. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans(a)dtu.dk> > Sent: 06 June 2020 00:36:13 > To: ceph-users; fleg(a)lpnhe.in2p3.fr > Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted > > Hi Francois, > > yes, the beacon grace needs to be higher due to the latency of swap. Not sure if 60s will do. For this particular recovery operation, you might want to go much higher (1h) and watch the cluster health closely. > > Good luck and best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> > Sent: 05 June 2020 23:51:04 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted > > Hi, > Unfortunately adding swap did not solve the problem ! > I added 400 GB of swap. It used about 18GB of swap after consuming all > the ram and stopped with the following logs : > > 2020-06-05 21:33:31.967 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr > Updating MDS map to version 324691 from mon.1 > 2020-06-05 21:33:40.355 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr > Updating MDS map to version 324692 from mon.1 > 2020-06-05 21:33:59.787 7f251b7e5700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 15 > 2020-06-05 21:33:59.787 7f251b7e5700 0 > mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors > (last acked 3.99979s ago); MDS internal heartbeat is not healthy! > 2020-06-05 21:34:00.287 7f251b7e5700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 15 > 2020-06-05 21:34:00.287 7f251b7e5700 0 > mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors > (last acked 4.49976s ago); MDS internal heartbeat is not healthy! > .... > 2020-06-05 21:39:05.991 7f251bfe6700 1 heartbeat_map reset_timeout > 'MDSRank' had timed out after 15 > 2020-06-05 21:39:06.015 7f251bfe6700 1 > mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to > be laggy; 310.228s since last acked beacon

...

2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy

> 2020-06-05 21:39:19.838 7f251bfe6700 1 mds.0.322900 skipping upkeep > work because connection to Monitors appears laggy > 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr > Updating MDS map to version 324694 from mon.1 > 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Map > removed me (mds.-1 gid:210070681) from cluster due to lost contact; > respawning > 2020-06-05 21:39:19.870 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr respawn! > --- begin dump of recent events --- > -9999> 2020-06-05 19:28:07.982 7f25217f1700 5 > mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq > 2131 rtt 0.930951 > -9998> 2020-06-05 19:28:11.053 7f251b7e5700 5 > mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132 > -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient: > _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0 > -9996> 2020-06-05 19:28:12.176 7f25217f1700 5 > mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq > 2132 rtt 1.12294 > -9995> 2020-06-05 19:28:12.176 7f251e7eb700 1 > mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1 > -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick > -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: > _check_auth_rotating have uptodate secrets (they expire after 2020-06-05 > 19:27:44.290995) > ... > 2020-06-05 21:39:31.092 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr > Updating MDS map to version 324749 from mon.1 > 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr > Updating MDS map to version 324750 from mon.1 > 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Map > has assigned me to become a standby > > However, the mons doesn't seems particularly loaded ! > So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did > it both for mds and mons daemons because it's seems to be present in > both conf). > I will tells you if it works. > > Any other clue ? > F. > > Le 05/06/2020 à 14:44, Frank Schilder a écrit : >> Hi Francois, >> >> thanks for the link. The option "mds dump cache after rejoin" is for debugging purposes only. It will write the cache after rejoin to a file, but not drop the cache. This will not help you. I think this was implemented recently to make it possible to send a cache dump file to developers after an MDS crash before the restarting MDS changes the cache. >> >> In your case, I would set osd_op_queue_cut_off during the next regular cluster service or upgrade. >> >> My best bet right now is to try to add swap. Maybe someone else reading this has a better idea or you find a hint in one of the other threads. >> >> Good luck! >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >> Sent: 05 June 2020 14:34:06 >> To: Frank Schilder; ceph-users >> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >> >> Le 05/06/2020 à 14:18, Frank Schilder a écrit : >>> Hi Francois, >>> >>>> I was also wondering if setting mds dump cache after rejoin could help ? >>> Haven't heard of that option. Is there some documentation? >> I found it on : >> https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ >> mds dump cache after rejoin >> Description >> Ceph will dump MDS cache contents to a file after rejoining the cache >> (during recovery). >> Type >> Boolean >> Default >> false >> >> but I don't think it can help in my case, because rejoin occurs after >> replay and in my case replay never ends ! >> >>>> I have : >>>> osd_op_queue=wpq >>>> osd_op_queue_cut_off=low >>>> I can try to set osd_op_queue_cut_off to high, but it will be useful >>>> only if the mds get active, true ? >>> I think so. If you have no clients connected, there should not be queue priority issues. Maybe it is best to wait until your cluster is healthy again as you will need to restart all daemons. Make sure you set this in [global]. When I applied that change and after re-starting all OSDs my MDSes had reconnect issues until I set it on them too. I think all daemons use that option (the prefix osd_ is misleading). >> For sure I would prefer not to restart all daemons because the second >> filesystem is up and running (with production clients). >> >>>> For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB >>>> which seems reasonable for a mds server with 32/48GB). >>> This sounds bad. 8GB should not cause any issues. Maybe you are hitting a bug, I believe there is a regression in Nautilus. There were recent threads on absurdly high memory use by MDSes. Maybe its worth searching for these in the list. >> I will have a look. >> >>>> I already force the clients to unmount (and even rebooted the ones from >>>> which the rsync and the rmdir .snaps were launched). >>> I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. without -f or forced by reboot) the MDS should have dropped the clients already. If it was an unclean unmount it might not be that easy to get the stale client session out. However, I don't know about that. >> Moreover when I did that, the mds was already not active but in replay, >> so for sure the unmount was not acknowledged by any mds ! >> >>>> I think that providing more swap maybe the solution ! I will try that if >>>> I cannot find another way to fix it. >>> If the memory overrun is somewhat limited, this should allow the MDS to trim the logs. Will take a while, but it will do eventually. >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>> Sent: 05 June 2020 13:46:03 >>> To: Frank Schilder; ceph-users >>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >>> >>> I was also wondering if setting mds dump cache after rejoin could help ? >>> >>> >>> Le 05/06/2020 à 12:49, Frank Schilder a écrit : >>>> Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). >>>> >>>> How many rsync processes are you running in parallel? >>>> Do you have these settings enabled: >>>> >>>> osd_op_queue=wpq >>>> osd_op_queue_cut_off=high >>>> >>>> WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. >>>> >>>> You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): >>>> >>>> - reduce the MDS cache memory limit to force recall of caps much earlier than now >>>> - reduce client cach size >>>> - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful >>>> >>>> At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. >>>> >>>> If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. >>>> >>>> Harder measures: >>>> >>>> - stop all I/O from the FS clients, throw users out if necessary >>>> - ideally, try to cleanly (!) shut down clients or force trimming the cache by either >>>> * umount or >>>> * sync; echo 3 > /proc/sys/vm/drop_caches >>>> Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. >>>> >>>> At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. >>>> >>>> My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again. >>>> >>>> Hope that helps. >>>> >>>> Best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>> Sent: 05 June 2020 11:42:42 >>>> To: ceph-users >>>> Subject: [ceph-users] mds behind on trimming - replay until memory exhausted >>>> >>>> Hi all, >>>> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and >>>> 3 mds (1 active for each fs + one failover). >>>> We are transfering all the datas (~600M files) from one FS (which was in >>>> EC 3+2) to the other FS (in R3). >>>> On the old FS we first removed the snapshots (to avoid strays problems >>>> when removing files) and the ran some rsync deleting the files after the >>>> transfer. >>>> The operation should last a few weeks more to complete. >>>> But few days ago, we started to have some warning mds behind on trimming >>>> from the mds managing the old FS. >>>> Yesterday, I restarted the active mds service to force the takeover by >>>> the standby mds (basically because the standby is more powerfull and >>>> have more memory, i.e 48GB over 32). >>>> The standby mds took the rank 0 and started to replay... the mds behind >>>> on trimming came back and the number of segments rised as well as the >>>> memory usage of the server. Finally, it exhausted the memory of the mds >>>> and the service stopped and the previous mds took rank 0 and started to >>>> replay... until memory exhaustion and a new switch of mds etc... >>>> It thus seems that we are in a never ending loop ! And of course, as the >>>> mds is always in replay, the data are not accessible and the transfers >>>> are blocked. >>>> I stopped all the rsync and unmount the clients. >>>> >>>> My questions are : >>>> - Does the mds trim during the replay so we could hope that after a >>>> while it will purge everything and the mds will be able to become active >>>> at the end ? >>>> - Is there a way to accelerate the operation or to fix this situation ? >>>> >>>> Thanks for you help. >>>> F. >>>> _______________________________________________ >>>> ceph-users mailing list --ceph-users(a)ceph.io >>>> To unsubscribe send an email toceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Francois Legrand

1:45 p.m.

Hi Franck, Finally I dit : ceph config set global mds_beacon_grace 600000 and create /etc/sysctl.d/sysctl-ceph.conf with vm.min_free_kbytes=4194303 and then sysctl --system After that, the mds went to rejoin for a very long time (almost 24 hours) with errors like : 2020-06-07 04:10:36.802 7ff866e2e700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-06-07 04:10:36.802 7ff866e2e700 0 mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors (last acked 14653.8s ago); MDS internal heartbeat is not healthy! 2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-06-07 03:10:37.022271) and also 2020-06-07 04:10:44.942 7ff86d63b700 0 auth: could not find secret_id=10363 2020-06-07 04:10:44.942 7ff86d63b700 0 cephx: verify_authorizer could not get service secret for service mds secret_id=10363 but at the end the mds went active ! :-) I let it at rest from sunday afternoon until this morning. Indeed I was able to connect clients (in read-only for now) and read the datas. I checked the clients connected with ceph tell mds.lpnceph-mds02.in2p3.fr client ls and disconnected the few clients still there (with umount) and checked that they were not connected anymore with the same command. But I still have the following warnings MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 75372 secs MDS_TRIM 1 MDSs behind on trimming mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128) max_segments: 128, num_segments: 122836 and the number of segments is still rising (slowly). F. Le 08/06/2020 à 12:00, Frank Schilder a écrit :

...

Hi Francois, did you manage to get any further with this? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans(a)dtu.dk> Sent: 06 June 2020 15:21:59 To: ceph-users; fleg(a)lpnhe.in2p3.fr Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted I think you have a problem similar to one I have. The priority of beacons seems very low. As soon as something gets busy, beacons are ignored or not sent. This was part of your log messages from the MDS. It stopped reporting to the MONs due to laggy connection. This laggyness is a result of swapping:

2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy

Hence, during the (entire) time you are trying to get the MDS back using swap, it will almost certainly stop sending beacons. Therefore, you need to disable the time-out temporarily, otherwise the MON will always kill it for no real reason. The time-out should be long enough to cover the entire recovery period. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> Sent: 06 June 2020 11:11 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Thanks for the tip, I will try that. For now vm.min_free_kbytes = 90112 Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0 but this didn't change anything... -27> 2020-06-06 06:15:07.373 7f83e3626700 1 mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to be laggy; 332.044s since last acked beacon Which is the same time since last acked beacon I had before changing the parameter. As mds beacon interval is 4 s setting mds_beacon_grace to 240 should lead to 960 s (16mn). Thus I think that the bottleneck is elsewhere. F. Le 06/06/2020 à 09:47, Frank Schilder a écrit : > Hi Francois, > > there is actually one more parameter you might consider changing in case the MDS gets kicked out again. For a system under such high memory pressure, the value for the kernel parameter vm.min_free_kbytes might need adjusting. You can check the current value with > > sysctl vm.min_free_kbytes > > In your case with heavy swap usage, this value should probably be somewhere between 2-4GB. > > Careful, do not change this value while memory is in high demand. If not enough memory is available, setting this will immediately OOM kill your machine. Make sure that plenty of pages are unused. Drop page cache if necessary or reboot the machine before setting this value. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans(a)dtu.dk> > Sent: 06 June 2020 00:36:13 > To: ceph-users; fleg(a)lpnhe.in2p3.fr > Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted > > Hi Francois, > > yes, the beacon grace needs to be higher due to the latency of swap. Not sure if 60s will do. For this particular recovery operation, you might want to go much higher (1h) and watch the cluster health closely. > > Good luck and best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> > Sent: 05 June 2020 23:51:04 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted > > Hi, > Unfortunately adding swap did not solve the problem ! > I added 400 GB of swap. It used about 18GB of swap after consuming all > the ram and stopped with the following logs : > > 2020-06-05 21:33:31.967 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr > Updating MDS map to version 324691 from mon.1 > 2020-06-05 21:33:40.355 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr > Updating MDS map to version 324692 from mon.1 > 2020-06-05 21:33:59.787 7f251b7e5700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 15 > 2020-06-05 21:33:59.787 7f251b7e5700 0 > mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors > (last acked 3.99979s ago); MDS internal heartbeat is not healthy! > 2020-06-05 21:34:00.287 7f251b7e5700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 15 > 2020-06-05 21:34:00.287 7f251b7e5700 0 > mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors > (last acked 4.49976s ago); MDS internal heartbeat is not healthy! > .... > 2020-06-05 21:39:05.991 7f251bfe6700 1 heartbeat_map reset_timeout > 'MDSRank' had timed out after 15 > 2020-06-05 21:39:06.015 7f251bfe6700 1 > mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to > be laggy; 310.228s since last acked beacon

2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy

> 2020-06-05 21:39:19.838 7f251bfe6700 1 mds.0.322900 skipping upkeep > work because connection to Monitors appears laggy > 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr > Updating MDS map to version 324694 from mon.1 > 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Map > removed me (mds.-1 gid:210070681) from cluster due to lost contact; > respawning > 2020-06-05 21:39:19.870 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr respawn! > --- begin dump of recent events --- > -9999> 2020-06-05 19:28:07.982 7f25217f1700 5 > mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq > 2131 rtt 0.930951 > -9998> 2020-06-05 19:28:11.053 7f251b7e5700 5 > mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132 > -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient: > _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0 > -9996> 2020-06-05 19:28:12.176 7f25217f1700 5 > mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq > 2132 rtt 1.12294 > -9995> 2020-06-05 19:28:12.176 7f251e7eb700 1 > mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1 > -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick > -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: > _check_auth_rotating have uptodate secrets (they expire after 2020-06-05 > 19:27:44.290995) > ... > 2020-06-05 21:39:31.092 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr > Updating MDS map to version 324749 from mon.1 > 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr > Updating MDS map to version 324750 from mon.1 > 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Map > has assigned me to become a standby > > However, the mons doesn't seems particularly loaded ! > So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did > it both for mds and mons daemons because it's seems to be present in > both conf). > I will tells you if it works. > > Any other clue ? > F. > > Le 05/06/2020 à 14:44, Frank Schilder a écrit : >> Hi Francois, >> >> thanks for the link. The option "mds dump cache after rejoin" is for debugging purposes only. It will write the cache after rejoin to a file, but not drop the cache. This will not help you. I think this was implemented recently to make it possible to send a cache dump file to developers after an MDS crash before the restarting MDS changes the cache. >> >> In your case, I would set osd_op_queue_cut_off during the next regular cluster service or upgrade. >> >> My best bet right now is to try to add swap. Maybe someone else reading this has a better idea or you find a hint in one of the other threads. >> >> Good luck! >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >> Sent: 05 June 2020 14:34:06 >> To: Frank Schilder; ceph-users >> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >> >> Le 05/06/2020 à 14:18, Frank Schilder a écrit : >>> Hi Francois, >>> >>>> I was also wondering if setting mds dump cache after rejoin could help ? >>> Haven't heard of that option. Is there some documentation? >> I found it on : >> https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ >> mds dump cache after rejoin >> Description >> Ceph will dump MDS cache contents to a file after rejoining the cache >> (during recovery). >> Type >> Boolean >> Default >> false >> >> but I don't think it can help in my case, because rejoin occurs after >> replay and in my case replay never ends ! >> >>>> I have : >>>> osd_op_queue=wpq >>>> osd_op_queue_cut_off=low >>>> I can try to set osd_op_queue_cut_off to high, but it will be useful >>>> only if the mds get active, true ? >>> I think so. If you have no clients connected, there should not be queue priority issues. Maybe it is best to wait until your cluster is healthy again as you will need to restart all daemons. Make sure you set this in [global]. When I applied that change and after re-starting all OSDs my MDSes had reconnect issues until I set it on them too. I think all daemons use that option (the prefix osd_ is misleading). >> For sure I would prefer not to restart all daemons because the second >> filesystem is up and running (with production clients). >> >>>> For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB >>>> which seems reasonable for a mds server with 32/48GB). >>> This sounds bad. 8GB should not cause any issues. Maybe you are hitting a bug, I believe there is a regression in Nautilus. There were recent threads on absurdly high memory use by MDSes. Maybe its worth searching for these in the list. >> I will have a look. >> >>>> I already force the clients to unmount (and even rebooted the ones from >>>> which the rsync and the rmdir .snaps were launched). >>> I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. without -f or forced by reboot) the MDS should have dropped the clients already. If it was an unclean unmount it might not be that easy to get the stale client session out. However, I don't know about that. >> Moreover when I did that, the mds was already not active but in replay, >> so for sure the unmount was not acknowledged by any mds ! >> >>>> I think that providing more swap maybe the solution ! I will try that if >>>> I cannot find another way to fix it. >>> If the memory overrun is somewhat limited, this should allow the MDS to trim the logs. Will take a while, but it will do eventually. >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>> Sent: 05 June 2020 13:46:03 >>> To: Frank Schilder; ceph-users >>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >>> >>> I was also wondering if setting mds dump cache after rejoin could help ? >>> >>> >>> Le 05/06/2020 à 12:49, Frank Schilder a écrit : >>>> Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). >>>> >>>> How many rsync processes are you running in parallel? >>>> Do you have these settings enabled: >>>> >>>> osd_op_queue=wpq >>>> osd_op_queue_cut_off=high >>>> >>>> WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. >>>> >>>> You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): >>>> >>>> - reduce the MDS cache memory limit to force recall of caps much earlier than now >>>> - reduce client cach size >>>> - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful >>>> >>>> At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. >>>> >>>> If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. >>>> >>>> Harder measures: >>>> >>>> - stop all I/O from the FS clients, throw users out if necessary >>>> - ideally, try to cleanly (!) shut down clients or force trimming the cache by either >>>> * umount or >>>> * sync; echo 3 > /proc/sys/vm/drop_caches >>>> Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. >>>> >>>> At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. >>>> >>>> My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again. >>>> >>>> Hope that helps. >>>> >>>> Best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>> Sent: 05 June 2020 11:42:42 >>>> To: ceph-users >>>> Subject: [ceph-users] mds behind on trimming - replay until memory exhausted >>>> >>>> Hi all, >>>> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and >>>> 3 mds (1 active for each fs + one failover). >>>> We are transfering all the datas (~600M files) from one FS (which was in >>>> EC 3+2) to the other FS (in R3). >>>> On the old FS we first removed the snapshots (to avoid strays problems >>>> when removing files) and the ran some rsync deleting the files after the >>>> transfer. >>>> The operation should last a few weeks more to complete. >>>> But few days ago, we started to have some warning mds behind on trimming >>>> from the mds managing the old FS. >>>> Yesterday, I restarted the active mds service to force the takeover by >>>> the standby mds (basically because the standby is more powerfull and >>>> have more memory, i.e 48GB over 32). >>>> The standby mds took the rank 0 and started to replay... the mds behind >>>> on trimming came back and the number of segments rised as well as the >>>> memory usage of the server. Finally, it exhausted the memory of the mds >>>> and the service stopped and the previous mds took rank 0 and started to >>>> replay... until memory exhaustion and a new switch of mds etc... >>>> It thus seems that we are in a never ending loop ! And of course, as the >>>> mds is always in replay, the data are not accessible and the transfers >>>> are blocked. >>>> I stopped all the rsync and unmount the clients. >>>> >>>> My questions are : >>>> - Does the mds trim during the replay so we could hope that after a >>>> while it will purge everything and the mds will be able to become active >>>> at the end ? >>>> - Is there a way to accelerate the operation or to fix this situation ? >>>> >>>> Thanks for you help. >>>> F. >>>> _______________________________________________ >>>> ceph-users mailing list --ceph-users(a)ceph.io >>>> To unsubscribe send an email toceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Francois Legrand

2:27 p.m.

Thanks again for the hint ! Indeed, I did a ceph daemon mds.lpnceph-mds02.in2p3.fr objecter_requests and it seems that osd 27 is more or less stuck with op of age 34987.5 (while others osd have ages < 1). I tryed a ceph osd down 27 which resulted in reseting the age but I can notice that age for osd.27 ops is rising again. I think I will restart it (btw our osd servers and mds are different machines). F. Le 08/06/2020 à 15:01, Frank Schilder a écrit :

...

Hi Francois, this sounds great. At least its operational. I guess it is still using a lot of swap while trying to replay operations. I would disconnect cleanly all clients if you didn't do so already, even any read-only clients. Any extra load will just slow down recovery. My best guess is, that the MDS is replaying some operations, which is very slow due to swap. While doing so, the segments to trim will probably keep increasing for a while until it can start trimming. The slow meta-data IO is an operation hanging in some OSD. You should check which OSD it is (ceph health detail) and check if you can see the operation in the OSDs OPS queue. I would expect this OSD to have a really long OPS queue. I have seen meta-data operations hang for a long time. In case this OSD runs on the same server as your MDS, you will probably have to sit it out. If the meta-data operation is the only operation in the queue, the OSD might need a restart. But be careful, if in doubt ask the list first. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> Sent: 08 June 2020 14:45:13 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Hi Franck, Finally I dit : ceph config set global mds_beacon_grace 600000 and create /etc/sysctl.d/sysctl-ceph.conf with vm.min_free_kbytes=4194303 and then sysctl --system After that, the mds went to rejoin for a very long time (almost 24 hours) with errors like : 2020-06-07 04:10:36.802 7ff866e2e700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-06-07 04:10:36.802 7ff866e2e700 0 mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors (last acked 14653.8s ago); MDS internal heartbeat is not healthy! 2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-06-07 03:10:37.022271) and also 2020-06-07 04:10:44.942 7ff86d63b700 0 auth: could not find secret_id=10363 2020-06-07 04:10:44.942 7ff86d63b700 0 cephx: verify_authorizer could not get service secret for service mds secret_id=10363 but at the end the mds went active ! :-) I let it at rest from sunday afternoon until this morning. Indeed I was able to connect clients (in read-only for now) and read the datas. I checked the clients connected with ceph tell mds.lpnceph-mds02.in2p3.fr client ls and disconnected the few clients still there (with umount) and checked that they were not connected anymore with the same command. But I still have the following warnings MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 75372 secs MDS_TRIM 1 MDSs behind on trimming mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128) max_segments: 128, num_segments: 122836 and the number of segments is still rising (slowly). F. Le 08/06/2020 à 12:00, Frank Schilder a écrit : > Hi Francois, > > did you manage to get any further with this? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans(a)dtu.dk> > Sent: 06 June 2020 15:21:59 > To: ceph-users; fleg(a)lpnhe.in2p3.fr > Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted > > I think you have a problem similar to one I have. The priority of beacons seems very low. As soon as something gets busy, beacons are ignored or not sent. This was part of your log messages from the MDS. It stopped reporting to the MONs due to laggy connection. This laggyness is a result of swapping: > >> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep >> work because connection to Monitors appears laggy > Hence, during the (entire) time you are trying to get the MDS back using swap, it will almost certainly stop sending beacons. Therefore, you need to disable the time-out temporarily, otherwise the MON will always kill it for no real reason. The time-out should be long enough to cover the entire recovery period. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> > Sent: 06 June 2020 11:11 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted > > Thanks for the tip, > I will try that. For now vm.min_free_kbytes = 90112 > Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0 > but this didn't change anything... > -27> 2020-06-06 06:15:07.373 7f83e3626700 1 > mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to > be laggy; 332.044s since last acked beacon > Which is the same time since last acked beacon I had before changing the > parameter. > As mds beacon interval is 4 s setting mds_beacon_grace to 240 should > lead to 960 s (16mn). Thus I think that the bottleneck is elsewhere. > F. > > > Le 06/06/2020 à 09:47, Frank Schilder a écrit : >> Hi Francois, >> >> there is actually one more parameter you might consider changing in case the MDS gets kicked out again. For a system under such high memory pressure, the value for the kernel parameter vm.min_free_kbytes might need adjusting. You can check the current value with >> >> sysctl vm.min_free_kbytes >> >> In your case with heavy swap usage, this value should probably be somewhere between 2-4GB. >> >> Careful, do not change this value while memory is in high demand. If not enough memory is available, setting this will immediately OOM kill your machine. Make sure that plenty of pages are unused. Drop page cache if necessary or reboot the machine before setting this value. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Frank Schilder <frans(a)dtu.dk> >> Sent: 06 June 2020 00:36:13 >> To: ceph-users; fleg(a)lpnhe.in2p3.fr >> Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted >> >> Hi Francois, >> >> yes, the beacon grace needs to be higher due to the latency of swap. Not sure if 60s will do. For this particular recovery operation, you might want to go much higher (1h) and watch the cluster health closely. >> >> Good luck and best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> >> Sent: 05 June 2020 23:51:04 >> To: Frank Schilder; ceph-users >> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >> >> Hi, >> Unfortunately adding swap did not solve the problem ! >> I added 400 GB of swap. It used about 18GB of swap after consuming all >> the ram and stopped with the following logs : >> >> 2020-06-05 21:33:31.967 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >> Updating MDS map to version 324691 from mon.1 >> 2020-06-05 21:33:40.355 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >> Updating MDS map to version 324692 from mon.1 >> 2020-06-05 21:33:59.787 7f251b7e5700 1 heartbeat_map is_healthy >> 'MDSRank' had timed out after 15 >> 2020-06-05 21:33:59.787 7f251b7e5700 0 >> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors >> (last acked 3.99979s ago); MDS internal heartbeat is not healthy! >> 2020-06-05 21:34:00.287 7f251b7e5700 1 heartbeat_map is_healthy >> 'MDSRank' had timed out after 15 >> 2020-06-05 21:34:00.287 7f251b7e5700 0 >> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors >> (last acked 4.49976s ago); MDS internal heartbeat is not healthy! >> .... >> 2020-06-05 21:39:05.991 7f251bfe6700 1 heartbeat_map reset_timeout >> 'MDSRank' had timed out after 15 >> 2020-06-05 21:39:06.015 7f251bfe6700 1 >> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to >> be laggy; 310.228s since last acked beacon >> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep >> work because connection to Monitors appears laggy >> 2020-06-05 21:39:19.838 7f251bfe6700 1 mds.0.322900 skipping upkeep >> work because connection to Monitors appears laggy >> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >> Updating MDS map to version 324694 from mon.1 >> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Map >> removed me (mds.-1 gid:210070681) from cluster due to lost contact; >> respawning >> 2020-06-05 21:39:19.870 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr respawn! >> --- begin dump of recent events --- >> -9999> 2020-06-05 19:28:07.982 7f25217f1700 5 >> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq >> 2131 rtt 0.930951 >> -9998> 2020-06-05 19:28:11.053 7f251b7e5700 5 >> mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132 >> -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient: >> _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0 >> -9996> 2020-06-05 19:28:12.176 7f25217f1700 5 >> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq >> 2132 rtt 1.12294 >> -9995> 2020-06-05 19:28:12.176 7f251e7eb700 1 >> mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1 >> -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick >> -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: >> _check_auth_rotating have uptodate secrets (they expire after 2020-06-05 >> 19:27:44.290995) >> ... >> 2020-06-05 21:39:31.092 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr >> Updating MDS map to version 324749 from mon.1 >> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr >> Updating MDS map to version 324750 from mon.1 >> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Map >> has assigned me to become a standby >> >> However, the mons doesn't seems particularly loaded ! >> So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did >> it both for mds and mons daemons because it's seems to be present in >> both conf). >> I will tells you if it works. >> >> Any other clue ? >> F. >> >> Le 05/06/2020 à 14:44, Frank Schilder a écrit : >>> Hi Francois, >>> >>> thanks for the link. The option "mds dump cache after rejoin" is for debugging purposes only. It will write the cache after rejoin to a file, but not drop the cache. This will not help you. I think this was implemented recently to make it possible to send a cache dump file to developers after an MDS crash before the restarting MDS changes the cache. >>> >>> In your case, I would set osd_op_queue_cut_off during the next regular cluster service or upgrade. >>> >>> My best bet right now is to try to add swap. Maybe someone else reading this has a better idea or you find a hint in one of the other threads. >>> >>> Good luck! >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>> Sent: 05 June 2020 14:34:06 >>> To: Frank Schilder; ceph-users >>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >>> >>> Le 05/06/2020 à 14:18, Frank Schilder a écrit : >>>> Hi Francois, >>>> >>>>> I was also wondering if setting mds dump cache after rejoin could help ? >>>> Haven't heard of that option. Is there some documentation? >>> I found it on : >>> https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ >>> mds dump cache after rejoin >>> Description >>> Ceph will dump MDS cache contents to a file after rejoining the cache >>> (during recovery). >>> Type >>> Boolean >>> Default >>> false >>> >>> but I don't think it can help in my case, because rejoin occurs after >>> replay and in my case replay never ends ! >>> >>>>> I have : >>>>> osd_op_queue=wpq >>>>> osd_op_queue_cut_off=low >>>>> I can try to set osd_op_queue_cut_off to high, but it will be useful >>>>> only if the mds get active, true ? >>>> I think so. If you have no clients connected, there should not be queue priority issues. Maybe it is best to wait until your cluster is healthy again as you will need to restart all daemons. Make sure you set this in [global]. When I applied that change and after re-starting all OSDs my MDSes had reconnect issues until I set it on them too. I think all daemons use that option (the prefix osd_ is misleading). >>> For sure I would prefer not to restart all daemons because the second >>> filesystem is up and running (with production clients). >>> >>>>> For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB >>>>> which seems reasonable for a mds server with 32/48GB). >>>> This sounds bad. 8GB should not cause any issues. Maybe you are hitting a bug, I believe there is a regression in Nautilus. There were recent threads on absurdly high memory use by MDSes. Maybe its worth searching for these in the list. >>> I will have a look. >>> >>>>> I already force the clients to unmount (and even rebooted the ones from >>>>> which the rsync and the rmdir .snaps were launched). >>>> I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. without -f or forced by reboot) the MDS should have dropped the clients already. If it was an unclean unmount it might not be that easy to get the stale client session out. However, I don't know about that. >>> Moreover when I did that, the mds was already not active but in replay, >>> so for sure the unmount was not acknowledged by any mds ! >>> >>>>> I think that providing more swap maybe the solution ! I will try that if >>>>> I cannot find another way to fix it. >>>> If the memory overrun is somewhat limited, this should allow the MDS to trim the logs. Will take a while, but it will do eventually. >>>> >>>> Best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>> Sent: 05 June 2020 13:46:03 >>>> To: Frank Schilder; ceph-users >>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >>>> >>>> I was also wondering if setting mds dump cache after rejoin could help ? >>>> >>>> >>>> Le 05/06/2020 à 12:49, Frank Schilder a écrit : >>>>> Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). >>>>> >>>>> How many rsync processes are you running in parallel? >>>>> Do you have these settings enabled: >>>>> >>>>> osd_op_queue=wpq >>>>> osd_op_queue_cut_off=high >>>>> >>>>> WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. >>>>> >>>>> You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): >>>>> >>>>> - reduce the MDS cache memory limit to force recall of caps much earlier than now >>>>> - reduce client cach size >>>>> - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful >>>>> >>>>> At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. >>>>> >>>>> If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. >>>>> >>>>> Harder measures: >>>>> >>>>> - stop all I/O from the FS clients, throw users out if necessary >>>>> - ideally, try to cleanly (!) shut down clients or force trimming the cache by either >>>>> * umount or >>>>> * sync; echo 3 > /proc/sys/vm/drop_caches >>>>> Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. >>>>> >>>>> At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. >>>>> >>>>> My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again. >>>>> >>>>> Hope that helps. >>>>> >>>>> Best regards, >>>>> ================= >>>>> Frank Schilder >>>>> AIT Risø Campus >>>>> Bygning 109, rum S14 >>>>> >>>>> ________________________________________ >>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>>> Sent: 05 June 2020 11:42:42 >>>>> To: ceph-users >>>>> Subject: [ceph-users] mds behind on trimming - replay until memory exhausted >>>>> >>>>> Hi all, >>>>> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and >>>>> 3 mds (1 active for each fs + one failover). >>>>> We are transfering all the datas (~600M files) from one FS (which was in >>>>> EC 3+2) to the other FS (in R3). >>>>> On the old FS we first removed the snapshots (to avoid strays problems >>>>> when removing files) and the ran some rsync deleting the files after the >>>>> transfer. >>>>> The operation should last a few weeks more to complete. >>>>> But few days ago, we started to have some warning mds behind on trimming >>>>> from the mds managing the old FS. >>>>> Yesterday, I restarted the active mds service to force the takeover by >>>>> the standby mds (basically because the standby is more powerfull and >>>>> have more memory, i.e 48GB over 32). >>>>> The standby mds took the rank 0 and started to replay... the mds behind >>>>> on trimming came back and the number of segments rised as well as the >>>>> memory usage of the server. Finally, it exhausted the memory of the mds >>>>> and the service stopped and the previous mds took rank 0 and started to >>>>> replay... until memory exhaustion and a new switch of mds etc... >>>>> It thus seems that we are in a never ending loop ! And of course, as the >>>>> mds is always in replay, the data are not accessible and the transfers >>>>> are blocked. >>>>> I stopped all the rsync and unmount the clients. >>>>> >>>>> My questions are : >>>>> - Does the mds trim during the replay so we could hope that after a >>>>> while it will purge everything and the mds will be able to become active >>>>> at the end ? >>>>> - Is there a way to accelerate the operation or to fix this situation ? >>>>> >>>>> Thanks for you help. >>>>> F. >>>>> _______________________________________________ >>>>> ceph-users mailing list --ceph-users(a)ceph.io >>>>> To unsubscribe send an email toceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Francois Legrand

3 p.m.

There is no recovery going on, but indeed we have a pg damaged (with some lost objects due to a major crash few weeks ago)... and there are some shards of this pg on osd 27 ! That's also why we are migrating all the data out of this FS ! It's certainly related and I guess that it's trying to remove some datas that are already lost and it get stuck ! I don't know if there is a way to tell ceph to forget about these ops ! I guess no. I thus think that there is not that much to do apart from reading as much data as we can to save as much as possible. F. Le 08/06/2020 à 15:48, Frank Schilder a écrit :

...

That's strange. Maybe there is another problem. Do you have any other health warnings that might be related? Is there some recovery/rebalancing going on? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> Sent: 08 June 2020 15:27:59 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Thanks again for the hint ! Indeed, I did a ceph daemon mds.lpnceph-mds02.in2p3.fr objecter_requests and it seems that osd 27 is more or less stuck with op of age 34987.5 (while others osd have ages < 1). I tryed a ceph osd down 27 which resulted in reseting the age but I can notice that age for osd.27 ops is rising again. I think I will restart it (btw our osd servers and mds are different machines). F. Le 08/06/2020 à 15:01, Frank Schilder a écrit : > Hi Francois, > > this sounds great. At least its operational. I guess it is still using a lot of swap while trying to replay operations. > > I would disconnect cleanly all clients if you didn't do so already, even any read-only clients. Any extra load will just slow down recovery. My best guess is, that the MDS is replaying some operations, which is very slow due to swap. While doing so, the segments to trim will probably keep increasing for a while until it can start trimming. > > The slow meta-data IO is an operation hanging in some OSD. You should check which OSD it is (ceph health detail) and check if you can see the operation in the OSDs OPS queue. I would expect this OSD to have a really long OPS queue. I have seen meta-data operations hang for a long time. In case this OSD runs on the same server as your MDS, you will probably have to sit it out. > > If the meta-data operation is the only operation in the queue, the OSD might need a restart. But be careful, if in doubt ask the list first. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> > Sent: 08 June 2020 14:45:13 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted > > Hi Franck, > Finally I dit : > ceph config set global mds_beacon_grace 600000 > and create /etc/sysctl.d/sysctl-ceph.conf with > vm.min_free_kbytes=4194303 > and then > sysctl --system > > After that, the mds went to rejoin for a very long time (almost 24 > hours) with errors like : > 2020-06-07 04:10:36.802 7ff866e2e700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 15 > 2020-06-07 04:10:36.802 7ff866e2e700 0 > mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors > (last acked 14653.8s ago); MDS internal heartbeat is not healthy! > 2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating > possible clock skew, rotating keys expired way too early (before > 2020-06-07 03:10:37.022271) > and also > 2020-06-07 04:10:44.942 7ff86d63b700 0 auth: could not find secret_id=10363 > 2020-06-07 04:10:44.942 7ff86d63b700 0 cephx: verify_authorizer could > not get service secret for service mds secret_id=10363 > > but at the end the mds went active ! :-) > I let it at rest from sunday afternoon until this morning. > Indeed I was able to connect clients (in read-only for now) and read the > datas. > I checked the clients connected with ceph tell > mds.lpnceph-mds02.in2p3.fr client ls > and disconnected the few clients still there (with umount) and checked > that they were not connected anymore with the same command. > But I still have the following warnings > MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs > mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked > > 30 secs, oldest blocked for 75372 secs > MDS_TRIM 1 MDSs behind on trimming > mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128) > max_segments: 128, num_segments: 122836 > > and the number of segments is still rising (slowly). > F. > > > Le 08/06/2020 à 12:00, Frank Schilder a écrit : >> Hi Francois, >> >> did you manage to get any further with this? >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Frank Schilder <frans(a)dtu.dk> >> Sent: 06 June 2020 15:21:59 >> To: ceph-users; fleg(a)lpnhe.in2p3.fr >> Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted >> >> I think you have a problem similar to one I have. The priority of beacons seems very low. As soon as something gets busy, beacons are ignored or not sent. This was part of your log messages from the MDS. It stopped reporting to the MONs due to laggy connection. This laggyness is a result of swapping: >> >>> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep >>> work because connection to Monitors appears laggy >> Hence, during the (entire) time you are trying to get the MDS back using swap, it will almost certainly stop sending beacons. Therefore, you need to disable the time-out temporarily, otherwise the MON will always kill it for no real reason. The time-out should be long enough to cover the entire recovery period. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> >> Sent: 06 June 2020 11:11 >> To: Frank Schilder; ceph-users >> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted >> >> Thanks for the tip, >> I will try that. For now vm.min_free_kbytes = 90112 >> Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0 >> but this didn't change anything... >> -27> 2020-06-06 06:15:07.373 7f83e3626700 1 >> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to >> be laggy; 332.044s since last acked beacon >> Which is the same time since last acked beacon I had before changing the >> parameter. >> As mds beacon interval is 4 s setting mds_beacon_grace to 240 should >> lead to 960 s (16mn). Thus I think that the bottleneck is elsewhere. >> F. >> >> >> Le 06/06/2020 à 09:47, Frank Schilder a écrit : >>> Hi Francois, >>> >>> there is actually one more parameter you might consider changing in case the MDS gets kicked out again. For a system under such high memory pressure, the value for the kernel parameter vm.min_free_kbytes might need adjusting. You can check the current value with >>> >>> sysctl vm.min_free_kbytes >>> >>> In your case with heavy swap usage, this value should probably be somewhere between 2-4GB. >>> >>> Careful, do not change this value while memory is in high demand. If not enough memory is available, setting this will immediately OOM kill your machine. Make sure that plenty of pages are unused. Drop page cache if necessary or reboot the machine before setting this value. >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Frank Schilder <frans(a)dtu.dk> >>> Sent: 06 June 2020 00:36:13 >>> To: ceph-users; fleg(a)lpnhe.in2p3.fr >>> Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted >>> >>> Hi Francois, >>> >>> yes, the beacon grace needs to be higher due to the latency of swap. Not sure if 60s will do. For this particular recovery operation, you might want to go much higher (1h) and watch the cluster health closely. >>> >>> Good luck and best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> >>> Sent: 05 June 2020 23:51:04 >>> To: Frank Schilder; ceph-users >>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >>> >>> Hi, >>> Unfortunately adding swap did not solve the problem ! >>> I added 400 GB of swap. It used about 18GB of swap after consuming all >>> the ram and stopped with the following logs : >>> >>> 2020-06-05 21:33:31.967 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>> Updating MDS map to version 324691 from mon.1 >>> 2020-06-05 21:33:40.355 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>> Updating MDS map to version 324692 from mon.1 >>> 2020-06-05 21:33:59.787 7f251b7e5700 1 heartbeat_map is_healthy >>> 'MDSRank' had timed out after 15 >>> 2020-06-05 21:33:59.787 7f251b7e5700 0 >>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors >>> (last acked 3.99979s ago); MDS internal heartbeat is not healthy! >>> 2020-06-05 21:34:00.287 7f251b7e5700 1 heartbeat_map is_healthy >>> 'MDSRank' had timed out after 15 >>> 2020-06-05 21:34:00.287 7f251b7e5700 0 >>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors >>> (last acked 4.49976s ago); MDS internal heartbeat is not healthy! >>> .... >>> 2020-06-05 21:39:05.991 7f251bfe6700 1 heartbeat_map reset_timeout >>> 'MDSRank' had timed out after 15 >>> 2020-06-05 21:39:06.015 7f251bfe6700 1 >>> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to >>> be laggy; 310.228s since last acked beacon >>> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep >>> work because connection to Monitors appears laggy >>> 2020-06-05 21:39:19.838 7f251bfe6700 1 mds.0.322900 skipping upkeep >>> work because connection to Monitors appears laggy >>> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>> Updating MDS map to version 324694 from mon.1 >>> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Map >>> removed me (mds.-1 gid:210070681) from cluster due to lost contact; >>> respawning >>> 2020-06-05 21:39:19.870 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr respawn! >>> --- begin dump of recent events --- >>> -9999> 2020-06-05 19:28:07.982 7f25217f1700 5 >>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq >>> 2131 rtt 0.930951 >>> -9998> 2020-06-05 19:28:11.053 7f251b7e5700 5 >>> mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132 >>> -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient: >>> _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0 >>> -9996> 2020-06-05 19:28:12.176 7f25217f1700 5 >>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq >>> 2132 rtt 1.12294 >>> -9995> 2020-06-05 19:28:12.176 7f251e7eb700 1 >>> mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1 >>> -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick >>> -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: >>> _check_auth_rotating have uptodate secrets (they expire after 2020-06-05 >>> 19:27:44.290995) >>> ... >>> 2020-06-05 21:39:31.092 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr >>> Updating MDS map to version 324749 from mon.1 >>> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr >>> Updating MDS map to version 324750 from mon.1 >>> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Map >>> has assigned me to become a standby >>> >>> However, the mons doesn't seems particularly loaded ! >>> So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did >>> it both for mds and mons daemons because it's seems to be present in >>> both conf). >>> I will tells you if it works. >>> >>> Any other clue ? >>> F. >>> >>> Le 05/06/2020 à 14:44, Frank Schilder a écrit : >>>> Hi Francois, >>>> >>>> thanks for the link. The option "mds dump cache after rejoin" is for debugging purposes only. It will write the cache after rejoin to a file, but not drop the cache. This will not help you. I think this was implemented recently to make it possible to send a cache dump file to developers after an MDS crash before the restarting MDS changes the cache. >>>> >>>> In your case, I would set osd_op_queue_cut_off during the next regular cluster service or upgrade. >>>> >>>> My best bet right now is to try to add swap. Maybe someone else reading this has a better idea or you find a hint in one of the other threads. >>>> >>>> Good luck! >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>> Sent: 05 June 2020 14:34:06 >>>> To: Frank Schilder; ceph-users >>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >>>> >>>> Le 05/06/2020 à 14:18, Frank Schilder a écrit : >>>>> Hi Francois, >>>>> >>>>>> I was also wondering if setting mds dump cache after rejoin could help ? >>>>> Haven't heard of that option. Is there some documentation? >>>> I found it on : >>>> https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ >>>> mds dump cache after rejoin >>>> Description >>>> Ceph will dump MDS cache contents to a file after rejoining the cache >>>> (during recovery). >>>> Type >>>> Boolean >>>> Default >>>> false >>>> >>>> but I don't think it can help in my case, because rejoin occurs after >>>> replay and in my case replay never ends ! >>>> >>>>>> I have : >>>>>> osd_op_queue=wpq >>>>>> osd_op_queue_cut_off=low >>>>>> I can try to set osd_op_queue_cut_off to high, but it will be useful >>>>>> only if the mds get active, true ? >>>>> I think so. If you have no clients connected, there should not be queue priority issues. Maybe it is best to wait until your cluster is healthy again as you will need to restart all daemons. Make sure you set this in [global]. When I applied that change and after re-starting all OSDs my MDSes had reconnect issues until I set it on them too. I think all daemons use that option (the prefix osd_ is misleading). >>>> For sure I would prefer not to restart all daemons because the second >>>> filesystem is up and running (with production clients). >>>> >>>>>> For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB >>>>>> which seems reasonable for a mds server with 32/48GB). >>>>> This sounds bad. 8GB should not cause any issues. Maybe you are hitting a bug, I believe there is a regression in Nautilus. There were recent threads on absurdly high memory use by MDSes. Maybe its worth searching for these in the list. >>>> I will have a look. >>>> >>>>>> I already force the clients to unmount (and even rebooted the ones from >>>>>> which the rsync and the rmdir .snaps were launched). >>>>> I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. without -f or forced by reboot) the MDS should have dropped the clients already. If it was an unclean unmount it might not be that easy to get the stale client session out. However, I don't know about that. >>>> Moreover when I did that, the mds was already not active but in replay, >>>> so for sure the unmount was not acknowledged by any mds ! >>>> >>>>>> I think that providing more swap maybe the solution ! I will try that if >>>>>> I cannot find another way to fix it. >>>>> If the memory overrun is somewhat limited, this should allow the MDS to trim the logs. Will take a while, but it will do eventually. >>>>> >>>>> Best regards, >>>>> ================= >>>>> Frank Schilder >>>>> AIT Risø Campus >>>>> Bygning 109, rum S14 >>>>> >>>>> ________________________________________ >>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>>> Sent: 05 June 2020 13:46:03 >>>>> To: Frank Schilder; ceph-users >>>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >>>>> >>>>> I was also wondering if setting mds dump cache after rejoin could help ? >>>>> >>>>> >>>>> Le 05/06/2020 à 12:49, Frank Schilder a écrit : >>>>>> Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). >>>>>> >>>>>> How many rsync processes are you running in parallel? >>>>>> Do you have these settings enabled: >>>>>> >>>>>> osd_op_queue=wpq >>>>>> osd_op_queue_cut_off=high >>>>>> >>>>>> WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. >>>>>> >>>>>> You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): >>>>>> >>>>>> - reduce the MDS cache memory limit to force recall of caps much earlier than now >>>>>> - reduce client cach size >>>>>> - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful >>>>>> >>>>>> At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. >>>>>> >>>>>> If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. >>>>>> >>>>>> Harder measures: >>>>>> >>>>>> - stop all I/O from the FS clients, throw users out if necessary >>>>>> - ideally, try to cleanly (!) shut down clients or force trimming the cache by either >>>>>> * umount or >>>>>> * sync; echo 3 > /proc/sys/vm/drop_caches >>>>>> Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. >>>>>> >>>>>> At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. >>>>>> >>>>>> My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again. >>>>>> >>>>>> Hope that helps. >>>>>> >>>>>> Best regards, >>>>>> ================= >>>>>> Frank Schilder >>>>>> AIT Risø Campus >>>>>> Bygning 109, rum S14 >>>>>> >>>>>> ________________________________________ >>>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>>>> Sent: 05 June 2020 11:42:42 >>>>>> To: ceph-users >>>>>> Subject: [ceph-users] mds behind on trimming - replay until memory exhausted >>>>>> >>>>>> Hi all, >>>>>> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and >>>>>> 3 mds (1 active for each fs + one failover). >>>>>> We are transfering all the datas (~600M files) from one FS (which was in >>>>>> EC 3+2) to the other FS (in R3). >>>>>> On the old FS we first removed the snapshots (to avoid strays problems >>>>>> when removing files) and the ran some rsync deleting the files after the >>>>>> transfer. >>>>>> The operation should last a few weeks more to complete. >>>>>> But few days ago, we started to have some warning mds behind on trimming >>>>>> from the mds managing the old FS. >>>>>> Yesterday, I restarted the active mds service to force the takeover by >>>>>> the standby mds (basically because the standby is more powerfull and >>>>>> have more memory, i.e 48GB over 32). >>>>>> The standby mds took the rank 0 and started to replay... the mds behind >>>>>> on trimming came back and the number of segments rised as well as the >>>>>> memory usage of the server. Finally, it exhausted the memory of the mds >>>>>> and the service stopped and the previous mds took rank 0 and started to >>>>>> replay... until memory exhaustion and a new switch of mds etc... >>>>>> It thus seems that we are in a never ending loop ! And of course, as the >>>>>> mds is always in replay, the data are not accessible and the transfers >>>>>> are blocked. >>>>>> I stopped all the rsync and unmount the clients. >>>>>> >>>>>> My questions are : >>>>>> - Does the mds trim during the replay so we could hope that after a >>>>>> while it will purge everything and the mds will be able to become active >>>>>> at the end ? >>>>>> - Is there a way to accelerate the operation or to fix this situation ? >>>>>> >>>>>> Thanks for you help. >>>>>> F. >>>>>> _______________________________________________ >>>>>> ceph-users mailing list --ceph-users(a)ceph.io >>>>>> To unsubscribe send an email toceph-users-leave(a)ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Francois Legrand

3:38 p.m.

I already had some discussion on the list about this problem. But I should ask again. We really lost some objects and there are not enought shards to reconstruct them (it's an erasure coding data pool)... so it cannot be fixed anymore and we know we have data loss ! I did not marked the PG out because there are still some parts (objects) which are still present and we hope to be able to copy them and save a few bytes more ! It would be great to be able to flush only broken objects, but I don't know how to do that, even if it's possible ! I thus run some cephfs-data-scan pg_files to identify the files with data on this pg and the I run a grep -q -m 1 "." "/path_to_damaged_file" to identify the ones which are really empty (we tested different way to do this and it seems that's the fastest). F. Le 08/06/2020 à 16:07, Frank Schilder a écrit :

...

OK, now we are talking. It is very well possible that trimming will not start until this operation is completed. If there are enough shards/copies to recover the lost objects, you should try a pg repair first. If you did loose too many replicas, there are ways to flush this PG out of the system. You will loose data this way. I don't know how to repair or flush only broken objects out of a PG, but would hope that this is possible. Before you do anything destructive, open a new thread in this list specifically for how to repair/remove this PG with the least possible damage. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> Sent: 08 June 2020 16:00:28 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted There is no recovery going on, but indeed we have a pg damaged (with some lost objects due to a major crash few weeks ago)... and there are some shards of this pg on osd 27 ! That's also why we are migrating all the data out of this FS ! It's certainly related and I guess that it's trying to remove some datas that are already lost and it get stuck ! I don't know if there is a way to tell ceph to forget about these ops ! I guess no. I thus think that there is not that much to do apart from reading as much data as we can to save as much as possible. F. Le 08/06/2020 à 15:48, Frank Schilder a écrit : > That's strange. Maybe there is another problem. Do you have any other health warnings that might be related? Is there some recovery/rebalancing going on? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> > Sent: 08 June 2020 15:27:59 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted > > Thanks again for the hint ! > Indeed, I did a > ceph daemon mds.lpnceph-mds02.in2p3.fr objecter_requests > and it seems that osd 27 is more or less stuck with op of age 34987.5 > (while others osd have ages < 1). > I tryed a ceph osd down 27 which resulted in reseting the age but I can > notice that age for osd.27 ops is rising again. > I think I will restart it (btw our osd servers and mds are different > machines). > F. > > Le 08/06/2020 à 15:01, Frank Schilder a écrit : >> Hi Francois, >> >> this sounds great. At least its operational. I guess it is still using a lot of swap while trying to replay operations. >> >> I would disconnect cleanly all clients if you didn't do so already, even any read-only clients. Any extra load will just slow down recovery. My best guess is, that the MDS is replaying some operations, which is very slow due to swap. While doing so, the segments to trim will probably keep increasing for a while until it can start trimming. >> >> The slow meta-data IO is an operation hanging in some OSD. You should check which OSD it is (ceph health detail) and check if you can see the operation in the OSDs OPS queue. I would expect this OSD to have a really long OPS queue. I have seen meta-data operations hang for a long time. In case this OSD runs on the same server as your MDS, you will probably have to sit it out. >> >> If the meta-data operation is the only operation in the queue, the OSD might need a restart. But be careful, if in doubt ask the list first. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> >> Sent: 08 June 2020 14:45:13 >> To: Frank Schilder; ceph-users >> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted >> >> Hi Franck, >> Finally I dit : >> ceph config set global mds_beacon_grace 600000 >> and create /etc/sysctl.d/sysctl-ceph.conf with >> vm.min_free_kbytes=4194303 >> and then >> sysctl --system >> >> After that, the mds went to rejoin for a very long time (almost 24 >> hours) with errors like : >> 2020-06-07 04:10:36.802 7ff866e2e700 1 heartbeat_map is_healthy >> 'MDSRank' had timed out after 15 >> 2020-06-07 04:10:36.802 7ff866e2e700 0 >> mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors >> (last acked 14653.8s ago); MDS internal heartbeat is not healthy! >> 2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating >> possible clock skew, rotating keys expired way too early (before >> 2020-06-07 03:10:37.022271) >> and also >> 2020-06-07 04:10:44.942 7ff86d63b700 0 auth: could not find secret_id=10363 >> 2020-06-07 04:10:44.942 7ff86d63b700 0 cephx: verify_authorizer could >> not get service secret for service mds secret_id=10363 >> >> but at the end the mds went active ! :-) >> I let it at rest from sunday afternoon until this morning. >> Indeed I was able to connect clients (in read-only for now) and read the >> datas. >> I checked the clients connected with ceph tell >> mds.lpnceph-mds02.in2p3.fr client ls >> and disconnected the few clients still there (with umount) and checked >> that they were not connected anymore with the same command. >> But I still have the following warnings >> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs >> mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked > >> 30 secs, oldest blocked for 75372 secs >> MDS_TRIM 1 MDSs behind on trimming >> mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128) >> max_segments: 128, num_segments: 122836 >> >> and the number of segments is still rising (slowly). >> F. >> >> >> Le 08/06/2020 à 12:00, Frank Schilder a écrit : >>> Hi Francois, >>> >>> did you manage to get any further with this? >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Frank Schilder <frans(a)dtu.dk> >>> Sent: 06 June 2020 15:21:59 >>> To: ceph-users; fleg(a)lpnhe.in2p3.fr >>> Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted >>> >>> I think you have a problem similar to one I have. The priority of beacons seems very low. As soon as something gets busy, beacons are ignored or not sent. This was part of your log messages from the MDS. It stopped reporting to the MONs due to laggy connection. This laggyness is a result of swapping: >>> >>>> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep >>>> work because connection to Monitors appears laggy >>> Hence, during the (entire) time you are trying to get the MDS back using swap, it will almost certainly stop sending beacons. Therefore, you need to disable the time-out temporarily, otherwise the MON will always kill it for no real reason. The time-out should be long enough to cover the entire recovery period. >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> >>> Sent: 06 June 2020 11:11 >>> To: Frank Schilder; ceph-users >>> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted >>> >>> Thanks for the tip, >>> I will try that. For now vm.min_free_kbytes = 90112 >>> Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0 >>> but this didn't change anything... >>> -27> 2020-06-06 06:15:07.373 7f83e3626700 1 >>> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to >>> be laggy; 332.044s since last acked beacon >>> Which is the same time since last acked beacon I had before changing the >>> parameter. >>> As mds beacon interval is 4 s setting mds_beacon_grace to 240 should >>> lead to 960 s (16mn). Thus I think that the bottleneck is elsewhere. >>> F. >>> >>> >>> Le 06/06/2020 à 09:47, Frank Schilder a écrit : >>>> Hi Francois, >>>> >>>> there is actually one more parameter you might consider changing in case the MDS gets kicked out again. For a system under such high memory pressure, the value for the kernel parameter vm.min_free_kbytes might need adjusting. You can check the current value with >>>> >>>> sysctl vm.min_free_kbytes >>>> >>>> In your case with heavy swap usage, this value should probably be somewhere between 2-4GB. >>>> >>>> Careful, do not change this value while memory is in high demand. If not enough memory is available, setting this will immediately OOM kill your machine. Make sure that plenty of pages are unused. Drop page cache if necessary or reboot the machine before setting this value. >>>> >>>> Best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Frank Schilder <frans(a)dtu.dk> >>>> Sent: 06 June 2020 00:36:13 >>>> To: ceph-users; fleg(a)lpnhe.in2p3.fr >>>> Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted >>>> >>>> Hi Francois, >>>> >>>> yes, the beacon grace needs to be higher due to the latency of swap. Not sure if 60s will do. For this particular recovery operation, you might want to go much higher (1h) and watch the cluster health closely. >>>> >>>> Good luck and best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Francois Legrand <fleg(a)lpnhe.in2p3.fr> >>>> Sent: 05 June 2020 23:51:04 >>>> To: Frank Schilder; ceph-users >>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >>>> >>>> Hi, >>>> Unfortunately adding swap did not solve the problem ! >>>> I added 400 GB of swap. It used about 18GB of swap after consuming all >>>> the ram and stopped with the following logs : >>>> >>>> 2020-06-05 21:33:31.967 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>>> Updating MDS map to version 324691 from mon.1 >>>> 2020-06-05 21:33:40.355 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>>> Updating MDS map to version 324692 from mon.1 >>>> 2020-06-05 21:33:59.787 7f251b7e5700 1 heartbeat_map is_healthy >>>> 'MDSRank' had timed out after 15 >>>> 2020-06-05 21:33:59.787 7f251b7e5700 0 >>>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors >>>> (last acked 3.99979s ago); MDS internal heartbeat is not healthy! >>>> 2020-06-05 21:34:00.287 7f251b7e5700 1 heartbeat_map is_healthy >>>> 'MDSRank' had timed out after 15 >>>> 2020-06-05 21:34:00.287 7f251b7e5700 0 >>>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors >>>> (last acked 4.49976s ago); MDS internal heartbeat is not healthy! >>>> .... >>>> 2020-06-05 21:39:05.991 7f251bfe6700 1 heartbeat_map reset_timeout >>>> 'MDSRank' had timed out after 15 >>>> 2020-06-05 21:39:06.015 7f251bfe6700 1 >>>> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to >>>> be laggy; 310.228s since last acked beacon >>>> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep >>>> work because connection to Monitors appears laggy >>>> 2020-06-05 21:39:19.838 7f251bfe6700 1 mds.0.322900 skipping upkeep >>>> work because connection to Monitors appears laggy >>>> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>>> Updating MDS map to version 324694 from mon.1 >>>> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Map >>>> removed me (mds.-1 gid:210070681) from cluster due to lost contact; >>>> respawning >>>> 2020-06-05 21:39:19.870 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr respawn! >>>> --- begin dump of recent events --- >>>> -9999> 2020-06-05 19:28:07.982 7f25217f1700 5 >>>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq >>>> 2131 rtt 0.930951 >>>> -9998> 2020-06-05 19:28:11.053 7f251b7e5700 5 >>>> mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132 >>>> -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient: >>>> _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0 >>>> -9996> 2020-06-05 19:28:12.176 7f25217f1700 5 >>>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq >>>> 2132 rtt 1.12294 >>>> -9995> 2020-06-05 19:28:12.176 7f251e7eb700 1 >>>> mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1 >>>> -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick >>>> -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: >>>> _check_auth_rotating have uptodate secrets (they expire after 2020-06-05 >>>> 19:27:44.290995) >>>> ... >>>> 2020-06-05 21:39:31.092 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr >>>> Updating MDS map to version 324749 from mon.1 >>>> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr >>>> Updating MDS map to version 324750 from mon.1 >>>> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Map >>>> has assigned me to become a standby >>>> >>>> However, the mons doesn't seems particularly loaded ! >>>> So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did >>>> it both for mds and mons daemons because it's seems to be present in >>>> both conf). >>>> I will tells you if it works. >>>> >>>> Any other clue ? >>>> F. >>>> >>>> Le 05/06/2020 à 14:44, Frank Schilder a écrit : >>>>> Hi Francois, >>>>> >>>>> thanks for the link. The option "mds dump cache after rejoin" is for debugging purposes only. It will write the cache after rejoin to a file, but not drop the cache. This will not help you. I think this was implemented recently to make it possible to send a cache dump file to developers after an MDS crash before the restarting MDS changes the cache. >>>>> >>>>> In your case, I would set osd_op_queue_cut_off during the next regular cluster service or upgrade. >>>>> >>>>> My best bet right now is to try to add swap. Maybe someone else reading this has a better idea or you find a hint in one of the other threads. >>>>> >>>>> Good luck! >>>>> ================= >>>>> Frank Schilder >>>>> AIT Risø Campus >>>>> Bygning 109, rum S14 >>>>> >>>>> ________________________________________ >>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>>> Sent: 05 June 2020 14:34:06 >>>>> To: Frank Schilder; ceph-users >>>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >>>>> >>>>> Le 05/06/2020 à 14:18, Frank Schilder a écrit : >>>>>> Hi Francois, >>>>>> >>>>>>> I was also wondering if setting mds dump cache after rejoin could help ? >>>>>> Haven't heard of that option. Is there some documentation? >>>>> I found it on : >>>>> https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ >>>>> mds dump cache after rejoin >>>>> Description >>>>> Ceph will dump MDS cache contents to a file after rejoining the cache >>>>> (during recovery). >>>>> Type >>>>> Boolean >>>>> Default >>>>> false >>>>> >>>>> but I don't think it can help in my case, because rejoin occurs after >>>>> replay and in my case replay never ends ! >>>>> >>>>>>> I have : >>>>>>> osd_op_queue=wpq >>>>>>> osd_op_queue_cut_off=low >>>>>>> I can try to set osd_op_queue_cut_off to high, but it will be useful >>>>>>> only if the mds get active, true ? >>>>>> I think so. If you have no clients connected, there should not be queue priority issues. Maybe it is best to wait until your cluster is healthy again as you will need to restart all daemons. Make sure you set this in [global]. When I applied that change and after re-starting all OSDs my MDSes had reconnect issues until I set it on them too. I think all daemons use that option (the prefix osd_ is misleading). >>>>> For sure I would prefer not to restart all daemons because the second >>>>> filesystem is up and running (with production clients). >>>>> >>>>>>> For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB >>>>>>> which seems reasonable for a mds server with 32/48GB). >>>>>> This sounds bad. 8GB should not cause any issues. Maybe you are hitting a bug, I believe there is a regression in Nautilus. There were recent threads on absurdly high memory use by MDSes. Maybe its worth searching for these in the list. >>>>> I will have a look. >>>>> >>>>>>> I already force the clients to unmount (and even rebooted the ones from >>>>>>> which the rsync and the rmdir .snaps were launched). >>>>>> I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. without -f or forced by reboot) the MDS should have dropped the clients already. If it was an unclean unmount it might not be that easy to get the stale client session out. However, I don't know about that. >>>>> Moreover when I did that, the mds was already not active but in replay, >>>>> so for sure the unmount was not acknowledged by any mds ! >>>>> >>>>>>> I think that providing more swap maybe the solution ! I will try that if >>>>>>> I cannot find another way to fix it. >>>>>> If the memory overrun is somewhat limited, this should allow the MDS to trim the logs. Will take a while, but it will do eventually. >>>>>> >>>>>> Best regards, >>>>>> ================= >>>>>> Frank Schilder >>>>>> AIT Risø Campus >>>>>> Bygning 109, rum S14 >>>>>> >>>>>> ________________________________________ >>>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>>>> Sent: 05 June 2020 13:46:03 >>>>>> To: Frank Schilder; ceph-users >>>>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >>>>>> >>>>>> I was also wondering if setting mds dump cache after rejoin could help ? >>>>>> >>>>>> >>>>>> Le 05/06/2020 à 12:49, Frank Schilder a écrit : >>>>>>> Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). >>>>>>> >>>>>>> How many rsync processes are you running in parallel? >>>>>>> Do you have these settings enabled: >>>>>>> >>>>>>> osd_op_queue=wpq >>>>>>> osd_op_queue_cut_off=high >>>>>>> >>>>>>> WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. >>>>>>> >>>>>>> You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): >>>>>>> >>>>>>> - reduce the MDS cache memory limit to force recall of caps much earlier than now >>>>>>> - reduce client cach size >>>>>>> - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful >>>>>>> >>>>>>> At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. >>>>>>> >>>>>>> If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. >>>>>>> >>>>>>> Harder measures: >>>>>>> >>>>>>> - stop all I/O from the FS clients, throw users out if necessary >>>>>>> - ideally, try to cleanly (!) shut down clients or force trimming the cache by either >>>>>>> * umount or >>>>>>> * sync; echo 3 > /proc/sys/vm/drop_caches >>>>>>> Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. >>>>>>> >>>>>>> At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. >>>>>>> >>>>>>> My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again. >>>>>>> >>>>>>> Hope that helps. >>>>>>> >>>>>>> Best regards, >>>>>>> ================= >>>>>>> Frank Schilder >>>>>>> AIT Risø Campus >>>>>>> Bygning 109, rum S14 >>>>>>> >>>>>>> ________________________________________ >>>>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>>>>> Sent: 05 June 2020 11:42:42 >>>>>>> To: ceph-users >>>>>>> Subject: [ceph-users] mds behind on trimming - replay until memory exhausted >>>>>>> >>>>>>> Hi all, >>>>>>> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and >>>>>>> 3 mds (1 active for each fs + one failover). >>>>>>> We are transfering all the datas (~600M files) from one FS (which was in >>>>>>> EC 3+2) to the other FS (in R3). >>>>>>> On the old FS we first removed the snapshots (to avoid strays problems >>>>>>> when removing files) and the ran some rsync deleting the files after the >>>>>>> transfer. >>>>>>> The operation should last a few weeks more to complete. >>>>>>> But few days ago, we started to have some warning mds behind on trimming >>>>>>> from the mds managing the old FS. >>>>>>> Yesterday, I restarted the active mds service to force the takeover by >>>>>>> the standby mds (basically because the standby is more powerfull and >>>>>>> have more memory, i.e 48GB over 32). >>>>>>> The standby mds took the rank 0 and started to replay... the mds behind >>>>>>> on trimming came back and the number of segments rised as well as the >>>>>>> memory usage of the server. Finally, it exhausted the memory of the mds >>>>>>> and the service stopped and the previous mds took rank 0 and started to >>>>>>> replay... until memory exhaustion and a new switch of mds etc... >>>>>>> It thus seems that we are in a never ending loop ! And of course, as the >>>>>>> mds is always in replay, the data are not accessible and the transfers >>>>>>> are blocked. >>>>>>> I stopped all the rsync and unmount the clients. >>>>>>> >>>>>>> My questions are : >>>>>>> - Does the mds trim during the replay so we could hope that after a >>>>>>> while it will purge everything and the mds will be able to become active >>>>>>> at the end ? >>>>>>> - Is there a way to accelerate the operation or to fix this situation ? >>>>>>> >>>>>>> Thanks for you help. >>>>>>> F. >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list --ceph-users(a)ceph.io >>>>>>> To unsubscribe send an email toceph-users-leave(a)ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

Reply

Francois Legrand

9:20 p.m.

Hi, Actually I let the mds managing the damaged filesystem as it is because the files can be read (despite of the warning and errors). Thus I restarted the rsyncs to transfer everything to the new filesystem (thus on different PG because it's a different cephfs with different pools) but without deleting the olds files to avoid killing definitively the old mds and the old fs. The number of segment is then more or less stable (very high ~123611 but not increasing too much). I guess that we will have enought space to copy the remaining datas (it will be short but I think it will pass). Once everything will be transfered and checked, I will destroy the old FS and the damaged pool. F. Le 09/06/2020 à 19:50, Frank Schilder a écrit :

...

Looks like an answer to your other thread takes its time. Is it a possible option for you to - copy all readable files using this PG to some other storage, - remove or clean up the broken PG and - copy the files back in? This might lead to a healthy cluster. I don't know a proper procedure though. Somehow the ceph fs must play along as files using this will also use other PGs and get partly broken. Have you found other options? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> Sent: 08 June 2020 16:38:18 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted I already had some discussion on the list about this problem. But I should ask again. We really lost some objects and there are not enought shards to reconstruct them (it's an erasure coding data pool)... so it cannot be fixed anymore and we know we have data loss ! I did not marked the PG out because there are still some parts (objects) which are still present and we hope to be able to copy them and save a few bytes more ! It would be great to be able to flush only broken objects, but I don't know how to do that, even if it's possible ! I thus run some cephfs-data-scan pg_files to identify the files with data on this pg and the I run a grep -q -m 1 "." "/path_to_damaged_file" to identify the ones which are really empty (we tested different way to do this and it seems that's the fastest). F. Le 08/06/2020 à 16:07, Frank Schilder a écrit : > OK, now we are talking. It is very well possible that trimming will not start until this operation is completed. > > If there are enough shards/copies to recover the lost objects, you should try a pg repair first. If you did loose too many replicas, there are ways to flush this PG out of the system. You will loose data this way. I don't know how to repair or flush only broken objects out of a PG, but would hope that this is possible. > > Before you do anything destructive, open a new thread in this list specifically for how to repair/remove this PG with the least possible damage. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> > Sent: 08 June 2020 16:00:28 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted > > There is no recovery going on, but indeed we have a pg damaged (with > some lost objects due to a major crash few weeks ago)... and there are > some shards of this pg on osd 27 ! > That's also why we are migrating all the data out of this FS ! > It's certainly related and I guess that it's trying to remove some > datas that are already lost and it get stuck ! I don't know if there is > a way to tell ceph to forget about these ops ! I guess no. > I thus think that there is not that much to do apart from reading as > much data as we can to save as much as possible. > F. > > Le 08/06/2020 à 15:48, Frank Schilder a écrit : >> That's strange. Maybe there is another problem. Do you have any other health warnings that might be related? Is there some recovery/rebalancing going on? >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >> Sent: 08 June 2020 15:27:59 >> To: Frank Schilder; ceph-users >> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted >> >> Thanks again for the hint ! >> Indeed, I did a >> ceph daemon mds.lpnceph-mds02.in2p3.fr objecter_requests >> and it seems that osd 27 is more or less stuck with op of age 34987.5 >> (while others osd have ages < 1). >> I tryed a ceph osd down 27 which resulted in reseting the age but I can >> notice that age for osd.27 ops is rising again. >> I think I will restart it (btw our osd servers and mds are different >> machines). >> F. >> >> Le 08/06/2020 à 15:01, Frank Schilder a écrit : >>> Hi Francois, >>> >>> this sounds great. At least its operational. I guess it is still using a lot of swap while trying to replay operations. >>> >>> I would disconnect cleanly all clients if you didn't do so already, even any read-only clients. Any extra load will just slow down recovery. My best guess is, that the MDS is replaying some operations, which is very slow due to swap. While doing so, the segments to trim will probably keep increasing for a while until it can start trimming. >>> >>> The slow meta-data IO is an operation hanging in some OSD. You should check which OSD it is (ceph health detail) and check if you can see the operation in the OSDs OPS queue. I would expect this OSD to have a really long OPS queue. I have seen meta-data operations hang for a long time. In case this OSD runs on the same server as your MDS, you will probably have to sit it out. >>> >>> If the meta-data operation is the only operation in the queue, the OSD might need a restart. But be careful, if in doubt ask the list first. >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>> Sent: 08 June 2020 14:45:13 >>> To: Frank Schilder; ceph-users >>> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted >>> >>> Hi Franck, >>> Finally I dit : >>> ceph config set global mds_beacon_grace 600000 >>> and create /etc/sysctl.d/sysctl-ceph.conf with >>> vm.min_free_kbytes=4194303 >>> and then >>> sysctl --system >>> >>> After that, the mds went to rejoin for a very long time (almost 24 >>> hours) with errors like : >>> 2020-06-07 04:10:36.802 7ff866e2e700 1 heartbeat_map is_healthy >>> 'MDSRank' had timed out after 15 >>> 2020-06-07 04:10:36.802 7ff866e2e700 0 >>> mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors >>> (last acked 14653.8s ago); MDS internal heartbeat is not healthy! >>> 2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating >>> possible clock skew, rotating keys expired way too early (before >>> 2020-06-07 03:10:37.022271) >>> and also >>> 2020-06-07 04:10:44.942 7ff86d63b700 0 auth: could not find secret_id=10363 >>> 2020-06-07 04:10:44.942 7ff86d63b700 0 cephx: verify_authorizer could >>> not get service secret for service mds secret_id=10363 >>> >>> but at the end the mds went active ! :-) >>> I let it at rest from sunday afternoon until this morning. >>> Indeed I was able to connect clients (in read-only for now) and read the >>> datas. >>> I checked the clients connected with ceph tell >>> mds.lpnceph-mds02.in2p3.fr client ls >>> and disconnected the few clients still there (with umount) and checked >>> that they were not connected anymore with the same command. >>> But I still have the following warnings >>> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs >>> mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked > >>> 30 secs, oldest blocked for 75372 secs >>> MDS_TRIM 1 MDSs behind on trimming >>> mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128) >>> max_segments: 128, num_segments: 122836 >>> >>> and the number of segments is still rising (slowly). >>> F. >>> >>> >>> Le 08/06/2020 à 12:00, Frank Schilder a écrit : >>>> Hi Francois, >>>> >>>> did you manage to get any further with this? >>>> >>>> Best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Frank Schilder<frans(a)dtu.dk> >>>> Sent: 06 June 2020 15:21:59 >>>> To: ceph-users;fleg(a)lpnhe.in2p3.fr >>>> Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted >>>> >>>> I think you have a problem similar to one I have. The priority of beacons seems very low. As soon as something gets busy, beacons are ignored or not sent. This was part of your log messages from the MDS. It stopped reporting to the MONs due to laggy connection. This laggyness is a result of swapping: >>>> >>>>> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep >>>>> work because connection to Monitors appears laggy >>>> Hence, during the (entire) time you are trying to get the MDS back using swap, it will almost certainly stop sending beacons. Therefore, you need to disable the time-out temporarily, otherwise the MON will always kill it for no real reason. The time-out should be long enough to cover the entire recovery period. >>>> >>>> Best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>> Sent: 06 June 2020 11:11 >>>> To: Frank Schilder; ceph-users >>>> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted >>>> >>>> Thanks for the tip, >>>> I will try that. For now vm.min_free_kbytes = 90112 >>>> Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0 >>>> but this didn't change anything... >>>> -27> 2020-06-06 06:15:07.373 7f83e3626700 1 >>>> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to >>>> be laggy; 332.044s since last acked beacon >>>> Which is the same time since last acked beacon I had before changing the >>>> parameter. >>>> As mds beacon interval is 4 s setting mds_beacon_grace to 240 should >>>> lead to 960 s (16mn). Thus I think that the bottleneck is elsewhere. >>>> F. >>>> >>>> >>>> Le 06/06/2020 à 09:47, Frank Schilder a écrit : >>>>> Hi Francois, >>>>> >>>>> there is actually one more parameter you might consider changing in case the MDS gets kicked out again. For a system under such high memory pressure, the value for the kernel parameter vm.min_free_kbytes might need adjusting. You can check the current value with >>>>> >>>>> sysctl vm.min_free_kbytes >>>>> >>>>> In your case with heavy swap usage, this value should probably be somewhere between 2-4GB. >>>>> >>>>> Careful, do not change this value while memory is in high demand. If not enough memory is available, setting this will immediately OOM kill your machine. Make sure that plenty of pages are unused. Drop page cache if necessary or reboot the machine before setting this value. >>>>> >>>>> Best regards, >>>>> ================= >>>>> Frank Schilder >>>>> AIT Risø Campus >>>>> Bygning 109, rum S14 >>>>> >>>>> ________________________________________ >>>>> From: Frank Schilder<frans(a)dtu.dk> >>>>> Sent: 06 June 2020 00:36:13 >>>>> To: ceph-users;fleg(a)lpnhe.in2p3.fr >>>>> Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted >>>>> >>>>> Hi Francois, >>>>> >>>>> yes, the beacon grace needs to be higher due to the latency of swap. Not sure if 60s will do. For this particular recovery operation, you might want to go much higher (1h) and watch the cluster health closely. >>>>> >>>>> Good luck and best regards, >>>>> ================= >>>>> Frank Schilder >>>>> AIT Risø Campus >>>>> Bygning 109, rum S14 >>>>> >>>>> ________________________________________ >>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>>> Sent: 05 June 2020 23:51:04 >>>>> To: Frank Schilder; ceph-users >>>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >>>>> >>>>> Hi, >>>>> Unfortunately adding swap did not solve the problem ! >>>>> I added 400 GB of swap. It used about 18GB of swap after consuming all >>>>> the ram and stopped with the following logs : >>>>> >>>>> 2020-06-05 21:33:31.967 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>>>> Updating MDS map to version 324691 from mon.1 >>>>> 2020-06-05 21:33:40.355 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>>>> Updating MDS map to version 324692 from mon.1 >>>>> 2020-06-05 21:33:59.787 7f251b7e5700 1 heartbeat_map is_healthy >>>>> 'MDSRank' had timed out after 15 >>>>> 2020-06-05 21:33:59.787 7f251b7e5700 0 >>>>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors >>>>> (last acked 3.99979s ago); MDS internal heartbeat is not healthy! >>>>> 2020-06-05 21:34:00.287 7f251b7e5700 1 heartbeat_map is_healthy >>>>> 'MDSRank' had timed out after 15 >>>>> 2020-06-05 21:34:00.287 7f251b7e5700 0 >>>>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors >>>>> (last acked 4.49976s ago); MDS internal heartbeat is not healthy! >>>>> .... >>>>> 2020-06-05 21:39:05.991 7f251bfe6700 1 heartbeat_map reset_timeout >>>>> 'MDSRank' had timed out after 15 >>>>> 2020-06-05 21:39:06.015 7f251bfe6700 1 >>>>> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to >>>>> be laggy; 310.228s since last acked beacon >>>>> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep >>>>> work because connection to Monitors appears laggy >>>>> 2020-06-05 21:39:19.838 7f251bfe6700 1 mds.0.322900 skipping upkeep >>>>> work because connection to Monitors appears laggy >>>>> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>>>> Updating MDS map to version 324694 from mon.1 >>>>> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Map >>>>> removed me (mds.-1 gid:210070681) from cluster due to lost contact; >>>>> respawning >>>>> 2020-06-05 21:39:19.870 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr respawn! >>>>> --- begin dump of recent events --- >>>>> -9999> 2020-06-05 19:28:07.982 7f25217f1700 5 >>>>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq >>>>> 2131 rtt 0.930951 >>>>> -9998> 2020-06-05 19:28:11.053 7f251b7e5700 5 >>>>> mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132 >>>>> -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient: >>>>> _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0 >>>>> -9996> 2020-06-05 19:28:12.176 7f25217f1700 5 >>>>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq >>>>> 2132 rtt 1.12294 >>>>> -9995> 2020-06-05 19:28:12.176 7f251e7eb700 1 >>>>> mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1 >>>>> -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick >>>>> -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: >>>>> _check_auth_rotating have uptodate secrets (they expire after 2020-06-05 >>>>> 19:27:44.290995) >>>>> ... >>>>> 2020-06-05 21:39:31.092 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr >>>>> Updating MDS map to version 324749 from mon.1 >>>>> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr >>>>> Updating MDS map to version 324750 from mon.1 >>>>> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Map >>>>> has assigned me to become a standby >>>>> >>>>> However, the mons doesn't seems particularly loaded ! >>>>> So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did >>>>> it both for mds and mons daemons because it's seems to be present in >>>>> both conf). >>>>> I will tells you if it works. >>>>> >>>>> Any other clue ? >>>>> F. >>>>> >>>>> Le 05/06/2020 à 14:44, Frank Schilder a écrit : >>>>>> Hi Francois, >>>>>> >>>>>> thanks for the link. The option "mds dump cache after rejoin" is for debugging purposes only. It will write the cache after rejoin to a file, but not drop the cache. This will not help you. I think this was implemented recently to make it possible to send a cache dump file to developers after an MDS crash before the restarting MDS changes the cache. >>>>>> >>>>>> In your case, I would set osd_op_queue_cut_off during the next regular cluster service or upgrade. >>>>>> >>>>>> My best bet right now is to try to add swap. Maybe someone else reading this has a better idea or you find a hint in one of the other threads. >>>>>> >>>>>> Good luck! >>>>>> ================= >>>>>> Frank Schilder >>>>>> AIT Risø Campus >>>>>> Bygning 109, rum S14 >>>>>> >>>>>> ________________________________________ >>>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>>>> Sent: 05 June 2020 14:34:06 >>>>>> To: Frank Schilder; ceph-users >>>>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >>>>>> >>>>>> Le 05/06/2020 à 14:18, Frank Schilder a écrit : >>>>>>> Hi Francois, >>>>>>> >>>>>>>> I was also wondering if setting mds dump cache after rejoin could help ? >>>>>>> Haven't heard of that option. Is there some documentation? >>>>>> I found it on : >>>>>> https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ >>>>>> mds dump cache after rejoin >>>>>> Description >>>>>> Ceph will dump MDS cache contents to a file after rejoining the cache >>>>>> (during recovery). >>>>>> Type >>>>>> Boolean >>>>>> Default >>>>>> false >>>>>> >>>>>> but I don't think it can help in my case, because rejoin occurs after >>>>>> replay and in my case replay never ends ! >>>>>> >>>>>>>> I have : >>>>>>>> osd_op_queue=wpq >>>>>>>> osd_op_queue_cut_off=low >>>>>>>> I can try to set osd_op_queue_cut_off to high, but it will be useful >>>>>>>> only if the mds get active, true ? >>>>>>> I think so. If you have no clients connected, there should not be queue priority issues. Maybe it is best to wait until your cluster is healthy again as you will need to restart all daemons. Make sure you set this in [global]. When I applied that change and after re-starting all OSDs my MDSes had reconnect issues until I set it on them too. I think all daemons use that option (the prefix osd_ is misleading). >>>>>> For sure I would prefer not to restart all daemons because the second >>>>>> filesystem is up and running (with production clients). >>>>>> >>>>>>>> For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB >>>>>>>> which seems reasonable for a mds server with 32/48GB). >>>>>>> This sounds bad. 8GB should not cause any issues. Maybe you are hitting a bug, I believe there is a regression in Nautilus. There were recent threads on absurdly high memory use by MDSes. Maybe its worth searching for these in the list. >>>>>> I will have a look. >>>>>> >>>>>>>> I already force the clients to unmount (and even rebooted the ones from >>>>>>>> which the rsync and the rmdir .snaps were launched). >>>>>>> I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. without -f or forced by reboot) the MDS should have dropped the clients already. If it was an unclean unmount it might not be that easy to get the stale client session out. However, I don't know about that. >>>>>> Moreover when I did that, the mds was already not active but in replay, >>>>>> so for sure the unmount was not acknowledged by any mds ! >>>>>> >>>>>>>> I think that providing more swap maybe the solution ! I will try that if >>>>>>>> I cannot find another way to fix it. >>>>>>> If the memory overrun is somewhat limited, this should allow the MDS to trim the logs. Will take a while, but it will do eventually. >>>>>>> >>>>>>> Best regards, >>>>>>> ================= >>>>>>> Frank Schilder >>>>>>> AIT Risø Campus >>>>>>> Bygning 109, rum S14 >>>>>>> >>>>>>> ________________________________________ >>>>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>>>>> Sent: 05 June 2020 13:46:03 >>>>>>> To: Frank Schilder; ceph-users >>>>>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted >>>>>>> >>>>>>> I was also wondering if setting mds dump cache after rejoin could help ? >>>>>>> >>>>>>> >>>>>>> Le 05/06/2020 à 12:49, Frank Schilder a écrit : >>>>>>>> Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). >>>>>>>> >>>>>>>> How many rsync processes are you running in parallel? >>>>>>>> Do you have these settings enabled: >>>>>>>> >>>>>>>> osd_op_queue=wpq >>>>>>>> osd_op_queue_cut_off=high >>>>>>>> >>>>>>>> WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. >>>>>>>> >>>>>>>> You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): >>>>>>>> >>>>>>>> - reduce the MDS cache memory limit to force recall of caps much earlier than now >>>>>>>> - reduce client cach size >>>>>>>> - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful >>>>>>>> >>>>>>>> At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. >>>>>>>> >>>>>>>> If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. >>>>>>>> >>>>>>>> Harder measures: >>>>>>>> >>>>>>>> - stop all I/O from the FS clients, throw users out if necessary >>>>>>>> - ideally, try to cleanly (!) shut down clients or force trimming the cache by either >>>>>>>> * umount or >>>>>>>> * sync; echo 3 > /proc/sys/vm/drop_caches >>>>>>>> Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. >>>>>>>> >>>>>>>> At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. >>>>>>>> >>>>>>>> My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again. >>>>>>>> >>>>>>>> Hope that helps. >>>>>>>> >>>>>>>> Best regards, >>>>>>>> ================= >>>>>>>> Frank Schilder >>>>>>>> AIT Risø Campus >>>>>>>> Bygning 109, rum S14 >>>>>>>> >>>>>>>> ________________________________________ >>>>>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr> >>>>>>>> Sent: 05 June 2020 11:42:42 >>>>>>>> To: ceph-users >>>>>>>> Subject: [ceph-users] mds behind on trimming - replay until memory exhausted >>>>>>>> >>>>>>>> Hi all, >>>>>>>> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and >>>>>>>> 3 mds (1 active for each fs + one failover). >>>>>>>> We are transfering all the datas (~600M files) from one FS (which was in >>>>>>>> EC 3+2) to the other FS (in R3). >>>>>>>> On the old FS we first removed the snapshots (to avoid strays problems >>>>>>>> when removing files) and the ran some rsync deleting the files after the >>>>>>>> transfer. >>>>>>>> The operation should last a few weeks more to complete. >>>>>>>> But few days ago, we started to have some warning mds behind on trimming >>>>>>>> from the mds managing the old FS. >>>>>>>> Yesterday, I restarted the active mds service to force the takeover by >>>>>>>> the standby mds (basically because the standby is more powerfull and >>>>>>>> have more memory, i.e 48GB over 32). >>>>>>>> The standby mds took the rank 0 and started to replay... the mds behind >>>>>>>> on trimming came back and the number of segments rised as well as the >>>>>>>> memory usage of the server. Finally, it exhausted the memory of the mds >>>>>>>> and the service stopped and the previous mds took rank 0 and started to >>>>>>>> replay... until memory exhaustion and a new switch of mds etc... >>>>>>>> It thus seems that we are in a never ending loop ! And of course, as the >>>>>>>> mds is always in replay, the data are not accessible and the transfers >>>>>>>> are blocked. >>>>>>>> I stopped all the rsync and unmount the clients. >>>>>>>> >>>>>>>> My questions are : >>>>>>>> - Does the mds trim during the replay so we could hope that after a >>>>>>>> while it will purge everything and the mds will be able to become active >>>>>>>> at the end ? >>>>>>>> - Is there a way to accelerate the operation or to fix this situation ? >>>>>>>> >>>>>>>> Thanks for you help. >>>>>>>> F. >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list --ceph-users(a)ceph.io >>>>>>>> To unsubscribe send an emailtoceph-users-leave(a)ceph.io >>>>> _______________________________________________ >>>>> ceph-users mailing list --ceph-users(a)ceph.io >>>>> To unsubscribe send an email toceph-users-leave(a)ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list --ceph-users(a)ceph.io >>>> To unsubscribe send an email toceph-users-leave(a)ceph.io

Reply