Hi,
Actually I let the mds managing the damaged filesystem as it is because
the files can be read (despite of the warning and errors). Thus I
restarted the rsyncs to transfer everything to the new filesystem (thus
on different PG because it's a different cephfs with different pools)
but without deleting the olds files to avoid killing definitively the
old mds and the old fs. The number of segment is then more or less
stable (very high ~123611 but not increasing too much).
I guess that we will have enought space to copy the remaining datas (it
will be short but I think it will pass). Once everything will be
transfered and checked, I will destroy the old FS and the damaged pool.
F.
Le 09/06/2020 à 19:50, Frank Schilder a écrit :
Looks like an answer to your other thread takes its
time.
Is it a possible option for you to
- copy all readable files using this PG to some other storage,
- remove or clean up the broken PG and
- copy the files back in?
This might lead to a healthy cluster. I don't know a proper procedure though. Somehow
the ceph fs must play along as files using this will also use other PGs and get partly
broken.
Have you found other options?
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Francois Legrand<fleg(a)lpnhe.in2p3.fr>
Sent: 08 June 2020 16:38:18
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted
I already had some discussion on the list about this problem. But I
should ask again.
We really lost some objects and there are not enought shards to
reconstruct them (it's an erasure coding data pool)... so it cannot be
fixed anymore and we know we have data loss ! I did not marked the PG
out because there are still some parts (objects) which are still present
and we hope to be able to copy them and save a few bytes more ! It would
be great to be able to flush only broken objects, but I don't know how
to do that, even if it's possible !
I thus run some cephfs-data-scan pg_files to identify the files with
data on this pg and the I run a grep -q -m 1 "."
"/path_to_damaged_file"
to identify the ones which are really empty (we tested different way to
do this and it seems that's the fastest).
F.
Le 08/06/2020 à 16:07, Frank Schilder a écrit :
> OK, now we are talking. It is very well possible that trimming will not start until
this operation is completed.
>
> If there are enough shards/copies to recover the lost objects, you should try a pg
repair first. If you did loose too many replicas, there are ways to flush this PG out of
the system. You will loose data this way. I don't know how to repair or flush only
broken objects out of a PG, but would hope that this is possible.
>
> Before you do anything destructive, open a new thread in this list specifically for
how to repair/remove this PG with the least possible damage.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr>
> Sent: 08 June 2020 16:00:28
> To: Frank Schilder; ceph-users
> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted
>
> There is no recovery going on, but indeed we have a pg damaged (with
> some lost objects due to a major crash few weeks ago)... and there are
> some shards of this pg on osd 27 !
> That's also why we are migrating all the data out of this FS !
> It's certainly related and I guess that it's trying to remove some
> datas that are already lost and it get stuck ! I don't know if there is
> a way to tell ceph to forget about these ops ! I guess no.
> I thus think that there is not that much to do apart from reading as
> much data as we can to save as much as possible.
> F.
>
> Le 08/06/2020 à 15:48, Frank Schilder a écrit :
>> That's strange. Maybe there is another problem. Do you have any other health
warnings that might be related? Is there some recovery/rebalancing going on?
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr>
>> Sent: 08 June 2020 15:27:59
>> To: Frank Schilder; ceph-users
>> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
>>
>> Thanks again for the hint !
>> Indeed, I did a
>> ceph daemon mds.lpnceph-mds02.in2p3.fr objecter_requests
>> and it seems that osd 27 is more or less stuck with op of age 34987.5
>> (while others osd have ages < 1).
>> I tryed a ceph osd down 27 which resulted in reseting the age but I can
>> notice that age for osd.27 ops is rising again.
>> I think I will restart it (btw our osd servers and mds are different
>> machines).
>> F.
>>
>> Le 08/06/2020 à 15:01, Frank Schilder a écrit :
>>> Hi Francois,
>>>
>>> this sounds great. At least its operational. I guess it is still using a lot
of swap while trying to replay operations.
>>>
>>> I would disconnect cleanly all clients if you didn't do so already, even
any read-only clients. Any extra load will just slow down recovery. My best guess is, that
the MDS is replaying some operations, which is very slow due to swap. While doing so, the
segments to trim will probably keep increasing for a while until it can start trimming.
>>>
>>> The slow meta-data IO is an operation hanging in some OSD. You should check
which OSD it is (ceph health detail) and check if you can see the operation in the OSDs
OPS queue. I would expect this OSD to have a really long OPS queue. I have seen meta-data
operations hang for a long time. In case this OSD runs on the same server as your MDS, you
will probably have to sit it out.
>>>
>>> If the meta-data operation is the only operation in the queue, the OSD might
need a restart. But be careful, if in doubt ask the list first.
>>>
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> ________________________________________
>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr>
>>> Sent: 08 June 2020 14:45:13
>>> To: Frank Schilder; ceph-users
>>> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
>>>
>>> Hi Franck,
>>> Finally I dit :
>>> ceph config set global mds_beacon_grace 600000
>>> and create /etc/sysctl.d/sysctl-ceph.conf with
>>> vm.min_free_kbytes=4194303
>>> and then
>>> sysctl --system
>>>
>>> After that, the mds went to rejoin for a very long time (almost 24
>>> hours) with errors like :
>>> 2020-06-07 04:10:36.802 7ff866e2e700 1 heartbeat_map is_healthy
>>> 'MDSRank' had timed out after 15
>>> 2020-06-07 04:10:36.802 7ff866e2e700 0
>>> mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors
>>> (last acked 14653.8s ago); MDS internal heartbeat is not healthy!
>>> 2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating
>>> possible clock skew, rotating keys expired way too early (before
>>> 2020-06-07 03:10:37.022271)
>>> and also
>>> 2020-06-07 04:10:44.942 7ff86d63b700 0 auth: could not find secret_id=10363
>>> 2020-06-07 04:10:44.942 7ff86d63b700 0 cephx: verify_authorizer could
>>> not get service secret for service mds secret_id=10363
>>>
>>> but at the end the mds went active ! :-)
>>> I let it at rest from sunday afternoon until this morning.
>>> Indeed I was able to connect clients (in read-only for now) and read the
>>> datas.
>>> I checked the clients connected with ceph tell
>>> mds.lpnceph-mds02.in2p3.fr client ls
>>> and disconnected the few clients still there (with umount) and checked
>>> that they were not connected anymore with the same command.
>>> But I still have the following warnings
>>> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>>> mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked
>
>>> 30 secs, oldest blocked for 75372 secs
>>> MDS_TRIM 1 MDSs behind on trimming
>>> mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128)
>>> max_segments: 128, num_segments: 122836
>>>
>>> and the number of segments is still rising (slowly).
>>> F.
>>>
>>>
>>> Le 08/06/2020 à 12:00, Frank Schilder a écrit :
>>>> Hi Francois,
>>>>
>>>> did you manage to get any further with this?
>>>>
>>>> Best regards,
>>>> =================
>>>> Frank Schilder
>>>> AIT Risø Campus
>>>> Bygning 109, rum S14
>>>>
>>>> ________________________________________
>>>> From: Frank Schilder<frans(a)dtu.dk>
>>>> Sent: 06 June 2020 15:21:59
>>>> To: ceph-users;fleg(a)lpnhe.in2p3.fr
>>>> Subject: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
>>>>
>>>> I think you have a problem similar to one I have. The priority of beacons
seems very low. As soon as something gets busy, beacons are ignored or not sent. This was
part of your log messages from the MDS. It stopped reporting to the MONs due to laggy
connection. This laggyness is a result of swapping:
>>>>
>>>>> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep
>>>>> work because connection to Monitors appears laggy
>>>> Hence, during the (entire) time you are trying to get the MDS back using
swap, it will almost certainly stop sending beacons. Therefore, you need to disable the
time-out temporarily, otherwise the MON will always kill it for no real reason. The
time-out should be long enough to cover the entire recovery period.
>>>>
>>>> Best regards,
>>>> =================
>>>> Frank Schilder
>>>> AIT Risø Campus
>>>> Bygning 109, rum S14
>>>>
>>>> ________________________________________
>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr>
>>>> Sent: 06 June 2020 11:11
>>>> To: Frank Schilder; ceph-users
>>>> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until
memory exhausted
>>>>
>>>> Thanks for the tip,
>>>> I will try that. For now vm.min_free_kbytes = 90112
>>>> Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0
>>>> but this didn't change anything...
>>>> -27> 2020-06-06 06:15:07.373 7f83e3626700 1
>>>> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to
>>>> be laggy; 332.044s since last acked beacon
>>>> Which is the same time since last acked beacon I had before changing the
>>>> parameter.
>>>> As mds beacon interval is 4 s setting mds_beacon_grace to 240 should
>>>> lead to 960 s (16mn). Thus I think that the bottleneck is elsewhere.
>>>> F.
>>>>
>>>>
>>>> Le 06/06/2020 à 09:47, Frank Schilder a écrit :
>>>>> Hi Francois,
>>>>>
>>>>> there is actually one more parameter you might consider changing in
case the MDS gets kicked out again. For a system under such high memory pressure, the
value for the kernel parameter vm.min_free_kbytes might need adjusting. You can check the
current value with
>>>>>
>>>>> sysctl vm.min_free_kbytes
>>>>>
>>>>> In your case with heavy swap usage, this value should probably be
somewhere between 2-4GB.
>>>>>
>>>>> Careful, do not change this value while memory is in high demand. If
not enough memory is available, setting this will immediately OOM kill your machine. Make
sure that plenty of pages are unused. Drop page cache if necessary or reboot the machine
before setting this value.
>>>>>
>>>>> Best regards,
>>>>> =================
>>>>> Frank Schilder
>>>>> AIT Risø Campus
>>>>> Bygning 109, rum S14
>>>>>
>>>>> ________________________________________
>>>>> From: Frank Schilder<frans(a)dtu.dk>
>>>>> Sent: 06 June 2020 00:36:13
>>>>> To: ceph-users;fleg(a)lpnhe.in2p3.fr
>>>>> Subject: [ceph-users] Re: mds behind on trimming - replay until
memory exhausted
>>>>>
>>>>> Hi Francois,
>>>>>
>>>>> yes, the beacon grace needs to be higher due to the latency of swap.
Not sure if 60s will do. For this particular recovery operation, you might want to go much
higher (1h) and watch the cluster health closely.
>>>>>
>>>>> Good luck and best regards,
>>>>> =================
>>>>> Frank Schilder
>>>>> AIT Risø Campus
>>>>> Bygning 109, rum S14
>>>>>
>>>>> ________________________________________
>>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr>
>>>>> Sent: 05 June 2020 23:51:04
>>>>> To: Frank Schilder; ceph-users
>>>>> Subject: Re: [ceph-users] mds behind on trimming - replay until
memory exhausted
>>>>>
>>>>> Hi,
>>>>> Unfortunately adding swap did not solve the problem !
>>>>> I added 400 GB of swap. It used about 18GB of swap after consuming
all
>>>>> the ram and stopped with the following logs :
>>>>>
>>>>> 2020-06-05 21:33:31.967 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr
>>>>> Updating MDS map to version 324691 from mon.1
>>>>> 2020-06-05 21:33:40.355 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr
>>>>> Updating MDS map to version 324692 from mon.1
>>>>> 2020-06-05 21:33:59.787 7f251b7e5700 1 heartbeat_map is_healthy
>>>>> 'MDSRank' had timed out after 15
>>>>> 2020-06-05 21:33:59.787 7f251b7e5700 0
>>>>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to
monitors
>>>>> (last acked 3.99979s ago); MDS internal heartbeat is not healthy!
>>>>> 2020-06-05 21:34:00.287 7f251b7e5700 1 heartbeat_map is_healthy
>>>>> 'MDSRank' had timed out after 15
>>>>> 2020-06-05 21:34:00.287 7f251b7e5700 0
>>>>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to
monitors
>>>>> (last acked 4.49976s ago); MDS internal heartbeat is not healthy!
>>>>> ....
>>>>> 2020-06-05 21:39:05.991 7f251bfe6700 1 heartbeat_map reset_timeout
>>>>> 'MDSRank' had timed out after 15
>>>>> 2020-06-05 21:39:06.015 7f251bfe6700 1
>>>>> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears
to
>>>>> be laggy; 310.228s since last acked beacon
>>>>> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep
>>>>> work because connection to Monitors appears laggy
>>>>> 2020-06-05 21:39:19.838 7f251bfe6700 1 mds.0.322900 skipping upkeep
>>>>> work because connection to Monitors appears laggy
>>>>> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr
>>>>> Updating MDS map to version 324694 from mon.1
>>>>> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr
Map
>>>>> removed me (mds.-1 gid:210070681) from cluster due to lost contact;
>>>>> respawning
>>>>> 2020-06-05 21:39:19.870 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr
respawn!
>>>>> --- begin dump of recent events ---
>>>>> -9999> 2020-06-05 19:28:07.982 7f25217f1700 5
>>>>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay
seq
>>>>> 2131 rtt 0.930951
>>>>> -9998> 2020-06-05 19:28:11.053 7f251b7e5700 5
>>>>> mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132
>>>>> -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient:
>>>>> _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0
>>>>> -9996> 2020-06-05 19:28:12.176 7f25217f1700 5
>>>>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay
seq
>>>>> 2132 rtt 1.12294
>>>>> -9995> 2020-06-05 19:28:12.176 7f251e7eb700 1
>>>>> mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from
mon.1
>>>>> -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient:
tick
>>>>> -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient:
>>>>> _check_auth_rotating have uptodate secrets (they expire after
2020-06-05
>>>>> 19:27:44.290995)
>>>>> ...
>>>>> 2020-06-05 21:39:31.092 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr
>>>>> Updating MDS map to version 324749 from mon.1
>>>>> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr
>>>>> Updating MDS map to version 324750 from mon.1
>>>>> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr
Map
>>>>> has assigned me to become a standby
>>>>>
>>>>> However, the mons doesn't seems particularly loaded !
>>>>> So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I
did
>>>>> it both for mds and mons daemons because it's seems to be present
in
>>>>> both conf).
>>>>> I will tells you if it works.
>>>>>
>>>>> Any other clue ?
>>>>> F.
>>>>>
>>>>> Le 05/06/2020 à 14:44, Frank Schilder a écrit :
>>>>>> Hi Francois,
>>>>>>
>>>>>> thanks for the link. The option "mds dump cache after
rejoin" is for debugging purposes only. It will write the cache after rejoin to a
file, but not drop the cache. This will not help you. I think this was implemented
recently to make it possible to send a cache dump file to developers after an MDS crash
before the restarting MDS changes the cache.
>>>>>>
>>>>>> In your case, I would set osd_op_queue_cut_off during the next
regular cluster service or upgrade.
>>>>>>
>>>>>> My best bet right now is to try to add swap. Maybe someone else
reading this has a better idea or you find a hint in one of the other threads.
>>>>>>
>>>>>> Good luck!
>>>>>> =================
>>>>>> Frank Schilder
>>>>>> AIT Risø Campus
>>>>>> Bygning 109, rum S14
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr>
>>>>>> Sent: 05 June 2020 14:34:06
>>>>>> To: Frank Schilder; ceph-users
>>>>>> Subject: Re: [ceph-users] mds behind on trimming - replay until
memory exhausted
>>>>>>
>>>>>> Le 05/06/2020 à 14:18, Frank Schilder a écrit :
>>>>>>> Hi Francois,
>>>>>>>
>>>>>>>> I was also wondering if setting mds dump cache after
rejoin could help ?
>>>>>>> Haven't heard of that option. Is there some
documentation?
>>>>>> I found it on :
>>>>>>
https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/
>>>>>> mds dump cache after rejoin
>>>>>> Description
>>>>>> Ceph will dump MDS cache contents to a file after rejoining the
cache
>>>>>> (during recovery).
>>>>>> Type
>>>>>> Boolean
>>>>>> Default
>>>>>> false
>>>>>>
>>>>>> but I don't think it can help in my case, because rejoin
occurs after
>>>>>> replay and in my case replay never ends !
>>>>>>
>>>>>>>> I have :
>>>>>>>> osd_op_queue=wpq
>>>>>>>> osd_op_queue_cut_off=low
>>>>>>>> I can try to set osd_op_queue_cut_off to high, but it
will be useful
>>>>>>>> only if the mds get active, true ?
>>>>>>> I think so. If you have no clients connected, there should
not be queue priority issues. Maybe it is best to wait until your cluster is healthy again
as you will need to restart all daemons. Make sure you set this in [global]. When I
applied that change and after re-starting all OSDs my MDSes had reconnect issues until I
set it on them too. I think all daemons use that option (the prefix osd_ is misleading).
>>>>>> For sure I would prefer not to restart all daemons because the
second
>>>>>> filesystem is up and running (with production clients).
>>>>>>
>>>>>>>> For now, the mds_cache_memory_limit is set to 8 589 934
592 (so 8GB
>>>>>>>> which seems reasonable for a mds server with 32/48GB).
>>>>>>> This sounds bad. 8GB should not cause any issues. Maybe you
are hitting a bug, I believe there is a regression in Nautilus. There were recent threads
on absurdly high memory use by MDSes. Maybe its worth searching for these in the list.
>>>>>> I will have a look.
>>>>>>
>>>>>>>> I already force the clients to unmount (and even rebooted
the ones from
>>>>>>>> which the rsync and the rmdir .snaps were launched).
>>>>>>> I don't know when the MDS acknowledges this. If is was a
clean unmount (i.e. without -f or forced by reboot) the MDS should have dropped the
clients already. If it was an unclean unmount it might not be that easy to get the stale
client session out. However, I don't know about that.
>>>>>> Moreover when I did that, the mds was already not active but in
replay,
>>>>>> so for sure the unmount was not acknowledged by any mds !
>>>>>>
>>>>>>>> I think that providing more swap maybe the solution ! I
will try that if
>>>>>>>> I cannot find another way to fix it.
>>>>>>> If the memory overrun is somewhat limited, this should allow
the MDS to trim the logs. Will take a while, but it will do eventually.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> =================
>>>>>>> Frank Schilder
>>>>>>> AIT Risø Campus
>>>>>>> Bygning 109, rum S14
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr>
>>>>>>> Sent: 05 June 2020 13:46:03
>>>>>>> To: Frank Schilder; ceph-users
>>>>>>> Subject: Re: [ceph-users] mds behind on trimming - replay
until memory exhausted
>>>>>>>
>>>>>>> I was also wondering if setting mds dump cache after rejoin
could help ?
>>>>>>>
>>>>>>>
>>>>>>> Le 05/06/2020 à 12:49, Frank Schilder a écrit :
>>>>>>>> Out of interest, I did the same on a mimic cluster a few
months ago, running up to 5 parallel rsync sessions without any problems. I moved about
120TB. Each rsync was running on a separate client with its own cache. I made sure that
the sync dirs were all disjoint (no overlap of files/directories).
>>>>>>>>
>>>>>>>> How many rsync processes are you running in parallel?
>>>>>>>> Do you have these settings enabled:
>>>>>>>>
>>>>>>>> osd_op_queue=wpq
>>>>>>>> osd_op_queue_cut_off=high
>>>>>>>>
>>>>>>>> WPQ should be default, but osd_op_queue_cut_off=high
might not be. Setting the latter removed any behind trimming problems we have seen
before.
>>>>>>>>
>>>>>>>> You are in a somewhat peculiar situation. I think you
need to trim client caches, which requires an active MDS. If your MDS becomes active for
at least some time, you could try the following (I'm not an expert here, so take with
a grain of scepticism):
>>>>>>>>
>>>>>>>> - reduce the MDS cache memory limit to force recall of
caps much earlier than now
>>>>>>>> - reduce client cach size
>>>>>>>> - set "osd_op_queue_cut_off=high" if not
already done so, I think this requires restart of OSDs, so be careful
>>>>>>>>
>>>>>>>> At this point, you could watch your restart cycle to see
if things improve already. Maybe nothing more is required.
>>>>>>>>
>>>>>>>> If you have good SSDs, you could try to provide
temporarily some swap space. It saved me once. This will be very slow, but at least it
might allow you to move forward.
>>>>>>>>
>>>>>>>> Harder measures:
>>>>>>>>
>>>>>>>> - stop all I/O from the FS clients, throw users out if
necessary
>>>>>>>> - ideally, try to cleanly (!) shut down clients or force
trimming the cache by either
>>>>>>>> * umount or
>>>>>>>> * sync; echo 3 > /proc/sys/vm/drop_caches
>>>>>>>> Either of these might hang for a long time. Do
not interrupt and do not do this on more than one client at a time.
>>>>>>>>
>>>>>>>> At some point, your active MDS should be able to hold a
full session. You should then tune the cache and other parameters such that the MDSes can
handle your rsync sessions.
>>>>>>>>
>>>>>>>> My experience is that MDSes overrun their cache limits
quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is
physically available, I have not had any problems again.
>>>>>>>>
>>>>>>>> Hope that helps.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> =================
>>>>>>>> Frank Schilder
>>>>>>>> AIT Risø Campus
>>>>>>>> Bygning 109, rum S14
>>>>>>>>
>>>>>>>> ________________________________________
>>>>>>>> From: Francois Legrand<fleg(a)lpnhe.in2p3.fr>
>>>>>>>> Sent: 05 June 2020 11:42:42
>>>>>>>> To: ceph-users
>>>>>>>> Subject: [ceph-users] mds behind on trimming - replay
until memory exhausted
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> We have a ceph nautilus cluster (14.2.8) with two cephfs
filesystem and
>>>>>>>> 3 mds (1 active for each fs + one failover).
>>>>>>>> We are transfering all the datas (~600M files) from one
FS (which was in
>>>>>>>> EC 3+2) to the other FS (in R3).
>>>>>>>> On the old FS we first removed the snapshots (to avoid
strays problems
>>>>>>>> when removing files) and the ran some rsync deleting the
files after the
>>>>>>>> transfer.
>>>>>>>> The operation should last a few weeks more to complete.
>>>>>>>> But few days ago, we started to have some warning mds
behind on trimming
>>>>>>>> from the mds managing the old FS.
>>>>>>>> Yesterday, I restarted the active mds service to force
the takeover by
>>>>>>>> the standby mds (basically because the standby is more
powerfull and
>>>>>>>> have more memory, i.e 48GB over 32).
>>>>>>>> The standby mds took the rank 0 and started to replay...
the mds behind
>>>>>>>> on trimming came back and the number of segments rised as
well as the
>>>>>>>> memory usage of the server. Finally, it exhausted the
memory of the mds
>>>>>>>> and the service stopped and the previous mds took rank 0
and started to
>>>>>>>> replay... until memory exhaustion and a new switch of mds
etc...
>>>>>>>> It thus seems that we are in a never ending loop ! And of
course, as the
>>>>>>>> mds is always in replay, the data are not accessible and
the transfers
>>>>>>>> are blocked.
>>>>>>>> I stopped all the rsync and unmount the clients.
>>>>>>>>
>>>>>>>> My questions are :
>>>>>>>> - Does the mds trim during the replay so we could hope
that after a
>>>>>>>> while it will purge everything and the mds will be able
to become active
>>>>>>>> at the end ?
>>>>>>>> - Is there a way to accelerate the operation or to fix
this situation ?
>>>>>>>>
>>>>>>>> Thanks for you help.
>>>>>>>> F.
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list --ceph-users(a)ceph.io
>>>>>>>> To unsubscribe send an emailtoceph-users-leave(a)ceph.io
>>>>> _______________________________________________
>>>>> ceph-users mailing list --ceph-users(a)ceph.io
>>>>> To unsubscribe send an email toceph-users-leave(a)ceph.io
>>>> _______________________________________________
>>>> ceph-users mailing list --ceph-users(a)ceph.io
>>>> To unsubscribe send an email toceph-users-leave(a)ceph.io