Hi Erich,
great that you recovered from this.
It sounds like you had the same problem I had a few months ago.
mds crashes after up:replay state - ceph-users - lists.ceph.io
<https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IRV6K74GWE2SWAWQZVUDQAPSMY4J4R4D/#UBPVYXO5ODABKUZ436HN4WBX7QJUXY3P>
Kind regards,
Lars
[image: ariadne.ai Logo] Lars Köppel
Developer
Email: lars.koeppel(a)ariadne.ai
Phone: +49 6221 5993580 <+4962215993580>
ariadne.ai (Germany) GmbH
Häusserstraße 3, 69115 Heidelberg
Amtsgericht Mannheim, HRB 744040
Geschäftsführer: Dr. Fabian Svara
On Mon, Apr 22, 2024 at 11:31 PM Sake Ceph <ceph(a)paulusma.eu> wrote:
100 GB of Ram! Damn that's a lot for a filesystem
in my opinion, or am I
wrong?
Kind regards,
Sake
Op 22-04-2024 21:50 CEST schreef Erich Weiler
<weiler(a)soe.ucsc.edu>du>:
I was able to start another MDS daemon on another node that had 512GB
RAM, and then the active MDS eventually migrated there, and went through
the replay (which consumed about 100 GB of RAM), and then things
recovered. Phew. I guess I need significantly more RAM in my MDS
servers... I had no idea the MDS daemon could require that much RAM.
-erich
On 4/22/24 11:41 AM, Erich Weiler wrote:
> possibly but it would be pretty time consuming and difficult...
>
> Is it maybe a RAM issue since my MDS RAM is filling up? Should maybe
I
> bring up another MDS on another server with
huge amount of RAM and
move
> the MDS there in hopes it will have enough
RAM to complete the replay?
>
> On 4/22/24 11:37 AM, Sake Ceph wrote:
>> Just a question: is it possible to block or disable all clients? Just
>> to prevent load on the system.
>>
>> Kind regards,
>> Sake
>>> Op 22-04-2024 20:33 CEST schreef Erich Weiler <weiler(a)soe.ucsc.edu>du>:
>>>
>>> I also see this from 'ceph health detail':
>>>
>>> # ceph health detail
>>> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache;
1
>>> MDSs behind on trimming
>>> [WRN] FS_DEGRADED: 1 filesystem is degraded
>>> fs slugfs is degraded
>>> [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
>>> mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large
>>> (19GB/8GB); 0 inodes in use by clients, 0 stray files
>>> [WRN] MDS_TRIM: 1 MDSs behind on trimming
>>> mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming
(127084/250)
>>> max_segments: 250, num_segments:
127084
>>>
>>> MDS cache too large? The mds process is taking up 22GB right now and
>>> starting to swap my server, so maybe it somehow is too large....
>>>
>>> On 4/22/24 11:17 AM, Erich Weiler wrote:
>>>> Hi All,
>>>>
>>>> We have a somewhat serious situation where we have a cephfs
filesystem
>>>> (18.2.1), and 2 active MDSs (one
standby). ThI tried to restart
one of
>>>> the active daemons to unstick a
bunch of blocked requests, and the
>>>> standby went into 'replay' for a very long time, then RAM on
that
MDS
>>>> server filled up, and it just
stayed there for a while then
eventually
>>>> appeared to give up and switched
to the standby, but the cycle
started
>>>> again. So I restarted that MDS,
and now I'm in a situation where I
see
>>>> this:
>>>>
>>>> # ceph fs status
>>>> slugfs - 29 clients
>>>> ======
>>>> RANK STATE MDS ACTIVITY DNS INOS
>>>> DIRS CAPS
>>>> 0 replay slugfs.pr-md-01.xdtppo 3958k 57.1k
>>>> 12.2k 0
>>>> 1 resolve slugfs.pr-md-02.sbblqq 0 3
>>>> 1 0
>>>> POOL TYPE USED AVAIL
>>>> cephfs_metadata metadata 997G 2948G
>>>> cephfs_md_and_data data 0 87.6T
>>>> cephfs_data data 773T 175T
>>>> STANDBY MDS
>>>> slugfs.pr-md-03.mclckv
>>>> MDS version: ceph version 18.2.1
>>>> (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
>>>>
>>>> It just stays there indefinitely. All my clients are hung. I tried
>>>> restarting all MDS daemons and they just went back to this state
after
>>>> coming back up.
>>>>
>>>> Is there any way I can somehow escape this state of indefinite
>>>> replay/resolve?
>>>>
>>>> Thanks so much! I'm kinda nervous since none of my clients have
>>>> filesystem access at the moment...
>>>>
>>>> cheers,
>>>> erich
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io