I guess the mailing list didn't preserve the embedded image. Here's an
Imgur link:
I checked the logs as far back as we have them. The issue started
appearing only after my last Ceph upgrade on 2 May, which introduced the
new corruption assertion.
On 06/06/2023 09:16, Janek Bevendorff wrote:
I checked our Prometheus logs and the number of log
events of
individual MONs are indeed randomly starting to increase dramatically
all of a sudden. I attached a picture of the curves.
The first incidence you see there was when our metadata store filled
up entirely. The second, smaller one was the more controlled fill-up.
The last instance with only one runaway MDS is what I have just reported.
My unqualified wild guess is that the new safeguard to prevent the MDS
from committing corrupt dentries is holding up the queue, so all of a
sudden events are starting to pile up until the store is full.
On 05/06/2023 18:03, Janek Bevendorff wrote:
> That said, our MON store size has also been growing slowly from 900MB
> to 5.4GB. But we also have a few remapped PGs right now. Not sure if
> that would have an influence.
>
>
> On 05/06/2023 17:48, Janek Bevendorff wrote:
>> Hi Patrick, hi Dan!
>>
>> I got the MDS back and I think the issue is connected to the "newly
>> corrupt dentry" bug [1]. Even though I couldn't see any particular
>> reason for the SIGABRT at first, I then noticed one of these awfully
>> familiar stack traces.
>>
>> I rescheduled the two broken MDS ranks on two machines with 1.5TB
>> RAM each (just to make sure it's not that) and then let them do
>> their thing. The routine goes as follows: both replay the journal,
>> then rank 4 goes into the "resolve" state, but as soon as rank 3
>> also starts resolving, they both crash.
>>
>> Then I set
>>
>> ceph config mds mds_abort_on_newly_corrupt_dentry false
>> ceph config mds mds_go_bad_corrupt_dentry false
>>
>> and this time I was able to recover the ranks, even though "resolve"
>> and "clientreplay" took forever. I uploaded a compressed log of rank
>> 3 using ceph-post-file [2]. It's a log of several crash cycles,
>> including the final successful attempt after changing the settings.
>> The log decompresses to 815MB. I didn't censor any paths and they
>> are not super-secret, but please don't share.
>>
>> While writing this, the metadata pool size has reduced from 6TiB
>> back to 440GiB. I am starting to think that the fill-ups may also be
>> connected to the corruption issue. I also noticed that the ranks 3
>> and 4 always have huge journals. An inspection using
>> ceph-journal-tool takes forever and consumes 50GB of memory in the
>> process. Listing the events in the journal is impossible without
>> running out of RAM. Ranks 0, 1, and 2 don't have this problem and
>> this wasn't a problem for ranks 3 and 4 either before the fill-ups
>> started happening.
>>
>> Hope that helps getting to the bottom of this. I reset the guardrail
>> settings in the meantime.
>>
>> Cheers
>> Janek
>>
>>
>> [1] "Newly corrupt dentry" ML link:
>>
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JNZ6V5WSYKQ…
>>
>> [2] ceph-post-file ID: 7c039483-49fd-468c-ba40-fb10337aa7d6
>>
>>
>>
>> On 05/06/2023 16:08, Janek Bevendorff wrote:
>>> I just had the problem again that MDS were constantly reporting
>>> slow metadata IO and the pool was slowly growing. Hence I restarted
>>> the MDS and now ranks 4 and 5 don't come up again.
>>>
>>> Every time, they get to the resolve stage, the crash with a SIGABRT
>>> without an error message (not even at debug_mds = 20). Any idea
>>> what the reason could be? I checked whether they have enough RAM,
>>> which seems to be the case (unless they try to allocate tens of GB
>>> in one allocation).
>>>
>>> Janek
>>>
>>>
>>> On 31/05/2023 21:57, Janek Bevendorff wrote:
>>>> Hi Dan,
>>>>
>>>> Sorry, I meant Pacific. The version number was correct, the name
>>>> wasn’t. ;-)
>>>>
>>>> Yes, I have five active MDS and five hot standbys. Static pinning
>>>> isn’t really an options for our directory structure, so we’re
>>>> using ephemeral pins.
>>>>
>>>> Janek
>>>>
>>>>
>>>>> On 31. May 2023, at 18:44, Dan van der Ster
>>>>> <dan.vanderster(a)clyso.com> wrote:
>>>>>
>>>>> Hi Janek,
>>>>>
>>>>> A few questions and suggestions:
>>>>> - Do you have multi-active MDS? In my experience back in nautilus if
>>>>> something went wrong with mds export between mds's, the mds
>>>>> log/journal could grow unbounded like you observed until that export
>>>>> work was done. Static pinning could help if you are not using it
>>>>> already.
>>>>> - You definitely should disable the pg autoscaling on the mds
>>>>> metadata
>>>>> pool (and other pools imho) -- decide the correct number of PGs for
>>>>> your pools and leave it.
>>>>> - Which version are you running? You said nautilus but wrote 16.2.12
>>>>> which is pacific... If you're running nautilus v14 then I
recommend
>>>>> disabling pg autoscaling completely -- IIRC it does not have a
>>>>> fix for
>>>>> the OSD memory growth "pg dup" issue which can occur during
PG
>>>>> splitting/merging.
>>>>>
>>>>> Cheers, Dan
>>>>>
>>>>> ______________________________
>>>>> Clyso GmbH |
https://www.clyso.com
>>>>>
>>>>>
>>>>> On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
>>>>> <janek.bevendorff(a)uni-weimar.de> wrote:
>>>>>> I checked our logs from yesterday, the PG scaling only started
>>>>>> today,
>>>>>> perhaps triggered by the snapshot trimming. I disabled it, but
>>>>>> it didn't
>>>>>> change anything.
>>>>>>
>>>>>> What did change something was restarting the MDS one by one,
>>>>>> which had
>>>>>> got far behind with trimming their caches and with a bunch of
>>>>>> stuck ops.
>>>>>> After restarting them, the pool size decreased quickly to 600GiB.
I
>>>>>> noticed the same behaviour yesterday, though yesterday is was
more
>>>>>> extreme and restarting the MDS took about an hour and I had to
>>>>>> increase
>>>>>> the heartbeat timeout. This time, it took only half a minute per
>>>>>> MDS,
>>>>>> probably because it wasn't that extreme yet and I had reduced
the
>>>>>> maximum cache size. Still looks like a bug to me.
>>>>>>
>>>>>>
>>>>>> On 31/05/2023 11:18, Janek Bevendorff wrote:
>>>>>>> Another thing I just noticed is that the auto-scaler is
trying to
>>>>>>> scale the pool down to 128 PGs. That could also result in
large
>>>>>>> fluctuations, but this big?? In any case, it looks like a bug
>>>>>>> to me.
>>>>>>> Whatever is happening here, there should be safeguards with
>>>>>>> regard to
>>>>>>> the pool's capacity.
>>>>>>>
>>>>>>> Here's the current state of the pool in ceph osd pool ls
detail:
>>>>>>>
>>>>>>> pool 110 'cephfs.storage.meta' replicated size 4
min_size 3
>>>>>>> crush_rule
>>>>>>> 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target
128
>>>>>>> pgp_num_target 128 autoscale_mode on last_change 1359013
lfor
>>>>>>> 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0
>>>>>>> expected_num_objects 3000000 recovery_op_priority 5
>>>>>>> recovery_priority
>>>>>>> 2 application cephfs
>>>>>>>
>>>>>>> Janek
>>>>>>>
>>>>>>>
>>>>>>> On 31/05/2023 10:10, Janek Bevendorff wrote:
>>>>>>>> Forgot to add: We are still on Nautilus (16.2.12).
>>>>>>>>
>>>>>>>>
>>>>>>>> On 31/05/2023 09:53, Janek Bevendorff wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Perhaps this is a known issue and I was simply too
dumb to
>>>>>>>>> find it,
>>>>>>>>> but we are having problems with our CephFS metadata
pool
>>>>>>>>> filling up
>>>>>>>>> over night.
>>>>>>>>>
>>>>>>>>> Our cluster has a small SSD pool of around 15TB which
hosts our
>>>>>>>>> CephFS metadata pool. Usually, that's more than
enough. The
>>>>>>>>> normal
>>>>>>>>> size of the pool ranges between 200 and 800GiB (which
is
>>>>>>>>> quite a lot
>>>>>>>>> of fluctuation already). Yesterday, we had suddenly
had the pool
>>>>>>>>> fill up entirely and they only way to fix it was to
add more
>>>>>>>>> capacity. I increased the pool size to 18TB by adding
more
>>>>>>>>> SSDs and
>>>>>>>>> could resolve the problem. After a couple of hours of
>>>>>>>>> reshuffling,
>>>>>>>>> the pool size finally went back to 230GiB.
>>>>>>>>>
>>>>>>>>> But then we had another fill-up tonight to 7.6TiB.
Luckily, I
>>>>>>>>> had
>>>>>>>>> adjusted the weights so that not all disks could fill
up
>>>>>>>>> entirely
>>>>>>>>> like last time, so it ended there.
>>>>>>>>>
>>>>>>>>> I wasn't really able to identify the problem
yesterday, but
>>>>>>>>> under
>>>>>>>>> the more controllable scenario today, I could check
the MDS
>>>>>>>>> logs at
>>>>>>>>> debug_mds=10 and to me it seems like the problem is
caused by
>>>>>>>>> snapshot trimming. The logs contain a lot of
snapshot-related
>>>>>>>>> messages for paths that haven't been touched in a
long time. The
>>>>>>>>> messages all look something like this:
>>>>>>>>>
>>>>>>>>> May 31 09:16:48 XXX ceph-mds[2947525]:
>>>>>>>>> 2023-05-31T09:16:48.292+0200
>>>>>>>>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670)
add_client_cap
>>>>>>>>> first
>>>>>>>>> cap, joining realm snaprealm(0x10000000000 seq 1b1c
lc 1b1b cr 1
>>>>>>>>> b1b cps 2 snaps={185f=snap(185f 0x10000000000
'monthly_20221201'
>>>>>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de
0x10000000000
>>>>>>>>> 'monthly_20230101'
>>>>>>>>> 2023-01-01T00:00:04.657252+0100),1941=snap(1941
>>>>>>>>> 0x10000000000 ...
>>>>>>>>>
>>>>>>>>> May 31 09:25:03 XXX ceph-mds[3268481]:
>>>>>>>>> 2023-05-31T09:25:03.396+0200
>>>>>>>>> 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir
>>>>>>>>> 0x100000218fe.101111101* /storage/REDACTED/|
ptrwaiter=0
>>>>>>>>> request=0
>>>>>>>>> child=0 frozen=0 subtree=1 replicated=0 dirty=0
waiter=0
>>>>>>>>> authpin=0
>>>>>>>>> tempexporting=0 0x5607759d9600]
>>>>>>>>>
>>>>>>>>> May 31 09:25:03 XXX ceph-mds[3268481]:
>>>>>>>>> 2023-05-31T09:25:03.452+0200
>>>>>>>>> 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir
>>>>>>>>> 0x100000ff904.100111101010* /storage/REDACTED/|
ptrwaiter=0
>>>>>>>>> request=0 child=0 frozen=0 subtree=1 importing=0
replicated=0
>>>>>>>>> waiter=0 authpin=0 tempexporting=0 0x56034ed25200]
>>>>>>>>>
>>>>>>>>> May 31 09:25:03 XXX ceph-mds[3268481]:
>>>>>>>>> 2023-05-31T09:25:03.716+0200
>>>>>>>>> 7f0e6becd700 10 mds.0.server set_trace_dist
snaprealm
>>>>>>>>> snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps
2
>>>>>>>>> snaps={185f=snap(185f 0x10000000000
'monthly_20221201'
>>>>>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de
0x10000000000
>>>>>>>>> 'monthly_20230101'
>>>>>>>>> 2023-01-01T00:00:04.657252+0100),1941=snap(1941
>>>>>>>>> 0x10000000000 'monthly_20230201'
>>>>>>>>> 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6
0x10000000000
>>>>>>>>> 'monthly_20230301'
>>>>>>>>> 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24
>>>>>>>>> 0x10000000000 'monthly_20230401' ...)
len=384
>>>>>>>>>
>>>>>>>>> May 31 09:25:36 deltaweb055 ceph-mds[3268481]:
>>>>>>>>> 2023-05-31T09:25:36.076+0200 7f0e6becd700 10
>>>>>>>>> mds.0.cache.ino(0x10004d74911) remove_client_cap last
cap,
>>>>>>>>> leaving
>>>>>>>>> realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr
1b1b cps 2
>>>>>>>>> snaps={185f=snap(185f 0x10000000000
'monthly_20221201'
>>>>>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de
0x10000000000
>>>>>>>>> 'monthly_20230101' ...)
>>>>>>>>>
>>>>>>>>> The daily_*, montly_* etc. names are the names of our
regular
>>>>>>>>> snapshots.
>>>>>>>>>
>>>>>>>>> I posted a larger log file snippet using
ceph-post-file with
>>>>>>>>> the ID:
>>>>>>>>> da0eb93d-f340-4457-8a3f-434e8ef37d36
>>>>>>>>>
>>>>>>>>> Is it possible that the MDS are trimming old
snapshots without
>>>>>>>>> taking care not to fill up the entire metadata pool?
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Janek
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>>> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>> --
>>>>>>
>>>>>> Bauhaus-Universität Weimar
>>>>>> Bauhausstr. 9a, R308
>>>>>> 99423 Weimar, Germany
>>>>>>
>>>>>> Phone: +49 3643 58 3577
>>>>>>
www.webis.de
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>
--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany
Phone: +49 3643 58 3577