CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

List overview All Threads
Download

newer

older

Transmit rate metric based per...

Starting v17.2.5 RGW SSE with...

Janek Bevendorff

31 May 2023 31 May '23

8:53 a.m.

Hi, Perhaps this is a known issue and I was simply too dumb to find it, but we are having problems with our CephFS metadata pool filling up over night. Our cluster has a small SSD pool of around 15TB which hosts our CephFS metadata pool. Usually, that's more than enough. The normal size of the pool ranges between 200 and 800GiB (which is quite a lot of fluctuation already). Yesterday, we had suddenly had the pool fill up entirely and they only way to fix it was to add more capacity. I increased the pool size to 18TB by adding more SSDs and could resolve the problem. After a couple of hours of reshuffling, the pool size finally went back to 230GiB. But then we had another fill-up tonight to 7.6TiB. Luckily, I had adjusted the weights so that not all disks could fill up entirely like last time, so it ended there. I wasn't really able to identify the problem yesterday, but under the more controllable scenario today, I could check the MDS logs at debug_mds=10 and to me it seems like the problem is caused by snapshot trimming. The logs contain a lot of snapshot-related messages for paths that haven't been touched in a long time. The messages all look something like this: May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1 b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 0x10000000000 ... May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0 child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 tempexporting=0 0x5607759d9600] May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0 waiter=0 authpin=0 tempexporting=0 0x56034ed25200] May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 0x10000000000 'monthly_20230201' 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000 'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 0x10000000000 'monthly_20230401' ...) len=384 May 31 09:25:36 deltaweb055 ceph-mds[3268481]: 2023-05-31T09:25:36.076+0200 7f0e6becd700 10 mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 'monthly_20230101' ...) The daily_*, montly_* etc. names are the names of our regular snapshots. I posted a larger log file snippet using ceph-post-file with the ID: da0eb93d-f340-4457-8a3f-434e8ef37d36 Is it possible that the MDS are trimming old snapshots without taking care not to fill up the entire metadata pool? Cheers Janek

Show replies by date

Janek Bevendorff

31 May 31 May

9:10 a.m.

Forgot to add: We are still on Nautilus (16.2.12). On 31/05/2023 09:53, Janek Bevendorff wrote: > Hi, > > Perhaps this is a known issue and I was simply too dumb to find it, > but we are having problems with our CephFS metadata pool filling up > over night. > > Our cluster has a small SSD pool of around 15TB which hosts our CephFS > metadata pool. Usually, that's more than enough. The normal size of > the pool ranges between 200 and 800GiB (which is quite a lot of > fluctuation already). Yesterday, we had suddenly had the pool fill up > entirely and they only way to fix it was to add more capacity. I > increased the pool size to 18TB by adding more SSDs and could resolve > the problem. After a couple of hours of reshuffling, the pool size > finally went back to 230GiB. > > But then we had another fill-up tonight to 7.6TiB. Luckily, I had > adjusted the weights so that not all disks could fill up entirely like > last time, so it ended there. > > I wasn't really able to identify the problem yesterday, but under the > more controllable scenario today, I could check the MDS logs at > debug_mds=10 and to me it seems like the problem is caused by snapshot > trimming. The logs contain a lot of snapshot-related messages for > paths that haven't been touched in a long time. The messages all look > something like this: > > May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 > 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first > cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1 > b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' > 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 > 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 > 0x10000000000 ... > > May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 > 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir > 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0 > child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 > tempexporting=0 0x5607759d9600] > > May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 > 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir > 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 request=0 > child=0 frozen=0 subtree=1 importing=0 replicated=0 waiter=0 authpin=0 > tempexporting=0 0x56034ed25200] > > May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 > 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm > snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 > snaps={185f=snap(185f 0x10000000000 'monthly_20221201' > 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 > 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 > 0x10000000000 'monthly_20230201' > 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000 > 'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 > 0x10000000000 'monthly_20230401' ...) len=384 > > May 31 09:25:36 deltaweb055 ceph-mds[3268481]: > 2023-05-31T09:25:36.076+0200 7f0e6becd700 10 > mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving > realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 > snaps={185f=snap(185f 0x10000000000 'monthly_20221201' > 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 > 'monthly_20230101' ...) > > The daily_*, montly_* etc. names are the names of our regular snapshots. > > I posted a larger log file snippet using ceph-post-file with the ID: > da0eb93d-f340-4457-8a3f-434e8ef37d36 > > Is it possible that the MDS are trimming old snapshots without taking > care not to fill up the entire metadata pool? > > Cheers > Janek > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Janek Bevendorff

10:18 a.m.

Another thing I just noticed is that the auto-scaler is trying to scale the pool down to 128 PGs. That could also result in large fluctuations, but this big?? In any case, it looks like a bug to me. Whatever is happening here, there should be safeguards with regard to the pool's capacity. Here's the current state of the pool in ceph osd pool ls detail: pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 pgp_num_target 128 autoscale_mode on last_change 1359013 lfor 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 expected_num_objects 3000000 recovery_op_priority 5 recovery_priority 2 application cephfs Janek On 31/05/2023 10:10, Janek Bevendorff wrote: > Forgot to add: We are still on Nautilus (16.2.12). > > > On 31/05/2023 09:53, Janek Bevendorff wrote: >> Hi, >> >> Perhaps this is a known issue and I was simply too dumb to find it, >> but we are having problems with our CephFS metadata pool filling up >> over night. >> >> Our cluster has a small SSD pool of around 15TB which hosts our >> CephFS metadata pool. Usually, that's more than enough. The normal >> size of the pool ranges between 200 and 800GiB (which is quite a lot >> of fluctuation already). Yesterday, we had suddenly had the pool fill >> up entirely and they only way to fix it was to add more capacity. I >> increased the pool size to 18TB by adding more SSDs and could resolve >> the problem. After a couple of hours of reshuffling, the pool size >> finally went back to 230GiB. >> >> But then we had another fill-up tonight to 7.6TiB. Luckily, I had >> adjusted the weights so that not all disks could fill up entirely >> like last time, so it ended there. >> >> I wasn't really able to identify the problem yesterday, but under the >> more controllable scenario today, I could check the MDS logs at >> debug_mds=10 and to me it seems like the problem is caused by >> snapshot trimming. The logs contain a lot of snapshot-related >> messages for paths that haven't been touched in a long time. The >> messages all look something like this: >> >> May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 >> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first >> cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1 >> b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >> 0x10000000000 ... >> >> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 >> 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir >> 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0 >> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 >> tempexporting=0 0x5607759d9600] >> >> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 >> 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir >> 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 request=0 >> child=0 frozen=0 subtree=1 importing=0 replicated=0 waiter=0 >> authpin=0 tempexporting=0 0x56034ed25200] >> >> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 >> 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm >> snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >> 0x10000000000 'monthly_20230201' >> 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000 >> 'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 >> 0x10000000000 'monthly_20230401' ...) len=384 >> >> May 31 09:25:36 deltaweb055 ceph-mds[3268481]: >> 2023-05-31T09:25:36.076+0200 7f0e6becd700 10 >> mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving >> realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >> 'monthly_20230101' ...) >> >> The daily_*, montly_* etc. names are the names of our regular snapshots. >> >> I posted a larger log file snippet using ceph-post-file with the ID: >> da0eb93d-f340-4457-8a3f-434e8ef37d36 >> >> Is it possible that the MDS are trimming old snapshots without taking >> care not to fill up the entire metadata pool? >> >> Cheers >> Janek >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Janek Bevendorff

12:02 p.m.

I checked our logs from yesterday, the PG scaling only started today, perhaps triggered by the snapshot trimming. I disabled it, but it didn't change anything. What did change something was restarting the MDS one by one, which had got far behind with trimming their caches and with a bunch of stuck ops. After restarting them, the pool size decreased quickly to 600GiB. I noticed the same behaviour yesterday, though yesterday is was more extreme and restarting the MDS took about an hour and I had to increase the heartbeat timeout. This time, it took only half a minute per MDS, probably because it wasn't that extreme yet and I had reduced the maximum cache size. Still looks like a bug to me. On 31/05/2023 11:18, Janek Bevendorff wrote:

...

Forgot to add: We are still on Nautilus (16.2.12). On 31/05/2023 09:53, Janek Bevendorff wrote:

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de

Dan van der Ster

5:44 p.m.

Hi Janek, A few questions and suggestions: - Do you have multi-active MDS? In my experience back in nautilus if something went wrong with mds export between mds's, the mds log/journal could grow unbounded like you observed until that export work was done. Static pinning could help if you are not using it already. - You definitely should disable the pg autoscaling on the mds metadata pool (and other pools imho) -- decide the correct number of PGs for your pools and leave it. - Which version are you running? You said nautilus but wrote 16.2.12 which is pacific... If you're running nautilus v14 then I recommend disabling pg autoscaling completely -- IIRC it does not have a fix for the OSD memory growth "pg dup" issue which can occur during PG splitting/merging. Cheers, Dan ______________________________ Clyso GmbH | https://www.clyso.com On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de> wrote: > > I checked our logs from yesterday, the PG scaling only started today, > perhaps triggered by the snapshot trimming. I disabled it, but it didn't > change anything. > > What did change something was restarting the MDS one by one, which had > got far behind with trimming their caches and with a bunch of stuck ops. > After restarting them, the pool size decreased quickly to 600GiB. I > noticed the same behaviour yesterday, though yesterday is was more > extreme and restarting the MDS took about an hour and I had to increase > the heartbeat timeout. This time, it took only half a minute per MDS, > probably because it wasn't that extreme yet and I had reduced the > maximum cache size. Still looks like a bug to me. > > > On 31/05/2023 11:18, Janek Bevendorff wrote: > > Another thing I just noticed is that the auto-scaler is trying to > > scale the pool down to 128 PGs. That could also result in large > > fluctuations, but this big?? In any case, it looks like a bug to me. > > Whatever is happening here, there should be safeguards with regard to > > the pool's capacity. > > > > Here's the current state of the pool in ceph osd pool ls detail: > > > > pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule > > 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 > > pgp_num_target 128 autoscale_mode on last_change 1359013 lfor > > 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 > > expected_num_objects 3000000 recovery_op_priority 5 recovery_priority > > 2 application cephfs > > > > Janek > > > > > > On 31/05/2023 10:10, Janek Bevendorff wrote: > >> Forgot to add: We are still on Nautilus (16.2.12). > >> > >> > >> On 31/05/2023 09:53, Janek Bevendorff wrote: > >>> Hi, > >>> > >>> Perhaps this is a known issue and I was simply too dumb to find it, > >>> but we are having problems with our CephFS metadata pool filling up > >>> over night. > >>> > >>> Our cluster has a small SSD pool of around 15TB which hosts our > >>> CephFS metadata pool. Usually, that's more than enough. The normal > >>> size of the pool ranges between 200 and 800GiB (which is quite a lot > >>> of fluctuation already). Yesterday, we had suddenly had the pool > >>> fill up entirely and they only way to fix it was to add more > >>> capacity. I increased the pool size to 18TB by adding more SSDs and > >>> could resolve the problem. After a couple of hours of reshuffling, > >>> the pool size finally went back to 230GiB. > >>> > >>> But then we had another fill-up tonight to 7.6TiB. Luckily, I had > >>> adjusted the weights so that not all disks could fill up entirely > >>> like last time, so it ended there. > >>> > >>> I wasn't really able to identify the problem yesterday, but under > >>> the more controllable scenario today, I could check the MDS logs at > >>> debug_mds=10 and to me it seems like the problem is caused by > >>> snapshot trimming. The logs contain a lot of snapshot-related > >>> messages for paths that haven't been touched in a long time. The > >>> messages all look something like this: > >>> > >>> May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 > >>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first > >>> cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1 > >>> b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' > >>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 > >>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 > >>> 0x10000000000 ... > >>> > >>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 > >>> 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir > >>> 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0 > >>> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 > >>> tempexporting=0 0x5607759d9600] > >>> > >>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 > >>> 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir > >>> 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 > >>> request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0 > >>> waiter=0 authpin=0 tempexporting=0 0x56034ed25200] > >>> > >>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 > >>> 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm > >>> snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 > >>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' > >>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 > >>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 > >>> 0x10000000000 'monthly_20230201' > >>> 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000 > >>> 'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 > >>> 0x10000000000 'monthly_20230401' ...) len=384 > >>> > >>> May 31 09:25:36 deltaweb055 ceph-mds[3268481]: > >>> 2023-05-31T09:25:36.076+0200 7f0e6becd700 10 > >>> mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving > >>> realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 > >>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' > >>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 > >>> 'monthly_20230101' ...) > >>> > >>> The daily_*, montly_* etc. names are the names of our regular > >>> snapshots. > >>> > >>> I posted a larger log file snippet using ceph-post-file with the ID: > >>> da0eb93d-f340-4457-8a3f-434e8ef37d36 > >>> > >>> Is it possible that the MDS are trimming old snapshots without > >>> taking care not to fill up the entire metadata pool? > >>> > >>> Cheers > >>> Janek > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users(a)ceph.io > >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users(a)ceph.io > >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > -- > > Bauhaus-Universität Weimar > Bauhausstr. 9a, R308 > 99423 Weimar, Germany > > Phone: +49 3643 58 3577 > www.webis.de > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Janek Bevendorff

8:57 p.m.

Hi Dan, Sorry, I meant Pacific. The version number was correct, the name wasn’t. ;-) Yes, I have five active MDS and five hot standbys. Static pinning isn’t really an options for our directory structure, so we’re using ephemeral pins. Janek

...

On 31. May 2023, at 18:44, Dan van der Ster <dan.vanderster(a)clyso.com> wrote: Hi Janek, A few questions and suggestions: - Do you have multi-active MDS? In my experience back in nautilus if something went wrong with mds export between mds's, the mds log/journal could grow unbounded like you observed until that export work was done. Static pinning could help if you are not using it already. - You definitely should disable the pg autoscaling on the mds metadata pool (and other pools imho) -- decide the correct number of PGs for your pools and leave it. - Which version are you running? You said nautilus but wrote 16.2.12 which is pacific... If you're running nautilus v14 then I recommend disabling pg autoscaling completely -- IIRC it does not have a fix for the OSD memory growth "pg dup" issue which can occur during PG splitting/merging. Cheers, Dan ______________________________ Clyso GmbH | https://www.clyso.com On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de> wrote: > > I checked our logs from yesterday, the PG scaling only started today, > perhaps triggered by the snapshot trimming. I disabled it, but it didn't > change anything. > > What did change something was restarting the MDS one by one, which had > got far behind with trimming their caches and with a bunch of stuck ops. > After restarting them, the pool size decreased quickly to 600GiB. I > noticed the same behaviour yesterday, though yesterday is was more > extreme and restarting the MDS took about an hour and I had to increase > the heartbeat timeout. This time, it took only half a minute per MDS, > probably because it wasn't that extreme yet and I had reduced the > maximum cache size. Still looks like a bug to me. > > > On 31/05/2023 11:18, Janek Bevendorff wrote: >> Another thing I just noticed is that the auto-scaler is trying to >> scale the pool down to 128 PGs. That could also result in large >> fluctuations, but this big?? In any case, it looks like a bug to me. >> Whatever is happening here, there should be safeguards with regard to >> the pool's capacity. >> >> Here's the current state of the pool in ceph osd pool ls detail: >> >> pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule >> 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 >> pgp_num_target 128 autoscale_mode on last_change 1359013 lfor >> 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 >> expected_num_objects 3000000 recovery_op_priority 5 recovery_priority >> 2 application cephfs >> >> Janek >> >> >> On 31/05/2023 10:10, Janek Bevendorff wrote: >>> Forgot to add: We are still on Nautilus (16.2.12). >>> >>> >>> On 31/05/2023 09:53, Janek Bevendorff wrote: >>>> Hi, >>>> >>>> Perhaps this is a known issue and I was simply too dumb to find it, >>>> but we are having problems with our CephFS metadata pool filling up >>>> over night. >>>> >>>> Our cluster has a small SSD pool of around 15TB which hosts our >>>> CephFS metadata pool. Usually, that's more than enough. The normal >>>> size of the pool ranges between 200 and 800GiB (which is quite a lot >>>> of fluctuation already). Yesterday, we had suddenly had the pool >>>> fill up entirely and they only way to fix it was to add more >>>> capacity. I increased the pool size to 18TB by adding more SSDs and >>>> could resolve the problem. After a couple of hours of reshuffling, >>>> the pool size finally went back to 230GiB. >>>> >>>> But then we had another fill-up tonight to 7.6TiB. Luckily, I had >>>> adjusted the weights so that not all disks could fill up entirely >>>> like last time, so it ended there. >>>> >>>> I wasn't really able to identify the problem yesterday, but under >>>> the more controllable scenario today, I could check the MDS logs at >>>> debug_mds=10 and to me it seems like the problem is caused by >>>> snapshot trimming. The logs contain a lot of snapshot-related >>>> messages for paths that haven't been touched in a long time. The >>>> messages all look something like this: >>>> >>>> May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 >>>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first >>>> cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1 >>>> b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>> 0x10000000000 ... >>>> >>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 >>>> 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir >>>> 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0 >>>> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 >>>> tempexporting=0 0x5607759d9600] >>>> >>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 >>>> 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir >>>> 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 >>>> request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0 >>>> waiter=0 authpin=0 tempexporting=0 0x56034ed25200] >>>> >>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 >>>> 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm >>>> snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>> 0x10000000000 'monthly_20230201' >>>> 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000 >>>> 'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 >>>> 0x10000000000 'monthly_20230401' ...) len=384 >>>> >>>> May 31 09:25:36 deltaweb055 ceph-mds[3268481]: >>>> 2023-05-31T09:25:36.076+0200 7f0e6becd700 10 >>>> mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving >>>> realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>> 'monthly_20230101' ...) >>>> >>>> The daily_*, montly_* etc. names are the names of our regular >>>> snapshots. >>>> >>>> I posted a larger log file snippet using ceph-post-file with the ID: >>>> da0eb93d-f340-4457-8a3f-434e8ef37d36 >>>> >>>> Is it possible that the MDS are trimming old snapshots without >>>> taking care not to fill up the entire metadata pool? >>>> >>>> Cheers >>>> Janek >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > > -- > > Bauhaus-Universität Weimar > Bauhausstr. 9a, R308 > 99423 Weimar, Germany > > Phone: +49 3643 58 3577 > www.webis.de > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Janek Bevendorff

5 Jun 5 Jun

3:08 p.m.

I just had the problem again that MDS were constantly reporting slow metadata IO and the pool was slowly growing. Hence I restarted the MDS and now ranks 4 and 5 don't come up again. Every time, they get to the resolve stage, the crash with a SIGABRT without an error message (not even at debug_mds = 20). Any idea what the reason could be? I checked whether they have enough RAM, which seems to be the case (unless they try to allocate tens of GB in one allocation). Janek On 31/05/2023 21:57, Janek Bevendorff wrote:

...

On 31. May 2023, at 18:44, Dan van der Ster <dan.vanderster(a)clyso.com> wrote: Hi Janek, A few questions and suggestions: - Do you have multi-active MDS? In my experience back in nautilus if something went wrong with mds export between mds's, the mds log/journal could grow unbounded like you observed until that export work was done. Static pinning could help if you are not using it already. - You definitely should disable the pg autoscaling on the mds metadata pool (and other pools imho) -- decide the correct number of PGs for your pools and leave it. - Which version are you running? You said nautilus but wrote 16.2.12 which is pacific... If you're running nautilus v14 then I recommend disabling pg autoscaling completely -- IIRC it does not have a fix for the OSD memory growth "pg dup" issue which can occur during PG splitting/merging. Cheers, Dan ______________________________ Clyso GmbH | https://www.clyso.com On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de> wrote: > I checked our logs from yesterday, the PG scaling only started today, > perhaps triggered by the snapshot trimming. I disabled it, but it didn't > change anything. > > What did change something was restarting the MDS one by one, which had > got far behind with trimming their caches and with a bunch of stuck ops. > After restarting them, the pool size decreased quickly to 600GiB. I > noticed the same behaviour yesterday, though yesterday is was more > extreme and restarting the MDS took about an hour and I had to increase > the heartbeat timeout. This time, it took only half a minute per MDS, > probably because it wasn't that extreme yet and I had reduced the > maximum cache size. Still looks like a bug to me. > > > On 31/05/2023 11:18, Janek Bevendorff wrote: >> Another thing I just noticed is that the auto-scaler is trying to >> scale the pool down to 128 PGs. That could also result in large >> fluctuations, but this big?? In any case, it looks like a bug to me. >> Whatever is happening here, there should be safeguards with regard to >> the pool's capacity. >> >> Here's the current state of the pool in ceph osd pool ls detail: >> >> pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule >> 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 >> pgp_num_target 128 autoscale_mode on last_change 1359013 lfor >> 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 >> expected_num_objects 3000000 recovery_op_priority 5 recovery_priority >> 2 application cephfs >> >> Janek >> >> >> On 31/05/2023 10:10, Janek Bevendorff wrote: >>> Forgot to add: We are still on Nautilus (16.2.12). >>> >>> >>> On 31/05/2023 09:53, Janek Bevendorff wrote: >>>> Hi, >>>> >>>> Perhaps this is a known issue and I was simply too dumb to find it, >>>> but we are having problems with our CephFS metadata pool filling up >>>> over night. >>>> >>>> Our cluster has a small SSD pool of around 15TB which hosts our >>>> CephFS metadata pool. Usually, that's more than enough. The normal >>>> size of the pool ranges between 200 and 800GiB (which is quite a lot >>>> of fluctuation already). Yesterday, we had suddenly had the pool >>>> fill up entirely and they only way to fix it was to add more >>>> capacity. I increased the pool size to 18TB by adding more SSDs and >>>> could resolve the problem. After a couple of hours of reshuffling, >>>> the pool size finally went back to 230GiB. >>>> >>>> But then we had another fill-up tonight to 7.6TiB. Luckily, I had >>>> adjusted the weights so that not all disks could fill up entirely >>>> like last time, so it ended there. >>>> >>>> I wasn't really able to identify the problem yesterday, but under >>>> the more controllable scenario today, I could check the MDS logs at >>>> debug_mds=10 and to me it seems like the problem is caused by >>>> snapshot trimming. The logs contain a lot of snapshot-related >>>> messages for paths that haven't been touched in a long time. The >>>> messages all look something like this: >>>> >>>> May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 >>>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first >>>> cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1 >>>> b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>> 0x10000000000 ... >>>> >>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 >>>> 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir >>>> 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0 >>>> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 >>>> tempexporting=0 0x5607759d9600] >>>> >>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 >>>> 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir >>>> 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 >>>> request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0 >>>> waiter=0 authpin=0 tempexporting=0 0x56034ed25200] >>>> >>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 >>>> 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm >>>> snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>> 0x10000000000 'monthly_20230201' >>>> 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000 >>>> 'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 >>>> 0x10000000000 'monthly_20230401' ...) len=384 >>>> >>>> May 31 09:25:36 deltaweb055 ceph-mds[3268481]: >>>> 2023-05-31T09:25:36.076+0200 7f0e6becd700 10 >>>> mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving >>>> realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>> 'monthly_20230101' ...) >>>> >>>> The daily_*, montly_* etc. names are the names of our regular >>>> snapshots. >>>> >>>> I posted a larger log file snippet using ceph-post-file with the ID: >>>> da0eb93d-f340-4457-8a3f-434e8ef37d36 >>>> >>>> Is it possible that the MDS are trimming old snapshots without >>>> taking care not to fill up the entire metadata pool? >>>> >>>> Cheers >>>> Janek >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > -- > > Bauhaus-Universität Weimar > Bauhausstr. 9a, R308 > 99423 Weimar, Germany > > Phone: +49 3643 58 3577 > www.webis.de > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de

Janek Bevendorff

4:48 p.m.

...

On 31. May 2023, at 18:44, Dan van der Ster <dan.vanderster(a)clyso.com> wrote: Hi Janek, A few questions and suggestions: - Do you have multi-active MDS? In my experience back in nautilus if something went wrong with mds export between mds's, the mds log/journal could grow unbounded like you observed until that export work was done. Static pinning could help if you are not using it already. - You definitely should disable the pg autoscaling on the mds metadata pool (and other pools imho) -- decide the correct number of PGs for your pools and leave it. - Which version are you running? You said nautilus but wrote 16.2.12 which is pacific... If you're running nautilus v14 then I recommend disabling pg autoscaling completely -- IIRC it does not have a fix for the OSD memory growth "pg dup" issue which can occur during PG splitting/merging. Cheers, Dan ______________________________ Clyso GmbH | https://www.clyso.com On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de> wrote: > I checked our logs from yesterday, the PG scaling only started today, > perhaps triggered by the snapshot trimming. I disabled it, but it > didn't > change anything. > > What did change something was restarting the MDS one by one, which had > got far behind with trimming their caches and with a bunch of stuck > ops. > After restarting them, the pool size decreased quickly to 600GiB. I > noticed the same behaviour yesterday, though yesterday is was more > extreme and restarting the MDS took about an hour and I had to > increase > the heartbeat timeout. This time, it took only half a minute per MDS, > probably because it wasn't that extreme yet and I had reduced the > maximum cache size. Still looks like a bug to me. > > > On 31/05/2023 11:18, Janek Bevendorff wrote: >> Another thing I just noticed is that the auto-scaler is trying to >> scale the pool down to 128 PGs. That could also result in large >> fluctuations, but this big?? In any case, it looks like a bug to me. >> Whatever is happening here, there should be safeguards with regard to >> the pool's capacity. >> >> Here's the current state of the pool in ceph osd pool ls detail: >> >> pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 >> crush_rule >> 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 >> pgp_num_target 128 autoscale_mode on last_change 1359013 lfor >> 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 >> expected_num_objects 3000000 recovery_op_priority 5 recovery_priority >> 2 application cephfs >> >> Janek >> >> >> On 31/05/2023 10:10, Janek Bevendorff wrote: >>> Forgot to add: We are still on Nautilus (16.2.12). >>> >>> >>> On 31/05/2023 09:53, Janek Bevendorff wrote: >>>> Hi, >>>> >>>> Perhaps this is a known issue and I was simply too dumb to find it, >>>> but we are having problems with our CephFS metadata pool filling up >>>> over night. >>>> >>>> Our cluster has a small SSD pool of around 15TB which hosts our >>>> CephFS metadata pool. Usually, that's more than enough. The normal >>>> size of the pool ranges between 200 and 800GiB (which is quite a >>>> lot >>>> of fluctuation already). Yesterday, we had suddenly had the pool >>>> fill up entirely and they only way to fix it was to add more >>>> capacity. I increased the pool size to 18TB by adding more SSDs and >>>> could resolve the problem. After a couple of hours of reshuffling, >>>> the pool size finally went back to 230GiB. >>>> >>>> But then we had another fill-up tonight to 7.6TiB. Luckily, I had >>>> adjusted the weights so that not all disks could fill up entirely >>>> like last time, so it ended there. >>>> >>>> I wasn't really able to identify the problem yesterday, but under >>>> the more controllable scenario today, I could check the MDS logs at >>>> debug_mds=10 and to me it seems like the problem is caused by >>>> snapshot trimming. The logs contain a lot of snapshot-related >>>> messages for paths that haven't been touched in a long time. The >>>> messages all look something like this: >>>> >>>> May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 >>>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first >>>> cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1 >>>> b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>> 0x10000000000 ... >>>> >>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 >>>> 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir >>>> 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0 >>>> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 >>>> tempexporting=0 0x5607759d9600] >>>> >>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 >>>> 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir >>>> 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 >>>> request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0 >>>> waiter=0 authpin=0 tempexporting=0 0x56034ed25200] >>>> >>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 >>>> 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm >>>> snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>> 0x10000000000 'monthly_20230201' >>>> 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000 >>>> 'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 >>>> 0x10000000000 'monthly_20230401' ...) len=384 >>>> >>>> May 31 09:25:36 deltaweb055 ceph-mds[3268481]: >>>> 2023-05-31T09:25:36.076+0200 7f0e6becd700 10 >>>> mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving >>>> realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>> 'monthly_20230101' ...) >>>> >>>> The daily_*, montly_* etc. names are the names of our regular >>>> snapshots. >>>> >>>> I posted a larger log file snippet using ceph-post-file with the >>>> ID: >>>> da0eb93d-f340-4457-8a3f-434e8ef37d36 >>>> >>>> Is it possible that the MDS are trimming old snapshots without >>>> taking care not to fill up the entire metadata pool? >>>> >>>> Cheers >>>> Janek >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > -- > > Bauhaus-Universität Weimar > Bauhausstr. 9a, R308 > 99423 Weimar, Germany > > Phone: +49 3643 58 3577 > www.webis.de > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de

Janek Bevendorff

5:03 p.m.

...

Hi Patrick, hi Dan! I got the MDS back and I think the issue is connected to the "newly corrupt dentry" bug [1]. Even though I couldn't see any particular reason for the SIGABRT at first, I then noticed one of these awfully familiar stack traces. I rescheduled the two broken MDS ranks on two machines with 1.5TB RAM each (just to make sure it's not that) and then let them do their thing. The routine goes as follows: both replay the journal, then rank 4 goes into the "resolve" state, but as soon as rank 3 also starts resolving, they both crash. Then I set ceph config mds mds_abort_on_newly_corrupt_dentry false ceph config mds mds_go_bad_corrupt_dentry false and this time I was able to recover the ranks, even though "resolve" and "clientreplay" took forever. I uploaded a compressed log of rank 3 using ceph-post-file [2]. It's a log of several crash cycles, including the final successful attempt after changing the settings. The log decompresses to 815MB. I didn't censor any paths and they are not super-secret, but please don't share. While writing this, the metadata pool size has reduced from 6TiB back to 440GiB. I am starting to think that the fill-ups may also be connected to the corruption issue. I also noticed that the ranks 3 and 4 always have huge journals. An inspection using ceph-journal-tool takes forever and consumes 50GB of memory in the process. Listing the events in the journal is impossible without running out of RAM. Ranks 0, 1, and 2 don't have this problem and this wasn't a problem for ranks 3 and 4 either before the fill-ups started happening. Hope that helps getting to the bottom of this. I reset the guardrail settings in the meantime. Cheers Janek [1] "Newly corrupt dentry" ML link: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JNZ6V5WSYKQ… [2] ceph-post-file ID: 7c039483-49fd-468c-ba40-fb10337aa7d6 On 05/06/2023 16:08, Janek Bevendorff wrote: > I just had the problem again that MDS were constantly reporting slow > metadata IO and the pool was slowly growing. Hence I restarted the > MDS and now ranks 4 and 5 don't come up again. > > Every time, they get to the resolve stage, the crash with a SIGABRT > without an error message (not even at debug_mds = 20). Any idea what > the reason could be? I checked whether they have enough RAM, which > seems to be the case (unless they try to allocate tens of GB in one > allocation). > > Janek > > > On 31/05/2023 21:57, Janek Bevendorff wrote: >> Hi Dan, >> >> Sorry, I meant Pacific. The version number was correct, the name >> wasn’t. ;-) >> >> Yes, I have five active MDS and five hot standbys. Static pinning >> isn’t really an options for our directory structure, so we’re using >> ephemeral pins. >> >> Janek >> >> >>> On 31. May 2023, at 18:44, Dan van der Ster >>> <dan.vanderster(a)clyso.com> wrote: >>> >>> Hi Janek, >>> >>> A few questions and suggestions: >>> - Do you have multi-active MDS? In my experience back in nautilus if >>> something went wrong with mds export between mds's, the mds >>> log/journal could grow unbounded like you observed until that export >>> work was done. Static pinning could help if you are not using it >>> already. >>> - You definitely should disable the pg autoscaling on the mds metadata >>> pool (and other pools imho) -- decide the correct number of PGs for >>> your pools and leave it. >>> - Which version are you running? You said nautilus but wrote 16.2.12 >>> which is pacific... If you're running nautilus v14 then I recommend >>> disabling pg autoscaling completely -- IIRC it does not have a fix for >>> the OSD memory growth "pg dup" issue which can occur during PG >>> splitting/merging. >>> >>> Cheers, Dan >>> >>> ______________________________ >>> Clyso GmbH | https://www.clyso.com >>> >>> >>> On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff >>> <janek.bevendorff(a)uni-weimar.de> wrote: >>>> I checked our logs from yesterday, the PG scaling only started today, >>>> perhaps triggered by the snapshot trimming. I disabled it, but it >>>> didn't >>>> change anything. >>>> >>>> What did change something was restarting the MDS one by one, which >>>> had >>>> got far behind with trimming their caches and with a bunch of >>>> stuck ops. >>>> After restarting them, the pool size decreased quickly to 600GiB. I >>>> noticed the same behaviour yesterday, though yesterday is was more >>>> extreme and restarting the MDS took about an hour and I had to >>>> increase >>>> the heartbeat timeout. This time, it took only half a minute per MDS, >>>> probably because it wasn't that extreme yet and I had reduced the >>>> maximum cache size. Still looks like a bug to me. >>>> >>>> >>>> On 31/05/2023 11:18, Janek Bevendorff wrote: >>>>> Another thing I just noticed is that the auto-scaler is trying to >>>>> scale the pool down to 128 PGs. That could also result in large >>>>> fluctuations, but this big?? In any case, it looks like a bug to me. >>>>> Whatever is happening here, there should be safeguards with >>>>> regard to >>>>> the pool's capacity. >>>>> >>>>> Here's the current state of the pool in ceph osd pool ls detail: >>>>> >>>>> pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 >>>>> crush_rule >>>>> 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 >>>>> pgp_num_target 128 autoscale_mode on last_change 1359013 lfor >>>>> 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 >>>>> expected_num_objects 3000000 recovery_op_priority 5 >>>>> recovery_priority >>>>> 2 application cephfs >>>>> >>>>> Janek >>>>> >>>>> >>>>> On 31/05/2023 10:10, Janek Bevendorff wrote: >>>>>> Forgot to add: We are still on Nautilus (16.2.12). >>>>>> >>>>>> >>>>>> On 31/05/2023 09:53, Janek Bevendorff wrote: >>>>>>> Hi, >>>>>>> >>>>>>> Perhaps this is a known issue and I was simply too dumb to find >>>>>>> it, >>>>>>> but we are having problems with our CephFS metadata pool >>>>>>> filling up >>>>>>> over night. >>>>>>> >>>>>>> Our cluster has a small SSD pool of around 15TB which hosts our >>>>>>> CephFS metadata pool. Usually, that's more than enough. The normal >>>>>>> size of the pool ranges between 200 and 800GiB (which is quite >>>>>>> a lot >>>>>>> of fluctuation already). Yesterday, we had suddenly had the pool >>>>>>> fill up entirely and they only way to fix it was to add more >>>>>>> capacity. I increased the pool size to 18TB by adding more SSDs >>>>>>> and >>>>>>> could resolve the problem. After a couple of hours of reshuffling, >>>>>>> the pool size finally went back to 230GiB. >>>>>>> >>>>>>> But then we had another fill-up tonight to 7.6TiB. Luckily, I had >>>>>>> adjusted the weights so that not all disks could fill up entirely >>>>>>> like last time, so it ended there. >>>>>>> >>>>>>> I wasn't really able to identify the problem yesterday, but under >>>>>>> the more controllable scenario today, I could check the MDS >>>>>>> logs at >>>>>>> debug_mds=10 and to me it seems like the problem is caused by >>>>>>> snapshot trimming. The logs contain a lot of snapshot-related >>>>>>> messages for paths that haven't been touched in a long time. The >>>>>>> messages all look something like this: >>>>>>> >>>>>>> May 31 09:16:48 XXX ceph-mds[2947525]: >>>>>>> 2023-05-31T09:16:48.292+0200 >>>>>>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap >>>>>>> first >>>>>>> cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1 >>>>>>> b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>>>>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>>>>> 0x10000000000 ... >>>>>>> >>>>>>> May 31 09:25:03 XXX ceph-mds[3268481]: >>>>>>> 2023-05-31T09:25:03.396+0200 >>>>>>> 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir >>>>>>> 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0 >>>>>>> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 >>>>>>> tempexporting=0 0x5607759d9600] >>>>>>> >>>>>>> May 31 09:25:03 XXX ceph-mds[3268481]: >>>>>>> 2023-05-31T09:25:03.452+0200 >>>>>>> 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir >>>>>>> 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 >>>>>>> request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0 >>>>>>> waiter=0 authpin=0 tempexporting=0 0x56034ed25200] >>>>>>> >>>>>>> May 31 09:25:03 XXX ceph-mds[3268481]: >>>>>>> 2023-05-31T09:25:03.716+0200 >>>>>>> 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm >>>>>>> snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>>>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>>>>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>>>>> 0x10000000000 'monthly_20230201' >>>>>>> 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000 >>>>>>> 'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 >>>>>>> 0x10000000000 'monthly_20230401' ...) len=384 >>>>>>> >>>>>>> May 31 09:25:36 deltaweb055 ceph-mds[3268481]: >>>>>>> 2023-05-31T09:25:36.076+0200 7f0e6becd700 10 >>>>>>> mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving >>>>>>> realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>>>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>>>>> 'monthly_20230101' ...) >>>>>>> >>>>>>> The daily_*, montly_* etc. names are the names of our regular >>>>>>> snapshots. >>>>>>> >>>>>>> I posted a larger log file snippet using ceph-post-file with >>>>>>> the ID: >>>>>>> da0eb93d-f340-4457-8a3f-434e8ef37d36 >>>>>>> >>>>>>> Is it possible that the MDS are trimming old snapshots without >>>>>>> taking care not to fill up the entire metadata pool? >>>>>>> >>>>>>> Cheers >>>>>>> Janek >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> -- >>>> >>>> Bauhaus-Universität Weimar >>>> Bauhausstr. 9a, R308 >>>> 99423 Weimar, Germany >>>> >>>> Phone: +49 3643 58 3577 >>>> www.webis.de >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >

-- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de

Janek Bevendorff

6 Jun 6 Jun

8:16 a.m.

...

That said, our MON store size has also been growing slowly from 900MB to 5.4GB. But we also have a few remapped PGs right now. Not sure if that would have an influence. On 05/06/2023 17:48, Janek Bevendorff wrote: > Hi Patrick, hi Dan! > > I got the MDS back and I think the issue is connected to the "newly > corrupt dentry" bug [1]. Even though I couldn't see any particular > reason for the SIGABRT at first, I then noticed one of these awfully > familiar stack traces. > > I rescheduled the two broken MDS ranks on two machines with 1.5TB RAM > each (just to make sure it's not that) and then let them do their > thing. The routine goes as follows: both replay the journal, then > rank 4 goes into the "resolve" state, but as soon as rank 3 also > starts resolving, they both crash. > > Then I set > > ceph config mds mds_abort_on_newly_corrupt_dentry false > ceph config mds mds_go_bad_corrupt_dentry false > > and this time I was able to recover the ranks, even though "resolve" > and "clientreplay" took forever. I uploaded a compressed log of rank > 3 using ceph-post-file [2]. It's a log of several crash cycles, > including the final successful attempt after changing the settings. > The log decompresses to 815MB. I didn't censor any paths and they are > not super-secret, but please don't share. > > While writing this, the metadata pool size has reduced from 6TiB back > to 440GiB. I am starting to think that the fill-ups may also be > connected to the corruption issue. I also noticed that the ranks 3 > and 4 always have huge journals. An inspection using > ceph-journal-tool takes forever and consumes 50GB of memory in the > process. Listing the events in the journal is impossible without > running out of RAM. Ranks 0, 1, and 2 don't have this problem and > this wasn't a problem for ranks 3 and 4 either before the fill-ups > started happening. > > Hope that helps getting to the bottom of this. I reset the guardrail > settings in the meantime. > > Cheers > Janek > > > [1] "Newly corrupt dentry" ML link: > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JNZ6V5WSYKQ… > > [2] ceph-post-file ID: 7c039483-49fd-468c-ba40-fb10337aa7d6 > > > > On 05/06/2023 16:08, Janek Bevendorff wrote: >> I just had the problem again that MDS were constantly reporting slow >> metadata IO and the pool was slowly growing. Hence I restarted the >> MDS and now ranks 4 and 5 don't come up again. >> >> Every time, they get to the resolve stage, the crash with a SIGABRT >> without an error message (not even at debug_mds = 20). Any idea what >> the reason could be? I checked whether they have enough RAM, which >> seems to be the case (unless they try to allocate tens of GB in one >> allocation). >> >> Janek >> >> >> On 31/05/2023 21:57, Janek Bevendorff wrote: >>> Hi Dan, >>> >>> Sorry, I meant Pacific. The version number was correct, the name >>> wasn’t. ;-) >>> >>> Yes, I have five active MDS and five hot standbys. Static pinning >>> isn’t really an options for our directory structure, so we’re using >>> ephemeral pins. >>> >>> Janek >>> >>> >>>> On 31. May 2023, at 18:44, Dan van der Ster >>>> <dan.vanderster(a)clyso.com> wrote: >>>> >>>> Hi Janek, >>>> >>>> A few questions and suggestions: >>>> - Do you have multi-active MDS? In my experience back in nautilus if >>>> something went wrong with mds export between mds's, the mds >>>> log/journal could grow unbounded like you observed until that export >>>> work was done. Static pinning could help if you are not using it >>>> already. >>>> - You definitely should disable the pg autoscaling on the mds >>>> metadata >>>> pool (and other pools imho) -- decide the correct number of PGs for >>>> your pools and leave it. >>>> - Which version are you running? You said nautilus but wrote 16.2.12 >>>> which is pacific... If you're running nautilus v14 then I recommend >>>> disabling pg autoscaling completely -- IIRC it does not have a fix >>>> for >>>> the OSD memory growth "pg dup" issue which can occur during PG >>>> splitting/merging. >>>> >>>> Cheers, Dan >>>> >>>> ______________________________ >>>> Clyso GmbH | https://www.clyso.com >>>> >>>> >>>> On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff >>>> <janek.bevendorff(a)uni-weimar.de> wrote: >>>>> I checked our logs from yesterday, the PG scaling only started >>>>> today, >>>>> perhaps triggered by the snapshot trimming. I disabled it, but it >>>>> didn't >>>>> change anything. >>>>> >>>>> What did change something was restarting the MDS one by one, >>>>> which had >>>>> got far behind with trimming their caches and with a bunch of >>>>> stuck ops. >>>>> After restarting them, the pool size decreased quickly to 600GiB. I >>>>> noticed the same behaviour yesterday, though yesterday is was more >>>>> extreme and restarting the MDS took about an hour and I had to >>>>> increase >>>>> the heartbeat timeout. This time, it took only half a minute per >>>>> MDS, >>>>> probably because it wasn't that extreme yet and I had reduced the >>>>> maximum cache size. Still looks like a bug to me. >>>>> >>>>> >>>>> On 31/05/2023 11:18, Janek Bevendorff wrote: >>>>>> Another thing I just noticed is that the auto-scaler is trying to >>>>>> scale the pool down to 128 PGs. That could also result in large >>>>>> fluctuations, but this big?? In any case, it looks like a bug to >>>>>> me. >>>>>> Whatever is happening here, there should be safeguards with >>>>>> regard to >>>>>> the pool's capacity. >>>>>> >>>>>> Here's the current state of the pool in ceph osd pool ls detail: >>>>>> >>>>>> pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 >>>>>> crush_rule >>>>>> 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 >>>>>> pgp_num_target 128 autoscale_mode on last_change 1359013 lfor >>>>>> 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 >>>>>> expected_num_objects 3000000 recovery_op_priority 5 >>>>>> recovery_priority >>>>>> 2 application cephfs >>>>>> >>>>>> Janek >>>>>> >>>>>> >>>>>> On 31/05/2023 10:10, Janek Bevendorff wrote: >>>>>>> Forgot to add: We are still on Nautilus (16.2.12). >>>>>>> >>>>>>> >>>>>>> On 31/05/2023 09:53, Janek Bevendorff wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> Perhaps this is a known issue and I was simply too dumb to >>>>>>>> find it, >>>>>>>> but we are having problems with our CephFS metadata pool >>>>>>>> filling up >>>>>>>> over night. >>>>>>>> >>>>>>>> Our cluster has a small SSD pool of around 15TB which hosts our >>>>>>>> CephFS metadata pool. Usually, that's more than enough. The >>>>>>>> normal >>>>>>>> size of the pool ranges between 200 and 800GiB (which is quite >>>>>>>> a lot >>>>>>>> of fluctuation already). Yesterday, we had suddenly had the pool >>>>>>>> fill up entirely and they only way to fix it was to add more >>>>>>>> capacity. I increased the pool size to 18TB by adding more >>>>>>>> SSDs and >>>>>>>> could resolve the problem. After a couple of hours of >>>>>>>> reshuffling, >>>>>>>> the pool size finally went back to 230GiB. >>>>>>>> >>>>>>>> But then we had another fill-up tonight to 7.6TiB. Luckily, I had >>>>>>>> adjusted the weights so that not all disks could fill up entirely >>>>>>>> like last time, so it ended there. >>>>>>>> >>>>>>>> I wasn't really able to identify the problem yesterday, but under >>>>>>>> the more controllable scenario today, I could check the MDS >>>>>>>> logs at >>>>>>>> debug_mds=10 and to me it seems like the problem is caused by >>>>>>>> snapshot trimming. The logs contain a lot of snapshot-related >>>>>>>> messages for paths that haven't been touched in a long time. The >>>>>>>> messages all look something like this: >>>>>>>> >>>>>>>> May 31 09:16:48 XXX ceph-mds[2947525]: >>>>>>>> 2023-05-31T09:16:48.292+0200 >>>>>>>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap >>>>>>>> first >>>>>>>> cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1 >>>>>>>> b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>>>>>> 'monthly_20230101' >>>>>>>> 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>>>>>> 0x10000000000 ... >>>>>>>> >>>>>>>> May 31 09:25:03 XXX ceph-mds[3268481]: >>>>>>>> 2023-05-31T09:25:03.396+0200 >>>>>>>> 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir >>>>>>>> 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 >>>>>>>> request=0 >>>>>>>> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 >>>>>>>> authpin=0 >>>>>>>> tempexporting=0 0x5607759d9600] >>>>>>>> >>>>>>>> May 31 09:25:03 XXX ceph-mds[3268481]: >>>>>>>> 2023-05-31T09:25:03.452+0200 >>>>>>>> 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir >>>>>>>> 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 >>>>>>>> request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0 >>>>>>>> waiter=0 authpin=0 tempexporting=0 0x56034ed25200] >>>>>>>> >>>>>>>> May 31 09:25:03 XXX ceph-mds[3268481]: >>>>>>>> 2023-05-31T09:25:03.716+0200 >>>>>>>> 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm >>>>>>>> snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>>>>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>>>>>> 'monthly_20230101' >>>>>>>> 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>>>>>> 0x10000000000 'monthly_20230201' >>>>>>>> 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000 >>>>>>>> 'monthly_20230301' >>>>>>>> 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 >>>>>>>> 0x10000000000 'monthly_20230401' ...) len=384 >>>>>>>> >>>>>>>> May 31 09:25:36 deltaweb055 ceph-mds[3268481]: >>>>>>>> 2023-05-31T09:25:36.076+0200 7f0e6becd700 10 >>>>>>>> mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, >>>>>>>> leaving >>>>>>>> realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>>>>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>>>>>> 'monthly_20230101' ...) >>>>>>>> >>>>>>>> The daily_*, montly_* etc. names are the names of our regular >>>>>>>> snapshots. >>>>>>>> >>>>>>>> I posted a larger log file snippet using ceph-post-file with >>>>>>>> the ID: >>>>>>>> da0eb93d-f340-4457-8a3f-434e8ef37d36 >>>>>>>> >>>>>>>> Is it possible that the MDS are trimming old snapshots without >>>>>>>> taking care not to fill up the entire metadata pool? >>>>>>>> >>>>>>>> Cheers >>>>>>>> Janek >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>> -- >>>>> >>>>> Bauhaus-Universität Weimar >>>>> Bauhausstr. 9a, R308 >>>>> 99423 Weimar, Germany >>>>> >>>>> Phone: +49 3643 58 3577 >>>>> www.webis.de >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>

-- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de

Janek Bevendorff

8:26 a.m.

I guess the mailing list didn't preserve the embedded image. Here's an Imgur link: https://imgur.com/a/WSmAOaG I checked the logs as far back as we have them. The issue started appearing only after my last Ceph upgrade on 2 May, which introduced the new corruption assertion. On 06/06/2023 09:16, Janek Bevendorff wrote:

...

I checked our Prometheus logs and the number of log events of individual MONs are indeed randomly starting to increase dramatically all of a sudden. I attached a picture of the curves. The first incidence you see there was when our metadata store filled up entirely. The second, smaller one was the more controlled fill-up. The last instance with only one runaway MDS is what I have just reported. My unqualified wild guess is that the new safeguard to prevent the MDS from committing corrupt dentries is holding up the queue, so all of a sudden events are starting to pile up until the store is full. On 05/06/2023 18:03, Janek Bevendorff wrote: > That said, our MON store size has also been growing slowly from 900MB > to 5.4GB. But we also have a few remapped PGs right now. Not sure if > that would have an influence. > > > On 05/06/2023 17:48, Janek Bevendorff wrote: >> Hi Patrick, hi Dan! >> >> I got the MDS back and I think the issue is connected to the "newly >> corrupt dentry" bug [1]. Even though I couldn't see any particular >> reason for the SIGABRT at first, I then noticed one of these awfully >> familiar stack traces. >> >> I rescheduled the two broken MDS ranks on two machines with 1.5TB >> RAM each (just to make sure it's not that) and then let them do >> their thing. The routine goes as follows: both replay the journal, >> then rank 4 goes into the "resolve" state, but as soon as rank 3 >> also starts resolving, they both crash. >> >> Then I set >> >> ceph config mds mds_abort_on_newly_corrupt_dentry false >> ceph config mds mds_go_bad_corrupt_dentry false >> >> and this time I was able to recover the ranks, even though "resolve" >> and "clientreplay" took forever. I uploaded a compressed log of rank >> 3 using ceph-post-file [2]. It's a log of several crash cycles, >> including the final successful attempt after changing the settings. >> The log decompresses to 815MB. I didn't censor any paths and they >> are not super-secret, but please don't share. >> >> While writing this, the metadata pool size has reduced from 6TiB >> back to 440GiB. I am starting to think that the fill-ups may also be >> connected to the corruption issue. I also noticed that the ranks 3 >> and 4 always have huge journals. An inspection using >> ceph-journal-tool takes forever and consumes 50GB of memory in the >> process. Listing the events in the journal is impossible without >> running out of RAM. Ranks 0, 1, and 2 don't have this problem and >> this wasn't a problem for ranks 3 and 4 either before the fill-ups >> started happening. >> >> Hope that helps getting to the bottom of this. I reset the guardrail >> settings in the meantime. >> >> Cheers >> Janek >> >> >> [1] "Newly corrupt dentry" ML link: >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JNZ6V5WSYKQ… >> >> [2] ceph-post-file ID: 7c039483-49fd-468c-ba40-fb10337aa7d6 >> >> >> >> On 05/06/2023 16:08, Janek Bevendorff wrote: >>> I just had the problem again that MDS were constantly reporting >>> slow metadata IO and the pool was slowly growing. Hence I restarted >>> the MDS and now ranks 4 and 5 don't come up again. >>> >>> Every time, they get to the resolve stage, the crash with a SIGABRT >>> without an error message (not even at debug_mds = 20). Any idea >>> what the reason could be? I checked whether they have enough RAM, >>> which seems to be the case (unless they try to allocate tens of GB >>> in one allocation). >>> >>> Janek >>> >>> >>> On 31/05/2023 21:57, Janek Bevendorff wrote: >>>> Hi Dan, >>>> >>>> Sorry, I meant Pacific. The version number was correct, the name >>>> wasn’t. ;-) >>>> >>>> Yes, I have five active MDS and five hot standbys. Static pinning >>>> isn’t really an options for our directory structure, so we’re >>>> using ephemeral pins. >>>> >>>> Janek >>>> >>>> >>>>> On 31. May 2023, at 18:44, Dan van der Ster >>>>> <dan.vanderster(a)clyso.com> wrote: >>>>> >>>>> Hi Janek, >>>>> >>>>> A few questions and suggestions: >>>>> - Do you have multi-active MDS? In my experience back in nautilus if >>>>> something went wrong with mds export between mds's, the mds >>>>> log/journal could grow unbounded like you observed until that export >>>>> work was done. Static pinning could help if you are not using it >>>>> already. >>>>> - You definitely should disable the pg autoscaling on the mds >>>>> metadata >>>>> pool (and other pools imho) -- decide the correct number of PGs for >>>>> your pools and leave it. >>>>> - Which version are you running? You said nautilus but wrote 16.2.12 >>>>> which is pacific... If you're running nautilus v14 then I recommend >>>>> disabling pg autoscaling completely -- IIRC it does not have a >>>>> fix for >>>>> the OSD memory growth "pg dup" issue which can occur during PG >>>>> splitting/merging. >>>>> >>>>> Cheers, Dan >>>>> >>>>> ______________________________ >>>>> Clyso GmbH | https://www.clyso.com >>>>> >>>>> >>>>> On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff >>>>> <janek.bevendorff(a)uni-weimar.de> wrote: >>>>>> I checked our logs from yesterday, the PG scaling only started >>>>>> today, >>>>>> perhaps triggered by the snapshot trimming. I disabled it, but >>>>>> it didn't >>>>>> change anything. >>>>>> >>>>>> What did change something was restarting the MDS one by one, >>>>>> which had >>>>>> got far behind with trimming their caches and with a bunch of >>>>>> stuck ops. >>>>>> After restarting them, the pool size decreased quickly to 600GiB. I >>>>>> noticed the same behaviour yesterday, though yesterday is was more >>>>>> extreme and restarting the MDS took about an hour and I had to >>>>>> increase >>>>>> the heartbeat timeout. This time, it took only half a minute per >>>>>> MDS, >>>>>> probably because it wasn't that extreme yet and I had reduced the >>>>>> maximum cache size. Still looks like a bug to me. >>>>>> >>>>>> >>>>>> On 31/05/2023 11:18, Janek Bevendorff wrote: >>>>>>> Another thing I just noticed is that the auto-scaler is trying to >>>>>>> scale the pool down to 128 PGs. That could also result in large >>>>>>> fluctuations, but this big?? In any case, it looks like a bug >>>>>>> to me. >>>>>>> Whatever is happening here, there should be safeguards with >>>>>>> regard to >>>>>>> the pool's capacity. >>>>>>> >>>>>>> Here's the current state of the pool in ceph osd pool ls detail: >>>>>>> >>>>>>> pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 >>>>>>> crush_rule >>>>>>> 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 >>>>>>> pgp_num_target 128 autoscale_mode on last_change 1359013 lfor >>>>>>> 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 >>>>>>> expected_num_objects 3000000 recovery_op_priority 5 >>>>>>> recovery_priority >>>>>>> 2 application cephfs >>>>>>> >>>>>>> Janek >>>>>>> >>>>>>> >>>>>>> On 31/05/2023 10:10, Janek Bevendorff wrote: >>>>>>>> Forgot to add: We are still on Nautilus (16.2.12). >>>>>>>> >>>>>>>> >>>>>>>> On 31/05/2023 09:53, Janek Bevendorff wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Perhaps this is a known issue and I was simply too dumb to >>>>>>>>> find it, >>>>>>>>> but we are having problems with our CephFS metadata pool >>>>>>>>> filling up >>>>>>>>> over night. >>>>>>>>> >>>>>>>>> Our cluster has a small SSD pool of around 15TB which hosts our >>>>>>>>> CephFS metadata pool. Usually, that's more than enough. The >>>>>>>>> normal >>>>>>>>> size of the pool ranges between 200 and 800GiB (which is >>>>>>>>> quite a lot >>>>>>>>> of fluctuation already). Yesterday, we had suddenly had the pool >>>>>>>>> fill up entirely and they only way to fix it was to add more >>>>>>>>> capacity. I increased the pool size to 18TB by adding more >>>>>>>>> SSDs and >>>>>>>>> could resolve the problem. After a couple of hours of >>>>>>>>> reshuffling, >>>>>>>>> the pool size finally went back to 230GiB. >>>>>>>>> >>>>>>>>> But then we had another fill-up tonight to 7.6TiB. Luckily, I >>>>>>>>> had >>>>>>>>> adjusted the weights so that not all disks could fill up >>>>>>>>> entirely >>>>>>>>> like last time, so it ended there. >>>>>>>>> >>>>>>>>> I wasn't really able to identify the problem yesterday, but >>>>>>>>> under >>>>>>>>> the more controllable scenario today, I could check the MDS >>>>>>>>> logs at >>>>>>>>> debug_mds=10 and to me it seems like the problem is caused by >>>>>>>>> snapshot trimming. The logs contain a lot of snapshot-related >>>>>>>>> messages for paths that haven't been touched in a long time. The >>>>>>>>> messages all look something like this: >>>>>>>>> >>>>>>>>> May 31 09:16:48 XXX ceph-mds[2947525]: >>>>>>>>> 2023-05-31T09:16:48.292+0200 >>>>>>>>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap >>>>>>>>> first >>>>>>>>> cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1 >>>>>>>>> b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>>>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>>>>>>> 'monthly_20230101' >>>>>>>>> 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>>>>>>> 0x10000000000 ... >>>>>>>>> >>>>>>>>> May 31 09:25:03 XXX ceph-mds[3268481]: >>>>>>>>> 2023-05-31T09:25:03.396+0200 >>>>>>>>> 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir >>>>>>>>> 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 >>>>>>>>> request=0 >>>>>>>>> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 >>>>>>>>> authpin=0 >>>>>>>>> tempexporting=0 0x5607759d9600] >>>>>>>>> >>>>>>>>> May 31 09:25:03 XXX ceph-mds[3268481]: >>>>>>>>> 2023-05-31T09:25:03.452+0200 >>>>>>>>> 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir >>>>>>>>> 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 >>>>>>>>> request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0 >>>>>>>>> waiter=0 authpin=0 tempexporting=0 0x56034ed25200] >>>>>>>>> >>>>>>>>> May 31 09:25:03 XXX ceph-mds[3268481]: >>>>>>>>> 2023-05-31T09:25:03.716+0200 >>>>>>>>> 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm >>>>>>>>> snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>>>>>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>>>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>>>>>>> 'monthly_20230101' >>>>>>>>> 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>>>>>>> 0x10000000000 'monthly_20230201' >>>>>>>>> 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000 >>>>>>>>> 'monthly_20230301' >>>>>>>>> 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 >>>>>>>>> 0x10000000000 'monthly_20230401' ...) len=384 >>>>>>>>> >>>>>>>>> May 31 09:25:36 deltaweb055 ceph-mds[3268481]: >>>>>>>>> 2023-05-31T09:25:36.076+0200 7f0e6becd700 10 >>>>>>>>> mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, >>>>>>>>> leaving >>>>>>>>> realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>>>>>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>>>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>>>>>>> 'monthly_20230101' ...) >>>>>>>>> >>>>>>>>> The daily_*, montly_* etc. names are the names of our regular >>>>>>>>> snapshots. >>>>>>>>> >>>>>>>>> I posted a larger log file snippet using ceph-post-file with >>>>>>>>> the ID: >>>>>>>>> da0eb93d-f340-4457-8a3f-434e8ef37d36 >>>>>>>>> >>>>>>>>> Is it possible that the MDS are trimming old snapshots without >>>>>>>>> taking care not to fill up the entire metadata pool? >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> Janek >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>> -- >>>>>> >>>>>> Bauhaus-Universität Weimar >>>>>> Bauhausstr. 9a, R308 >>>>>> 99423 Weimar, Germany >>>>>> >>>>>> Phone: +49 3643 58 3577 >>>>>> www.webis.de >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>

-- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de

Patrick Donnelly

8 Jun 8 Jun

8:52 p.m.

On Mon, Jun 5, 2023 at 11:48 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de> wrote:

...

Probably only ceph config mds mds_go_bad_corrupt_dentry false was necessary for recovery. You don't have any logs showing it hit those asserts? I'm afraid your ceph-post-file logs were lost to the nether. AFAICT, our ceph-post-file storage has been non-functional since the beginning of the lab outage last year. We're looking into it.

...

While writing this, the metadata pool size has reduced from 6TiB back to 440GiB. I am starting to think that the fill-ups may also be connected to the corruption issue.

Extremely unlikely.

...

I also noticed that the ranks 3 and 4 always have huge journals. An inspection using ceph-journal-tool takes forever and consumes 50GB of memory in the process. Listing the events in the journal is impossible without running out of RAM. Ranks 0, 1, and 2 don't have this problem and this wasn't a problem for ranks 3 and 4 either before the fill-ups started happening.

So clearly (a) an incredible number of journal events are being logged and (b) trimming is slow or unable to make progress. I'm looking into why but you can help by running the attached script when the problem is occurring so I can investigate. I'll need a tarball of the outputs. Also, in the off-chance this is related to the MDS balancer, please disable it since you're using ephemeral pinning: ceph config set mds mds_bal_interval 0 -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

Janek Bevendorff

9 Jun 9 Jun

8:27 a.m.

Hi Patrick,

...

I'm afraid your ceph-post-file logs were lost to the nether. AFAICT, our ceph-post-file storage has been non-functional since the beginning of the lab outage last year. We're looking into it.

I have it here still. Any other way I can send it to you?

...

Extremely unlikely.

Okay, taking your word for it. But something seems to be stalling journal trimming. We had a similar thing yesterday evening, but at much smaller scale without noticeable pool size increase. I only got an alert that the ceph_mds_log_ev Prometheus metric starting going up again for a single MDS. It grew past 1M events, so I restarted it. I also restarted the other MDS and they all immediately jumped to above 5M events and stayed there. They are, in fact, still there and have decreased only very slightly in the morning. The pool size is totally within a normal range, though, at 290GiB.

...

How do I send it to you if not via ceph-post-file?

...

Also, in the off-chance this is related to the MDS balancer, please disable it since you're using ephemeral pinning: ceph config set mds mds_bal_interval 0

Done. Thanks for your help! Janek -- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de

Patrick Donnelly

10 Jun 10 Jun

2:23 p.m.

On Fri, Jun 9, 2023 at 3:27 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de> wrote:

...

Hi Patrick,

I'm afraid your ceph-post-file logs were lost to the nether. AFAICT, our ceph-post-file storage has been non-functional since the beginning of the lab outage last year. We're looking into it.

I have it here still. Any other way I can send it to you?

Nevermind, I found the machine it was stored on. It was a misconfiguration caused by post-lab-outage rebuilds.

...

Extremely unlikely.

Please keep monitoring it. I think you're not the only cluster to experience this.

...

How do I send it to you if not via ceph-post-file?

It should work soon next week. We're moving the drop.ceph.com service to a standalone VM soonish.

...

Also, in the off-chance this is related to the MDS balancer, please disable it since you're using ephemeral pinning: ceph config set mds mds_bal_interval 0

Done. Thanks for your help! Janek -- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de

-- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

Janek Bevendorff

12 Jun 12 Jun

10:31 a.m.

Good news: We haven't had any new fill-ups so far. On the contrary, the pool size is as small as it's ever been (200GiB). Bad news: The MDS are still acting strangely. I have very uneven session load and I don't know where it comes from. ceph_mds_sessions_total_load reports a number of 1.4 million on mds.3, whereas all the others are mostly idle. I checked the client list on that rank, but the heaviest client has about 8k caps, which isn't very much at all. Most have 0 or 1. I don't see any blocked ops in flight. I don't think this is to do with the disabled balancer, because I've seen this pattern before. The event log size of 3/5 MDS is also very high, still. mds.1, mds.3, and mds.4 report between 4 and 5 million events, mds.0 around 1.4 million and mds.2 between 0 and 200,000. The numbers have been constant since my last MDS restart four days ago. I ran your ceph-gather.sh script a couple of times, but dumps only mds.0. Should I modify it to dump mds.3 instead so you can have a look? Janek On 10/06/2023 15:23, Patrick Donnelly wrote:

...

On Fri, Jun 9, 2023 at 3:27 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de> wrote:

Hi Patrick,

I'm afraid your ceph-post-file logs were lost to the nether. AFAICT, our ceph-post-file storage has been non-functional since the beginning of the lab outage last year. We're looking into it.

I have it here still. Any other way I can send it to you?

Nevermind, I found the machine it was stored on. It was a misconfiguration caused by post-lab-outage rebuilds.

Extremely unlikely.

Please keep monitoring it. I think you're not the only cluster to experience this.

How do I send it to you if not via ceph-post-file?

It should work soon next week. We're moving the drop.ceph.com service to a standalone VM soonish.

Also, in the off-chance this is related to the MDS balancer, please disable it since you're using ephemeral pinning: ceph config set mds mds_bal_interval 0

Done. Thanks for your help! Janek -- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de

-- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de

Patrick Donnelly

16 Jun 16 Jun

7:37 p.m.

Hi Janek, On Mon, Jun 12, 2023 at 5:31 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de> wrote:

...

Good news: We haven't had any new fill-ups so far. On the contrary, the pool size is as small as it's ever been (200GiB).

Great!

...

Bad news: The MDS are still acting strangely. I have very uneven session load and I don't know where it comes from. ceph_mds_sessions_total_load reports a number of 1.4 million on mds.3, whereas all the others are mostly idle. I checked the client list on that rank, but the heaviest client has about 8k caps, which isn't very much at all. Most have 0 or 1. I don't see any blocked ops in flight. I don't think this is to do with the disabled balancer, because I've seen this pattern before.

That's interesting... I don't have an explanation.

...

The event log size of 3/5 MDS is also very high, still. mds.1, mds.3, and mds.4 report between 4 and 5 million events, mds.0 around 1.4 million and mds.2 between 0 and 200,000. The numbers have been constant since my last MDS restart four days ago. I ran your ceph-gather.sh script a couple of times, but dumps only mds.0. Should I modify it to dump mds.3 instead so you can have a look?

Yes, please.

...

Janek On 10/06/2023 15:23, Patrick Donnelly wrote:

On Fri, Jun 9, 2023 at 3:27 AM Janek Bevendorff <janek.bevendorff(a)uni-weimar.de> wrote:

Hi Patrick,

I'm afraid your ceph-post-file logs were lost to the nether. AFAICT, our ceph-post-file storage has been non-functional since the beginning of the lab outage last year. We're looking into it.

I have it here still. Any other way I can send it to you?

Nevermind, I found the machine it was stored on. It was a misconfiguration caused by post-lab-outage rebuilds.

Extremely unlikely.

Please keep monitoring it. I think you're not the only cluster to experience this.

How do I send it to you if not via ceph-post-file?

It should work soon next week. We're moving the drop.ceph.com service to a standalone VM soonish.

Also, in the off-chance this is related to the MDS balancer, please disable it since you're using ephemeral pinning: ceph config set mds mds_bal_interval 0

Done. Thanks for your help! Janek -- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de

-- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de

-- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

Janek Bevendorff

19 Jun 19 Jun

4:48 p.m.

Hi Patrick,

...

Yes, please.

The session load on mds.3 had already resolved itself after a few days, so I cannot reproduce it any more. Right now, mds.0 has the highest load and a steadily growing event log, but it's not crazy (yet). Nonetheless, I've sent you my dumps with upload ID b95ee882-21e1-4ea1-a419-639a86acc785. The older dumps are from when mds.3 was under load, but they are all from mds.0. I also attached a newer batch, which I created just a few minutes ago. Janek -- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de

334

days inactive

353

days old

ceph-users@ceph.io

Manage subscription

16 comments

3 participants

tags (0)

participants (3)

Dan van der Ster
Janek Bevendorff
Patrick Donnelly