On Sun, Apr 12, 2020 at 9:33 PM Dan van der Ster <dan(a)vanderster.com> wrote:
Hi John,
Did you make any progress on investigating this?
Today I also saw huge relative buffer_anon usage on our 2 active mds's
running 14.2.8:
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 2322,
"bytes": 2322
},
...
"buffer_anon": {
"items": 4947214,
"bytes": 19785847411
},
...
"osdmap": {
"items": 4036,
"bytes": 89488
},
...
"mds_co": {
"items": 9248718,
"bytes": 157725128
},
...
},
"total": {
"items": 14202290,
"bytes": 19943664349
}
}
That mds has `mds cache memory limit = 15353442304` and there was no
health warning about the mds memory usage exceeding the limit.
(I only noticed because some other crons on the mds's were going oom).
Patrick: is there any known memory leak in nautilus mds's ?
I restarted one MDS with ms_type = simple and that MDS maintained a
normal amount of buffer_anon for several hours, while the other active
MDS (with async ms type) saw its buffer_anon grow by some ~10GB
overnight.
So, it seems there are still memory leaks with ms_type = async in 14.2.8.
OTOH, the whole cluster is kinda broken now due to
https://tracker.ceph.com/issues/45080, which may be related to the
ms_type=simple .. I'm still debugging.
Cheers, Dan
> Any tips to debug this further?
>
> Cheers, Dan
>
> On Wed, Mar 4, 2020 at 8:38 PM John Madden <jmadden.com(a)gmail.com> wrote:
> >
> > Though it appears potentially(?) better, I'm still having issues with
> > this on 14.2.8. Kick off the ~20 threads sequentially reading ~1M
> > files and buffer_anon still grows apparently without bound.
> >
> > mds.1 tcmalloc heap stats:------------------------------------------------
> > MALLOC: 53710413656 (51222.2 MiB) Bytes in use by application
> > MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
> > MALLOC: + 334028128 ( 318.6 MiB) Bytes in central cache freelist
> > MALLOC: + 11210608 ( 10.7 MiB) Bytes in transfer cache freelist
> > MALLOC: + 11105240 ( 10.6 MiB) Bytes in thread cache freelists
> > MALLOC: + 77525152 ( 73.9 MiB) Bytes in malloc metadata
> > MALLOC: ------------
> > MALLOC: = 54144282784 (51636.0 MiB) Actual memory used (physical + swap)
> > MALLOC: + 49963008 ( 47.6 MiB) Bytes released to OS (aka unmapped)
> > MALLOC: ------------
> > MALLOC: = 54194245792 (51683.7 MiB) Virtual address space used
> > MALLOC:
> > MALLOC: 262021 Spans in use
> > MALLOC: 18 Thread heaps in use
> > MALLOC: 8192 Tcmalloc page size
> > ------------------------------------------------
> >
> > The byte count appears to grow even as the item count drops, though
> > the trend is for both to increase over the life of the workload:
> > ceph daemon mds.1 dump_mempools | jq .mempool.by_pool.buffer_anon:
> >
> > {
> > "items": 28045,
> > "bytes": 24197601109
> > }
> > {
> > "items": 27132,
> > "bytes": 24262495865
> > }
> > {
> > "items": 27105,
> > "bytes": 24262537939
> > }
> > {
> > "items": 33309,
> > "bytes": 29754507505
> > }
> > {
> > "items": 36160,
> > "bytes": 31803033733
> > }
> > {
> > "items": 56772,
> > "bytes": 51062350351
> > }
> >
> > Is there further data/debug I can retrieve to help track this down?
> >
> >
> > On Wed, Feb 19, 2020 at 4:38 PM John Madden <jmadden.com(a)gmail.com>
wrote:
> > >
> > > Ah, no, I hadn't seen that. Patiently awaiting .8 then. Thanks!
> > >
> > > On Mon, Feb 17, 2020 at 8:52 AM Dan van der Ster <dan(a)vanderster.com>
wrote:
> > > >
> > > > On Mon, Feb 10, 2020 at 8:31 PM John Madden
<jmadden.com(a)gmail.com> wrote:
> > > > >
> > > > > Upgraded to 14.2.7, doesn't appear to have affected the
behavior. As requested:
> > > >
> > > > In case it wasn't clear -- the fix that Patrick mentioned was
> > > > postponed to 14.2.8.
> > > >
> > > > -- dan