Dear Mark and Dan,
I'm in the process of restarting all OSDs and could use some quick advice on bluestore
cache settings. My plan is to set higher minimum values and deal with accumulated excess
usage via regular restarts. Looking at the documentation
(
https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/), I find the
following relevant options (with defaults):
# Automatic Cache Sizing
osd_memory_target {4294967296} # 4GB
osd_memory_base {805306368} # 768MB
osd_memory_cache_min {134217728} # 128MB
# Manual Cache Sizing
bluestore_cache_meta_ratio {.4} # 40% ?
bluestore_cache_kv_ratio {.4} # 40% ?
bluestore_cache_kv_max {512 * 1024*1024} # 512MB
Q1) If I increase osd_memory_cache_min, should I also increase osd_memory_base by the same
or some other amount?
Q2) The cache ratio options are shown under the section "Manual Cache Sizing".
Do they also apply when cache auto tuning is enabled? If so, is it worth changing these
defaults for higher values of osd_memory_cache_min?
Many thanks for your help with this. I can't find answers to these questions in the
docs.
There might be two reasons for high osd_map memory usage. One is, that our OSDs seem to
hold a large number of OSD maps:
OSD208, uptime 113-13:44:32:
# ceph daemon osd.208 status
{
"cluster_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
"osd_fsid": "16891b16-b4d8-418b-a6ba-34b85921e809",
"whoami": 208,
"state": "active",
"oldest_map": 162084,
"newest_map": 162766,
"num_pgs": 96
}
OSD.211, uptime 2-18:56:49:
# ceph daemon osd.211 status
{
"cluster_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
"osd_fsid": "81cb5da4-bf12-42a3-b9d4-9d4fba2a58fd",
"whoami": 211,
"state": "active",
"oldest_map": 162084,
"newest_map": 162766,
"num_pgs": 98
}
A long-running one and a freshly restarted one hold the same number of osd_maps. However,
the long-running one has accummulated more than 1GB extra memory usage (3067.6 MiB versus
1951.4 MiB):
# ceph daemon osd.208 heap stats
osd.208 tcmalloc heap stats:------------------------------------------------
MALLOC: 2356637072 ( 2247.5 MiB) Bytes in use by application
MALLOC: + 5742592 ( 5.5 MiB) Bytes in page heap freelist
MALLOC: + 822018216 ( 783.9 MiB) Bytes in central cache freelist
MALLOC: + 491520 ( 0.5 MiB) Bytes in transfer cache freelist
MALLOC: + 11850184 ( 11.3 MiB) Bytes in thread cache freelists
MALLOC: + 19922944 ( 19.0 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 3216662528 ( 3067.6 MiB) Actual memory used (physical + swap)
MALLOC: + 249790464 ( 238.2 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 3466452992 ( 3305.9 MiB) Virtual address space used
MALLOC:
MALLOC: 293267 Spans in use
MALLOC: 36 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
# ceph daemon osd.208 dump_mempools
{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 4691828,
"bytes": 37534624
},
"bluestore_cache_data": {
"items": 0,
"bytes": 0
},
"bluestore_cache_onode": {
"items": 19,
"bytes": 10792
},
"bluestore_cache_other": {
"items": 5720554,
"bytes": 46039919
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 18,
"bytes": 12384
},
"bluestore_writing_deferred": {
"items": 214,
"bytes": 26514641
},
"bluestore_writing": {
"items": 64,
"bytes": 10206779
},
"bluefs": {
"items": 9735,
"bytes": 188984
},
"buffer_anon": {
"items": 292345,
"bytes": 67468304
},
"buffer_meta": {
"items": 562,
"bytes": 35968
},
"osd": {
"items": 96,
"bytes": 1115904
},
"osd_mapbl": {
"items": 80,
"bytes": 8501746
},
"osd_pglog": {
"items": 328703,
"bytes": 117673864
},
"osdmap": {
"items": 12101478,
"bytes": 210941392
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 23145696,
"bytes": 526245301
}
}
}
# ceph daemon osd.211 heap stats
osd.211 tcmalloc heap stats:------------------------------------------------
MALLOC: 1727399344 ( 1647.4 MiB) Bytes in use by application
MALLOC: + 532480 ( 0.5 MiB) Bytes in page heap freelist
MALLOC: + 262860912 ( 250.7 MiB) Bytes in central cache freelist
MALLOC: + 11693568 ( 11.2 MiB) Bytes in transfer cache freelist
MALLOC: + 29694944 ( 28.3 MiB) Bytes in thread cache freelists
MALLOC: + 14024704 ( 13.4 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 2046205952 ( 1951.4 MiB) Actual memory used (physical + swap)
MALLOC: + 229212160 ( 218.6 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 2275418112 ( 2170.0 MiB) Virtual address space used
MALLOC:
MALLOC: 145115 Spans in use
MALLOC: 32 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
# ceph daemon osd.211 dump_mempools
{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 4691828,
"bytes": 37534624
},
"bluestore_cache_data": {
"items": 894,
"bytes": 163053568
},
"bluestore_cache_onode": {
"items": 165536,
"bytes": 94024448
},
"bluestore_cache_other": {
"items": 33936718,
"bytes": 233428234
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 110,
"bytes": 75680
},
"bluestore_writing_deferred": {
"items": 38,
"bytes": 6061245
},
"bluestore_writing": {
"items": 0,
"bytes": 0
},
"bluefs": {
"items": 9956,
"bytes": 189640
},
"buffer_anon": {
"items": 293298,
"bytes": 59950954
},
"buffer_meta": {
"items": 1005,
"bytes": 64320
},
"osd": {
"items": 98,
"bytes": 1139152
},
"osd_mapbl": {
"items": 80,
"bytes": 8501690
},
"osd_pglog": {
"items": 350517,
"bytes": 132253139
},
"osdmap": {
"items": 633498,
"bytes": 10866360
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 40083576,
"bytes": 747143054
}
}
}
Same disk type, same memory_target.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 16 July 2020 09:11
To: Mark Nelson; Dan van der Ster
Cc: ceph-users
Subject: [ceph-users] Re: OSD memory leak?
Dear Dan, cc Mark,
this sounds exactly like the scenario I'm looking at. We have rolling snapshots on RBD
images on currently ca. 200 VMs and increasing. Snapshots are daily with different
retention periods.
We have two pools with separate hardware backing RBD and cephfs. The mem stats I sent are
from an OSD backing cephfs, which does not have any snaps currently. So the snaps on other
OSDs influence the memory usage of OSDs that have nothing to do with the RBDs. I also
noticed a significant drop of memory usage across the cluster after restarting the OSDs on
just one host. Not sure if this is expected either.
Looks like the OSDs do collect dead baggage quite fast and the memory_target reduces the
caches in an attempt to accommodate for that. The fact that the kernel swaps this out in
favour of disk buffers on a system with low swappiness where the only disk access is local
syslog indicates that this is allocated but never used - a quite massive leak. It
currently looks like that after only a couple of days the leakage exceeds the mem target
already.
I don't want to have the occasional OOM killer on my operations team. For now I will
probably adopt a reverse strategy, give up on memory_target doing something useful,
increase the minimum cache limits to ensure at least some caching, have swap take care of
the leak and restart OSDs regularly (every 2-3 months).
Would be good if this could be looked at. Please let me know if there is some data I can
provide.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Mark Nelson <mnelson(a)redhat.com>
Sent: 15 July 2020 18:36:06
To: Dan van der Ster
Cc: ceph-users
Subject: [ceph-users] Re: OSD memory leak?
On 7/15/20 9:58 AM, Dan van der Ster wrote:
Hi Mark,
On Mon, Jul 13, 2020 at 3:42 PM Mark Nelson <mnelson(a)redhat.com> wrote:
Hi Frank,
So the osd_memory_target code will basically shrink the size of the
bluestore and rocksdb caches to attempt to keep the overall mapped (not
rss!) memory of the process below the target. It's sort of "best
effort" in that it can't guarantee the process will fit within a given
target, it will just (assuming we are over target) shrink the caches up
to some minimum value and that's it. 2GB per OSD is a pretty ambitious
target. It's the lowest osd_memory_target we recommend setting. I'm a
little surprised the OSD is consuming this much memory with a 2GB target
though.
Looking at your mempool dump I see very little memory allocated to the
caches. In fact the majority is taken up by osdmap (looks like you have
a decent number of OSDs) and pglog. That indicates that the memory
Do you know if
this high osdmap usage is known already?
Our big block storage cluster generates a new osdmap every few seconds
(due to rbd snap trimming) and we see the osdmap mempool usage growing
over a few months until osds start getting OOM killed.
Today we proactively restarted them because the osdmap_mempool was
using close to 700MB.
So it seems that whatever is supposed to be trimming is not working.
(This is observed with nautilus 14.2.8 but iirc it has been the same
even when we were running luminous and mimic too)
Cheers, Dan
Hrm, it hasn't been on my radar, though looking back through the mailing
list there appears to be various reports over the years of high usage
(some of which theoretically have been fixed). Maybe submit a tracker
issue? 700MB seems quite high for osdmap, but I don't really know the
retention rules so someone else who knows that code better will have to
chime in.
> autotuning is probably working but simply can't do anything more to
> help. Something else is taking up the memory. Figure you've got a
> little shy of 500MB for the mempools. RocksDB will take up more (and
> potentially quite a bit more if you have memtables backing up waiting to
> be flushed to L0) and potentially some other things in the OSD itself
> that could take up memory. If you feel comfortable experimenting, you
> could try changing the rocksdb WAL/memtable settings. By default we
> have up to 4 256MB WAL buffers. Instead you could try something like 2
> 64MB buffers, but be aware this could cause slow performance or even
> temporary write stalls if you have fast storage. Still, this would only
> give you up to ~0.9GB back. Since you are on mimic, you might also want
> to check what your kernel's transparent huge pages configuration is. I
> don't remember if we backported Patrick's fix to always avoid THP for
> ceph processes. If your kernel is set to "always", you might consider
> trying it with "madvise".
>
> Alternately, have you tried the built-in tcmalloc heap profiler? You
> might be able to get a better sense of where memory is being used with
> that as well.
>
>
> Mark
>
>
> On 7/13/20 7:07 AM, Frank Schilder wrote:
>> Hi all,
>>
>> on a mimic 13.2.8 cluster I observe a gradual increase of memory usage by OSD
daemons, in particular, under heavy load. For our spinners I use osd_memory_target=2G. The
daemons overrun the 2G in virt size rather quickly and grow to something like 4G virtual.
The real memory consumption stays more or less around the 2G of the target. There are some
overshoots, but these go down again during periods with less load.
>>
>> What I observe now is that the actual memory consumption slowly grows and OSDs
start using more than 2G virtual memory. I see this as slowly growing swap usage despite
having more RAM available (swappiness=10). This indicates allocated but unused memory or
memory not accessed for a long time, usually a leak. Here some heap stats:
>>
>> Before restart:
>> osd.101 tcmalloc heap stats:------------------------------------------------
>> MALLOC: 3438940768 ( 3279.6 MiB) Bytes in use by application
>> MALLOC: + 5611520 ( 5.4 MiB) Bytes in page heap freelist
>> MALLOC: + 257307352 ( 245.4 MiB) Bytes in central cache freelist
>> MALLOC: + 357376 ( 0.3 MiB) Bytes in transfer cache freelist
>> MALLOC: + 6727368 ( 6.4 MiB) Bytes in thread cache freelists
>> MALLOC: + 25559040 ( 24.4 MiB) Bytes in malloc metadata
>> MALLOC: ------------
>> MALLOC: = 3734503424 ( 3561.5 MiB) Actual memory used (physical + swap)
>> MALLOC: + 575946752 ( 549.3 MiB) Bytes released to OS (aka unmapped)
>> MALLOC: ------------
>> MALLOC: = 4310450176 ( 4110.8 MiB) Virtual address space used
>> MALLOC:
>> MALLOC: 382884 Spans in use
>> MALLOC: 35 Thread heaps in use
>> MALLOC: 8192 Tcmalloc page size
>> ------------------------------------------------
>> # ceph daemon osd.101 dump_mempools
>> {
>> "mempool": {
>> "by_pool": {
>> "bloom_filter": {
>> "items": 0,
>> "bytes": 0
>> },
>> "bluestore_alloc": {
>> "items": 4691828,
>> "bytes": 37534624
>> },
>> "bluestore_cache_data": {
>> "items": 0,
>> "bytes": 0
>> },
>> "bluestore_cache_onode": {
>> "items": 51,
>> "bytes": 28968
>> },
>> "bluestore_cache_other": {
>> "items": 5761276,
>> "bytes": 46292425
>> },
>> "bluestore_fsck": {
>> "items": 0,
>> "bytes": 0
>> },
>> "bluestore_txc": {
>> "items": 67,
>> "bytes": 46096
>> },
>> "bluestore_writing_deferred": {
>> "items": 208,
>> "bytes": 26037057
>> },
>> "bluestore_writing": {
>> "items": 52,
>> "bytes": 6789398
>> },
>> "bluefs": {
>> "items": 9478,
>> "bytes": 183720
>> },
>> "buffer_anon": {
>> "items": 291450,
>> "bytes": 28093473
>> },
>> "buffer_meta": {
>> "items": 546,
>> "bytes": 34944
>> },
>> "osd": {
>> "items": 98,
>> "bytes": 1139152
>> },
>> "osd_mapbl": {
>> "items": 78,
>> "bytes": 8204276
>> },
>> "osd_pglog": {
>> "items": 341944,
>> "bytes": 120607952
>> },
>> "osdmap": {
>> "items": 10687217,
>> "bytes": 186830528
>> },
>> "osdmap_mapping": {
>> "items": 0,
>> "bytes": 0
>> },
>> "pgmap": {
>> "items": 0,
>> "bytes": 0
>> },
>> "mds_co": {
>> "items": 0,
>> "bytes": 0
>> },
>> "unittest_1": {
>> "items": 0,
>> "bytes": 0
>> },
>> "unittest_2": {
>> "items": 0,
>> "bytes": 0
>> }
>> },
>> "total": {
>> "items": 21784293,
>> "bytes": 461822613
>> }
>> }
>> }
>>
>> Right after restart + health_ok:
>> osd.101 tcmalloc heap stats:------------------------------------------------
>> MALLOC: 1173996280 ( 1119.6 MiB) Bytes in use by application
>> MALLOC: + 3727360 ( 3.6 MiB) Bytes in page heap freelist
>> MALLOC: + 25493688 ( 24.3 MiB) Bytes in central cache freelist
>> MALLOC: + 17101824 ( 16.3 MiB) Bytes in transfer cache freelist
>> MALLOC: + 20301904 ( 19.4 MiB) Bytes in thread cache freelists
>> MALLOC: + 5242880 ( 5.0 MiB) Bytes in malloc metadata
>> MALLOC: ------------
>> MALLOC: = 1245863936 ( 1188.1 MiB) Actual memory used (physical + swap)
>> MALLOC: + 20488192 ( 19.5 MiB) Bytes released to OS (aka unmapped)
>> MALLOC: ------------
>> MALLOC: = 1266352128 ( 1207.7 MiB) Virtual address space used
>> MALLOC:
>> MALLOC: 54160 Spans in use
>> MALLOC: 33 Thread heaps in use
>> MALLOC: 8192 Tcmalloc page size
>> ------------------------------------------------
>>
>> Am I looking at a memory leak here or are these heap stats expected?
>>
>> I don't mind the swap usage, it doesn't have impact. I'm just
wondering if I need to restart OSDs regularly. The "leakage" above occurred
within only 2 months.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io