Looks like the image attachment got removed. Please find it here:
https://imgur.com/a/3tabzCN
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans(a)dtu.dk>
Sent: 31 August 2020 14:42
To: Mark Nelson; Dan van der Ster; ceph-users
Subject: [ceph-users] Re: OSD memory leak?
Hi Dan and Mark,
sorry, took a bit longer. I uploaded a new archive containing files with the following
format (
https://files.dtu.dk/u/jb0uS6U9LlCfvS5L/heap_profiling-2020-08-31.tgz?l - valid 60
days):
- osd.195.profile.*.heap - raw heap dump file
- osd.195.profile.*.heap.txt - output of conversion with --text
- osd.195.profile.*.heap-base0001.txt - output of conversion with --text against first
dump as base
- osd.195.*.heap_stats - output of ceph daemon osd.195 heap stats, every hour
- osd.195.*.mempools - output of ceph daemon osd.195 dump_mempools, every hour
- osd.195.*.perf - output of ceph daemon osd.195 perf dump, every hour, counters are
reset
Only for the last couple of days are converted files included, post-conversion of
everything simply takes too long.
Please find also attached a recording of memory usage on one of the relevant OSD nodes. I
marked restarts of all OSDs/the host with vertical red lines. What is worrying is the
self-amplifying nature of the leak. ts not a linear process, it looks at least quadratic
if not exponential. What we are looking for is, given the comparably short uptime,
probably still in the lower percentages with increasing rate. The OSDs just started to
overrun their limit:
top - 14:38:49 up 155 days, 19:17, 1 user, load average: 5.99, 4.59, 4.59
Tasks: 684 total, 1 running, 293 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.9 us, 0.9 sy, 0.0 ni, 89.6 id, 7.6 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 65727628 total, 6937548 free, 41921260 used, 16868820 buff/cache
KiB Swap: 93532160 total, 90199040 free, 3333120 used. 6740136 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4099023 ceph 20 0 5918704 3.8g 9700 S 1.7 6.1 378:37.01 /usr/bin/ceph-osd
--cluster ceph -f -i 35 --setuser cep+
4097639 ceph 20 0 5340924 3.0g 11428 S 87.1 4.7 14636:30 /usr/bin/ceph-osd
--cluster ceph -f -i 195 --setuser ce+
4097974 ceph 20 0 3648188 2.3g 9628 S 8.3 3.6 1375:58 /usr/bin/ceph-osd
--cluster ceph -f -i 201 --setuser ce+
4098322 ceph 20 0 3478980 2.2g 9688 S 5.3 3.6 1426:05 /usr/bin/ceph-osd
--cluster ceph -f -i 223 --setuser ce+
4099374 ceph 20 0 3446784 2.2g 9252 S 4.6 3.5 1142:14 /usr/bin/ceph-osd
--cluster ceph -f -i 205 --setuser ce+
4098679 ceph 20 0 3832140 2.2g 9796 S 6.6 3.5 1248:26 /usr/bin/ceph-osd
--cluster ceph -f -i 132 --setuser ce+
4100782 ceph 20 0 3641608 2.2g 9652 S 7.9 3.5 1278:10 /usr/bin/ceph-osd
--cluster ceph -f -i 207 --setuser ce+
4095944 ceph 20 0 3375672 2.2g 8968 S 7.3 3.5 1250:02 /usr/bin/ceph-osd
--cluster ceph -f -i 108 --setuser ce+
4096956 ceph 20 0 3509376 2.2g 9456 S 7.9 3.5 1157:27 /usr/bin/ceph-osd
--cluster ceph -f -i 203 --setuser ce+
4099731 ceph 20 0 3563652 2.2g 8972 S 3.6 3.5 1421:48 /usr/bin/ceph-osd
--cluster ceph -f -i 61 --setuser cep+
4096262 ceph 20 0 3531988 2.2g 9040 S 9.9 3.5 1600:15 /usr/bin/ceph-osd
--cluster ceph -f -i 121 --setuser ce+
4100442 ceph 20 0 3359736 2.1g 9804 S 4.3 3.4 1185:53 /usr/bin/ceph-osd
--cluster ceph -f -i 226 --setuser ce+
4096617 ceph 20 0 3443060 2.1g 9432 S 5.0 3.4 1449:29 /usr/bin/ceph-osd
--cluster ceph -f -i 199 --setuser ce+
4097298 ceph 20 0 3483532 2.1g 9600 S 5.6 3.3 1265:28 /usr/bin/ceph-osd
--cluster ceph -f -i 97 --setuser cep+
4100093 ceph 20 0 3428348 2.0g 9568 S 3.3 3.2 1298:53 /usr/bin/ceph-osd
--cluster ceph -f -i 197 --setuser ce+
4095630 ceph 20 0 3440160 2.0g 8976 S 3.6 3.2 1451:35 /usr/bin/ceph-osd
--cluster ceph -f -i 62 --setuser cep+
Generally speaking, increasing the cache minimum seems to help with keeping important
information in RAM. Unfortunately, it also means that swap usage starts much earlier.
Best regards and thanks for your help,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14