Re: Huge RAM Ussage on OSD recovery - ceph-users

21 Oct 2020

El 2020-10-21 10:08, Mark Nelson escribió:
...
  On 10/21/20 7:54 AM, Ing. Luis Felipe Domínguez Vega
wrote:
  El 2020-10-21 08:43, Mark Nelson escribió:
  Theoretically we shouldn't be spiking memory
as much these days 
 during
 recovery, but the code is complicated and it's tough to reproduce
 these kinds of issues in-house.  If you happen to catch it in the 
 act,
 do you see the pglog mempool stats also spiking up?

 Mark

 On 10/21/20 2:34 AM, Dan van der Ster wrote:
  Hi,

 This might be the pglog issue which has been coming up a few times 
 on the list.
 If the OSD cannot boot without going OOM, you might have success by
 trimming the pglog, e.g. search this list for "ceph-objectstore-tool
 --op trim-pg-log" for some recipes. The thread "OSDs taking too much
 memory, for pglog" in particular might help.

 Cheers, Dan

 On Tue, Oct 20, 2020 at 11:57 PM Ing. Luis Felipe Domínguez Vega
 &lt;luis.dominguez(a)desoft.cu&gt; wrote:
> Hi, today mi Infra provider has a blackout, then the Ceph was try 
> to
> recover but are in an inconsistent state because many OSD can 
> recover
> itself because the kernel kill it by OOM. Even now one OSD that was 
> OK,
> go down by OOM killed.
> 
> Even in a server with 32GB RAM the OSD use ALL that and never 
> recover, i
> think that can be a memory leak, ceph version octopus 15.2.3
> 
> In: https://pastebin.pl/view/59089adc
> You can see that buffer_anon get 32GB, but why?? all my cluster is 
> down
> because that.
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 
_______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io  this
https://pastebin.pl/view/59089adc is almost the OSD going to be 
 killed by OOM

 Ok, that is very interesting!  The OSD memory autotuning code shrank
 the caches to be almost nothing to try and compensate for the huge
 growth in buffer_anon (and to a lesser extent osd_pglog) usage but
 obviously couldn't do anything with that much memory being used.  Any
 chance you could create a tracker ticket and paste the memory pool
 info in along with ceph version/etc?

 https://tracker.ceph.com/

 Mark 
Thanks, https://tracker.ceph.com/issues/47929