Re: [ceph-users] Sick Nautilus cluster, OOM killing OSDs, lots of osdmaps - Dev

9 Oct 2019

[adding dev]

On Wed, 9 Oct 2019, Aaron Johnson wrote:
...
  Hi all

 I have a smallish test cluster (14 servers, 84 OSDs) running 14.2.4.  
 Monthly OS patching and reboots that go along with it have resulted in 
 the cluster getting very unwell.

 Many of the servers in the cluster are OOM-killing the ceph-osd 
 processes when they try to start.  (6 OSDs per server running on 
 filestore.). Strace shows the ceph-osd processes are spending hours 
 reading through the 220k osdmap files after being started. 
Is the process size growing during this time?  There should be a cap to 
the size of the OSDMap cache; perhaps there is a regression there.

One common thing to do here is 'ceph osd set noup' and restart the OSD, 
and then monitor the OSD's progress catching up on maps with 'ceph daemon 
osd.NN status' (compare the epoch to what you get from 'ceph osd dump | 
head').  This will take a while if you are really 220k maps (!!!) behind,
but the memory usage during that period should be relatively constant.

...
  This behavior started after we recently made it about
72% full to see 
 how things behaved.  We also upgraded it to Nautilus 14.2.2 at about the 
 same time.

 I’ve tried starting just one OSD per server at a time in hopes of 
 avoiding the OOM killer.  Also tried setting noin, rebooting the whole 
 cluster, waiting a day, then marking each of the OSDs in manually.  The 
 end result is the same either way.  About 60% of PGs are still down, 30% 
 are peering, and the rest are in worse shape. 
Usually in instances like this in the past, getting all OSDs to catch up 
on maps and then unsetting 'noup' will let them all come up and peer at 
the same time.  But usually what has happened is many of the OSDs are not 
caught up and it's not immediately obvious, so PGs don't peer.  So setting 
noup and waiting for all osds to be caught up (as per 'ceph daemon osd.NNN 
status') first generally helps.

But none of that explains why you're seeing OOM, so I'm curious what you 
see with memory usage while OSDs are catching up...

Thanks!
sage

...

 Anyone out there have suggestions about how I should go about getting 
 this cluster healthy again?  Any ideas appreciated.

 Thanks!

 - Aaron