On 2020-03-26T16:31:29, Blaine Gardner <BlGardner(a)suse.com> wrote:
Hi Blaine,
thanks for bringing this up.
Advice I got from Joao: In the case of Ceph monitors,
they are more
likely to be experiencing memory over-use during recovery scenarios,
and killing mons during this due to exceeding a limit may make the
problem much worse. The best-practice I have here is to only set
memory requests for Ceph mons, ideally 4GB.
In the case of OSDs, things are a little more complex. OSDs will read
the POD_MEMORY_REQUEST and POD_MEMORY_LIMIT environment variables
which are set by Rook inside Kubernetes pods, and OSD will tune their
memory usage to meet this.
And that's great (though at that point, we probably ought to prevent
the manual setting of the memory target, since it's based on this
external setting?).
the risks of setting (or not setting) Pod Memory
Limits on OSDs
knowing that if the limit is set too low or if the OSDs begin to
memory leak, they will be terminated and restarted by Kubernetes? -
One risk I can imagine is that if OSDs are all started at nearly the
same time and experience similar loads, they might be likely to leak
memory at similar rates and be killed by Kubernetes at about the same
time. Stampeding herds of OSD memory leaks followed by memory limit
terminations might occur which could ripple to causing other OSDs to
become unstable. - Not setting a limit might mean that OSDs
experience memory leak and cause OOM situations for other daemons or
for the Kubernetes kubelet if the system settings don't guarantee
kubelet some amount of resources.
I think the risk of killing OSDs by accident as a false threshold
exception is too high; the impact can be that other pods fail (we're
providing storage to them, after all).
The same can be said for anything that's in the IO path.
Warning and alerting yes. But unless the memory leaks are *really*
severe (and that's hard to quantify, 150%, 200% the expected max?),
trying to service the storage stack is probably still sensible. The
ripple effect of killing them is massive.
The kernel OOM will still go after processes that run completely amok if
the total capacity of the system is exceeded.
Is it good to kill daemons if they exceed a limit in
order to prevent
memory leaks from affecting the rest of the system? MDS? RGW? MGR?
NFS-Ganesha?
It might not matter as much with the mgr, but a misconfigured memory
limit repeatedly killing that one is likely painful too.
Killing an MDS/NFS instance might stop client systems from being able to
flush their dirty buffers, making the overall IO/memory situation worse.
If anyone has knowledgeable recommendations about any
daemons, I'd
love your input. Please reply-all so that I get replies straight to my
inbox.
Based on experience with HA stacks, I'd be very very careful with
killing storage path components. At the very least, those limits, be
they timeouts or resource caps, need to be extremely generous, because
of the ripple effects.
Typically, the storage system is given minimum resource guarantees and
other workloads are limited to not interfere - not vice-versa.
Again, warning/alerting is sensible.
Regards,
Lars
--
SUSE Software Solutions Germany GmbH, MD: Felix Imendörffer, HRB 36809 (AG Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli
Zbinden)