(Please use reply-all so that people I've explicitly tagged and I can continue to get
direct email replies)
I have a conundrum around developing best practices for Rook/Ceph clusters around OSD
memory targets and hardware recommendations for OSDs. I want to lay down some bulleted
notes first.
- 'osd_memory_target' defaults to 4GB
- OSDs attempt to keep memory allocation to 'osd_memory_target'
- BUT... this is only best-effort
- AND... there is no guarantee that the kernel will actually reclaim memory that OSDs
release/unmap
- Therefore, we (SUSE) have developed a recommendation that ...
Total OSD node RAM required = (num OSDs) x (1 GB + osd_memory_target) + 16 GB
- In Rook/Kubernetes, Ceph OSDs will read the POD_MEMORY_REQUEST and POD_MEMORY_LIMIT env
vars to infer a new default value for 'osd_memory_target'
- POD_MEMORY_REQUEST translates directly to 'osd_memory_target' 1:1
- POD_MEMORY_LIMIT (if REQUEST is unset) will set 'osd_memory_target' using the
formula ( LIMIT x osd_memory_target_cgroup_limit_ratio )
- the default 'osd_memory_target' will be = min(REQUEST, LIMIT*ratio)
- Lars has suggested that setting limits is not a best practice for Ceph; when limits are
encountered, Ceph is likely in a failure state, and killing daemons could result in a
"thundering herds" distributed systems problem
As you can see, there is a self-referential problem here. The OSD hardware recommendation
should inform us how to set k8s resource limits for OSDs; however, doing so will affect
osd_memory_target, which alters the recommendation, which further alters our k8s resource
limit circularly forever.
We can address this issue with a semi-workaround currently:
set osd_memory_target explicitly in Ceph's config, and set an appropriate k8s resource
request matching (osd_memory_target + 1GB + some extra) to meet the hardware
recommendation. However, means that the Ceph feature of setting osd_memory_target based on
resource requests isn't really used because it doesn't behave to actual best
practices. And setting a realistic k8s resource request is useful for kubernetes so that
it won't schedule more daemons onto a node than the node can realistically support.
Long-term, I wonder if it is good to add into Ceph a computation that [[ osd_memory_target
= REQUEST + osd_memory_request_overhead ]] where the osd_memory_requests overhead defaults
to 1GB or somewhat higher.
Please discuss, and let me know if anything here seems like I've gotten it wrong or if
there are other options I haven't seen.
Cheers, and happy Tuesday!
Blaine