On Tue, 2021-04-06 at 12:32 +0200, Dan van der Ster wrote:
On Mon, Apr 5, 2021 at 8:33 PM Jeff Layton
<jlayton(a)redhat.com> wrote:
On Thu, 2021-04-01 at 11:04 +0200, Dan van der Ster wrote:
Hi,
Context: one of our users is mounting 350 ceph kernel PVCs per 30GB VM
and they notice "memory pressure".
Manifested how?
Our users lost the monitoring, so we are going to try to reproduce to
get more details.
Do you know any way to see how much memory is used by the kernel
clients? (Aside from the ceph_inode_info and ceph_dentry_info which I
see in slabtop).
Nothing simple, I'm afraid, and even those don't tell you the full
picture. ceph_dentry_info is a separate allocation from the actual
dentry.
I see that the osd_client keeps just one copy of the
osdmap, so that's
going to be only ~256kB * num_clients on this particular cluster.
Do we also need to kmalloc something the size of the pg map? That
would be ~4MB * num_clients here.
Are there any other large data structures, even for idle mounts?
Almost certainly, but it's not trivial to measure them. You might start
by looking at net/ceph/osdmap.c in the kernel sources and consider
instrumenting it to report how large its allocations are. We simply
don't keep those sorts of detailed stats of allocations that the client
does.
When planning for k8s hosts, what would be a
reasonable limit on the
number of ceph kernel PVCs to mount per host?
This seems like a really difficult thing to gauge. It depends on a
number of different factors including amount of RAM and CPUs on the box,
mount options, workload and applications, etc...
If one kernel mounts the
same cephfs several times (with different prefixes), we observed that
this is a unique client session. But does the ceph module globally
share a single copy of cluster metadata, e.g. osdmaps, or is that all
duplicated per session?
One copy per-cluster client, which should generally be shared between
mounts to the same cluster, provided that you're using similar-enough
mount options for the kernel to do that.
As Sage suspected, we have a unique cephx user per PVC mounted.
We're using the manila csi, which indeed invokes mgr/volumes to create
the shares. They look like this, for reference:
"client_metadata": {
"features": "0x0000000000007bff",
"entity_id": "pvc-691d1f23-da81-4a08-a6e7-d16f44e5f2a0",
"hostname": "paas-standard-avz-b-6qvn6",
"kernel_version": "5.10.19-200.fc33.x86_64",
"root":
"/volumes/_nogroup/dbe3dbbf-e8d6-4f13-aac4-7a116d9a6772"
}
It's good to know that by using the same cephx users, we could
optimize the clients on a given host.
Also, k8s
makes it trivial for a user to mount a single PVC from
hundreds or thousands of clients. Suppose we wanted to be able to
limit the number of clients per PVC -- Do you think a new
`max_sessions=N` cephx cap would be the best approach for this?
Why do you want to limit the number of clients per PVC? I'm not sure
that would really solve anything.
Mounting from a huge number of clients can easily overload the MDSs.
But Manila only lets us hand out CephFS quotas by rbytes or # shares.
So if we could similarly limit the number of sessions per cephx user
(i.e. per share), then we can prevent these overloads.
The problem there is that you'll end up with clients that just start
suddenly failing to mount because you hit your arbitrary capacity
limits, and it'll almost certainly be first-come/first served. This is a
different matter than applying quotas because it potentially affects you
at mount time.
FWIW, I'm not a fan of solutions that end up with clients pooping
themselves because they get back some esoteric error due to exceeding a
limit when trying to mount or something.
--
Jeff Layton <jlayton(a)redhat.com>
--
Jeff Layton <jlayton(a)redhat.com>