k8s kernel clients: reasonable number of mounts per host, and limiting num client sessions

List overview All Threads
Download

newer

older

Re: Nautilus 14.2.19 mon 100% CPU

Ceph CFP Coordination for 2021

Dan van der Ster

1 Apr 2021 1 Apr '21

6:04 p.m.

Hi, Context: one of our users is mounting 350 ceph kernel PVCs per 30GB VM and they notice "memory pressure". When planning for k8s hosts, what would be a reasonable limit on the number of ceph kernel PVCs to mount per host? If one kernel mounts the same cephfs several times (with different prefixes), we observed that this is a unique client session. But does the ceph module globally share a single copy of cluster metadata, e.g. osdmaps, or is that all duplicated per session? Can anyone estimate how much memory is consumed by each mount (assuming it is a client of an O(1k) osd ceph cluster)? Also, k8s makes it trivial for a user to mount a single PVC from hundreds or thousands of clients. Suppose we wanted to be able to limit the number of clients per PVC -- Do you think a new `max_sessions=N` cephx cap would be the best approach for this? Best Regards, Dan

Show replies by date

Jeff Layton

6 Apr 6 Apr

3:33 a.m.

On Thu, 2021-04-01 at 11:04 +0200, Dan van der Ster wrote:

...

Hi, Context: one of our users is mounting 350 ceph kernel PVCs per 30GB VM and they notice "memory pressure".

Manifested how?

...

When planning for k8s hosts, what would be a reasonable limit on the number of ceph kernel PVCs to mount per host?

This seems like a really difficult thing to gauge. It depends on a number of different factors including amount of RAM and CPUs on the box, mount options, workload and applications, etc...

...

If one kernel mounts the same cephfs several times (with different prefixes), we observed that this is a unique client session. But does the ceph module globally share a single copy of cluster metadata, e.g. osdmaps, or is that all duplicated per session?

One copy per-cluster client, which should generally be shared between mounts to the same cluster, provided that you're using similar-enough mount options for the kernel to do that.

...

Can anyone estimate how much memory is consumed by each mount (assuming it is a client of an O(1k) osd ceph cluster)?

Again, hard to tell, and somewhat nebulous. Each mount will get its own superblock, but most of the client info is shared, so the overhead from an additional mount itself should be fairly trivial. The big question mark is how many inodes and dentries you have in core at the time, and how much data (particularly, dirty data) you have in the pagecache.

...

Also, k8s makes it trivial for a user to mount a single PVC from hundreds or thousands of clients. Suppose we wanted to be able to limit the number of clients per PVC -- Do you think a new `max_sessions=N` cephx cap would be the best approach for this?

Why do you want to limit the number of clients per PVC? I'm not sure that would really solve anything. FWIW, I'm not a fan of solutions that end up with clients pooping themselves because they get back some esoteric error due to exceeding a limit when trying to mount or something. -- Jeff Layton <jlayton(a)redhat.com>

Sage Weil

3:55 a.m.

On Mon, Apr 5, 2021 at 1:33 PM Jeff Layton <jlayton(a)redhat.com> wrote:

...

On Thu, 2021-04-01 at 11:04 +0200, Dan van der Ster wrote:

One copy per-cluster client, which should generally be shared between mounts to the same cluster, provided that you're using similar-enough mount options for the kernel to do that.

I suspect the problem is that if these are coming from mgr/volumes, then each mount has a unique cephx user (and a client cap that locks them into the exported directory), which means that the client instances can't be shared. sage

Jeff Layton

4:48 a.m.

On Mon, 2021-04-05 at 13:55 -0500, Sage Weil wrote:

...

On Mon, Apr 5, 2021 at 1:33 PM Jeff Layton <jlayton(a)redhat.com> wrote:

On Thu, 2021-04-01 at 11:04 +0200, Dan van der Ster wrote:

One copy per-cluster client, which should generally be shared between mounts to the same cluster, provided that you're using similar-enough mount options for the kernel to do that.

Oof. You're probably right. In that case, you're sort of SoL since you really do have to have a different client if the creds are different. Still, it's hard to imagine that it's _that_ much overhead, even at 350 mounts, but I guess it depends on the amount of memory in the host. -- Jeff Layton <jlayton(a)redhat.com>

Sage Weil

4:57 a.m.

On Mon, Apr 5, 2021 at 2:48 PM Jeff Layton <jlayton(a)redhat.com> wrote:

...

On Mon, 2021-04-05 at 13:55 -0500, Sage Weil wrote:

On Mon, Apr 5, 2021 at 1:33 PM Jeff Layton <jlayton(a)redhat.com> wrote:

On Thu, 2021-04-01 at 11:04 +0200, Dan van der Ster wrote:

One copy per-cluster client, which should generally be shared between mounts to the same cluster, provided that you're using similar-enough mount options for the kernel to do that.

Oof. You're probably right. In that case, you're sort of SoL since you really do have to have a different client if the creds are different.

We might consider: 1. An alternate mgr/volumes auth mode/model where a single user has access to the whole volume (i.e., all subvolumes). This might not require any change in mgr/volumes itself, actually--just use volume-granularity creds for the client. 2. A hybrid kernel client mode where we can share a single ceph_{mon,osd}_client for the data path, but have independent ceph_mds_clients for each mount. (As a practical matter, the osd caps are identical, so it's annoying that each mount has independent OSD connections.) 3. A mechanism for the caps to be refreshed for a client after the connection is established. That might allow a per-client auth identity to be used, and the caps for that client to be adjusted as volumes are added/removed from that host. Not really wild about any of these except for the first once, since it probably requires minimal changes to ceph-csi only... :) > Still, it's hard to imagine that it's _that_ much overhead, even at 350 > mounts, but I guess it depends on the amount of memory in the host.

Patrick Donnelly

12:04 p.m.

On Mon, Apr 5, 2021 at 12:58 PM Sage Weil <sage(a)newdream.net> wrote:

...

On Mon, Apr 5, 2021 at 2:48 PM Jeff Layton <jlayton(a)redhat.com> wrote:

On Mon, 2021-04-05 at 13:55 -0500, Sage Weil wrote:

On Mon, Apr 5, 2021 at 1:33 PM Jeff Layton <jlayton(a)redhat.com> wrote:

On Thu, 2021-04-01 at 11:04 +0200, Dan van der Ster wrote: > If one kernel mounts the > same cephfs several times (with different prefixes), we observed that > this is a unique client session. But does the ceph module globally > share a single copy of cluster metadata, e.g. osdmaps, or is that all > duplicated per session? > One copy per-cluster client, which should generally be shared between mounts to the same cluster, provided that you're using similar-enough mount options for the kernel to do that.

Oof. You're probably right. In that case, you're sort of SoL since you really do have to have a different client if the creds are different.

This comes with the downside that if the auth credential is blocklisted for any reason, it takes down every other mount too. You also have the inverse problem: if the MDS blocklists a misbehaving client, that client may still blindly continue reading/writing because it's using another instance for the OSD communication.

...

3. A mechanism for the caps to be refreshed for a client after the connection is established. That might allow a per-client auth identity to be used, and the caps for that client to be adjusted as volumes are added/removed from that host. Not really wild about any of these except for the first once, since it probably requires minimal changes to ceph-csi only... :)

Still, it's hard to imagine that it's _that_ much overhead, even at 350 mounts, but I guess it depends on the amount of memory in the host.

_______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

-- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

Patrick Donnelly

12:01 p.m.

On Mon, Apr 5, 2021 at 11:55 AM Sage Weil <sage(a)newdream.net> wrote:

...

On Mon, Apr 5, 2021 at 1:33 PM Jeff Layton <jlayton(a)redhat.com> wrote:

On Thu, 2021-04-01 at 11:04 +0200, Dan van der Ster wrote:

One copy per-cluster client, which should generally be shared between mounts to the same cluster, provided that you're using similar-enough mount options for the kernel to do that.

This is how it works for Manila but not CSI. As I recall, a single cephx user credential is used for all mounts by CSI. +Travis Nielsen / +Humble Chirammal -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

Humble Chirammal

3:05 p.m.

On Tue, Apr 6, 2021 at 8:31 AM Patrick Donnelly <pdonnell(a)redhat.com> wrote:

...

On Mon, Apr 5, 2021 at 11:55 AM Sage Weil <sage(a)newdream.net> wrote:

On Mon, Apr 5, 2021 at 1:33 PM Jeff Layton <jlayton(a)redhat.com> wrote:

On Thu, 2021-04-01 at 11:04 +0200, Dan van der Ster wrote:

One copy per-cluster client, which should generally be shared between mounts to the same cluster, provided that you're using similar-enough mount options for the kernel to do that.

This is how it works for Manila but not CSI. As I recall, a single cephx user credential is used for all mounts by CSI. +Travis Nielsen / +Humble Chirammal

Thats correct , single user with below mentioned capabilities are used for CSI volumes: https://github.com/ceph/ceph-csi/blob/devel/docs/capabilities.md

...

-- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

-- Cheers, Humble Red Hat Storage Engineering Mastering KVM Virtualization: http://amzn.to/2vFTXaW Website: http://humblec.com

Dan van der Ster

7:32 p.m.

On Mon, Apr 5, 2021 at 8:33 PM Jeff Layton <jlayton(a)redhat.com> wrote:

...

On Thu, 2021-04-01 at 11:04 +0200, Dan van der Ster wrote:

Hi, Context: one of our users is mounting 350 ceph kernel PVCs per 30GB VM and they notice "memory pressure".

Manifested how?

Our users lost the monitoring, so we are going to try to reproduce to get more details. Do you know any way to see how much memory is used by the kernel clients? (Aside from the ceph_inode_info and ceph_dentry_info which I see in slabtop). I see that the osd_client keeps just one copy of the osdmap, so that's going to be only ~256kB * num_clients on this particular cluster. Do we also need to kmalloc something the size of the pg map? That would be ~4MB * num_clients here. Are there any other large data structures, even for idle mounts?

...

When planning for k8s hosts, what would be a reasonable limit on the number of ceph kernel PVCs to mount per host?

This seems like a really difficult thing to gauge. It depends on a number of different factors including amount of RAM and CPUs on the box, mount options, workload and applications, etc...

One copy per-cluster client, which should generally be shared between mounts to the same cluster, provided that you're using similar-enough mount options for the kernel to do that.

As Sage suspected, we have a unique cephx user per PVC mounted. We're using the manila csi, which indeed invokes mgr/volumes to create the shares. They look like this, for reference: "client_metadata": { "features": "0x0000000000007bff", "entity_id": "pvc-691d1f23-da81-4a08-a6e7-d16f44e5f2a0", "hostname": "paas-standard-avz-b-6qvn6", "kernel_version": "5.10.19-200.fc33.x86_64", "root": "/volumes/_nogroup/dbe3dbbf-e8d6-4f13-aac4-7a116d9a6772" } It's good to know that by using the same cephx users, we could optimize the clients on a given host.

...

Why do you want to limit the number of clients per PVC? I'm not sure that would really solve anything.

Mounting from a huge number of clients can easily overload the MDSs. But Manila only lets us hand out CephFS quotas by rbytes or # shares. So if we could similarly limit the number of sessions per cephx user (i.e. per share), then we can prevent these overloads. Cheers, Dan > > FWIW, I'm not a fan of solutions that end up with clients pooping > themselves because they get back some esoteric error due to exceeding a > limit when trying to mount or something. > > -- > Jeff Layton <jlayton(a)redhat.com> >

Jeff Layton

8:45 p.m.

On Tue, 2021-04-06 at 12:32 +0200, Dan van der Ster wrote:

...

On Mon, Apr 5, 2021 at 8:33 PM Jeff Layton <jlayton(a)redhat.com> wrote:

On Thu, 2021-04-01 at 11:04 +0200, Dan van der Ster wrote:

Hi, Context: one of our users is mounting 350 ceph kernel PVCs per 30GB VM and they notice "memory pressure".

Manifested how?

Nothing simple, I'm afraid, and even those don't tell you the full picture. ceph_dentry_info is a separate allocation from the actual dentry.

...

I see that the osd_client keeps just one copy of the osdmap, so that's going to be only ~256kB * num_clients on this particular cluster. Do we also need to kmalloc something the size of the pg map? That would be ~4MB * num_clients here. Are there any other large data structures, even for idle mounts?

Almost certainly, but it's not trivial to measure them. You might start by looking at net/ceph/osdmap.c in the kernel sources and consider instrumenting it to report how large its allocations are. We simply don't keep those sorts of detailed stats of allocations that the client does.

...

When planning for k8s hosts, what would be a reasonable limit on the number of ceph kernel PVCs to mount per host?

This seems like a really difficult thing to gauge. It depends on a number of different factors including amount of RAM and CPUs on the box, mount options, workload and applications, etc...

One copy per-cluster client, which should generally be shared between mounts to the same cluster, provided that you're using similar-enough mount options for the kernel to do that.

Why do you want to limit the number of clients per PVC? I'm not sure that would really solve anything.

The problem there is that you'll end up with clients that just start suddenly failing to mount because you hit your arbitrary capacity limits, and it'll almost certainly be first-come/first served. This is a different matter than applying quotas because it potentially affects you at mount time.

...

FWIW, I'm not a fan of solutions that end up with clients pooping themselves because they get back some esoteric error due to exceeding a limit when trying to mount or something. -- Jeff Layton <jlayton(a)redhat.com>

-- Jeff Layton <jlayton(a)redhat.com>

Dan van der Ster

9 Apr 9 Apr

12:43 a.m.

On Tue, Apr 6, 2021 at 1:45 PM Jeff Layton <jlayton(a)redhat.com> wrote:

...

On Tue, 2021-04-06 at 12:32 +0200, Dan van der Ster wrote:

On Mon, Apr 5, 2021 at 8:33 PM Jeff Layton <jlayton(a)redhat.com> wrote:

On Thu, 2021-04-01 at 11:04 +0200, Dan van der Ster wrote:

Hi, Context: one of our users is mounting 350 ceph kernel PVCs per 30GB VM and they notice "memory pressure".

Manifested how?

Nothing simple, I'm afraid, and even those don't tell you the full picture. ceph_dentry_info is a separate allocation from the actual dentry.

I've just created 1000 cephx users and mounted a largish cluster 1000 times from a single 8GB VM. I saw the used memory increase by around 1GB after the mounts were completed, and that memory is freed after I umount. The path that I mounted has an unpacked linux tarball. I ran 'find /' and 'md5sum linux.tgz' across 200 of those mounts simultaneously, and they all completed quickly, uneventfully, without any noticeable impact on memory consumption, (aside from what would be expected in the dentry and page cache). So, I'm concluding that this whole thread was noise; we can support hundreds of mounts per host without concern. Thanks for your time, Best Regards, Dan

Jeff Layton

2:11 a.m.

On Thu, 2021-04-08 at 17:43 +0200, Dan van der Ster wrote:

...

On Tue, Apr 6, 2021 at 1:45 PM Jeff Layton <jlayton(a)redhat.com> wrote:

On Tue, 2021-04-06 at 12:32 +0200, Dan van der Ster wrote:

On Mon, Apr 5, 2021 at 8:33 PM Jeff Layton <jlayton(a)redhat.com> wrote:

On Thu, 2021-04-01 at 11:04 +0200, Dan van der Ster wrote: > Hi, > > Context: one of our users is mounting 350 ceph kernel PVCs per 30GB VM > and they notice "memory pressure". > Manifested how?

Nothing simple, I'm afraid, and even those don't tell you the full picture. ceph_dentry_info is a separate allocation from the actual dentry.

No problem, and good to know! We're definitely interested if you start hitting capacity limits. It's possible that we could more aggressively share certain objects (maos and such?) between mounts (and possibly rbd clients), but I don't see a reason to go there until we need to. Cheers, -- Jeff Layton <jlayton(a)redhat.com>

1117

days inactive

1124

days old

dev@ceph.io

Manage subscription

11 comments

5 participants

tags (0)

participants (5)

Dan van der Ster
Humble Chirammal
Jeff Layton
Patrick Donnelly
Sage Weil