Okay, I adjusted rgw to register under gid like the others, and
changed the cephadm logic around to cope.
I also cleaned simplified the 'ceph -s' output:
services:
mon: 1 daemons, quorum a (age 15s)
mgr: x(active, since 92s)
osd: 1 osds: 1 up (since 71s), 1 in (since 88s)
cephfs-mirror: 1 daemon active (1 hosts)
rbd-mirror: 2 daemons active (1 hosts)
rgw: 2 daemons active (1 hosts, 1 zones)
- don't list individual daemon ids (won't scale for large clusters)
- present any groupings we can identify (currently just distinct hosts
and rgw zones; if there are reasonable groupings for cephfs,rbd-mirror
or iscsi let's add those too)
s
On Thu, Mar 18, 2021 at 8:26 PM Jason Dillaman <jdillama(a)redhat.com> wrote:
>
> On Thu, Mar 18, 2021 at 9:00 PM Sage Weil <sage(a)newdream.net> wrote:
> >
> > Hi everyone,
> >
> > The non-core daemon registrations in servicemap vs cephadm came up
> > twice in the last couple of weeks:
> >
> > First,
https://github.com/ceph/ceph/pull/40035 changed rgw to register
> > as rgw.$id.$gid and made cephadm complain about stray unmanaged
> > daemons. The motivation was that the PR allows multiple radosgw
> > daemons to share the same auth name + key and still show up in the
> > servicemap.
> >
> > Then, today, I noticed that cephfs-mirror caused the same cephadm
> > error because was registering as cephfs-mirror.$gid instead of the
> > cephfs-mirror.$id that cephadm expected. I went to fix that in
> > cephfs-mirror, but noticed that the behavior was copied from
> > rbd-mirror.. which wasn't causing any cephadm error. It turns out
> > that cephadm has some special code from rbd-mirror to identify daemons
> > in the servicemap:
> >
> >
https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/serve.py#L4…
> >
> > So to fix cephfs-mirror, I opted to keep the existing behavior and
> > adjust cephadm:
> >
> >
https://github.com/ceph/ceph/pull/40220/commits/30d87f3746ff9daf219366354f2…
> >
> > For now, at least, that solves the problem. But, as things stand rgw
> > and {cephfs,rbd}-mirror are behaving a bit differently with
> > servicemap. The registrations look like so:
> >
> > {
> > "epoch": 538,
> > "modified": "2021-03-18T17:28:12.500356-0400",
> > "services": {
> > "cephfs-mirror": {
> > "daemons": {
> > "summary": "",
> > "4220": {
> > "start_epoch": 501,
> > "start_stamp":
"2021-03-18T12:49:32.929888-0400",
> > "gid": 4220,
> > "addr": "10.3.64.25:0/3521332238",
> > "metadata": {
> > ...
> > "id": "dael.csfspq",
> > "instance_id": "4220",
> > ...
> > },
> > "task_status": {}
> > }
> > }
> > },
> > "rbd-mirror": {
> > "daemons": {
> > "summary": "",
> > "4272": {
> > "start_epoch": 531,
> > "start_stamp":
"2021-03-18T16:31:26.540108-0400",
> > "gid": 4272,
> > "addr": "10.3.64.25:0/2576541551",
> > "metadata": {
> > ...
> > "id": "dael.kfenmm",
> > "instance_id": "4272",
> > ...
> > },
> > "task_status": {}
> > },
> > "4299": {
> > "start_epoch": 534,
> > "start_stamp":
"2021-03-18T16:52:59.027580-0400",
> > "gid": 4299,
> > "addr": "10.3.64.25:0/600966616",
> > "metadata": {
> > ...
> > "id": "dael.yfhmmq",
> > "instance_id": "4299",
> > ...
> > },
> > "task_status": {}
> > }
> > }
> > },
> > "rgw": {
> > "daemons": {
> > "summary": "",
> > "foo.dael.hwyogi": {
> > "start_epoch": 537,
> > "start_stamp":
"2021-03-18T17:27:58.998535-0400",
> > "gid": 4319,
> > "addr": "10.3.64.25:0/3084463187",
> > "metadata": {
> > ...
> > "zone_id":
"6321d54d-d780-43f3-af53-ce52aed2ef8a",
> > "zone_name": "default",
> > "zonegroup_id":
"e8453745-84a7-4d58-9aa9-9bfaf1ce9a7f",
> > "zonegroup_name": "default"
> > },
> > "task_status": {}
> > },
> > "foo.dael.pyvurh": {
> > "start_epoch": 537,
> > "start_stamp":
"2021-03-18T17:27:58.999620-0400",
> > "gid": 4318,
> > "addr": "10.3.64.25:0/2303221705",
> > "metadata": {
> > ...
> > "zone_id":
"6321d54d-d780-43f3-af53-ce52aed2ef8a",
> > "zone_name": "default",
> > "zonegroup_id":
"e8453745-84a7-4d58-9aa9-9bfaf1ce9a7f",
> > "zonegroup_name": "default"
> > },
> > "task_status": {}
> > },
> > "foo.dael.rqipjp": {
> > "start_epoch": 538,
> > "start_stamp":
"2021-03-18T17:28:10.866327-0400",
> > "gid": 4330,
> > "addr": "10.3.64.25:0/4039152887",
> > "metadata": {
> > ...
> > "zone_id":
"6321d54d-d780-43f3-af53-ce52aed2ef8a",
> > "zone_name": "default",
> > "zonegroup_id":
"e8453745-84a7-4d58-9aa9-9bfaf1ce9a7f",
> > "zonegroup_name": "default"
> > },
> > "task_status": {}
> > }
> > }
> > }
> > }
> > }
> >
> > With the *-mirror approach, the servicemap "key" is always the gid,
> > and you have to look at the "id" to see how the daemon is
> > named/authenticated. With rgw, the name is the key and there is no
> > "id" key.
> >
> > I'm inclined to just go with the gid-as-key for rgw too and add the
> > "id" key so that we are behaving consistently. This would have the
> > side-effect of also solving the original goal of allowing many rgw
> > daemons to share the same auth identity and still show up in the
> > servicemap.
>
> Just wanted to throw another variation in this model while we are
> talking about it. tcmu-runner for the Ceph iSCSI gateway registers as
> "<node-name>:<pool-name>/<image-name>" [1]. It's
implementation
> predates all of these other ones.
>
> > The downside is that interpreting the service for the running daemons
> > is a bit more work. For example, currently ceph -s shows
> >
> > services:
> > mon: 1 daemons, quorum a (age 2d)
> > mgr: x(active, since 58m)
> > osd: 1 osds: 1 up (since 2d), 1 in (since 2d)
> > cephfs-mirror: 1 daemon active (4220)
> > rbd-mirror: 2 daemons active (4272, 4299)
> > rgw: 2 daemons active (foo.dael.rqipjp, foo.dael.sajkvh)
> >
> > Showing the gids there is clearly now what we want. But similarly
> > showing the daemon names is probably also a bad idea since it won't
> > scale beyond ~3 or so; we probably just want a simple count.
>
> tcmu-runner really hit this scaling issue and Xiubo just added the
> ability to programatically fold these together via optional
> "daemon_type" and "daemon_prefix" metadata values [2][3] so that
"ceph
> -s" will show something like:
>
> ... snip ...
> tcmu-runner: 3 portals active (gateway0, gateway1, gateway2)
> ... snip ...
>
> > Reasonable?
> > sage
> > _______________________________________________
> > Dev mailing list -- dev(a)ceph.io
> > To unsubscribe send an email to dev-leave(a)ceph.io
> >
>
> [1]
https://github.com/open-iscsi/tcmu-runner/blob/master/rbd.c#L190
> [2]
https://github.com/open-iscsi/tcmu-runner/blob/master/rbd.c#L202
> [3]
https://github.com/ceph/ceph/blob/master/src/mgr/ServiceMap.cc#L83
>
> --
> Jason
>