On Thu, Apr 22, 2021 at 9:41 PM Ilya Dryomov <idryomov(a)gmail.com> wrote:
On Thu, Apr 22, 2021 at 2:59 PM Kefu Chai
<kchai(a)redhat.com> wrote:
On Wed, Apr 21, 2021 at 5:48 PM Ilya Dryomov <idryomov(a)gmail.com> wrote:
>
> On Wed, Apr 21, 2021 at 9:59 AM Kefu Chai <kchai(a)redhat.com> wrote:
> >
> > hi folks,
> >
> > while looking at
https://github.com/ceph/ceph/pull/32422, i think a
probably safer approach is to make the monitor more efficient. currently,
monitor is sort of a single-threaded application. quite a few critical code
paths of monitor are protected by Monitor::lock, among other things
> >
> > - periodical task performed by tick() which is in turn called by
SafeTimer. the "safty" of the SafeTimer is ensured by Monitor::lock
> > - Monitor::_ms_dispatch is also called
with the Monitor::lock
acquired. in the case of
https://github.com/ceph/ceph/pull/32422, one or
more kcephfs clients are even able to slow down the whole cluster by asking
for the latest osdmap with an ancient one in its hand, if the cluster is
able to rebalance/recover in speedy way and accumulate lots of osdmap in a
short time.
> >
> > a typical scaring use case is:
> >
> > 1. an all-flash cluster just completes a rebalance/recover. the
rebalance
completed quickly, and it leaves the cluster with a ton of
osdmaps before some of the clients have a chance to pick up these updated
maps.
> > 2. (kcephfs) clients with ancient
osdmaps in their hands wake up
randomly, and they want the latest osdmap!
> > 3. monitors are occupied with loading
the maps from rocksdb and
encoding them in very large batches (when discussing with
the author of
https://github.com/ceph/ceph/pull/32422, he mentioned that the total size
of inc osdmap could be up to 200~300 MiB).
> > 4. and the cluster is basically
unresponsive.
> >
> > so, does it sound like a right way to improve its performance when
serving the CPU intensive workload by dissecting the data dependencies in
the monitor and to explore the possibility to make the monitor more
multi-threaded?
Another thing to explore in addition to making the monitor more
efficient might be the concept of osdmap for clients as opposed to
a full osdmap. The osdmap already has a client section and an OSD
section (everything else, really), but there is no way to indicate
interest in just the client section so we always encode the entire
thing. The kernel client for example doesn't even decode the OSD
section so a good chunk of CPU cycles spent on encoding is wasted.
There are a couple of fields in the OSD section that can be of
interest to clients. One example is require_osd_release and the
features of the OSD -- used in Objecter::_calc_target() for an
optimization. If we identify these and move or duplicate them
in the client section, implementing client-section-only osdmap
subscription should be pretty easy and would save both CPU cycles
and network bandwidth.
thank you Ilya. to understand which portion the client is interested, i
am looking
at osdmap_decode() in net/ceph/osdmap.c, so they are:
===== client-used data =======
+ fsid, epoch, created, modified
+ pools:
+ type, size, crush_ruleset, object_hash, pg_num, pgp_num
- lpg_num, lpgp_num, last_change, snap_seq, snap_epoch, snaps,
removed_snaps,
auid
+ flags
- crash_replay_interval // we always encode 0 here, BTW
+ min_size
- quota_max_bytes, quota_max_objects
- tiers, tier_of,
+ read_tier, write_tier
- properties
- hit_set_params, hit_set_period, hit_set_count,
- stripe_width
- target_max_bytes, target_max_objects,
cache_target_dirty_ratio_micro,
cache_target_full_ratio_micro,
cache_min_flush_age, cache_min_evict_age,
- erasure_code_profile,
+ last_force_request_resend (last_force_op_resend_preluminous)
- min_read_recency_for_promote
- expected_num_objects
- cache_target_dirty_high_ratio_micro
- min_write_recency_for_promote
- use_gmt_hitset
- fast_read
- hit_set_grade_decay_rate, hit_set_search_last_n
- opts
+ last_force_op_resend_prenautilus
- ... // ignore the rest
+ pool_names
+ pool_max
+ flags
+ max_osd
+ osd_state
+ osd_weight
+ osd_addrs->client_addrs
+ pg_temp
+ crush
- erasure_code_profiles
+ pg_upmap, pg_upmap_items
- crush_version, new_removed_snaps, new_purged_snaps, last_up_change,
last_in_change
===== osd-specific data =====
....
* osd_xinfo
* features
* require_osd_release
in which,
- "+" implies the bits that decoded by the kernel
- "-" implies the bits ignored by the kernel
- "*" implies the bits that the client would be interested in
so, in other words, we could introduce a variant of osdmap which *only*
includes
the "+" and "*" fields. and allow the client to subscribe to the
variant. probably we can do this with 2 phases:
Hi Kefu,
I don't think basing it on the kernel client alone is good idea.
It's certainly a good start, but I think we are likely to over-discard
because the kernel client omits quite a lot of things.
ahh, very true. i will need to audit the librados client as well.
the first phase:
0. let OSDMonitor discriminate between different peers by checking the
peer's
entity type, if the peer is a client it sends the client-used
portion of osdmap, otherwise it sends the full version.
I think discriminating by entity type wouldn't work. For example, the
"ceph" tool would be a "client", but it does need to show and be able
to
modify stuff in the OSD section.
i see. wanted to have a minimal change as the first step. so it seems like
a dead end then.
the second phase:
1. add a new feature named FEATURE_OSDMAP_CLIENT to ceph::features::mon,
so the
monitors in quincy will be able to serve a new variant osdmap
2. introduce yet another "what" in
MMonSubscribe. like an extension of
"osdmap", so in addition to the
all-in-one "osdmap", monitor can serve the
maps optimized for client-side, and serve subscription of "osdmap.c", where
"c" stands for "client", and it should send the full and inc map
only
contains the "+" and "*" bits listed above.
3. OSDMonitor will be able to encode the
client-only osdmaps on demand
for the "osdmap.c" subscribers.
4. if the monmap implies that the monitor
supports the
FEATURE_OSDMAP_CLIENT feature, it sends "osdmap.c"
MMonSubscribe to monitor
for the client-only osdmap.
Yeah, something along these lines. Opt-in subscription independent of
the entity type.
thank you again, Ilya. i created a copy on the etherpad and linked it at
. will bring
this up in the next CDM for more inputs.