in multisite, these async notifications are http messages that get
periodically broadcast to peer zones as new entries are added to a
shard of the mdlog or datalog. on the destination zones, they serve
two purposes:
* wake up the coroutine that was processing the given log shards, in
case they were sleeping because there was nothing to do the last time
they polled
* for data sync only, these messages also carry the keys of each new
datalog entry so we can trigger sync on the related bucket shards (in
addition to the buckets we're already syncing from the datalog itself)
these notifications have been in since jewel. as i understand it, the
goal was to make replication feel more responsive to updates, but the
model has two major flaws:
* it doesn't scale to more than one gateway per zone. when
broadcasting these notifications, we choose one radosgw endpoint from
each peer zone - but we have no way to know which one of those is
actually processing the log shards we're trying to notify. on receipt,
data sync will cache all of these keys in a map of 'modified_shards',
and the entries will just pile up in memory for the shards it isn't
processing
* it reduces the apparent latency of sync on some buckets at the
expense of overall sync throughput. not only does it prioritize sync
of 'hot' buckets over buckets in the backlog, but for every bucket we
sync via a notification, we'll re-sync it again when we get to its
entry in the log. i don't think this tradeoff is a good one
what does everyone else think? are there other reasons to keep sending these?
Hi Folks,
The performance meeting will be starting in about 40 minutes at 8AM
PST! Today I would like to talk about my work writing a simple omap
Benchmark based on our existing gtest suite and some of the behaviors I
observed in bluestore. We will also be discussing cache locking issues
in bluestore and possibly tradeoffs between simplification of the
locking behavior vs performance impact. Hope to see you there!
Etherpad:
https://pad.ceph.com/p/performance_weekly
Bluejeans:
https://bluejeans.com/908675367
Mark
Hi,
The CephFS watercooler talks [1] are short technical discussions on
topics usually related to CephFS. For the past few weeks, however,
we've had talks on Modern C++ features based on the book, "Effective
Modern C++: 42 Specific Ways to Improve Your Use of C++11 and C++14"
by Scott Meyers.
On March 16, Patrick Donnelly gave a talk on standard library smart
pointers. The recording is here,
https://bluejeans.com/s/7zTXYjspSuk/
On March 23, Rishabh Dave gave a talk on rvalue references, move
semantics and perfect forwarding. The recording is here,
https://bluejeans.com/s/cSFT0DZGg46
The recording links of the past watercooler talks are available at,
https://pad.ceph.com/p/CephFS_watercooler_recordings
Tomorrow (March 30), we'll have Venky Shankar talk about lambda expressions.
-Ramana
[1] CephFS water cooler talks are usually on Tuesdays at
https://bluejeans.com/4828642302 from 9:45-10:15 AM ET. It's on the
'Ceph Community' calendar.
hi folks,
i want to raise your attention to the tracker ticket of
https://tracker.ceph.com/issues/48909. and discuss with you for a better
solution.
some context first, back in https://github.com/ceph/ceph/pull/18614,
changes were made so the slow requests were reported to mgr to move the
burden from monitor to mgr. with that change, all health related reports
are sent to mgr, and the aggregated version is composed by mgr, and sent to
monitor. i think, that'd help to improve the scalability of a Ceph cluster.
moreover, IIUC, to let mgr take part of the load of the monitor was one of
the reasons why mgr was introduced in the first place.
in https://tracker.ceph.com/issues/43975, it's reported that the slow ops
were no longer recorded in cluster log anymore since mimic. as a fix,
https://github.com/ceph/ceph/pull/33328 was created to send slow ops and
their types to cluster log.
in https://tracker.ceph.com/issues/43975, it's noticed that this fix even
worsen the performance of a cluster suffering from slow ops by adding more
load to monitor. hence https://github.com/ceph/ceph/pull/39199 was created
to throttle this.
i am wondering if we can make better use of the health reporting
machinery instead of pouring the health warnings to clog when slow ops are
observed?
what do you think?
cheers,
Hi all,
We’re running into what seems to be a reoccurring bug on RGW’s when handling multipart uploads. The RGW’s seem to orphaning upload parts which then take up space and can be difficult to find as they do not show up using client side s3 tools. RGW shows the pieces from the multipart upload are essentially orphaned and still stored in the cluster even after the upload has finished and the piecemeal object recombined. We’re currently running Octopus version 2.8 and are able to reliably reproduce the bug.
Interestingly, when looking at the cluster through s3cmd or boto it’s showing the correct bucket usage with just the successful multipart object in the bucket, and the smaller shards from the upload not appearing. The bug seems like it’s related to bucket index sharding, as the bug is fixable by setting the shards to 0 and running a bucket check command. But running the command for buckets with sharding enabled doesn’t do anything and the orphans remain in the cluster.
Looking at the issues backlog it seems like this was a problem even in much earlier releases dating all the way back to Hammer. https://tracker.ceph.com/issues/16767 <https://tracker.ceph.com/issues/16767>
We can confirm that this bug is still persisting even in Octopus and Nautilus. A current manual workaround is to reset bucket sharding to 0 and run a bucket check command. However this is ineffective since one would need to know which bucket is affected (which can only be done through RGW since the s3 tools don’t show the orphaned pieces), and bucket sharding would need to set to 0 for the fix to happen.
Has anyone else come across this bug? The comments from the issue ticket show it’s been a consistent problem through the years but with unfortunately no movement. The bug was assigned 3 years ago but looks like a fix was unable to be implemented.
- Gavin
The tests over the weekend looked good. If you're aware of any
remaining blockers for the initial pacific release, speak now!
There is still some missing content in the release notes for RBD and RGW.
Thanks-
s
There are just a couple remaining issues before the final release.
Please test it out and report any bugs.
The full release notes are in progress here [0].
Notable Changes
---------------
* New ``bluestore_rocksdb_options_annex`` config
parameter. Complements ``bluestore_rocksdb_options`` and allows
setting rocksdb options without repeating the existing defaults.
* The cephfs addes two new CDentry tags, 'I' --> 'i' and 'L' --> 'l',
and on-RADOS metadata is no longer backwards compatible after
upgraded to Pacific or a later release.
* $pid expansion in config paths like ``admin_socket`` will now
properly expand to the daemon pid for commands like ``ceph-mds`` or
``ceph-osd``. Previously only ``ceph-fuse``/``rbd-nbd`` expanded
``$pid`` with the actual daemon pid.
* The allowable options for some ``radosgw-admin`` commands have been
changed.
* ``mdlog-list``, ``datalog-list``, ``sync-error-list`` no longer
accepts start and end dates, but does accept a single optional
start marker. * ``mdlog-trim``, ``datalog-trim``,
``sync-error-trim`` only accept a single marker giving the end of
the trimmed range. * Similarly the date ranges and marker ranges
have been removed on the RESTful DATALog and MDLog list and trim
operations.
* ceph-volume: The ``lvm batch`` subcommand received a major
rewrite. This closed a number of bugs and improves usability in
terms of size specification and calculation, as well as idempotency
behaviour and disk replacement process. Please refer to
https://docs.ceph.com/en/latest/ceph-volume/lvm/batch/ for more
detailed information.
* Configuration variables for permitted scrub times have changed. The
legal values for ``osd_scrub_begin_hour`` and ``osd_scrub_end_hour``
are 0 - 23. The use of 24 is now illegal. Specifying ``0`` for
both values causes every hour to be allowed. The legal vaues for
``osd_scrub_begin_week_day`` and ``osd_scrub_end_week_day`` are 0 -
6. The use of 7 is now illegal. Specifying ``0`` for both values
causes every day of the week to be allowed.
* Multiple file systems in a single Ceph cluster is now stable. New
Ceph clusters enable support for multiple file systems by
default. Existing clusters must still set the "enable_multiple" flag
on the fs. Please see the CephFS documentation for more information.
* volume/nfs: Recently "ganesha-" prefix from cluster id and
nfs-ganesha common config object was removed, to ensure consistent
namespace across different orchestrator backends. Please delete any
existing nfs-ganesha clusters prior to upgrading and redeploy new
clusters after upgrading to Pacific.
* A new health check, DAEMON_OLD_VERSION, will warn if different
versions of Ceph are running on daemons. It will generate a health
error if multiple versions are detected. This condition must exist
for over mon_warn_older_version_delay (set to 1 week by default) in
order for the health condition to be triggered. This allows most
upgrades to proceed without falsely seeing the warning. If upgrade
is paused for an extended time period, health mute can be used like
this "ceph health mute DAEMON_OLD_VERSION --sticky". In this case
after upgrade has finished use "ceph health unmute
DAEMON_OLD_VERSION".
* MGR: progress module can now be turned on/off, using the commands:
``ceph progress on`` and ``ceph progress off``. * An AWS-compliant
API: "GetTopicAttributes" was added to replace the existing
"GetTopic" API. The new API should be used to fetch information
about topics used for bucket notifications.
* librbd: The shared, read-only parent cache's config option
``immutable_object_cache_watermark`` now has been updated to
property reflect the upper cache utilization before space is
reclaimed. The default ``immutable_object_cache_watermark`` now is
``0.9``. If the capacity reaches 90% the daemon will delete cold
cache.
* OSD: the option ``osd_fast_shutdown_notify_mon`` has been introduced
to allow the OSD to notify the monitor it is shutting down even if
``osd_fast_shutdown`` is enabled. This helps with the monitor logs
on larger clusters, that may get many 'osd.X reported immediately
failed by osd.Y' messages, and confuse tools.
[0] https://github.com/ceph/ceph/pull/40265
Hi, I'm reading the function 'get_object_key' in src/os/bluestore/BlueStore.cc, and trying to know why the onode key conforms to these order:
- shard_id
- hobj.pool
- hobj.hash_reverse_bits
- hobj.nspace
...
Would it be reasonable if I change these orders for a new cluster?
I only know that RocsDB store omap and list objects using the prefix 'O'.
So if I move 'hobj.nspace' to the head, will it be faster for listing objects in a namespace using 'rados ls -N {namespace}'?
===================================================
template<typename S>
static void get_object_key(CephContext *cct, const ghobject_t& oid, S *key)
{
key->clear();
size_t max_len = ENCODED_KEY_PREFIX_LEN +
(oid.hobj.nspace.length() * 3 + 1) +
(oid.hobj.get_key().length() * 3 + 1) +
1 + // for '<', '=', or '>'
(oid.hobj.oid.name.length() * 3 + 1) +
8 + 8 + 1;
key->reserve(max_len);
_key_encode_prefix(oid, key);
append_escaped(oid.hobj.nspace, key);
if (oid.hobj.get_key().length()) {
// is a key... could be < = or >.
append_escaped(oid.hobj.get_key(), key);
// (ASCII chars < = and > sort in that order, yay)
int r = oid.hobj.get_key().compare(oid.hobj.oid.name);
if (r) {
key->append(r > 0 ? ">" : "<");
append_escaped(oid.hobj.oid.name, key);
} else {
// same as no key
key->append("=");
}
} else {
// no key
append_escaped(oid.hobj.oid.name, key);
key->append("=");
}
_key_encode_u64(oid.hobj.snap, key);
_key_encode_u64(oid.generation, key);
key->push_back(ONODE_KEY_SUFFIX);
}
Hey folks, to help spread awareness we're going to start sending
summaries of this meeting to dev(a)ceph.io. Anyone involved in the project
is welcome to join. The time, link, and notes are here [0]. It's
also on the ceph community calendar [1].
This week we discussed pacific readiness - there are just a few
outstanding issues at this point, which we expect to have fixed
this week.
As part of the release, we're preparing release notes by adding
to the branch here [2] and updating the trello board for anything
targeted at pacific [3].
Finally, ceph developer summit for Quincy is coming up in a couple
weeks, so folks are encouraged to add to the agenda [4].
Josh
[0] https://pad.ceph.com/p/clt-weekly-minutes
[1]
https://calendar.google.com/calendar/embed?src=9ts9c7lt7u1vic2ijvvqqlfpo0%4…
[2] https://github.com/ceph/ceph/pull/40265
[3] https://trello.com/b/ugTc2QFH/ceph-backlog
[4] https://pad.ceph.com/p/cds-quincy