March 2021 - Dev - lists.ceph.io

by David Galloway

This is the 19th update to the Ceph Nautilus release series. This is a hotfix release to prevent daemons from binding to loopback network interfaces. All nautilus users are advised to upgrade to this release. Notable Changes --------------- * This release fixes a regression introduced in v14.2.18 whereby in certain environments, OSDs will bind to 127.0.0.1. See https://tracker.ceph.com/issues/49938. Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at http://download.ceph.com/tarballs/ceph-14.2.19.tar.gz * For packages, see http://docs.ceph.com/docs/master/install/get-packages/ * Release git sha1: bb796b9b5bab9463106022eef406373182465d11

3 years

1
0
0 0

rgw multisite: revisiting the design of 'async notifications'

by Casey Bodley

in multisite, these async notifications are http messages that get periodically broadcast to peer zones as new entries are added to a shard of the mdlog or datalog. on the destination zones, they serve two purposes: * wake up the coroutine that was processing the given log shards, in case they were sleeping because there was nothing to do the last time they polled * for data sync only, these messages also carry the keys of each new datalog entry so we can trigger sync on the related bucket shards (in addition to the buckets we're already syncing from the datalog itself) these notifications have been in since jewel. as i understand it, the goal was to make replication feel more responsive to updates, but the model has two major flaws: * it doesn't scale to more than one gateway per zone. when broadcasting these notifications, we choose one radosgw endpoint from each peer zone - but we have no way to know which one of those is actually processing the log shards we're trying to notify. on receipt, data sync will cache all of these keys in a map of 'modified_shards', and the entries will just pile up in memory for the shards it isn't processing * it reduces the apparent latency of sync on some buckets at the expense of overall sync throughput. not only does it prioritize sync of 'hot' buckets over buckets in the backlog, but for every bucket we sync via a notification, we'll re-sync it again when we get to its entry in the log. i don't think this tradeoff is a good one what does everyone else think? are there other reasons to keep sending these?

3 years

2
2
0 0

03/25/2021 perf meeting is on!

by Mark Nelson

Hi Folks, The performance meeting will be starting in about 40 minutes at 8AM PST! Today I would like to talk about my work writing a simple omap Benchmark based on our existing gtest suite and some of the behaviors I observed in bluestore. We will also be discussing cache locking issues in bluestore and possibly tradeoffs between simplification of the locking behavior vs performance impact. Hope to see you there! Etherpad: https://pad.ceph.com/p/performance_weekly Bluejeans: https://bluejeans.com/908675367 Mark

3 years

2
1
0 0

[CephFS Watercooler Talks] Patrick and Rishabh on smart pointers and move semantics

by Ramana Venkatesh Raja

Hi, The CephFS watercooler talks [1] are short technical discussions on topics usually related to CephFS. For the past few weeks, however, we've had talks on Modern C++ features based on the book, "Effective Modern C++: 42 Specific Ways to Improve Your Use of C++11 and C++14" by Scott Meyers. On March 16, Patrick Donnelly gave a talk on standard library smart pointers. The recording is here, https://bluejeans.com/s/7zTXYjspSuk/ On March 23, Rishabh Dave gave a talk on rvalue references, move semantics and perfect forwarding. The recording is here, https://bluejeans.com/s/cSFT0DZGg46 The recording links of the past watercooler talks are available at, https://pad.ceph.com/p/CephFS_watercooler_recordings Tomorrow (March 30), we'll have Venky Shankar talk about lambda expressions. -Ramana [1] CephFS water cooler talks are usually on Tuesdays at https://bluejeans.com/4828642302 from 9:45-10:15 AM ET. It's on the 'Ceph Community' calendar.

3 years

1
0
0 0

log slow ops to cluster log

by Kefu Chai

hi folks, i want to raise your attention to the tracker ticket of https://tracker.ceph.com/issues/48909. and discuss with you for a better solution. some context first, back in https://github.com/ceph/ceph/pull/18614, changes were made so the slow requests were reported to mgr to move the burden from monitor to mgr. with that change, all health related reports are sent to mgr, and the aggregated version is composed by mgr, and sent to monitor. i think, that'd help to improve the scalability of a Ceph cluster. moreover, IIUC, to let mgr take part of the load of the monitor was one of the reasons why mgr was introduced in the first place. in https://tracker.ceph.com/issues/43975, it's reported that the slow ops were no longer recorded in cluster log anymore since mimic. as a fix, https://github.com/ceph/ceph/pull/33328 was created to send slow ops and their types to cluster log. in https://tracker.ceph.com/issues/43975, it's noticed that this fix even worsen the performance of a cluster suffering from slow ops by adding more load to monitor. hence https://github.com/ceph/ceph/pull/39199 was created to throttle this. i am wondering if we can make better use of the health reporting machinery instead of pouring the health warnings to clog when slow ops are observed? what do you think? cheers,

3 years

3
2
0 0

RGW multipart errors

by Gavin Chen

Hi all, We’re running into what seems to be a reoccurring bug on RGW’s when handling multipart uploads. The RGW’s seem to orphaning upload parts which then take up space and can be difficult to find as they do not show up using client side s3 tools. RGW shows the pieces from the multipart upload are essentially orphaned and still stored in the cluster even after the upload has finished and the piecemeal object recombined. We’re currently running Octopus version 2.8 and are able to reliably reproduce the bug. Interestingly, when looking at the cluster through s3cmd or boto it’s showing the correct bucket usage with just the successful multipart object in the bucket, and the smaller shards from the upload not appearing. The bug seems like it’s related to bucket index sharding, as the bug is fixable by setting the shards to 0 and running a bucket check command. But running the command for buckets with sharding enabled doesn’t do anything and the orphans remain in the cluster. Looking at the issues backlog it seems like this was a problem even in much earlier releases dating all the way back to Hammer. https://tracker.ceph.com/issues/16767 <https://tracker.ceph.com/issues/16767> We can confirm that this bug is still persisting even in Octopus and Nautilus. A current manual workaround is to reset bucket sharding to 0 and run a bucket check command. However this is ineffective since one would need to know which bucket is affected (which can only be done through RGW since the s3 tools don’t show the orphaned pieces), and bucket sharding would need to set to 0 for the fix to happen. Has anyone else come across this bug? The comments from the issue ticket show it’s been a consistent problem through the years but with unfortunately no movement. The bug was assigned 3 years ago but looks like a fix was unable to be implemented. - Gavin

3 years

2
2
0 0

pacific build

by Sage Weil

The tests over the weekend looked good. If you're aware of any remaining blockers for the initial pacific release, speak now! There is still some missing content in the release notes for RBD and RGW. Thanks- s

3 years

3
2
0 0

Pacific release candidate v16.1.0 is out

by Josh Durgin

There are just a couple remaining issues before the final release. Please test it out and report any bugs. The full release notes are in progress here [0]. Notable Changes --------------- * New ``bluestore_rocksdb_options_annex`` config parameter. Complements ``bluestore_rocksdb_options`` and allows setting rocksdb options without repeating the existing defaults. * The cephfs addes two new CDentry tags, 'I' --> 'i' and 'L' --> 'l', and on-RADOS metadata is no longer backwards compatible after upgraded to Pacific or a later release. * $pid expansion in config paths like ``admin_socket`` will now properly expand to the daemon pid for commands like ``ceph-mds`` or ``ceph-osd``. Previously only ``ceph-fuse``/``rbd-nbd`` expanded ``$pid`` with the actual daemon pid. * The allowable options for some ``radosgw-admin`` commands have been changed. * ``mdlog-list``, ``datalog-list``, ``sync-error-list`` no longer accepts start and end dates, but does accept a single optional start marker. * ``mdlog-trim``, ``datalog-trim``, ``sync-error-trim`` only accept a single marker giving the end of the trimmed range. * Similarly the date ranges and marker ranges have been removed on the RESTful DATALog and MDLog list and trim operations. * ceph-volume: The ``lvm batch`` subcommand received a major rewrite. This closed a number of bugs and improves usability in terms of size specification and calculation, as well as idempotency behaviour and disk replacement process. Please refer to https://docs.ceph.com/en/latest/ceph-volume/lvm/batch/ for more detailed information. * Configuration variables for permitted scrub times have changed. The legal values for ``osd_scrub_begin_hour`` and ``osd_scrub_end_hour`` are 0 - 23. The use of 24 is now illegal. Specifying ``0`` for both values causes every hour to be allowed. The legal vaues for ``osd_scrub_begin_week_day`` and ``osd_scrub_end_week_day`` are 0 - 6. The use of 7 is now illegal. Specifying ``0`` for both values causes every day of the week to be allowed. * Multiple file systems in a single Ceph cluster is now stable. New Ceph clusters enable support for multiple file systems by default. Existing clusters must still set the "enable_multiple" flag on the fs. Please see the CephFS documentation for more information. * volume/nfs: Recently "ganesha-" prefix from cluster id and nfs-ganesha common config object was removed, to ensure consistent namespace across different orchestrator backends. Please delete any existing nfs-ganesha clusters prior to upgrading and redeploy new clusters after upgrading to Pacific. * A new health check, DAEMON_OLD_VERSION, will warn if different versions of Ceph are running on daemons. It will generate a health error if multiple versions are detected. This condition must exist for over mon_warn_older_version_delay (set to 1 week by default) in order for the health condition to be triggered. This allows most upgrades to proceed without falsely seeing the warning. If upgrade is paused for an extended time period, health mute can be used like this "ceph health mute DAEMON_OLD_VERSION --sticky". In this case after upgrade has finished use "ceph health unmute DAEMON_OLD_VERSION". * MGR: progress module can now be turned on/off, using the commands: ``ceph progress on`` and ``ceph progress off``. * An AWS-compliant API: "GetTopicAttributes" was added to replace the existing "GetTopic" API. The new API should be used to fetch information about topics used for bucket notifications. * librbd: The shared, read-only parent cache's config option ``immutable_object_cache_watermark`` now has been updated to property reflect the upper cache utilization before space is reclaimed. The default ``immutable_object_cache_watermark`` now is ``0.9``. If the capacity reaches 90% the daemon will delete cold cache. * OSD: the option ``osd_fast_shutdown_notify_mon`` has been introduced to allow the OSD to notify the monitor it is shutting down even if ``osd_fast_shutdown`` is enabled. This helps with the monitor logs on larger clusters, that may get many 'osd.X reported immediately failed by osd.Y' messages, and confuse tools. [0] https://github.com/ceph/ceph/pull/40265

3 years

5
6
0 0

Why onode key conforms such an order in get_object_key?

by 7onghc＠gmail.com

Hi, I'm reading the function 'get_object_key' in src/os/bluestore/BlueStore.cc, and trying to know why the onode key conforms to these order: - shard_id - hobj.pool - hobj.hash_reverse_bits - hobj.nspace ... Would it be reasonable if I change these orders for a new cluster? I only know that RocsDB store omap and list objects using the prefix 'O'. So if I move 'hobj.nspace' to the head, will it be faster for listing objects in a namespace using 'rados ls -N {namespace}'? =================================================== template<typename S> static void get_object_key(CephContext *cct, const ghobject_t& oid, S *key) { key->clear(); size_t max_len = ENCODED_KEY_PREFIX_LEN + (oid.hobj.nspace.length() * 3 + 1) + (oid.hobj.get_key().length() * 3 + 1) + 1 + // for '<', '=', or '>' (oid.hobj.oid.name.length() * 3 + 1) + 8 + 8 + 1; key->reserve(max_len); _key_encode_prefix(oid, key); append_escaped(oid.hobj.nspace, key); if (oid.hobj.get_key().length()) { // is a key... could be < = or >. append_escaped(oid.hobj.get_key(), key); // (ASCII chars < = and > sort in that order, yay) int r = oid.hobj.get_key().compare(oid.hobj.oid.name); if (r) { key->append(r > 0 ? ">" : "<"); append_escaped(oid.hobj.oid.name, key); } else { // same as no key key->append("="); } } else { // no key append_escaped(oid.hobj.oid.name, key); key->append("="); } _key_encode_u64(oid.hobj.snap, key); _key_encode_u64(oid.generation, key); key->push_back(ONODE_KEY_SUFFIX); }

3 years, 1 month

2
1
0 0

Ceph leadership team meeting 2021-03-24

by Josh Durgin

Hey folks, to help spread awareness we're going to start sending summaries of this meeting to dev(a)ceph.io. Anyone involved in the project is welcome to join. The time, link, and notes are here [0]. It's also on the ceph community calendar [1]. This week we discussed pacific readiness - there are just a few outstanding issues at this point, which we expect to have fixed this week. As part of the release, we're preparing release notes by adding to the branch here [2] and updating the trello board for anything targeted at pacific [3]. Finally, ceph developer summit for Quincy is coming up in a couple weeks, so folks are encouraged to add to the agenda [4]. Josh [0] https://pad.ceph.com/p/clt-weekly-minutes [1] https://calendar.google.com/calendar/embed?src=9ts9c7lt7u1vic2ijvvqqlfpo0%4… [2] https://github.com/ceph/ceph/pull/40265 [3] https://trello.com/b/ugTc2QFH/ceph-backlog [4] https://pad.ceph.com/p/cds-quincy

3 years, 1 month

1
0
0 0

2024

2023

2022

2021

2020

2019

Dev March 2021