Hi all,
Does anyone know how to trace the write/read request from client to
rbd/rados/OSD?
Is there any useful document besides the doc/dev/blkin?
I'm wondering how to trace the large distributed storage system e.g.
Ceph to observer/monitor it.
B.R.
Changcheng
On Sat, Mar 28, 2020 at 1:29 AM Mazzystr <mazzystr(a)gmail.com> wrote:
>
> What about the missing dependencies for octopus on el8? (looking at yoooou
> ceph-mgr!)
FWIW, leveldb for el8 is pending on review at
https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2020-3171aba6be,
if you could help test it. that'd be great!
>
> On Fri, Mar 27, 2020 at 7:15 AM Sage Weil <sage(a)newdream.net> wrote:
>
> > One word of caution: there is one known upgrade issue if you
> >
> > - upgrade from luminous to nautilus, and then
> > - run nautilus for a very short period of time (hours), and then
> > - upgrade from nautilus to octopus
> >
> > that prevents OSDs from starting. We have a fix that will be in 15.2.1,
> > but until that is out, I would recommend against the double-upgrade. If
> > you have been running nautilus for a while (days) you should be fine.
> >
> > sage
> >
> >
> > https://tracker.ceph.com/issues/44770
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
--
Regards
Kefu Chai
Hi everyone,
I am taking time off from the Ceph project and from Red Hat, starting in
April and extending through the US election in November. I will initially
be working with an organization focused on voter registration and turnout
and combating voter suppression and disinformation campaigns.
During this time I will maintain some involvement in the Ceph community,
primarily around strategic planning for Pacific and the Ceph Foundation,
but most of my time will be focused elsewhere.
Most decision making around Ceph will remain in the capable hands of the
Ceph Leadership Team and component leads--I have the utmost confidence in
their judgement and abilities. Yehuda Sadeh and Josh Durgin will be
filling in to provide high-level guidance where needed.
I’ll be participating in the Pacific planning meetings planned for next
week, which will be important in kicking off development for Pacific:
https://ceph.io/cds/ceph-developer-summit-pacific/
I am extremely proud of what we have accomplished with the Octopus
release, and I believe the Ceph community will continue to do great things
with Pacific! I look forward to returning at the end of the year to help
wrap up the release and (hopefully) get things ready for Cephalocon next
March.
Most of all, I am excited to become engaged in another effort that I feel
strongly about--one that will have a very real impact on my kids’
futures--and that will be easier to explain to lay people! :)
Thanks!
sage
I am trying to develop some best practices around setting Kubernetes Pod Memory Requests and Memory Limits for Ceph daemons.
Setting a Pod Memory Request will control how Kubernetes schedules a pod. Setting a Pod Memory Limit will mean that the container may be killed if it exceeds the limit.
Advice I got from Joao: In the case of Ceph monitors, they are more likely to be experiencing memory over-use during recovery scenarios, and killing mons during this due to exceeding a limit may make the problem much worse. The best-practice I have here is to only set memory requests for Ceph mons, ideally 4GB.
In the case of OSDs, things are a little more complex. OSDs will read the POD_MEMORY_REQUEST and POD_MEMORY_LIMIT environment variables which are set by Rook inside Kubernetes pods, and OSD will tune their memory usage to meet this. They will target the minimum between POD_MEMORY_REQUEST and [POD_MEMORY_LIMIT * 0.8]. OSDs to my understanding aggressively try to stay within their targets. What are the risks of setting (or not setting) Pod Memory Limits on OSDs knowing that if the limit is set too low or if the OSDs begin to memory leak, they will be terminated and restarted by Kubernetes?
- One risk I can imagine is that if OSDs are all started at nearly the same time and experience similar loads, they might be likely to leak memory at similar rates and be killed by Kubernetes at about the same time. Stampeding herds of OSD memory leaks followed by memory limit terminations might occur which could ripple to causing other OSDs to become unstable.
- Not setting a limit might mean that OSDs experience memory leak and cause OOM situations for other daemons or for the Kubernetes kubelet if the system settings don't guarantee kubelet some amount of resources.
What are the risks of killing other daemons past a particular limit? Is it good to kill daemons if they exceed a limit in order to prevent memory leaks from affecting the rest of the system? MDS? RGW? MGR? NFS-Ganesha?
If anyone has knowledgeable recommendations about any daemons, I'd love your input. Please reply-all so that I get replies straight to my inbox.
Blaine
We're happy to announce the first stable release of Octopus v15.2.0.
There are a lot of changes and new features added, we advise everyone to
read the release notes carefully, and in particular the upgrade notes,
before upgrading. Please refer to the official blog entry
https://ceph.io/releases/v15-2-0-octopus-released/ for a detailed
version with links & changelog.
This release wouldn't have been possible without the support of the
community, this release saw contributions from over 330 developers & 80
organizations, and we thank everyone for making this release happen.
Major Changes from Nautilus
---------------------------
General
~~~~~~~
* A new deployment tool called **cephadm** has been introduced that
integrates Ceph daemon deployment and management via containers
into the orchestration layer.
* Health alerts can now be muted, either temporarily or permanently.
* Health alerts are now raised for recent Ceph daemons crashes.
* A simple 'alerts' module has been introduced to send email
health alerts for clusters deployed without the benefit of an
existing external monitoring infrastructure.
* Packages are built for the following distributions:
- CentOS 8
- CentOS 7 (partial--see below)
- Ubuntu 18.04 (Bionic)
- Debian Buster
- Container images (based on CentOS 8)
Note that the dashboard, prometheus, and restful manager modules
will not work on the CentOS 7 build due to Python 3 module
dependencies that are missing in CentOS 7.
Besides this packages built by the community will also available for the
following distros:
- Fedora (33/rawhide)
- openSUSE (15.2, Tumbleweed)
Dashboard
~~~~~~~~~
The mgr-dashboard has gained a lot of new features and functionality:
* UI Enhancements
- New vertical navigation bar
- New unified sidebar: better background task and events notification
- Shows all progress mgr module notifications
- Multi-select on tables to perform bulk operations
* Dashboard user account security enhancements
- Disabling/enabling existing user accounts
- Clone an existing user role
- Users can change their own password
- Configurable password policies: Minimum password complexity/length
requirements
- Configurable password expiration
- Change password after first login
New and enhanced management of Ceph features/services:
* OSD/device management
- List all disks associated with an OSD
- Add support for blinking enclosure LEDs via the orchestrator
- List all hosts known by the orchestrator
- List all disks and their properties attached to a node
- Display disk health information (health prediction and SMART data)
- Deploy new OSDs on new disks/hosts
- Display and allow sorting by an OSD's default device class in the OSD
table
- Explicitly set/change the device class of an OSD, display and sort OSDs by
device class
* Pool management
- Viewing and setting pool quotas
- Define and change per-pool PG autoscaling mode
* RGW management enhancements
- Enable bucket versioning
- Enable MFA support
- Select placement target on bucket creation
* CephFS management enhancements
- CephFS client eviction
- CephFS snapshot management
- CephFS quota management
- Browse CephFS directory
* iSCSI management enhancements
- Show iSCSI GW status on landing page
- Prevent deletion of IQNs with open sessions
- Display iSCSI "logged in" info
* Prometheus alert management
- List configured Prometheus alerts
RADOS
~~~~~
* Objects can now be brought in sync during recovery by copying only
the modified portion of the object, reducing tail latencies during
recovery.
* Ceph will allow recovery below *min_size* for Erasure coded pools,
wherever possible.
* The PG autoscaler feature introduced in Nautilus is enabled for
new pools by default, allowing new clusters to autotune *pg num*
without any user intervention. The default values for new pools
and RGW/CephFS metadata pools have also been adjusted to perform
well for most users.
* BlueStore has received several improvements and performance
updates, including improved accounting for "omap" (key/value)
object data by pool, improved cache memory management, and a
reduced allocation unit size for SSD devices. (Note that by
default, the first time each OSD starts after upgrading to octopus
it will trigger a conversion that may take from a few minutes to a
few hours, depending on the amount of stored "omap" data.)
* Snapshot trimming metadata is now managed in a more efficient and
scalable fashion.
RBD block storage
~~~~~~~~~~~~~~~~~
* Mirroring now supports a new snapshot-based mode that no longer requires
the journaling feature and its related impacts in exchange for the loss
of point-in-time consistency (it remains crash consistent).
* Clone operations now preserve the sparseness of the underlying RBD image.
* The trash feature has been improved to (optionally) automatically
move old parent images to the trash when their children are all
deleted or flattened.
* The trash can be configured to automatically purge on a defined schedule.
* Images can be online re-sparsified to reduce the usage of zeroed extents.
* The `rbd-nbd` tool has been improved to use more modern kernel interfaces.
* Caching has been improved to be more efficient and performant.
* `rbd-mirror` automatically adjusts its per-image memory usage based
upon its memory target.
* A new persistent read-only caching daemon is available to offload reads from
shared parent images.
RGW object storage
~~~~~~~~~~~~~~~~~~
* New `Multisite Sync Policy` primitives for per-bucket replication. (EXPERIMENTAL)
* S3 feature support:
- Bucket Replication (EXPERIMENTAL)
- `Bucket Notifications`_ via HTTP/S, AMQP and Kafka
- Bucket Tagging
- Object Lock
- Public Access Block for buckets
* Bucket sharding:
- Significantly improved listing performance on buckets with many shards.
- Dynamic resharding prefers prime shard counts for improved distribution.
- Raised the default number of bucket shards to 11.
* Added `HashiCorp Vault Integration`_ for SSE-KMS.
* Added Keystone token cache for S3 requests.
CephFS distributed file system
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Inline data support in CephFS has been deprecated and will likely be
removed in a future release.
* MDS daemons can now be assigned to manage a particular file system via the
new `mds_join_fs` option.
* MDS now aggressively asks idle clients to trim caps which improves stability
when file system load changes.
* The mgr volumes plugin has received numerous improvements to support CephFS
via CSI, including snapshots and cloning.
* cephfs-shell has had numerous incremental improvements and bug fixes.
Upgrading from Mimic or Nautilus
--------------------------------
You can monitor the progress of your upgrade at each stage with the
`ceph versions` command, which will tell you what ceph version(s) are
running for each type of daemon.
Instructions
~~~~~~~~~~~~
#. Make sure your cluster is stable and healthy (no down or
recovering OSDs). (Optional, but recommended.)
#. Set the `noout` flag for the duration of the upgrade. (Optional,
but recommended.)::
# ceph osd set noout
#. Upgrade monitors by installing the new packages and restarting the
monitor daemons. For example, on each monitor host,::
# systemctl restart ceph-mon.target
Once all monitors are up, verify that the monitor upgrade is
complete by looking for the `octopus` string in the mon
map. The command::
# ceph mon dump | grep min_mon_release
should report::
min_mon_release 15 (nautilus)
If it doesn't, that implies that one or more monitors hasn't been
upgraded and restarted and/or the quorum does not include all monitors.
#. Upgrade `ceph-mgr` daemons by installing the new packages and
restarting all manager daemons. For example, on each manager host,::
# systemctl restart ceph-mgr.target
Verify the `ceph-mgr` daemons are running by checking `ceph
-s`::
# ceph -s
...
services:
mon: 3 daemons, quorum foo,bar,baz
mgr: foo(active), standbys: bar, baz
...
#. Upgrade all OSDs by installing the new packages and restarting the
ceph-osd daemons on all OSD hosts::
# systemctl restart ceph-osd.target
Note that the first time each OSD starts, it will do a format
conversion to improve the accounting for "omap" data. This may
take a few minutes to as much as a few hours (for an HDD with lots
of omap data). You can disable this automatic conversion with::
# ceph config set osd bluestore_fsck_quick_fix_on_mount false
You can monitor the progress of the OSD upgrades with the
`ceph versions` or `ceph osd versions` commands::
# ceph osd versions
{
"ceph version 13.2.5 (...) mimic (stable)": 12,
"ceph version 15.2.0 (...) octopus (stable)": 22,
}
#. Upgrade all CephFS MDS daemons. For each CephFS file system,
#. Reduce the number of ranks to 1. (Make note of the original
number of MDS daemons first if you plan to restore it later.)::
# ceph status
# ceph fs set <fs_name> max_mds 1
#. Wait for the cluster to deactivate any non-zero ranks by
periodically checking the status::
# ceph status
#. Take all standby MDS daemons offline on the appropriate hosts with::
# systemctl stop ceph-mds@<daemon_name>
#. Confirm that only one MDS is online and is rank 0 for your FS::
# ceph status
#. Upgrade the last remaining MDS daemon by installing the new
packages and restarting the daemon::
# systemctl restart ceph-mds.target
#. Restart all standby MDS daemons that were taken offline::
# systemctl start ceph-mds.target
#. Restore the original value of `max_mds` for the volume::
# ceph fs set <fs_name> max_mds <original_max_mds>
#. Upgrade all radosgw daemons by upgrading packages and restarting
daemons on all hosts::
# systemctl restart ceph-radosgw.target
#. Complete the upgrade by disallowing pre-Octopus OSDs and enabling
all new Octopus-only functionality::
# ceph osd require-osd-release octopus
#. If you set `noout` at the beginning, be sure to clear it with::
# ceph osd unset noout
#. Verify the cluster is healthy with `ceph health`.
If your CRUSH tunables are older than Hammer, Ceph will now issue a
health warning. If you see a health alert to that effect, you can
revert this change with::
ceph config set mon mon_crush_min_required_version firefly
If Ceph does not complain, however, then we recommend you also
switch any existing CRUSH buckets to straw2, which was added back
in the Hammer release. If you have any 'straw' buckets, this will
result in a modest amount of data movement, but generally nothing
too severe.::
ceph osd getcrushmap -o backup-crushmap
ceph osd crush set-all-straw-buckets-to-straw2
If there are problems, you can easily revert with::
ceph osd setcrushmap -i backup-crushmap
Moving to 'straw2' buckets will unlock a few recent features, like
the `crush-compat` :ref:`balancer <balancer>` mode added back in Luminous.
#. If you are upgrading from Mimic, or did not already do so when you
upgraded to Nautlius, we recommened you enable the new :ref:`v2
network protocol <msgr2>`, issue the following command::
ceph mon enable-msgr2
This will instruct all monitors that bind to the old default port
6789 for the legacy v1 protocol to also bind to the new 3300 v2
protocol port. To see if all monitors have been updated,::
ceph mon dump
and verify that each monitor has both a `v2:` and `v1:` address
listed.
#. Consider enabling the :ref:`telemetry module <telemetry>` to send
anonymized usage statistics and crash information to the Ceph
upstream developers. To see what would be reported (without actually
sending any information to anyone),::
ceph mgr module enable telemetry
ceph telemetry show
If you are comfortable with the data that is reported, you can opt-in to
automatically report the high-level cluster metadata with::
ceph telemetry on
For more information about the telemetry module, see :ref:`the
documentation <telemetry>`.
Upgrading from pre-Mimic releases (like Luminous)
-------------------------------------------------
You *must* first upgrade to Mimic (13.2.z) or Nautilus (14.2.z) before
upgrading to Octopus.
Upgrade compatibility notes
---------------------------
* Starting with Octopus, there is now a separate repository directory
for each version on `download.ceph.com` (e.g., `rpm-15.2.0` and
`debian-15.2.0`). The traditional package directory that is named
after the release (e.g., `rpm-octopus` and `debian-octopus`) is
now a symlink to the most recently bug fix version for that release.
We no longer generate a single repository that combines all bug fix
versions for a single named release.
* The RGW "num_rados_handles" has been removed.
If you were using a value of "num_rados_handles" greater than 1
multiply your current "objecter_inflight_ops" and
"objecter_inflight_op_bytes" paramaeters by the old
"num_rados_handles" to get the same throttle behavior.
* Ceph now packages python bindings for python3.6 instead of
python3.4, because python3 in EL7/EL8 is now using python3.6
as the native python3. see the `announcement`_
for more details on the background of this change.
* librbd now uses a write-around cache policy be default,
replacing the previous write-back cache policy default.
This cache policy allows librbd to immediately complete
write IOs while they are still in-flight to the OSDs.
Subsequent flush requests will ensure all in-flight
write IOs are completed prior to completing. The
librbd cache policy can be controlled via a new
"rbd_cache_policy" configuration option.
* librbd now includes a simple IO scheduler which attempts to
batch together multiple IOs against the same backing RBD
data block object. The librbd IO scheduler policy can be
controlled via a new "rbd_io_scheduler" configuration
option.
* RGW: radosgw-admin introduces two subcommands that allow the
managing of expire-stale objects that might be left behind after a
bucket reshard in earlier versions of RGW. One subcommand lists such
objects and the other deletes them. Read the troubleshooting section
of the dynamic resharding docs for details.
* RGW: Bucket naming restrictions have changed and likely to cause
InvalidBucketName errors. We recommend to set `rgw_relaxed_s3_bucket_names`
option to true as a workaround.
* In the Zabbix Mgr Module there was a typo in the key being send
to Zabbix for PGs in backfill_wait state. The key that was sent
was 'wait_backfill' and the correct name is 'backfill_wait'.
Update your Zabbix template accordingly so that it accepts the
new key being send to Zabbix.
* zabbix plugin for ceph manager now includes osd and pool
discovery. Update of zabbix_template.xml is needed
to receive per-pool (read/write throughput, diskspace usage)
and per-osd (latency, status, pgs) statistics
* The format of all date + time stamps has been modified to fully
conform to ISO 8601. The old format (`YYYY-MM-DD
HH:MM:SS.ssssss`) excluded the `T` separator between the date and
time and was rendered using the local time zone without any explicit
indication. The new format includes the separator as well as a
`+nnnn` or `-nnnn` suffix to indicate the time zone, or a `Z`
suffix if the time is UTC. For example,
`2019-04-26T18:40:06.225953+0100`.
Any code or scripts that was previously parsing date and/or time
values from the JSON or XML structure CLI output should be checked
to ensure it can handle ISO 8601 conformant values. Any code
parsing date or time values from the unstructured human-readable
output should be modified to parse the structured output instead, as
the human-readable output may change without notice.
* The `bluestore_no_per_pool_stats_tolerance` config option has been
replaced with `bluestore_fsck_error_on_no_per_pool_stats`
(default: false). The overall default behavior has not changed:
fsck will warn but not fail on legacy stores, and repair will
convert to per-pool stats.
* The disaster-recovery related 'ceph mon sync force' command has been
replaced with 'ceph daemon <...> sync_force'.
* The `osd_recovery_max_active` option now has
`osd_recovery_max_active_hdd` and `osd_recovery_max_active_ssd`
variants, each with different default values for HDD and SSD-backed
OSDs, respectively. By default `osd_recovery_max_active` now
defaults to zero, which means that the OSD will conditionally use
the HDD or SSD option values. Administrators who have customized
this value may want to consider whether they have set this to a
value similar to the new defaults (3 for HDDs and 10 for SSDs) and,
if so, remove the option from their configuration entirely.
* monitors now have a `ceph osd info` command that will provide information
on all osds, or provided osds, thus simplifying the process of having to
parse `osd dump` for the same information.
* The structured output of `ceph status` or `ceph -s` is now more
concise, particularly the `mgrmap` and `monmap` sections, and the
structure of the `osdmap` section has been cleaned up.
* A health warning is now generated if the average osd heartbeat ping
time exceeds a configurable threshold for any of the intervals
computed. The OSD computes 1 minute, 5 minute and 15 minute
intervals with average, minimum and maximum values. New
configuration option `mon_warn_on_slow_ping_ratio` specifies a
percentage of `osd_heartbeat_grace` to determine the threshold. A
value of zero disables the warning. New configuration option
`mon_warn_on_slow_ping_time` specified in milliseconds over-rides
the computed value, causes a warning when OSD heartbeat pings take
longer than the specified amount. New admin command `ceph daemon
mgr.# dump_osd_network [threshold]` command will list all
connections with a ping time longer than the specified threshold or
value determined by the config options, for the average for any of
the 3 intervals. New admin command `ceph daemon osd.#
dump_osd_network [threshold]` will do the same but only including
heartbeats initiated by the specified OSD.
* Inline data support for CephFS has been deprecated. When setting the flag,
users will see a warning to that effect, and enabling it now requires the
`--yes-i-really-really-mean-it` flag. If the MDS is started on a
filesystem that has it enabled, a health warning is generated. Support for
this feature will be removed in a future release.
* `ceph {set,unset} full` is not supported anymore. We have been using
`full` and `nearfull` flags in OSD map for tracking the fullness status
of a cluster back since the Hammer release, if the OSD map is marked `full`
all write operations will be blocked until this flag is removed. In the
Infernalis release and Linux kernel 4.7 client, we introduced the per-pool
full/nearfull flags to track the status for a finer-grained control, so the
clients will hold the write operations if either the cluster-wide `full`
flag or the per-pool `full` flag is set. This was a compromise, as we
needed to support the cluster with and without per-pool `full` flags
support. But this practically defeated the purpose of introducing the
per-pool flags. So, in the Mimic release, the new flags finally took the
place of their cluster-wide counterparts, as the monitor started removing
these two flags from OSD map. So the clients of Infernalis and up can benefit
from this change, as they won't be blocked by the full pools which they are
not writing to. In this release, `ceph {set,unset} full` is now considered
as an invalid command. And the clients will continue honoring both the
cluster-wide and per-pool flags to be backward comaptible with pre-infernalis
clusters.
* The telemetry module now reports more information.
First, there is a new 'device' channel, enabled by default, that
will report anonymized hard disk and SSD health metrics to
telemetry.ceph.com in order to build and improve device failure
prediction algorithms. If you are not comfortable sharing device
metrics, you can disable that channel first before re-opting-in::
ceph config set mgr mgr/telemetry/channel_device false
Second, we now report more information about CephFS file systems,
including:
- how many MDS daemons (in total and per file system)
- which features are (or have been) enabled
- how many data pools
- approximate file system age (year + month of creation)
- how many files, bytes, and snapshots
- how much metadata is being cached
We have also added:
- which Ceph release the monitors are running
- whether msgr v1 or v2 addresses are used for the monitors
- whether IPv4 or IPv6 addresses are used for the monitors
- whether RADOS cache tiering is enabled (and which mode)
- whether pools are replicated or erasure coded, and
which erasure code profile plugin and parameters are in use
- how many hosts are in the cluster, and how many hosts have each type of daemon
- whether a separate OSD cluster network is being used
- how many RBD pools and images are in the cluster, and how many pools have RBD mirroring enabled
- how many RGW daemons, zones, and zonegroups are present; which RGW frontends are in use
- aggregate stats about the CRUSH map, like which algorithms are used, how
big buckets are, how many rules are defined, and what tunables are in
use
If you had telemetry enabled, you will need to re-opt-in with::
ceph telemetry on
You can view exactly what information will be reported first with::
$ ceph telemetry show # see everything
$ ceph telemetry show basic # basic cluster info (including all of the new info)
* Following invalid settings now are not tolerated anymore
for the command `ceph osd erasure-code-profile set xxx`.
* invalid `m` for "reed_sol_r6_op" erasure technique
* invalid `m` and invalid `w` for "liber8tion" erasure technique
* New OSD daemon command dump_recovery_reservations which reveals the
recovery locks held (in_progress) and waiting in priority queues.
* New OSD daemon command dump_scrub_reservations which reveals the
scrub reservations that are held for local (primary) and remote (replica) PGs.
* Previously, `ceph tell mgr ...` could be used to call commands
implemented by mgr modules. This is no longer supported. Since
luminous, using `tell` has not been necessary: those same commands
are also accessible without the `tell mgr` portion (e.g., `ceph
tell mgr influx foo` is the same as `ceph influx foo`. `ceph
tell mgr ...` will now call admin commands--the same set of
commands accessible via `ceph daemon ...` when you are logged into
the appropriate host.
* The `ceph tell` and `ceph daemon` commands have been unified,
such that all such commands are accessible via either interface.
Note that ceph-mgr tell commands are accessible via either `ceph
tell mgr ...` or `ceph tell mgr.<id> ...`, and it is only
possible to send tell commands to the active daemon (the standbys do
not accept incoming connections over the network).
* Ceph will now issue a health warning if a RADOS pool as a `pg_num`
value that is not a power of two. This can be fixed by adjusting
the pool to a nearby power of two::
ceph osd pool set <pool-name> pg_num <new-pg-num>
Alternatively, the warning can be silenced with::
ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false
* The format of MDSs in `ceph fs dump` has changed.
* The `mds_cache_size` config option is completely removed. Since luminous,
the `mds_cache_memory_limit` config option has been preferred to configure
the MDS's cache limits.
* The `pg_autoscale_mode` is now set to `on` by default for newly
created pools, which means that Ceph will automatically manage the
number of PGs. To change this behavior, or to learn more about PG
autoscaling, see :ref:`pg-autoscaler`. Note that existing pools in
upgraded clusters will still be set to `warn` by default.
* The `upmap_max_iterations` config option of mgr/balancer has been
renamed to `upmap_max_optimizations` to better match its behaviour.
* `mClockClientQueue` and `mClockClassQueue` OpQueue
implementations have been removed in favor of of a single
`mClockScheduler` implementation of a simpler OSD interface.
Accordingly, the `osd_op_queue_mclock*` family of config options
has been removed in favor of the `osd_mclock_scheduler*` family
of options.
* The config subsystem now searches dot ('.') delimited prefixes for
options. That means for an entity like `client.foo.bar`, its
overall configuration will be a combination of the global options,
`client`, `client.foo`, and `client.foo.bar`. Previously,
only global, `client`, and `client.foo.bar` options would apply.
This change may affect the configuration for clients that include a
`.` in their name.
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-15.2.0.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: dc6a0b5c3cbf6a5e1d6d4f20b5ad466d76b96247
--
Abhishek Lekshmanan
SUSE Software Solutions Germany GmbH
GF: Felix Imendörffer
Hello,
I use "make-debs.sh" script to build debian packages, but instead of
getting *-dbg packages, I get packages like
"ceph-base-dbgsym_14.2.7-75-g14b67dbc11-1_amd64.ddeb", which cannot be
imported by reprepro, with errors like:
Unknown file type: 'a2e592c703334bc9f55a62f6fac267d8 135503940 debug
optional ceph-base-dbgsym_14.2.7-75-g14b67dbc11-1_amd64.ddeb',
assuming source format...
Warning: File 'ceph-base-dbgsym_14.2.7-75-g14b67dbc11-1_amd64.ddeb'
looks like source but does not start with 'ceph_'!
I must have missed something about the package building procedure.
What supposed to be the right way to build deb packages?
Best,
Can Zhang
On Tue, 24 Mar 2020, Igor Fedotov wrote:
> Hi Sage,
>
> We've got another occurrence for the ticket:
> https://tracker.ceph.com/issues/40300
>
> Now I'm trying to realize what's happening in BlueFS when it occurs.
> Unfortunately customer applied the suggested workaround and hence likely
> killed the original sst layout.
>
> So I'm wondering if the bluefs part of the issue (which prevents OSD from
> restart) is caused by a very long single read (>4GB len) from BlueFS.
>
> If so BlueRocksSequentialFile::Read implementation seems to be broken due to
> int usage:
>
> rocksdb::Status Read(size_t n, rocksdb::Slice* result, char* scratch)
> override {
> int r = fs->read(h, &h->buf, h->buf.pos, n, NULL, scratch);
>
> ceph_assert(r >= 0);
>
> *result = rocksdb::Slice(scratch, r);
>
> ...
>
> Please note that sizeof(int) is 4!
size_t is long, so we could change int here (and for _read, and so on
down the stack) to ssize_t...
> Also I'm wondering if we're obliged to return exactly the requested amount of
> data from Read to RocksDB. Can't some read cap at this function be the simple
> solution?
I think the easiest way to answer that is to look at the PosixStack (or
whatever it's called) implementation in the rocksdb tree and see if it
ever returns short, or whether it wraps read(2) in a loop.
Unfortunately that's not entirely clear...
if (r < n) {
if (feof(file_)) {
// We leave status as ok if we hit the end of the file
// We also clear the error so that the reads can continue
// if a new data is written to the file
clearerr(file_);
} else {
// A partial read with an error: return a non-ok status
s = IOError("While reading file sequentially", filename_, errno);
}
}
But, I think the only real reason we'd want to do a short read is if
the read is long due to readahead, in which case the problem is
more that readahead was kludged into the stack at the wrong
point.
sage
Interesting, thanks for the details. Adding the dev(a)ceph.io list.
On Thu, Mar 26, 2020 at 12:07 PM <akmd(a)metaswitch.com> wrote:
>
> Hello,
>
> I am observing non-intuitive results for a performance test using the S3 API to RGW. I am wondering if others have similar experiences or knowledge here.
>
> Our application is using the “if-none-match” header on S3-API requests. This header is set by the application if it already has a copy of the object in question but wishes to check if there is a newer version. If the etag of the current object matches then RGW sends a 304 response, and if it doesn’t it sends the updated content of the object.
>
> We’re observing that the response time of requests resulting in “304 Not Modified” is typically slower than those for normal object retrieval. This wasn’t intuitive to me – in the 304 case there is no content to transfer over the network and I would expect the request can be satisfied just by looking at the RGW index (I was under the impression that metadata including etag is in the index). Anecdotally, HEAD requests see similar results but I haven't yet analysed in full.
The bucket index is generally only consulted when listing objects. For
GET/HEAD requests on objects, we read the etag and other object
attributes directly from the associated 'head object' in the
buckets.data pool.
For small objects (ie. non-multipart uploads), this head object also
stores the first 4MB of object data. For GET requests, radosgw will
prefetch this data in the same osd request that reads the object's
attributes, even in the 304 case where no data is returned to the
client.
The 'head object' concept works differently for versioned buckets,
because each object version has its own head object, and the 'current'
version's head object (internally called the 'object logical head', or
OLH) acts like a symlink to one of those. So a GET/HEAD request that
doesn't specify a ?versionId will require (at least*) two rados reads
- one on the OLH object to determine the current version, and another
on that version's head object.
* OLH performance isn't necessarily deterministic, because it may
involve replaying a journal of racing writes of new versions.
>
>
> Does anyone else have data or experience about expected performance of this scenario? Are there any potential avenues for optimization of configuration ? What kind of commands can I use to debug this further ?
Given the above, I would expect the performance to be very similar
with and without If-None-Match. I'd suggest running radosgw with
'--debug-ms=1' and comparing the number of osd_op requests it sends
between the 206 and 304 cases.
>
>
> Some details of the current setup:
> => ceph version 14.2.5 (ad5bd132e1492173c85fda2cc863152730b16a92) nautilus (stable)
> => Objects are typically 80-100KB.
> => Versioning is enabled on the bucket.
> => Our requests specify a Range header (hence will generate 206 not 200).
> => Multisite features are enabled.
> => Bucket has 20 shards – I’ve put a dump of "bucket limits" below.
>
> Performance results
>
> Response, Request Count, Median, 75th percentile, 90th percentile, 95th percentile,
> 206 Partial, 20473, 3, 3, 16, 129, 1200
> 304 Not Modified, 15644, 9, 16, 46, 212, 1192
>
> Bucket details
> {
> "bucket": "albansstack-scsdata",
> "tenant": "",
> "num_objects": 465780,
> "num_shards": 20,
> "objects_per_shard": 23289,
> "fill_status": "OK"
> },
>
> Many thanks,
>
> Alistair.
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi Folks,
Perf meeting for today starts in 15 minutes! Today we will briefly
discuss changing the default for bluefs_buffered_io back to disabled and
updates regarding other recent work. See you there!
Etherpad:
https://pad.ceph.com/p/performance_weekly
Bluejeans:
https://bluejeans.com/908675367
Thanks,
Mark