April 2020 - ceph-users - lists.ceph.io

Re: Questions on Ceph cluster without OS disks

by Martin Verges

Hello Samuel, we from croit.io don't use NFS to boot up Servers. We copy the OS directly into the RAM (approximately 0.5-1GB). Think of it like a container, you start it and throw it away when you no longer need it. This way we can save the slots of OS harddisks to add more storage per node and reduce overall costs as 1GB ram is cheaper then an OS disk and consumes less power. If our management node is down, nothing will happen to the cluster. No impact, no downtime. However, you do need the mgmt node to boot up the cluster. So after a very rare total power outage, your first system would be the mgmt node and then the cluster itself. But again, if you configure your systems correct, no manual work is required to recover from that. For everything else, it is possible (but definitely not needed) to deploy our mgmt node in active/passive HA. We have multiple hundred installations worldwide in production environments. Our strong PXE knowledge comes from more than 20 years of datacenter hosting experience and it never ever failed us in the last >10 years. The main benefits out of that: - Immutable OS freshly booted: Every host has exactly the same version, same library, kernel, Ceph versions,... - OS is heavily tested by us: Every croit deployment has exactly the same image. We can find errors much faster and hit much fewer errors. - Easy Update: Updating OS, Ceph or anything else is just a node reboot. No cluster downtime, No service Impact, full automatic handling by our mgmt Software. - No need to install OS: No maintenance costs, no labor required, no other OS management required. - Centralized Logs/Stats: As it is booted in memory, all logs and statistics are collected on a central place for easy access. - Easy to scale: It doesn't matter if you boot 3 oder 300 nodes, all boot the exact same image in a few seconds. .. lots more Please do not hesitate to contact us directly. We always try to offer an excellent service and are strongly customer oriented. -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.verges(a)croit.io Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Sa., 21. März 2020 um 13:53 Uhr schrieb huxiaoyu(a)horebdata.cn < huxiaoyu(a)horebdata.cn>: > Hello， Martin， > > I notice that Croit advocate the use of ceph cluster without OS disks, but > with PXE boot. > > Do you use a NFS server to serve the root file system for each node? such > as hosting configuration files, user and password, log files, etc. My > question is, will the NFS server be a single point of failure? If the NFS > server goes down, the network experience any outage, ceph nodes may not be > able to write to the local file systems, possibly leading to service outage. > > How do you deal with the above potential issues in production? I am a bit > worried... > > best regards, > > samuel > > > > > ------------------------------ > huxiaoyu(a)horebdata.cn > > >

4 years

6
14
0 0

Is there a better way to make a samba/nfs gateway?

by Seth Galitzer

I have a hybrid environment and need to share with both Linux and Windows clients. For my previous iterations of file storage, I exported nfs and samba shares directly from my monolithic file server. All Linux clients used nfs and all Windows clients used samba. Now that I've switched to ceph, things are a bit more complicated. I built a gateway to export nfs and samba as needed, and connect that as a client to my ceph cluster. After having file locking problems with kernel nfs, I made the switch to nfs-ganesha, which has helped immensely. For Linux clients that have high I/O needs, like desktops and some web servers, I connect to ceph directly for those shares. For all other Linux needs, I use nfs from the gateway. For all Windows clients (desktops and a small number of servers), I use samba exported from the gateway. Since my ceph cluster went live in August, I have had some kind of strange (to me) error at least once a week, almost always related to the gateway client. Last night, it was MDS_CLIENT_OLDEST_TID. Since we're on Spring Break at my university and not very busy, I decided to unmount/remount the ceph share, requiring stopping nfs and samba services. Stopping nfs-ganesha took a while, but it finally completed with no complaints from the ceph cluster. Stopping samba took longer and gave me MDS_SLOW_REQUEST and MDS_CLIENT_LATE_RELEASE on the mds. It finally finished, and I was able to unmount/remount the ceph share and that finally cleared all the errors. This is leading me to believe that samba on the gateway and all the clients attaching to that is putting a strain on the connection back to ceph. Which finally brings me to my question: is there a better way to export samba to my clients using the ceph back end? Or is this as good as it gets and I just have to put up with the seemingly frequent errors? I can live with the errors and have been able to handle them so far, but I know people who have much bigger clusters and many more clients than me (by an order of magnitude) and don't see nearly as many errors as I do. Which is why I'm trying to figure out what is special about my setup. All my ceph nodes are running latest nautilus on Centos 7 (I just updated last week to 14.2.8), as is the gateway host. I'm mounting ceph directly on the gateway (by way of the kernel using cephfs, not rados/rbd) to a single mount point and exporting from there. My searches so far have not turned up anything extraordinarily useful, so I'm asking for some guidance here. Any advice is welcome. Thanks. Seth -- Seth Galitzer Systems Coordinator Computer Science Department Kansas State University http://www.cs.ksu.edu/~sgsax sgsax(a)ksu.edu 785-532-7790

4 years

8
11
0 0

Resize Bluestore i.e. shrink?

by Udo Waechter

Hi all, I'm currently building a little ceph-cluster, with embedded devices. My OSD Nodes are constrained in RAM (1GB, but 5 SATA Ports, please don't kill me ;) ). Anyways. Each of those nodes have 2x256 GB SSD and 2x 1TB HDDs. I'm using nautilus and deployd the whole thing with ceph-deploy. Now, I'd like to add more mem (a.k.a SWAP space, as raid1 on either ssds/hdds) to those OSDs. My question is: Can I shrink a bluestore lvm? I've seen 'ceph-bluestore-tool bluefs-bdev-expand' which suggests to expand the underlying lvm. It seem to me that I would have to remove the specific OSD and then add it manually after resizing the underlying lvm volume. And yes, I know that ceph need RAM ;) Cheers, udo.

4 years

2
1
0 0

Upgrading from Mimic to Nautilus

by Paul Choi

Hi, How is the experience of upgrading from Mimic to Nautilus? Anything to watch out for? And is it a smooth transition for the ceph-fuse clients on Mimic when interacting with a Nautilus MDS? I'm reading the docs on https://docs.ceph.com/docs/nautilus/install/upgrading-ceph/ and it looks pretty straightforward. I'm currently on Mimic 13.2.8, and have been putting off upgrading to Nautilus, but I think it's time. Thanks in advance, -Paul Choi

4 years

3
2
0 0

Purpose of crush_ln() function

by Bobby

Hi, I am trying to understand Straw2 Bucket of Ceph CRUSH algorithm. Can someone please tell what is the purpose of* crush_ln?* And what does "*compute 2^44*log2(input+1)*" means in the comment section above above *crush_ln* function? Thanks in advance

4 years

1
0
0 0

different RGW Versions on same ceph cluster

by Scheurer François

Dear All One ceph cluster is running with all daemons (mon, mgr, osd, rgw) having the version 12.2.12. Let's say we configure an additional radosgw instance with version 14.2.8, configured with the same ceph cluster name, realm, zonegroup and zone as the existing instances. Is it dangerous to start a different version of rgw concurrently with the older rgw? Can both versions server PUT & GET requests concurrently on the same buckets? It is probably ok, because during an upgrade it is as well assumed to have temporarily different versions running concurrently... So the question is kind of the same as asking if one may cancel the radosgw upgrade from Luminous to Nautilus in the middle, rolling back the radosgw packages. Many thanks in advance. Cheers Francois Scheurer

4 years

2
2
0 0

[Octopus] Beware the on-disk conversion

by Jack

Hi, As the upgrade documentation tells: > Note that the first time each OSD starts, it will do a format > conversion to improve the accounting for “omap” data. This may > take a few minutes to as much as a few hours (for an HDD with lots > of omap data). You can disable this automatic conversion with: What the documentation does not say is that this process takes a lot of memory I am upgrading a rusty cluster from Nautilus, you can check out the ram consumption as attachment First, we have a 3TB osd conversion: it tooks ~15min, and 19GB of memory Then, we have a larger 6TB osd conversion: it tooks more than 2 hours, and 35GB of memory Finally, you have the largest 10TB osd: only 1H15, but 52GB of memory

4 years

5
15
0 0

osd_pg_create causing slow requests in Nautilus

by Bryan Stillwell

We've run into a problem on our test cluster this afternoon which is running Nautilus (14.2.2). It seems that any time PGs move on the cluster (from marking an OSD down, setting the primary-affinity to 0, or by using the balancer), a large number of the OSDs in the cluster peg the CPU cores they're running on for a while which causes slow requests. From what I can tell it appears to be related to slow peering caused by osd_pg_create() taking a long time. This was seen on quite a few OSDs while waiting for peering to complete: # ceph daemon osd.3 ops { "ops": [ { "description": "osd_pg_create(e179061 287.7a:177739 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)", "initiated_at": "2019-08-27 14:34:46.556413", "age": 318.25234538000001, "duration": 318.25241895300002, "type_data": { "flag_point": "started", "events": [ { "time": "2019-08-27 14:34:46.556413", "event": "initiated" }, { "time": "2019-08-27 14:34:46.556413", "event": "header_read" }, { "time": "2019-08-27 14:34:46.556299", "event": "throttled" }, { "time": "2019-08-27 14:34:46.556456", "event": "all_read" }, { "time": "2019-08-27 14:35:12.456901", "event": "dispatched" }, { "time": "2019-08-27 14:35:12.456903", "event": "wait for new map" }, { "time": "2019-08-27 14:40:01.292346", "event": "started" } ] } }, ...snip... { "description": "osd_pg_create(e179066 287.7a:177739 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)", "initiated_at": "2019-08-27 14:35:09.908567", "age": 294.900191001, "duration": 294.90068416899999, "type_data": { "flag_point": "delayed", "events": [ { "time": "2019-08-27 14:35:09.908567", "event": "initiated" }, { "time": "2019-08-27 14:35:09.908567", "event": "header_read" }, { "time": "2019-08-27 14:35:09.908520", "event": "throttled" }, { "time": "2019-08-27 14:35:09.908617", "event": "all_read" }, { "time": "2019-08-27 14:35:12.456921", "event": "dispatched" }, { "time": "2019-08-27 14:35:12.456923", "event": "wait for new map" } ] } } ], "num_ops": 6 } That "wait for new map" message made us think something was getting hung up on the monitors, so we restarted them all without any luck. I'll keep investigating, but so far my google searches aren't pulling anything up so I wanted to see if anyone else is running into this? Thanks, Bryan

4 years

6
16
0 0

Odd CephFS Performance

by Gabryel Mason-Williams

We have been benchmarking CephFS and comparing it Rados to see the performance difference and how much overhead CephFS has. However, we are getting odd results when using more than 1 OSD server (each OSDS has only one disk) using CephFS but using Rados everything appears normal. These tests are run on the same Ceph Cluster. CephFS Rados OSDS Thread 16 Thread 16 1 289 316 2 139 546 3 143 728 4 142 844 CephFS is being benchmarked using: fio --name=seqwrite --rw=write --direct=1 --ioengine=libaio --bs=4M --numjobs=16 --size=1G --group_reporting Rados is being benchmarked using: rados bench -p cephfs_data 10 write -t 16 If you could provide some help or insight into why this is happening or how to stop it, that would be much appreciated. Kind regards, Gabryel

4 years

2
2
0 0

v15.2.0 Octopus released

by Abhishek Lekshmanan

We're happy to announce the first stable release of Octopus v15.2.0. There are a lot of changes and new features added, we advise everyone to read the release notes carefully, and in particular the upgrade notes, before upgrading. Please refer to the official blog entry https://ceph.io/releases/v15-2-0-octopus-released/ for a detailed version with links & changelog. This release wouldn't have been possible without the support of the community, this release saw contributions from over 330 developers & 80 organizations, and we thank everyone for making this release happen. Major Changes from Nautilus --------------------------- General ~~~~~~~ * A new deployment tool called **cephadm** has been introduced that integrates Ceph daemon deployment and management via containers into the orchestration layer. * Health alerts can now be muted, either temporarily or permanently. * Health alerts are now raised for recent Ceph daemons crashes. * A simple 'alerts' module has been introduced to send email health alerts for clusters deployed without the benefit of an existing external monitoring infrastructure. * Packages are built for the following distributions: - CentOS 8 - CentOS 7 (partial--see below) - Ubuntu 18.04 (Bionic) - Debian Buster - Container images (based on CentOS 8) Note that the dashboard, prometheus, and restful manager modules will not work on the CentOS 7 build due to Python 3 module dependencies that are missing in CentOS 7. Besides this packages built by the community will also available for the following distros: - Fedora (33/rawhide) - openSUSE (15.2, Tumbleweed) Dashboard ~~~~~~~~~ The mgr-dashboard has gained a lot of new features and functionality: * UI Enhancements - New vertical navigation bar - New unified sidebar: better background task and events notification - Shows all progress mgr module notifications - Multi-select on tables to perform bulk operations * Dashboard user account security enhancements - Disabling/enabling existing user accounts - Clone an existing user role - Users can change their own password - Configurable password policies: Minimum password complexity/length requirements - Configurable password expiration - Change password after first login New and enhanced management of Ceph features/services: * OSD/device management - List all disks associated with an OSD - Add support for blinking enclosure LEDs via the orchestrator - List all hosts known by the orchestrator - List all disks and their properties attached to a node - Display disk health information (health prediction and SMART data) - Deploy new OSDs on new disks/hosts - Display and allow sorting by an OSD's default device class in the OSD table - Explicitly set/change the device class of an OSD, display and sort OSDs by device class * Pool management - Viewing and setting pool quotas - Define and change per-pool PG autoscaling mode * RGW management enhancements - Enable bucket versioning - Enable MFA support - Select placement target on bucket creation * CephFS management enhancements - CephFS client eviction - CephFS snapshot management - CephFS quota management - Browse CephFS directory * iSCSI management enhancements - Show iSCSI GW status on landing page - Prevent deletion of IQNs with open sessions - Display iSCSI "logged in" info * Prometheus alert management - List configured Prometheus alerts RADOS ~~~~~ * Objects can now be brought in sync during recovery by copying only the modified portion of the object, reducing tail latencies during recovery. * Ceph will allow recovery below *min_size* for Erasure coded pools, wherever possible. * The PG autoscaler feature introduced in Nautilus is enabled for new pools by default, allowing new clusters to autotune *pg num* without any user intervention. The default values for new pools and RGW/CephFS metadata pools have also been adjusted to perform well for most users. * BlueStore has received several improvements and performance updates, including improved accounting for "omap" (key/value) object data by pool, improved cache memory management, and a reduced allocation unit size for SSD devices. (Note that by default, the first time each OSD starts after upgrading to octopus it will trigger a conversion that may take from a few minutes to a few hours, depending on the amount of stored "omap" data.) * Snapshot trimming metadata is now managed in a more efficient and scalable fashion. RBD block storage ~~~~~~~~~~~~~~~~~ * Mirroring now supports a new snapshot-based mode that no longer requires the journaling feature and its related impacts in exchange for the loss of point-in-time consistency (it remains crash consistent). * Clone operations now preserve the sparseness of the underlying RBD image. * The trash feature has been improved to (optionally) automatically move old parent images to the trash when their children are all deleted or flattened. * The trash can be configured to automatically purge on a defined schedule. * Images can be online re-sparsified to reduce the usage of zeroed extents. * The `rbd-nbd` tool has been improved to use more modern kernel interfaces. * Caching has been improved to be more efficient and performant. * `rbd-mirror` automatically adjusts its per-image memory usage based upon its memory target. * A new persistent read-only caching daemon is available to offload reads from shared parent images. RGW object storage ~~~~~~~~~~~~~~~~~~ * New `Multisite Sync Policy` primitives for per-bucket replication. (EXPERIMENTAL) * S3 feature support: - Bucket Replication (EXPERIMENTAL) - `Bucket Notifications`_ via HTTP/S, AMQP and Kafka - Bucket Tagging - Object Lock - Public Access Block for buckets * Bucket sharding: - Significantly improved listing performance on buckets with many shards. - Dynamic resharding prefers prime shard counts for improved distribution. - Raised the default number of bucket shards to 11. * Added `HashiCorp Vault Integration`_ for SSE-KMS. * Added Keystone token cache for S3 requests. CephFS distributed file system ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * Inline data support in CephFS has been deprecated and will likely be removed in a future release. * MDS daemons can now be assigned to manage a particular file system via the new `mds_join_fs` option. * MDS now aggressively asks idle clients to trim caps which improves stability when file system load changes. * The mgr volumes plugin has received numerous improvements to support CephFS via CSI, including snapshots and cloning. * cephfs-shell has had numerous incremental improvements and bug fixes. Upgrading from Mimic or Nautilus -------------------------------- You can monitor the progress of your upgrade at each stage with the `ceph versions` command, which will tell you what ceph version(s) are running for each type of daemon. Instructions ~~~~~~~~~~~~ #. Make sure your cluster is stable and healthy (no down or recovering OSDs). (Optional, but recommended.) #. Set the `noout` flag for the duration of the upgrade. (Optional, but recommended.):: # ceph osd set noout #. Upgrade monitors by installing the new packages and restarting the monitor daemons. For example, on each monitor host,:: # systemctl restart ceph-mon.target Once all monitors are up, verify that the monitor upgrade is complete by looking for the `octopus` string in the mon map. The command:: # ceph mon dump | grep min_mon_release should report:: min_mon_release 15 (nautilus) If it doesn't, that implies that one or more monitors hasn't been upgraded and restarted and/or the quorum does not include all monitors. #. Upgrade `ceph-mgr` daemons by installing the new packages and restarting all manager daemons. For example, on each manager host,:: # systemctl restart ceph-mgr.target Verify the `ceph-mgr` daemons are running by checking `ceph -s`:: # ceph -s ... services: mon: 3 daemons, quorum foo,bar,baz mgr: foo(active), standbys: bar, baz ... #. Upgrade all OSDs by installing the new packages and restarting the ceph-osd daemons on all OSD hosts:: # systemctl restart ceph-osd.target Note that the first time each OSD starts, it will do a format conversion to improve the accounting for "omap" data. This may take a few minutes to as much as a few hours (for an HDD with lots of omap data). You can disable this automatic conversion with:: # ceph config set osd bluestore_fsck_quick_fix_on_mount false You can monitor the progress of the OSD upgrades with the `ceph versions` or `ceph osd versions` commands:: # ceph osd versions { "ceph version 13.2.5 (...) mimic (stable)": 12, "ceph version 15.2.0 (...) octopus (stable)": 22, } #. Upgrade all CephFS MDS daemons. For each CephFS file system, #. Reduce the number of ranks to 1. (Make note of the original number of MDS daemons first if you plan to restore it later.):: # ceph status # ceph fs set <fs_name> max_mds 1 #. Wait for the cluster to deactivate any non-zero ranks by periodically checking the status:: # ceph status #. Take all standby MDS daemons offline on the appropriate hosts with:: # systemctl stop ceph-mds@<daemon_name> #. Confirm that only one MDS is online and is rank 0 for your FS:: # ceph status #. Upgrade the last remaining MDS daemon by installing the new packages and restarting the daemon:: # systemctl restart ceph-mds.target #. Restart all standby MDS daemons that were taken offline:: # systemctl start ceph-mds.target #. Restore the original value of `max_mds` for the volume:: # ceph fs set <fs_name> max_mds <original_max_mds> #. Upgrade all radosgw daemons by upgrading packages and restarting daemons on all hosts:: # systemctl restart ceph-radosgw.target #. Complete the upgrade by disallowing pre-Octopus OSDs and enabling all new Octopus-only functionality:: # ceph osd require-osd-release octopus #. If you set `noout` at the beginning, be sure to clear it with:: # ceph osd unset noout #. Verify the cluster is healthy with `ceph health`. If your CRUSH tunables are older than Hammer, Ceph will now issue a health warning. If you see a health alert to that effect, you can revert this change with:: ceph config set mon mon_crush_min_required_version firefly If Ceph does not complain, however, then we recommend you also switch any existing CRUSH buckets to straw2, which was added back in the Hammer release. If you have any 'straw' buckets, this will result in a modest amount of data movement, but generally nothing too severe.:: ceph osd getcrushmap -o backup-crushmap ceph osd crush set-all-straw-buckets-to-straw2 If there are problems, you can easily revert with:: ceph osd setcrushmap -i backup-crushmap Moving to 'straw2' buckets will unlock a few recent features, like the `crush-compat` :ref:`balancer <balancer>` mode added back in Luminous. #. If you are upgrading from Mimic, or did not already do so when you upgraded to Nautlius, we recommened you enable the new :ref:`v2 network protocol <msgr2>`, issue the following command:: ceph mon enable-msgr2 This will instruct all monitors that bind to the old default port 6789 for the legacy v1 protocol to also bind to the new 3300 v2 protocol port. To see if all monitors have been updated,:: ceph mon dump and verify that each monitor has both a `v2:` and `v1:` address listed. #. Consider enabling the :ref:`telemetry module <telemetry>` to send anonymized usage statistics and crash information to the Ceph upstream developers. To see what would be reported (without actually sending any information to anyone),:: ceph mgr module enable telemetry ceph telemetry show If you are comfortable with the data that is reported, you can opt-in to automatically report the high-level cluster metadata with:: ceph telemetry on For more information about the telemetry module, see :ref:`the documentation <telemetry>`. Upgrading from pre-Mimic releases (like Luminous) ------------------------------------------------- You *must* first upgrade to Mimic (13.2.z) or Nautilus (14.2.z) before upgrading to Octopus. Upgrade compatibility notes --------------------------- * Starting with Octopus, there is now a separate repository directory for each version on `download.ceph.com` (e.g., `rpm-15.2.0` and `debian-15.2.0`). The traditional package directory that is named after the release (e.g., `rpm-octopus` and `debian-octopus`) is now a symlink to the most recently bug fix version for that release. We no longer generate a single repository that combines all bug fix versions for a single named release. * The RGW "num_rados_handles" has been removed. If you were using a value of "num_rados_handles" greater than 1 multiply your current "objecter_inflight_ops" and "objecter_inflight_op_bytes" paramaeters by the old "num_rados_handles" to get the same throttle behavior. * Ceph now packages python bindings for python3.6 instead of python3.4, because python3 in EL7/EL8 is now using python3.6 as the native python3. see the `announcement`_ for more details on the background of this change. * librbd now uses a write-around cache policy be default, replacing the previous write-back cache policy default. This cache policy allows librbd to immediately complete write IOs while they are still in-flight to the OSDs. Subsequent flush requests will ensure all in-flight write IOs are completed prior to completing. The librbd cache policy can be controlled via a new "rbd_cache_policy" configuration option. * librbd now includes a simple IO scheduler which attempts to batch together multiple IOs against the same backing RBD data block object. The librbd IO scheduler policy can be controlled via a new "rbd_io_scheduler" configuration option. * RGW: radosgw-admin introduces two subcommands that allow the managing of expire-stale objects that might be left behind after a bucket reshard in earlier versions of RGW. One subcommand lists such objects and the other deletes them. Read the troubleshooting section of the dynamic resharding docs for details. * RGW: Bucket naming restrictions have changed and likely to cause InvalidBucketName errors. We recommend to set `rgw_relaxed_s3_bucket_names` option to true as a workaround. * In the Zabbix Mgr Module there was a typo in the key being send to Zabbix for PGs in backfill_wait state. The key that was sent was 'wait_backfill' and the correct name is 'backfill_wait'. Update your Zabbix template accordingly so that it accepts the new key being send to Zabbix. * zabbix plugin for ceph manager now includes osd and pool discovery. Update of zabbix_template.xml is needed to receive per-pool (read/write throughput, diskspace usage) and per-osd (latency, status, pgs) statistics * The format of all date + time stamps has been modified to fully conform to ISO 8601. The old format (`YYYY-MM-DD HH:MM:SS.ssssss`) excluded the `T` separator between the date and time and was rendered using the local time zone without any explicit indication. The new format includes the separator as well as a `+nnnn` or `-nnnn` suffix to indicate the time zone, or a `Z` suffix if the time is UTC. For example, `2019-04-26T18:40:06.225953+0100`. Any code or scripts that was previously parsing date and/or time values from the JSON or XML structure CLI output should be checked to ensure it can handle ISO 8601 conformant values. Any code parsing date or time values from the unstructured human-readable output should be modified to parse the structured output instead, as the human-readable output may change without notice. * The `bluestore_no_per_pool_stats_tolerance` config option has been replaced with `bluestore_fsck_error_on_no_per_pool_stats` (default: false). The overall default behavior has not changed: fsck will warn but not fail on legacy stores, and repair will convert to per-pool stats. * The disaster-recovery related 'ceph mon sync force' command has been replaced with 'ceph daemon <...> sync_force'. * The `osd_recovery_max_active` option now has `osd_recovery_max_active_hdd` and `osd_recovery_max_active_ssd` variants, each with different default values for HDD and SSD-backed OSDs, respectively. By default `osd_recovery_max_active` now defaults to zero, which means that the OSD will conditionally use the HDD or SSD option values. Administrators who have customized this value may want to consider whether they have set this to a value similar to the new defaults (3 for HDDs and 10 for SSDs) and, if so, remove the option from their configuration entirely. * monitors now have a `ceph osd info` command that will provide information on all osds, or provided osds, thus simplifying the process of having to parse `osd dump` for the same information. * The structured output of `ceph status` or `ceph -s` is now more concise, particularly the `mgrmap` and `monmap` sections, and the structure of the `osdmap` section has been cleaned up. * A health warning is now generated if the average osd heartbeat ping time exceeds a configurable threshold for any of the intervals computed. The OSD computes 1 minute, 5 minute and 15 minute intervals with average, minimum and maximum values. New configuration option `mon_warn_on_slow_ping_ratio` specifies a percentage of `osd_heartbeat_grace` to determine the threshold. A value of zero disables the warning. New configuration option `mon_warn_on_slow_ping_time` specified in milliseconds over-rides the computed value, causes a warning when OSD heartbeat pings take longer than the specified amount. New admin command `ceph daemon mgr.# dump_osd_network [threshold]` command will list all connections with a ping time longer than the specified threshold or value determined by the config options, for the average for any of the 3 intervals. New admin command `ceph daemon osd.# dump_osd_network [threshold]` will do the same but only including heartbeats initiated by the specified OSD. * Inline data support for CephFS has been deprecated. When setting the flag, users will see a warning to that effect, and enabling it now requires the `--yes-i-really-really-mean-it` flag. If the MDS is started on a filesystem that has it enabled, a health warning is generated. Support for this feature will be removed in a future release. * `ceph {set,unset} full` is not supported anymore. We have been using `full` and `nearfull` flags in OSD map for tracking the fullness status of a cluster back since the Hammer release, if the OSD map is marked `full` all write operations will be blocked until this flag is removed. In the Infernalis release and Linux kernel 4.7 client, we introduced the per-pool full/nearfull flags to track the status for a finer-grained control, so the clients will hold the write operations if either the cluster-wide `full` flag or the per-pool `full` flag is set. This was a compromise, as we needed to support the cluster with and without per-pool `full` flags support. But this practically defeated the purpose of introducing the per-pool flags. So, in the Mimic release, the new flags finally took the place of their cluster-wide counterparts, as the monitor started removing these two flags from OSD map. So the clients of Infernalis and up can benefit from this change, as they won't be blocked by the full pools which they are not writing to. In this release, `ceph {set,unset} full` is now considered as an invalid command. And the clients will continue honoring both the cluster-wide and per-pool flags to be backward comaptible with pre-infernalis clusters. * The telemetry module now reports more information. First, there is a new 'device' channel, enabled by default, that will report anonymized hard disk and SSD health metrics to telemetry.ceph.com in order to build and improve device failure prediction algorithms. If you are not comfortable sharing device metrics, you can disable that channel first before re-opting-in:: ceph config set mgr mgr/telemetry/channel_device false Second, we now report more information about CephFS file systems, including: - how many MDS daemons (in total and per file system) - which features are (or have been) enabled - how many data pools - approximate file system age (year + month of creation) - how many files, bytes, and snapshots - how much metadata is being cached We have also added: - which Ceph release the monitors are running - whether msgr v1 or v2 addresses are used for the monitors - whether IPv4 or IPv6 addresses are used for the monitors - whether RADOS cache tiering is enabled (and which mode) - whether pools are replicated or erasure coded, and which erasure code profile plugin and parameters are in use - how many hosts are in the cluster, and how many hosts have each type of daemon - whether a separate OSD cluster network is being used - how many RBD pools and images are in the cluster, and how many pools have RBD mirroring enabled - how many RGW daemons, zones, and zonegroups are present; which RGW frontends are in use - aggregate stats about the CRUSH map, like which algorithms are used, how big buckets are, how many rules are defined, and what tunables are in use If you had telemetry enabled, you will need to re-opt-in with:: ceph telemetry on You can view exactly what information will be reported first with:: $ ceph telemetry show # see everything $ ceph telemetry show basic # basic cluster info (including all of the new info) * Following invalid settings now are not tolerated anymore for the command `ceph osd erasure-code-profile set xxx`. * invalid `m` for "reed_sol_r6_op" erasure technique * invalid `m` and invalid `w` for "liber8tion" erasure technique * New OSD daemon command dump_recovery_reservations which reveals the recovery locks held (in_progress) and waiting in priority queues. * New OSD daemon command dump_scrub_reservations which reveals the scrub reservations that are held for local (primary) and remote (replica) PGs. * Previously, `ceph tell mgr ...` could be used to call commands implemented by mgr modules. This is no longer supported. Since luminous, using `tell` has not been necessary: those same commands are also accessible without the `tell mgr` portion (e.g., `ceph tell mgr influx foo` is the same as `ceph influx foo`. `ceph tell mgr ...` will now call admin commands--the same set of commands accessible via `ceph daemon ...` when you are logged into the appropriate host. * The `ceph tell` and `ceph daemon` commands have been unified, such that all such commands are accessible via either interface. Note that ceph-mgr tell commands are accessible via either `ceph tell mgr ...` or `ceph tell mgr.<id> ...`, and it is only possible to send tell commands to the active daemon (the standbys do not accept incoming connections over the network). * Ceph will now issue a health warning if a RADOS pool as a `pg_num` value that is not a power of two. This can be fixed by adjusting the pool to a nearby power of two:: ceph osd pool set <pool-name> pg_num <new-pg-num> Alternatively, the warning can be silenced with:: ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false * The format of MDSs in `ceph fs dump` has changed. * The `mds_cache_size` config option is completely removed. Since luminous, the `mds_cache_memory_limit` config option has been preferred to configure the MDS's cache limits. * The `pg_autoscale_mode` is now set to `on` by default for newly created pools, which means that Ceph will automatically manage the number of PGs. To change this behavior, or to learn more about PG autoscaling, see :ref:`pg-autoscaler`. Note that existing pools in upgraded clusters will still be set to `warn` by default. * The `upmap_max_iterations` config option of mgr/balancer has been renamed to `upmap_max_optimizations` to better match its behaviour. * `mClockClientQueue` and `mClockClassQueue` OpQueue implementations have been removed in favor of of a single `mClockScheduler` implementation of a simpler OSD interface. Accordingly, the `osd_op_queue_mclock*` family of config options has been removed in favor of the `osd_mclock_scheduler*` family of options. * The config subsystem now searches dot ('.') delimited prefixes for options. That means for an entity like `client.foo.bar`, its overall configuration will be a combination of the global options, `client`, `client.foo`, and `client.foo.bar`. Previously, only global, `client`, and `client.foo.bar` options would apply. This change may affect the configuration for clients that include a `.` in their name. Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at http://download.ceph.com/tarballs/ceph-15.2.0.tar.gz * For packages, see http://docs.ceph.com/docs/master/install/get-packages/ * Release git sha1: dc6a0b5c3cbf6a5e1d6d4f20b5ad466d76b96247 -- Abhishek Lekshmanan SUSE Software Solutions Germany GmbH GF: Felix Imendörffer

4 years

14
21
0 0

2024

2023

2022

2021

2020

2019

ceph-users April 2020