For developers submitting jobs using teuthology, we now have
recommendations on what priority level to use:
https://docs.ceph.com/docs/master/dev/developer_guide/#testing-priority
--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
We're happy to announce the fourth bugfix release in the Octopus series.
In addition to a security fix in RGW, this release brings a range of fixes
across all components. We recommend that all Octopus users upgrade to this
release. For a detailed release notes with links & changelog please
refer to the official blog entry at https://ceph.io/releases/v15-2-4-octopus-released
Notable Changes
---------------
* CVE-2020-10753: rgw: sanitize newlines in s3 CORSConfiguration's ExposeHeader
(William Bowling, Adam Mohammed, Casey Bodley)
* Cephadm: There were a lot of small usability improvements and bug fixes:
* Grafana when deployed by Cephadm now binds to all network interfaces.
* `cephadm check-host` now prints all detected problems at once.
* Cephadm now calls `ceph dashboard set-grafana-api-ssl-verify false`
when generating an SSL certificate for Grafana.
* The Alertmanager is now correctly pointed to the Ceph Dashboard
* `cephadm adopt` now supports adopting an Alertmanager
* `ceph orch ps` now supports filtering by service name
* `ceph orch host ls` now marks hosts as offline, if they are not
accessible.
* Cephadm can now deploy NFS Ganesha services. For example, to deploy NFS with
a service id of mynfs, that will use the RADOS pool nfs-ganesha and namespace
nfs-ns::
ceph orch apply nfs mynfs nfs-ganesha nfs-ns
* Cephadm: `ceph orch ls --export` now returns all service specifications in
yaml representation that is consumable by `ceph orch apply`. In addition,
the commands `orch ps` and `orch ls` now support `--format yaml` and
`--format json-pretty`.
* Cephadm: `ceph orch apply osd` supports a `--preview` flag that prints a preview of
the OSD specification before deploying OSDs. This makes it possible to
verify that the specification is correct, before applying it.
* RGW: The `radosgw-admin` sub-commands dealing with orphans --
`radosgw-admin orphans find`, `radosgw-admin orphans finish`, and
`radosgw-admin orphans list-jobs` -- have been deprecated. They have
not been actively maintained and they store intermediate results on
the cluster, which could fill a nearly-full cluster. They have been
replaced by a tool, currently considered experimental,
`rgw-orphan-list`.
* RBD: The name of the rbd pool object that is used to store
rbd trash purge schedule is changed from "rbd_trash_trash_purge_schedule"
to "rbd_trash_purge_schedule". Users that have already started using
`rbd trash purge schedule` functionality and have per pool or namespace
schedules configured should copy "rbd_trash_trash_purge_schedule"
object to "rbd_trash_purge_schedule" before the upgrade and remove
"rbd_trash_purge_schedule" using the following commands in every RBD
pool and namespace where a trash purge schedule was previously
configured::
rados -p <pool-name> [-N namespace] cp rbd_trash_trash_purge_schedule rbd_trash_purge_schedule
rados -p <pool-name> [-N namespace] rm rbd_trash_trash_purge_schedule
or use any other convenient way to restore the schedule after the
upgrade.
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-14.2.10.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 7447c15c6ff58d7fce91843b705a268a1917325c
--
David Galloway
Systems Administrator, RDU
Ceph Engineering
IRC: dgalloway
Hi,
I am trying to use CEPH dmclock to see how it works for QoS control.
Especially, I want to set “osd_op_queue” as “mclock_client” to config
different [r, w, l] for each client. The CEPH version I use is nautilus
14.2.9.
I noticed that in "OSD CONFIG REFERENCE" section of CEPH documentation, it
states that "the mClock based ClientQueue (mclock_client) also incorporates
the client identifier in order to promote fairness between clients.", so I
believe librados can support per-client configurations right now. I wonder
how I can set up the CEPH configuration to config different (r, w, l) for
different clients using such “client identifier"? Thanks.
Best,
Zhenbo Qiao
Hi,
I am trying to use CEPH dmclock to see how it works for QoS control. Especially, I want to set “osd_op_queue” as “mclock_client” to config different [r, w, l] for each client. The CEPH version I use is nautilus 14.2.9.
I noticed that in "OSD CONFIG REFERENCE" section of CEPH documentation, it states that "the mClock based ClientQueue (mclock_client) also incorporates the client identifier in order to promote fairness between clients.", so I believe librados can support per-client configurations right now. I wonder how I can set up the CEPH configuration to config different (r, w, l) for different clients using such “client identifier"? Thanks.
Best,
Zhenbo Qiao
Hi everyone,
We would like to share with you our thoughts on RocksDB in Ceph with
main focus on efficient use of fast storage. Storage space provisioning
for RocksDB is quite complex in Ceph as it includes several layers of
abstraction. We highlight some problems while explaining how things are
working under the hood. At the end, we propose several solutions to
alleviate these problems.
RocksDB in Ceph: column families, levels' size and spillover
https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b…
We'd love to hear your feedback.
Thanks
--
*Kajetan Janiak*
C++ Engineer
CloudFerro sp z o.o.
office: Fabryczna 5A *m:* +48 500 191 166
00-446 Warszawa, Poland *e:* kjaniak(a)cloudferro.com
<https://cloudferro.com/>
Hi.
I'm seeing a difference between `ceph-volume lvm list` command result and
data linked in `/var/lib/ceph/osd/*` for db path.
Can someone help on this?
ceph-volume lvm list output:
[block]
/dev/ceph-bcce3074-3095-481f-bf89-5bc746bb5b8f/osd-block-0c2f092a-4d92-4b0a-85e6-b65221cff791
db device /dev/nvme0n1p1
in /var/lib/ceph/osd:
lrwxrwxrwx 1 ceph ceph 93 Jun 16 03:19 block ->
/dev/ceph-bcce3074-3095-481f-bf89-5bc746bb5b8f/osd-block-0c2f092a-4d92-4b0a-85e6-b65221cff791
lrwxrwxrwx 1 ceph ceph 14 Jun 16 03:19 block.db -> /dev/nvme1n1p1
Hi all,
I would like to share the comparison of 3 replication models based on
Pech OSD [1] cluster, which supports a sufficient minimum to replicate
transactions from OSD to OSD and keeps all data mutations in memory
(memstore).
My goal was to compare "primary-copy", "chain" and "client-based"
replication models and answer the question: how each model affects
network performance.
For this estimation I chose to implement my own OSD with bare minimum
(laborious but worth it) which design is similar to Crimson OSD but
core is based on sources from kernel libceph implementation
(i.e. messenger, osdmap, mon_client, etc), thus written in pure C.
-- What Pech OSD supports and what does not --
Comparison of the network response under different replication
scenarios does not require fail-over (we assume that during testing,
storing data in memory never fails, hosts never crash, etc), thus to
ease development Pech OSD does not support (current state of the code)
peering and fail-over. So object modification is replicated on each
mutation, yes, but cluster is not able to come to the consistent state
after an error.
Pech OSD supports RBD images, so that image can be accessed from
userspace librbd or mapped by kernel RBD block device. That is a bare
minimum which I need to run FIO loads and test network behavior.
-- What I test --
Originally my goal was to compare performance under same loads but
using different replication models: "client-based", "primary-copy" and
"chain". I want to see the numbers what different models can bring in
terms of network bandwidth, latency and IOPS (and when we talk about
comparison of replication models, only network is the factor which
impacts the overall performance).
Shortly about replication models:
"client-based" - client itself is responsible for sending requests to
replicas. To test this model OSD client code was modified on
userspace [2] and kernel [3] sides. Pros: savings on network hops
which reduces latency. Cons: complications in replication algorithm
when PG is not healthy, complications in replication algorithm when
there is a concurrent access to the same object from divers
clients, client network should be fat enough.
"chain" - client sends write request to primary, primary forwards to
the next secondary, and so on. Final ACK from last replica in chain
reaches primary or client directly. Pros: each OSD sends a request
only once, which reduces load on network for particular node and
spreads load. Overall bandwidth should increase. Cons: sequential
requests processing, which should impacts latency.
"primary-copy" - default and the only one model for Ceph: client
accesses primary replica, primary replica fans out data to
secondaries. Pros: already implemented. Cons: higher latency
comparing to "client-based", lower bandwidth comparing to "chain".
What is said above is the theory which has motivated me to prove or
disprove it with numbers on the real cluster.
-- How I test --
I have the cluster at my disposal, with 5 hosts with 100gb network for
OSDs and 8 client hosts, with 25gbit/s network.
Each OSD host has 24 CPUs, so for obvious reasons each host runs 24
OSDs, so (24x5) 120 Pech OSDs for the whole cluster setup.
There is one fully declustered pool with 1024 PGs (I want to spread
the load as much as possible). Pool is created with 3x replication
factor.
Each client starts 16 FIO jobs with random write to 16 RBD images
(userspace RBD client) with various block sizes, i.e. one FIO job per
image and 128 (16x8) jobs in total. Each client host runs FIO server,
all data from all servers is aggregated by FIO client and stored in
json format. There is a convenient python script [4] which generates
and runs FIO jobs, parses json results and outputs them in a human
readable pretty table.
Major FIO options:
ioengine=rbd
clientname=admin
pool=rbd
rw=randwrite
size=256m
time_based=1
runtime=10
ramp_time=10
iodepth=32
numjobs=1
During all tests I collected almost 1Gb of json data results. Pretty
enough for good analysis.
-- Results --
Firstly I would like to start comparing "primary-copy" and "chain"
on Pech OSD:
120OSDS/pech/primary-copy
write/iops write/bw write/clat_ns/mean
4k 365.89 K 1.40 GB/s 11.11 ms
8k 330.51 K 2.52 GB/s 12.22 ms
16k 274.06 K 4.19 GB/s 14.79 ms
32k 204.36 K 6.25 GB/s 19.95 ms
64k 141.78 K 8.68 GB/s 28.54 ms
128k 70.42 K 8.64 GB/s 58.99 ms
256k 37.75 K 9.30 GB/s 109.75 ms
512k 17.46 K 8.67 GB/s 216.53 ms
1m 8.56 K 8.65 GB/s 474.94 ms
120OSDS/pech/chain
write/iops write/bw write/clat_ns/mean
4k 380.29 K 1.45 GB/s 10.72 ms
8k 339.10 K 2.59 GB/s 11.99 ms
16k 280.28 K 4.28 GB/s 14.34 ms
32k 206.84 K 6.32 GB/s 19.64 ms
64k 131.57 K 8.05 GB/s 30.54 ms
128k 74.78 K 9.18 GB/s 54.25 ms
256k 39.82 K 9.81 GB/s 103.27 ms
512k 18.47 K 9.17 GB/s 213.78 ms
1m 8.98 K 9.08 GB/s 461.12 ms
There is a slight difference in the direction of bandwidth increase
for "chain" model, but I would rather take it for a noise. Another
runs for similar configuration show almost similar results: there is
a minor "bandwidth" improve but not so solid.
Client-based results are much more interesting:
120OSDS/pech/client-based
write/iops write/bw write/clat_ns/mean
4k 534.08 K 2.04 GB/s 7.62 ms
8k 471.78 K 3.60 GB/s 8.64 ms
16k 367.12 K 5.61 GB/s 11.11 ms
32k 242.56 K 7.41 GB/s 16.82 ms
64k 124.54 K 7.63 GB/s 32.98 ms
128k 62.45 K 7.67 GB/s 66.71 ms
256k 31.10 K 7.69 GB/s 135.36 ms
512k 15.41 K 7.71 GB/s 282.41 ms
1m 7.63 K 7.82 GB/s 567.63 ms
Small blocks show significant improve in latency: almost 40%, from
380k IOPS to 534k IOPS. Starting from 64k block the client network
25gbit/s is reached ("client-based" replication means client is
responsible for sending the data to all replicas, that means that each
byte with 3x replication factor should be repeated 3 times from each
client host, having ~8GB/s for 8 clients we estimate each client sends
~1GB/s, with 3x replication factor this is ~3GB/s and this is exactly
the ~24gbit/s of the client network).
What is important to keep in mind with Pech OSD design is that each
OSD process has only 1 OS thread, so when request is received and
request handler is executed there is no any preemption happens and no
other requests can be handled in parallel (unless special scheduling
routine is called, which is not, at least in current code state). So
various PGs on particular Pech OSD are handled sequentially.
The design is highly CPU bound, thus one simple trick can be made to
increase bandwidth: OSD pinning to CPU. Since we have 24 OSDs and 24
CPUs CPU affinitty is easy to apply:
120OSDS-AFF/pech/primary-copy
write/iops write/bw write/clat_ns/mean
4k 324.15 K 1.24 GB/s 12.35 ms
8k 293.52 K 2.24 GB/s 13.43 ms
16k 235.53 K 3.60 GB/s 16.46 ms
32k 187.31 K 5.73 GB/s 20.77 ms
64k 170.60 K 10.43 GB/s 23.10 ms
128k 92.54 K 11.33 GB/s 34.48 ms
256k 47.69 K 11.73 GB/s 97.32 ms
512k 18.52 K 9.19 GB/s 252.26 ms
1m 9.20 K 9.28 GB/s 507.33 ms
Bandwidth looks better for bigger blocks.
In conclusion about replication models. I did not notice any
significant difference between "primary-copy" and "chain". Perhaps it
makes sense to play with the replication factor.
In its turn "client-based" replication can be very promising for loads
in homogeneous networks, where there is no any concurrent access to
images. Simple example is a cluster with compute and storage nodes in
private network, where VMs access their own images. For such setups
latency is a factor which plays a huge role.
--
Roman
[1] https://github.com/rouming/pech
[2] https://github.com/rouming/ceph/tree/pech-osd
[3]
https://github.com/rouming/linux/tree/akpm--ceph-client-based-replication
[4] https://github.com/rouming/pech/blob/master/scripts/fio-runner.py
We would like to be ready for next nautilus point release within next week.
Please if you have any outstanding PRs for it label them accordingly
so they get tested and included.
Thx
YuriW
Today I rebased my branch after 3 weeks, I saw that req_info is newly
created, so is this req_info unqiue to every request like req_state?
As I have to change my code accordingly then.