Hi,
This is with v15.2 and v15.2.8.
Once an OSD service is applied, it can't be removed.
It always shows up from "ceph orch ls".
"ceph orch rm <osd service name>" only marks it "unmanaged",
but not actually removes it.
Is this the expected?
Thanks!
Tony
Hi,
I've been trying with v15.2 and v15.2.8, no luck.
Wondering if this is actually supported or ever worked for anyone?
Here is what I've done.
1) Create a cluster with 1 controller (mon and mgr) and 3 OSD nodes,
each of which is with 1 SSD for DB and 8 HDDs for data.
2) OSD service spec.
service_type: osd
service_id: osd-spec
placement:
hosts:
- ceph-osd-1
- ceph-osd-2
- ceph-osd-3
spec:
block_db_size: 92341796864
data_devices:
model: ST16000NM010G
db_devices:
model: KPM5XRUG960G
3) Add OSD hosts and apply OSD service spec. 8 OSDs (data on HDD and
DB on SSD) are created on each host properly.
4) Run "orch osd rm 1 --replace --force". OSD is marked "destroyed" and
reweight is set to 0 in "osd tree". "pg dump" shows no PG on that OSD.
"orch ps" shows no daemon running for that OSD.
5) Run "orch device zap <host> <device>". VG and LV for HDD are removed.
LV for DB stays. "orch device ls" shows HDD device is available.
6) Cephadm finds OSD claims and applies OSD spec on the host.
Here is the message.
============================
cephadm [INF] Found osd claims -> {'ceph-osd-1': ['1']}
cephadm [INF] Found osd claims for drivegroup osd-spec -> {'ceph-osd-1': ['1']}
cephadm [INF] Applying osd-spec on host ceph-osd-1...
cephadm [INF] Applying osd-spec on host ceph-osd-2...
cephadm [INF] Applying osd-spec on host ceph-osd-3...
cephadm [INF] ceph-osd-1: lvm batch --no-auto /dev/sdc /dev/sdd
/dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
--db-devices /dev/sdb --block-db-size 92341796864
--osd-ids 1 --yes --no-systemd
code: 0
out: ['']
err: ['/bin/docker:stderr --> passed data devices: 8 physical, 0 LVM',
'/bin/docker:stderr --> relative data size: 1.0',
'/bin/docker:stderr --> passed block_db devices: 1 physical, 0 LVM',
'/bin/docker:stderr --> 1 fast devices were passed, but none are available']
============================
Q1. Is DB LV on SSD supposed to be deleted or not, when replacing an OSD
whose data is on HDD and DB is on SSD?
Q2. If yes from Q1, is a new DB LV supposed to be created on SSD as long as
there is sufficient free space, when building the new OSD?
Q3. If no from Q1, since it's replacing, is the old DB LV going to be reused
for the new OSD?
Again, is this actually supposed to work? Am I missing anything or just trying
on some unsupported feature?
Thanks!
Tony
Hi all,
In OCS(Rook) env workflow for RGW daemons as follows,
Normally for creating ceph object-store, the first Rook creates pools for rgw daemon with the specified configuration.
Then depending on the no of instances, Rook create cephxuser and then rgw spawn daemon in the container(pod) using its id
with following arguments for radosgw binary
Args:
--fsid=91501490-4b55-47db-b226-f9d9968774c1
--keyring=/etc/ceph/keyring-store/keyring
--log-to-stderr=true
--err-to-stderr=true
--mon-cluster-log-to-stderr=true
--log-stderr-prefix=debug
--default-log-to-file=false
--default-mon-cluster-log-to-file=false
--mon-host=$(ROOK_CEPH_MON_HOST)
--mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS)
--id=rgw.my.store.a
--setuser=ceph
--setgroup=ceph
--foreground
--rgw-frontends=beast port=8080
--host=$(POD_NAME)
--rgw-mime-types-file=/etc/ceph/rgw/mime.types
--rgw-realm=my-store
--rgw-zonegroup=my-store
--rgw-zone=my-store
And here cephxuser will be "client.rgw.my.store.a" and all the pools for rgw will be created as my-store*. Normally if there is
a request for another instance in the config file for a ceph-object-store config file[1] for rook, another user "client.rgw.mystore.b"
will be created by rook and will consume the same pools.
There is a feature in Kubernetes known as autoscale in which pods can be automatically scaled based on specified metrics. If we apply that
feature for rgw pods, Kubernetes will automatically scale the rgw pods(like a clone of the existing pod) with the same argument for "--id"
based on the metrics, but ceph cannot distinguish those as different rgw daemons even though multiple pods of rgw are running simultaneously.
In "ceph status" shows only one daemon rgw as well
In vstart or ceph ansible(Ali help me to figure it out), I can see for each rgw daemon a cephxuser is getting created as well
Is this behaviour intended ? or am I hitting any corner case which was never tested before?
There is no point of autoscaling of rgw pod if it considered to the same daemon, the s3 client will talk to only one of the pods and ceph mgr
provides metrics can give incorrect data as well which can affect the autoscale feature
Also opened an issue in rook for the time being [2]
[1] https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/o…
[2] https://github.com/rook/rook/issues/6943
Regards,
Jiffin
Hi,
Ceph source code contains a script called vstart.sh which allows
developers to quickly test their code using a simple deployment on your
development system.
Here: https://docs.ceph.com/en/latest//dev/quick_guide/
I am really curious that how far we can go with vstart.sh script.
While my development cluster is running, I use tools like rados bench, rbd
and rbd-nbd to benchmark simple workload and test my code. Do we have
options to change the network settings in the fake cluster built from
vstart script and later benchmark it? For example , trying 1gbit ethernet
and 10gbit ethernet.
Thanks
(Sending it to dev list as people might know it there)
Hi,
There are many talks and presentations out there about Ceph's
performance. Ceph is great when it comes to parallel I/O, large queue
depths and many applications sending I/O towards Ceph.
One thing where Ceph isn't the fastest are 4k blocks written at Queue
Depth 1.
Some applications benefit very much from high performance/low latency
I/O at qd=1, for example Single Threaded applications which are writing
small files inside a VM running on RBD.
With some tuning you can get to a ~700us latency for a 4k write with
qd=1 (Replication, size=3)
I benchmark this using fio:
$ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..
700us latency means the result will be about ~1500 IOps (1000 / 0.7)
When comparing this to let's say a BSD machine running ZFS that's on the
low side. With ZFS+NVMe you'll be able to reach about somewhere between
7.000 and 10.000 IOps, the latency is simply much lower.
My benchmarking / test setup for this:
- Ceph Nautilus/Octopus (doesn't make a big difference)
- 3x SuperMicro 1U with:
- AMD Epyc 7302P 16-core CPU
- 128GB DDR4
- 10x Samsung PM983 3,84TB
- 10Gbit Base-T networking
Things to configure/tune:
- C-State pinning to 1
- CPU governer to performance
- Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)
Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
latency and going towards 25Gbit/100Gbit might help as well.
These are however only very small increments and might help to reduce
the latency by another 15% or so.
It doesn't bring us anywhere near the 10k IOps other applications can do.
And I totally understand that replication over a TCP/IP network takes
time and thus increases latency.
The Crimson project [0] is aiming to lower the latency with many things
like DPDK and SPDK, but this is far from finished and production ready.
In the meantime, am I overseeing some things here? Can we reduce the
latency further of the current OSDs?
Reaching a ~500us latency would already be great!
Thanks,
Wido
[0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
Hi everyone,
During CDM today Ilya pointed out that there is an open pull request
that adds on-wire compression to msgr v2 here:
https://github.com/ceph/ceph/pull/36517
Before we proceed there, though, we decided we should have a broader
discussion about how on-wire compression should be implemented.
The current pull request implements this purely in the msgr layer.
Benefits include that it applies to all messages--not just OSD
replication but also client/OSD traffic, inter-MDS traffic, and so on.
Downsides include that replicated writes are compressed multiple
times--once for each replica.
One alternate approach might be:
- expand the Message interface to allow set_data()/get_data() to
accept/expose compressed data (e.g., data + compression_disposition).
The message header could include a field indicating what codec was
used.
- OSD replication code could compress the data once and pass it to
both messages for both replicas
This would only capture the data portion of the message payload, but
that is probably the only part we really are about. It would also
require some special support for all the users that want to take
advantage of it... probably the osd replication backend and Objecter
to start. One could also imagine extending this to allow compressed
data to pass all the way through to bluestore, although that brings in
some additional concerns (bluestore has a max chunk size and some
alignment considerations, for instance).
Another possibility is integration compression into bufferlist. I'm
not sure that represents a very compelling set of trade-offs, however.
Other thoughts?
sage
hi Mark and Radek,
i am sending this mail for further discussion on our recent perf tests on
crimson.
Chunmei also performed performance tests testing classic osd + memstore and
crimson osd + cyanstore using "rados bench" and fio. where
- only a single async_op threads and a single core is designated to
classic-osd,
- two rados bench instances were used when testing with "rados bench".
- two jobs are used when testing with fio
- server is not co-located with client.
- single osd instance.
see
- https://gist.github.com/liu-chunmei/4fd88fd0ff56d6849439a2df329aa80e
- https://gist.github.com/liu-chunmei/f696b9c4f31b123fb223cdd47f13c8ea
respectively.
her findings are
- in the rados bench tests, crimson performs better than classic osd in
general.
- in the fio tests, the performance of crimson is almost on par with that
of classic osd. the cycles-per-op of crimson is significantly lower than
that of classic osd.
but in last standup, she mentioned that her impression was that the alien
store does not benefit from adding more threads than 2. this does not match
with your findings recently.
thoughts?
cheers,
Hi Folks,
I am working on GitHub action based automation for running teuthology
on some batch of PRs.
Following step describes this workflow:
Step 1: Once any PR passes necessary checks( make etc) a newly
introduced label `next` can be assigned to PR.
Step 2: The `next` labeled PRs will be batched together and scheduled
for the next teuthology run.
The collected labels from all the batched PRs will be used for
selecting the teuthology suite.
Having said that, there are two options to batch and schedule a tethology run:
1. Batch all patch together
2. Batch patch component wise
Both of the above options have their pros and cons one being the high
number of teuthology jobs per run if batched together and more number
teuthology runs if batched component wise.
Please, let me know which option is good to go ahead with and if there
are ways to improve it.
/sunny