While poking through one of our Nautilus clusters I noticed OSDs have
HB peers that are not sharing PGs.
Nautilus added OSDMap::get_random_up_osds_by_subtree() to select
random OSDs of type mon_osd_reporter_subtree_level even if
mon_osd_min_down_reporters is already met.
If you have multiple types of hardware mapped to different pools, OSDs
between these pools will HB each other which is not necessarily
expected from an operations point of view. This also has the potential
of wrongly marking OSDs down if one type of hardware is having issues.
The more HB peers the better but couldn't we increase the default for
mon_osd_min_down_reporters instead and if not met, call
get_random_up_osds_by_subtree? I initially made a patch to exclude any
OSD not part of the same crush root, but this wouldn't work widely
since it's possible to have a crush rule spanning multiple trees, I'm
not sure what other alternatives there are.
Another bit from pre-nautilus, the osd id-1 and +1 are added to the HB
peers, in order to have a "fully-connected set"[1]. I'm not sure I
understand that comment, could somebody briefly explain how it creates
a fully connected set and what set we're talking about?
Thanks!
[1] https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L5141
Hi Ceph maintainers and developers,
The objective of this is to discuss our work on a dmClock based client QoS management for CephFS.
Our group at LINE maintains Ceph storage clusters such as RGW, RBD, and CephFS to internally support OpenStack and K8S based private cloud environment for various applications and platforms including LINE messenger. We have seen that the RGW and RBD services can provide consistent performance to multiple active users since RGW employes the dmClock QoS scheduler for S3 clients and hypervisors internally utilize I/O throttler for VM block storage clients. Unfortunately, unlike RGW and RBD, CephFS clients can directly issue metadata requests to MDSs and filedata requests OSDs as they want. This situation occasionally (or frequently) happens and the other client performance may be degraded by the noisy neighbor. In the end, consistent performance cannot be guaranteed in our environment. From this observation and motivation, we are now considering the client QoS scheduler using the dmClock library for CephFS.
A few things about how to realize the QoS scheduler.
- Per subvolume QoS management. IOPS resources are only shared among the clients that mount the same root directory. QoS parameters can be easily configured through the extended attributes (similar to quota). Each dmClock scheduler can manage clients' requests using client session information.
- MDS QoS management. Client metadata requests like create, lookup, and etc. are managed by dmClock scheduler placed between the dispatcher and the main request handler (e.g., Server::handle_client_request()). We have observed that two active MDSs provide approximately 20KIOPS. As performance capacity is sometimes scarce for lots of clients, QoS management is needed for MDS.
- OSD QoS management. We would like to reopen and improve the previous work available at https://github.com/ceph/ceph/pull/20235.
- Client QoS management. Each client manages the dmClock tracker to keep track of both rho and delta to be packed to client request messages.
In case of the CLI, QoS parameters are configured using the extended attributes on each subvolume directory. Specifically, separate QoS configurations are considered for both MDSs and OSDs.
setfattr -n ceph.dmclock.mds_reservation -v 200 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
setfattr -n ceph.dmclock.mds_weight -v 500 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
setfattr -n ceph.dmclock.mds_limit -v 1000 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
setfattr -n ceph.dmclock.osd_reservation -v 500 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
setfattr -n ceph.dmclock.osd_weight -v 1000 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
setfattr -n ceph.dmclock.osd_limit -v 2000 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
Our QoS work has been kicked off from the previous month. Our first step is to go over the prior work and dmClock algorithm/library. Now we are actively focusing on checking the feasibility of our idea with some modifications to MDS and ceph-fuse. Our development is planned as follows.
- dmClock scheduler will be integrated into MDS and ceph-fuse by December 2020.
- dmClock scheduler will be incorporated with OSD by the first half of the next year.
Does the community have any plan to develop per client QoS management? Are there any other issues related to our QoS work? We are looking forward to hearing your valuable comments and feedback at an early stage.
Thanks
Yongseok Oh
I'm happy to announce another release of the go-ceph API
bindings. This is a regular release following our every-two-months release
cadence.
https://github.com/ceph/go-ceph/releases/tag/v0.7.0
Changes in the release are detailed in the link above.
The bindings aim to play a similar role to the "pybind" python bindings in the
ceph tree but for the Go language. These API bindings require the use of cgo.
There are already a few consumers of this library in the wild, including the
ceph-csi project.
Specific questions, comments, bugs etc are best directed at our github issues
tracker.
--
John Mulligan
phlogistonjohn(a)asynchrono.us
jmulligan(a)redhat.com
_______________________________________________
Dev mailing list -- dev(a)ceph.io
To unsubscribe send an email to dev-leave(a)ceph.io
Hi all,
Just out of curiosity.....Considering vector machines are being used in HPC
applications to accelerate certain kernels, do you think there are some
workloads in Ceph that could be good candidates to be offloaded and
accelerated on vector machines ?
Thanks in advance.....
BR
FOSDEM is a free software event that offers open source communities a place to
meet, share ideas and collaborate. It is well known for being highly
developer-oriented and in the past brought together 8000+ participants from all
over the world. It's home is in the city of Brussels (Belgium).
FOSDEM 2021 will take place as an online event during the weekend of February
6./7. 2021. More details about the event can be found at http://fosdem.org/
** Call For Participation
The Software Defined Storage devroom will go into it's fifth round for talks
around Open Source Software Defined Storage projects, management tools
and real world deployments.
Presentation topics could include but are not limited too:
- Your work on a SDS project like Ceph, Gluster, OpenEBS, CORTX or Longhorn
- Your work on or with SDS related projects like OpenStack SWIFT or Container
Storage Interface
- Management tools for SDS deployments
- Monitoring tools for SDS clusters
** Important dates:
- Dec 27th 2020: submission deadline for talk proposals
- Dec 31st 2020: announcement of the final schedule
- Feb 6th 2021: Software Defined Storage dev room
Talk proposals will be reviewed by a steering committee:
- Niels de Vos (OpenShift Container Storage Developer - Red Hat)
- Jan Fajerski (Ceph Developer - SUSE)
- TBD
Use the FOSDEM 'pentabarf' tool to submit your proposal:
https://penta.fosdem.org/submission/FOSDEM21
- If necessary, create a Pentabarf account and activate it.
Please reuse your account from previous years if you have
already created it.
https://penta.fosdem.org/user/new_account/FOSDEM21
- In the "Person" section, provide First name, Last name
(in the "General" tab), Email (in the "Contact" tab)
and Bio ("Abstract" field in the "Description" tab).
- Submit a proposal by clicking on "Create event".
- If you plan to register your proposal in several tracks to increase your chances,
don't! Register your talk once, in the most accurate track.
- Presentations have to be pre-recorded before the event and will be streamed on
the event weekend.
- Important! Select the "Software Defined Storage devroom" track
(on the "General" tab).
- Provide the title of your talk ("Event title" in the "General" tab).
- Provide a description of the subject of the talk and the
intended audience (in the "Abstract" field of the "Description" tab)
- Provide a rough outline of the talk or goals of the session (a short
list of bullet points covering topics that will be discussed) in the
"Full description" field in the "Description" tab
- Provide an expected length of your talk in the "Duration" field.
We suggest a length between 15 and 45 minutes.
** For accepted talks
Once your proposal is accepted we will assign you a volunteer deputy who will
help you to produce the talk recording. The volunteer will also try to ensure
the recording is of good quality, help with uploading it to the system,
broadcasting it during the event and moderate the Q&A session after the
broadcast. Please note that as a presenter you're expected to be available
online during and especially after the broadcast of you talk. The schedule will
be available under
https://fosdem.org/2021/schedule/track/software_defined_storage/
Hope to hear from you soon! And please forward this announcement.
If you have any further questions, please write to the mailing list at
storage-devroom(a)lists.fosdem.org and we will try to answer as soon as
possible.
Thanks!
Hi Folks,
The weekly performance meeting will start in approx 10 minutes! The only
topic we have for today so far is discussing the excessive PGLog memory
usage some folks on the mailing list have been reporting recently.
Please feel free to add your own topic as well.
Hope to see you there!
Etherpad:
https://pad.ceph.com/p/performance_weekly
Bluejeans:
https://bluejeans.com/908675367
Thanks,
Mark
Hi, everyone.
Recently, one of our online clusters encountered a problem. When we
are trying to add machines to an existing crush root, multiple pgs'
states are stuck in unknown or activating or peering. Right now, we
have resolved this problem by restarting those OSDs that are related
to the inactive pgs, but the root cause is still unknown.
On the other hand, we found that, in our online configuration, there
is an entry "bluestore_min_alloc_size_hdd = 262144", and no
"bluefs_shared_alloc_size" is configured which means it is the default
value 64K. Normally, this configuration would trigger an error when
creating osds. However we found that our online systems' version is
14.2.4, and it wouldn't trigger that error in this version.
My question is: could this misconfiguration be the root cause of the
problem mentioned above? Thanks:-)
Hi all,
I'm studying Paxos in Ceph because I need to add one new
PaxosService.
In Ceph, the Paxos is based on a single-proposer & multi-acceptors.
So, the quorum need choose the single-proposer(leader) first.
It seems that there're two ways to choose one monitor as leader:
I'm curious that why their pre-conditions are different.
------------
This doesn't block me to continue my development work. I just want to
know why.
Does anyone know that reason?
I. Normal way:
Elector::handle_ack
|--> logic.receive_ack(peer_rank, m->epoch);
|--> declare_victory();
// Note: pre-condition is below
-------------
electing_me && (acked_me.size() == elector->paxos_size())
II. Another way if timeout event happen:
Elector::_start() or Elector::_defer_to
|--> reset_timer();
|--> expire_event = mon->timer.add_event_after(
g_conf()->mon_election_timeout + plus,
new C_MonContext{
mon, [this](int) {
logic.end_election_period();
}
});
When timeout happens:
ElectionLogic::end_election_period()
|--> declare_victory();
// Note: pre-condition is below
-------------
electing_me && acked_me.size() > (elector->paxos_size() / 2)
B.R.
Changcheng
https://github.com/ceph/ceph/pull/38403
I'm not sure what's causing the make check to fail and the API tests to
fail, since this is only a documentation update and the changes made don't
touch any of the API-related or make-check-related parts of the code in the
repo, so I figured that it wouldn't be a terrible idea to report this here.
Is anyone else getting similar failures?
Zac
Docs
Ceph Upstream