For developers submitting jobs using teuthology, we now have
recommendations on what priority level to use:
https://docs.ceph.com/docs/master/dev/developer_guide/#testing-priority
--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
Hi Mark,
While trying to figure out a random failure in the mempool tests[0] introduced when fixing a bug in how mempool selects shards holding the byte count of a given pool[1] earlier this year, I was intrigued by this "cache line ping pong" problem[2]. And I wonder if you have some kind of benchmark, somewhere in your toolbox, that someone could use to demonstrate the problem. Maybe such a code could be adapted to show the benefit of the optimization implemented in mempool?
Cheers
[0] https://tracker.ceph.com/issues/49781#note-9
[1] https://github.com/ceph/ceph/pull/39057/files
[2] https://www.drdobbs.com/parallel/understanding-and-avoiding-memory-issues/2…
--
Loïc Dachary, Artisan Logiciel Libre
The office is a productivity suite, developed and maintained by one of the biggest companies, market leaders in technology, Microsoft Office. You must already have gotten your answer, Microsoft is not just another company that poses to be developing great software with just a team of 100 members but Microsoft is a big company that needs to keep its reputation and work always on par with the revenue they are getting. This means they need to hire the best people to get the job done and that’s how it works in Microsoft. There is no issue with Microsoft, but the only issue is that Microsoft has been the target for hackers for quite a while now, It is one of the easiest targets anyone can find when it comes to the customer base and data theft. The data Microsoft has stored on their cloud servers is enormous and is really important for most of the people.
office.com/setuphttps://www.officesetup.helpoffice.com/setuphttps://w-ww-office.com/setupmcafee.com/activatehttps://www.help-mcafee.memcafee.com/activatehttps://w-w-w-mcafee.com/activate
Hi everyone,
Mark your calendars for April 6 - 7 for the Ceph Developer Summit!
The plan is to have a virtual meeting style with a loose schedule
based on technical developer-focused discussions for the next software
release, Quincy. Each day (2-3 hours) will have development
discussions focused around a particular Ceph component.
The format for these sessions is primarily discussions, but
presentations are ok for visual diagrams, as an example. Birds of a
feather and other hallway-type tracks are acceptable as well.
Please follow this etherpad and the Ceph Dev mailing list for further
updates on exact start times and meeting link. Session proposals will
also be collected on this etherpad:
https://pad.ceph.com/p/cds-quincy
--
Mike Perez
Hi everyone,
I cleaned up the CFP coordination etherpad with some events coming up.
Please add other events you think the community should be considering
proposing content on Ceph or adjacent projects like Rook.
KubeCon NA CFP, for example, is ending April 11. Take a look:
https://pad.ceph.com/p/cfp-coordination
I have also added this to our wiki for discovery.
https://tracker.ceph.com/projects/ceph/wiki/Community
--
Mike Perez
Hi all,
This email talks about how to design
1) ReplicaDaemon:
The daemon, running on the host with DCPMM & RNIC(RDMA-NIC),
reports what kind of info to Ceph/Monitor.
2) ReplicaMonitor:
ReplicaMonitor, one new PaxosService in Ceph/Monitor, manage the
ReplicaDaemons' info and deal with librbd's request to select
the appropriate ReplicaDaemons' info to librbd.
This email doesn't talk about:
After librbd get the ReplicaDaemons' info, how librbd will communite
with ReplicaDaemon and how to finish the replication.
RFC PR: [WIP] aggregate client state and route info
https://github.com/ceph/ceph/pull/37931
Detail:
+-----------------------------------+ +-----------------------------------------------+
|+---------------------------------+| | +--------------------+|
|| ReplicaDaemonInfo: || | |PaxosServiceMessage ||
|| || |+---------------------------------------------+|
|| daemon_id; || ||MReplicaDaemonBlink(MSG_REPLICADAEMON_BLINK):||
|| rnic_bind_port; || || ||
|| rnic_addr; || ||ReplicaDaemonInfo; ||
|| free_size; || |+---------------------------------------------+|
|+---------------------------------+| | +--------------------+|
|+---------------------------------+| | |PaxosServiceMessage ||
|| ReqReplicaDaemonInfo: || |+---------------------------------------------+|
|| || ||MMonGetReplicaDaemonMap(CEPH_MSG_MON_GET_REPL||
|| replicas; || ||ICADAEMONMAP): ||
|| replica_size; || || ||
|+---------------------------------+| ||ReqReplicaDaemonInfo; ||
|+---------------------------------+| |+---------------------------------------------++
|| ReplicaDaemonMap: || | +-------+|
|| || | |Message||
|| std::vector<ReplicaDaemonInfo>; || |+---------------------------------------------+|
|+---------------------------------+| ||MReplicaDaemonMap(CEPH_MSG_REPLICADAEMON_MAP)||
| MetaData(need encode/decode) | || ||
| | || ||
| | ||ReplicaDaemonMap; ||
| | |+---------------------------------------------+|
| | | |
| | | Three messages defined for the MetaData |
+-----------------------------------+ +-----------------------------------------------+
+--------+ +------------+
|Dispatch| |PaxosService|
+---------------------+ Update ReplicaDaemonInfo +---------------------------+
| ReplicaDaemon: | through | ReplicaMonitor: |
| | MReplicaDaemonBlink | |
| ReplicaDaemonInfo; -----------------------------------> ReplicaDaemonMap; |
| | | |
| ms_dispatch; | | //Need implement some APIs|
+---------------------+ +------^-------------|------+
Request ReplicaDaemonMap Feedback ReplicaDaemonMap
through | |through
MMonGetReplicaDaemonMap MReplicaDaemonMap
+------|-------------v------+
| librbd |
+---------------------------+
ReplicaDaemon reports ReplicaDaemonInfo to ReplicaMonitor by MReplicaDaemonBlink message.
ReplicaMonitor store all the ReplicaDaemonInfo into ReplicaDaemonMap after going through Paxos.
The client(librbd) send MMonGetReplicaDaemonMap to ReplicaMonitor, ReplicaMonitor will
choose the approprite ReplicaDaemon and pack all the info to new ReplicaDaemonMap to send
back to the client by MReplicaDaemonMap message;
B.R.
Changcheng
Hi Ceph users and developers,
[Apologies if duplicated. I posted the same content to ceph-users two days ago. But it cannot be seen so far.]
The object of this is to find a solution to MDS slow recovery/rejoin process. Our group at LINE has provided a shared file system service based on CephFS with K8S and Openstack users since last year. The performance and functionalities of CephFS are good for us, whereas MDS availability is not feasible. In reality, restarting active MDS(s) is needed for some reasons such as version upgrade and daemon crash, and parameter change. In our experience, it takes from a few minutes to tens of minutes with hundreds of sessions and two active MDSs in a cluster, where the mds_cache_memory_limit is set to more than 16GB. So, we have hesitation about restarting MDS and our customers are also not satisfied with this situation.
To analyze and reproduce the slow MDS recovery process, some experiments have been conducted using our test environment as described below.
- CentOS 7.9 kernel 3.10.0-1160.11.1.el7.x86_64
- Nautilus 14.2.16
- mds_cache_memory_limit 16GB
- MDS (two active MDSs, two standby-replay MDSs, one single MDS)
- OSD (20 OSDs, each OSD maintains a 100GB virtual disk)
- 500 sessions using kernel driver (among them, only 50 sessions generating workloads are considered as active clients, while other sessions are just mounted and rarely issue disk stat requests.)
- VDbench tool is employed to generate metadata-intensive workloads
In this experiment, each session has its own subvolume allocated for testing. VDbench for each session is configured with set depth=1, width=10, files=24576, filesize=4K and elapsed = 10800s. Each VDbench instance on a session first creates directories and files. When dirs/files are created as many as predefined values, it randomly issues getattr, create, and unlink operations for some hours.
During VDbench running, our restart test procedure written in Python restarts an active MDS when the average numbers of inodes and caps are greater than 4,915,200 and 1,000,000. Recovery times from stopping an active MDS to becoming an active MDS were measured as listed below. the results show that numbers are varied from a few minutes to tens of minutes when experiments are conducted.
recovery_count, recovery_time(s)
1, 1557
2, 1386
3, 846
4, 1012
5, 1119
6, 1272
A few things I have analyzed
- Rejoining process consumes a considerable amount of time. That's a known issue. (Sometimes respawning MDS happened. Increasing mds_heartbeat_grace doesn't help.)
- Reducing caps and mds_cache_memory_limit has a somewhat impact on the recovery performance. As the inodes/dentries caches are reduced, reply latencies are sharply increased.
- If all clients are using ceph-fuse, the recovery time may be a few times increased compared to using kernel driver.
- Even though only one active MDS is restarted, other MDS is sometimes also abruptly restarted.
- Dropping cache (e.g., ceph dameon mds.$(hostname) cache drop) helps reduce the recovery time, but it takes a few hours under active client workloads and latency spikes are inevitable.
A few of my questions.
- Why does the MDS recovery take so long despite a graceful restart? (What mainly depends on the time?)
- Are there some solutions to this problem with many active sessions and large MDS cache size? (1~2 minutes recovery time will be satisfied, that's our target value.)
- How can we make it deterministic in less than a few minutes?
Thanks
Yongseok
On 3/31/21 5:24 AM, Stefan Kooman wrote:
> On 3/30/21 10:28 PM, David Galloway wrote:
>> This is the 19th update to the Ceph Nautilus release series. This is a
>> hotfix release to prevent daemons from binding to loopback network
>> interfaces. All nautilus users are advised to upgrade to this release.
>
> Are Ceph Nautilus 14.2.19 AMD64 packages for Ubuntu Xenial still being
> built? I only see Arm64 packages in the repository.
>
> Gr. Stefan
>
They will be built and pushed hopefully today. We had a bug in our CI
after updating our builders to Ubuntu Focal.
The Ceph leadership team is composed primarily of team leads for the
various Ceph components. We meet weekly [1] to discuss project
direction, technical challenges, and plan events. The meeting minutes
are recorded on the Ceph etherpad [0]. Anyone involved in the project
is welcome to join.
Today we discussed some build issues relating to the release of
Pacific. A container build for CentOS 8 is failing in Shaman but built
for release. nfs-ganesha is also being built for CentOS 7 but not 8. A
fix [5] is in the works. Following build discussions, a round of
Kraken rum was dispensed to celebrate the successful and imminent
release of Pacific.
There are also new plans to host a monthly meeting for the Ceph
Dashboard to synchronize with component developers with an aim towards
improving cross-talk between developers and reducing dashboard
breakage with new code. Future discussion on this will take place at
the CDS next week.
Quincy is in planning now. In lieu of Cephalocon, we are hosting a
Ceph Developer Summit (CDS) to synchronize with community and discuss
plans for Quincy development. Some proposed topics are stored in
etherpad at [3]. A blog post is in the works [4].
[0] https://pad.ceph.com/p/clt-weekly-minutes
[1]
https://calendar.google.com/calendar/embed?src=9ts9c7lt7u1vic2ijvvqqlfpo0%4…
[3] https://pad.ceph.com/p/cds-quincy
[4] https://ceph.io/?p=13050&preview=1&_ppp=82ad2c4365
[5] https://github.com/ceph/ceph-build/pull/1781
--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D