Details of this release summarized here:
Please note that the number of suites is reduced (fs now has all other
tests merged to it, ceph-ansible removed). Let me know if some tests
are missing and we will add them.
Seeking approvals and trackers for known issues for:
rados - Neha ?
rgw - Casey ?
rbd - Ilya ?
fs - Patrick ?
upgrade-clients/client-upgrade-octopus-pacific - Josh, Neha ?
while looking at https://github.com/ceph/ceph/pull/32422, i think a
probably safer approach is to make the monitor more efficient. currently,
monitor is sort of a single-threaded application. quite a few critical code
paths of monitor are protected by Monitor::lock, among other things
- periodical task performed by tick() which is in turn called by SafeTimer.
the "safty" of the SafeTimer is ensured by Monitor::lock
- Monitor::_ms_dispatch is also called with the Monitor::lock acquired. in
the case of https://github.com/ceph/ceph/pull/32422, one or
more kcephfs clients are even able to slow down the whole cluster by asking
for the latest osdmap with an ancient one in its hand, if the cluster is
able to rebalance/recover in speedy way and accumulate lots of osdmap in a
a typical scaring use case is:
1. an all-flash cluster just completes a rebalance/recover. the rebalance
completed quickly, and it leaves the cluster with a ton of osdmaps before
some of the clients have a chance to pick up these updated maps.
2. (kcephfs) clients with ancient osdmaps in their hands wake up randomly,
and they want the latest osdmap!
3. monitors are occupied with loading the maps from rocksdb and encoding
them in very large batches (when discussing with the author of
https://github.com/ceph/ceph/pull/32422, he mentioned that the total size
of inc osdmap could be up to 200~300 MiB).
4. and the cluster is basically unresponsive.
so, does it sound like a right way to improve its performance when serving
the CPU intensive workload by dissecting the data dependencies in the
monitor and to explore the possibility to make the monitor more
Just sharing my sunday morning frustration of checking the build of my ports.
This occurs in ./src/test/encoding/check-generated.sh
In itself this type of problem if of course trivial to solve.
But in this case we use diff to compare the output, so there is
no easy way to fix this
/tmp/typ-s31EUGoSy /tmp/typ-iLTjVqhpI differ: char 24, line 2
**** DecayCounter test 1 dump_json check failed ****
ceph-dencoder type DecayCounter select_test 1 dump_json > /tmp/typ-s31EUGoSy
ceph-dencoder type DecayCounter select_test 1 encode decode dump_json > /tmp/typ-iLTjVqhpI
< "value": 2.99990449484967,
> "value": 2.9999046414456356,
Probably the easiest is to exclude the test and go on with life as it is.
But correct way is probably shorten the representation of the float printing.
So we end up with '2.9999' of perhaps even shorter which will make it '3.000'
Is that something that is appropriate to do in the dump_json part, if I can flesh
out the DecayCounter as "exception"
On Mon, May 3, 2021 at 6:36 AM Lokendra Rathour
> Hi Team,
> I was setting up the ceph cluster with
> - Node Details:3 Mon,2 MDS, 2 Mgr, 2 RGW
> - Deployment Type: Active Standby
> - Testing Mode: Failover of MDS Node
> - Setup : Octopus (15.2.7)
> - OS: centos 8.3
> - hardware: HP
> - Ram: 128 GB on each Node
> - OSD: 2 ( 1 tb each)
> - Operation: Normal I/O with mkdir on every 1 second.
> T*est Case: Power-off any active MDS Node for failover to happen*
> We have observed that whenever an active MDS Node is down it takes around*
> 40 seconds* to activate the standby MDS Node.
> on further checking the logs for the new-handover MDS Node we have seen
> delay on the basis of following inputs:
> 1. 10 second delay after which Mon calls for new Monitor election
> 1. [log] 0 log_channel(cluster) log [INF] : mon.cephnode1 calling
> monitor election
In the process of killing the active MDS, are you also killing a monitor?
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA