I have a cephfs in production based on 2 pools (data+metadata).
Data is in erasure coding with the profile :
Metadata is in replicated mode with k=3
The crush rules are as follow :
When we installed it, everything was in the same room, but know we
splitted our cluster (6 servers but soon 8) in 2 rooms. Thus we updated
the crushmap by adding a room layer (with ceph osd crush add-bucket
room1 room etc) and move all our servers in the tree to the correct
place (ceph osd crush move server1 room=room1 etc...).
Now, we would like to change the rules to set a failure domain to room
instead of host (to be sure that in case of disaster in one of the rooms
we will still have a copy in the other).
What is the best strategy to do this ?
I have a query regarding objecter behaviour for homeless session. In
situations when all OSDs containing copies (*let say replication 3*) of an
object are down, the objecter assigns a homeless session (OSD=-1) to a
client request. This request makes radosgw thread hang indefinitely as the
data could never be served because all required OSDs are down. With
multiple similar requests, all the radosgw threads gets exhausted and
hanged indefinitely waiting for the OSDs to come up. This creates complete
service unavailability as no rgw threads are present to process valid
requests which could have been directed towards active PGs/OSDs.
I think we should have behaviour in objecter or radosgw to terminate
request and return early in case of a homeless session. Let me know your
thoughts on this.
*This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the
system manager. This message contains confidential information and is
intended only for the individual named. If you are not the named addressee,
you should not disseminate, distribute or copy this email. Please notify
the sender immediately by email if you have received this email by mistake
and delete this email from your system. If you are not the intended
recipient, you are notified that disclosing, copying, distributing or
taking any action in reliance on the contents of this information is
*Any views or opinions presented in this
email are solely those of the author and do not necessarily represent those
of the organization. Any information on shares, debentures or similar
instruments, recommended product pricing, valuations and the like are for
information purposes only. It is not meant to be an instruction or
recommendation, as the case may be, to buy or to sell securities, products,
services nor an offer to buy or sell securities, products or services
unless specifically stated to be so on behalf of the Flipkart group.
Employees of the Flipkart group of companies are expressly required not to
make defamatory statements and not to infringe or authorise any
infringement of copyright or any other legal right by email communications.
Any such communication is contrary to organizational policy and outside the
scope of the employment of the individual concerned. The organization will
not accept any liability in respect of such communication, and the employee
responsible will be personally liable for any damages or other liability
*Our organization accepts no liability for the
content of this email, or for the consequences of any actions taken on the
basis of the information *provided,* unless that information is
subsequently confirmed in writing. If you are not the intended recipient,
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
I am trying to copy the contents of our storage server into a CephFS,
but am experiencing stability issues with my MDSs. The CephFS sits on
top of an erasure-coded pool with 5 MONs, 5 MDSs and a max_mds setting
of two. My Ceph cluster version is Nautilus, the client is Mimic and
uses the kernel module to mount the FS.
The index of filenames to copy is about 23GB and I am using 16 parallel
rsync processes over a 10G link to copy the files over to Ceph. This
works perfectly for a while, but then the MDSs start reporting oversized
caches (between 20 and 50GB, sometimes more) and an inode count between
1 and 4 million. Particularly the Inode count seems quite high to me.
Each rsync job has 25k files to work with, so if all 16 processes open
all their files at the same time, I should not exceed 400k. Even if I
double this number to account for the client's page cache, I should get
nowhere near that number of inodes (a sync flush takes about 1 second).
Then after a few hours, my MDSs start failing with messages like this:
-21> 2019-07-22 14:00:05.877 7f67eacec700 1 heartbeat_map
is_healthy 'MDSRank' had timed out after 15
-20> 2019-07-22 14:00:05.877 7f67eacec700 0 mds.beacon.XXX Skipping
beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal
heartbeat is not healthy!
The standby nodes try to take over, but take forever to become active
and will fail as well eventually.
During my research, I found this related topic:
but I tried everything in there from increasing to lowering my cache
size, the number of segments etc. I also played around with the number
of active MDSs and two appears to work the best, whereas one cannot keep
up with the load and three seems to be the worst of all choices.
Do you have any ideas how I can improve the stability of my MDS damons
to handle the load properly? single 10G link is a toy and we could query
the cluster with a lot more requests per second, but it's already
yielding to 16 rsync processes.
in this <https://ceph.io/community/the-first-telemetry-results-are-in/>
blog post I find this statement:
"So, in our ideal world so far (assuming equal size OSDs), every OSD now
has the same number of PGs assigned."
My issue is that accross all pools the number of PGs per OSD is not equal.
And I conclude that this is causing very unbalanced data placement.
As a matter of fact the data stored on my 1.6TB HDD in specific pool
"hdb_backup" is in a range starting with
osd.228 size: 1.6 usage: 52.61 reweight: 1.00000
and ending with
osd.145 size: 1.6 usage: 81.11 reweight: 1.00000
This impacts the amount of data that can be stored in the cluster heavily.
Ceph balancer is enabled, but this is not solving this issue.
root@ld3955:~# ceph balancer status
Therefore I would ask you for suggestions how to work on this unbalanced
I have attached pastebin for
- ceph osd df sorted by usage <https://pastebin.com/QLQHjA9g>
- ceph osd df tree <https://pastebin.com/SvhP2hp5>
My cluster has multiple crush roots respresenting different disks.
In addition I have defined multiple pools, one pool for each disk type:
hdd, ssd, nvme.
We have a ceph+cephfs cluster runing nautilus version 14.2.4
We have debian buster/ubuntu bionic clients mounting cephfs in kernel mode without problems.
We now want to mount cephfs from our new centos 8 clients. Unfortunately, ceph-common is needed but there are no packages available for el8 (only el7). And no way to install the el7 packages on centos 8 (missing deps).
Thus, despite the fact that centos 8 have a 4.18 kernel (required to use quota, snapshots etc...), it seems impossible to mount in kernel mode (good perfs) and we still have to use the so slow fuse mode.
Is it possible to workaround this problem ? Or when is it planned to provides (even as beta) the ceph packages for centos 8 ?
This is the seventh bugfix release of the Mimic v13.2.x long term stable
release series. We recommend all Mimic users upgrade.
For the full release notes, see
- Cache trimming is now throttled. Dropping the MDS cache via the “ceph
tell mds.<foo> cache drop” command or large reductions in the cache size
will no longer cause service unavailability.
- Behavior with recalling caps has been significantly improved to not
attempt recalling too many caps at once, leading to instability. MDS with
a large cache (64GB+) should be more stable.
- MDS now provides a config option “mds_max_caps_per_client” (default:
1M) to limit the number of caps a client session may hold. Long running
client sessions with a large number of caps have been a source of
instability in the MDS when all of these caps need to be processed during
certain session events. It is recommended to not unnecessarily increase
- The “mds_recall_state_timeout” config parameter has been removed. Late
client recall warnings are now generated based on the number of caps the
MDS has recalled which have not been released. The new config parameters
“mds_recall_warning_threshold” (default: 32K) and
“mds_recall_warning_decay_rate” (default: 60s) set the threshold for this
- The “cache drop” admin socket command has been removed. The “ceph tell
mds.X cache drop” remains.
- A health warning is now generated if the average osd heartbeat ping
time exceeds a configurable threshold for any of the intervals computed.
The OSD computes 1 minute, 5 minute and 15 minute intervals with average,
minimum and maximum values. New configuration option
“mon_warn_on_slow_ping_ratio” specifies a percentage of
“osd_heartbeat_grace” to determine the threshold. A value of zero disables
the warning. A new configuration option “mon_warn_on_slow_ping_time”,
specified in milliseconds, overrides the computed value, causing a warning
when OSD heartbeat pings take longer than the specified amount. A new
admin command “ceph daemon mgr.# dump_osd_network [threshold]” lists all
connections with a ping time longer than the specified threshold or value
determined by the config options, for the average for any of the 3
intervals. A new admin command ceph daemon osd.# dump_osd_network
[threshold]” does the same but only including heartbeats initiated by the
- The default value of the
“osd_deep_scrub_large_omap_object_key_threshold” parameter has been
lowered to detect an object with large number of omap keys more easily.
- radosgw-admin introduces two subcommands that allow the managing of
expire-stale objects that might be left behind after a bucket reshard in
earlier versions of RGW. One subcommand lists such objects and the other
deletes them. Read the troubleshooting section of the dynamic resharding
docs for details.
To better understand how our current users utilize Ceph, we conducted a
public community survey. This information is a guide to the community of
how we spend our contribution efforts for future development. The survey
results will remain anonymous and aggregated in future Ceph Foundation
publications to the community.
I'm pleased to announce after much discussion on the Ceph dev mailing list
 that the community has formed the Ceph Survey for 2019.
The deadline for this survey due to it being out later than we'd like will
be January 31st, 2020 at 11:59 PT.
We have discussed in the future to use the Ceph telemetry module to collect
the data to save time for our users. Please let me know of any mistakes
that need to be corrected on the survey. Thanks!
Ceph Community Manager
494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA
@Thingee <https://twitter.com/thingee> Thingee
I might have found the reason why several of our clusters (and maybe
Bryan's too) are getting stuck not trimming osdmaps.
It seems that when an osd fails, the min_last_epoch_clean gets stuck
forever (even long after HEALTH_OK), until the ceph-mons are
I've updated the ticket: https://tracker.ceph.com/issues/41154