I have a query regarding objecter behaviour for homeless session. In
situations when all OSDs containing copies (*let say replication 3*) of an
object are down, the objecter assigns a homeless session (OSD=-1) to a
client request. This request makes radosgw thread hang indefinitely as the
data could never be served because all required OSDs are down. With
multiple similar requests, all the radosgw threads gets exhausted and
hanged indefinitely waiting for the OSDs to come up. This creates complete
service unavailability as no rgw threads are present to process valid
requests which could have been directed towards active PGs/OSDs.
I think we should have behaviour in objecter or radosgw to terminate
request and return early in case of a homeless session. Let me know your
thoughts on this.
*This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the
system manager. This message contains confidential information and is
intended only for the individual named. If you are not the named addressee,
you should not disseminate, distribute or copy this email. Please notify
the sender immediately by email if you have received this email by mistake
and delete this email from your system. If you are not the intended
recipient, you are notified that disclosing, copying, distributing or
taking any action in reliance on the contents of this information is
*Any views or opinions presented in this
email are solely those of the author and do not necessarily represent those
of the organization. Any information on shares, debentures or similar
instruments, recommended product pricing, valuations and the like are for
information purposes only. It is not meant to be an instruction or
recommendation, as the case may be, to buy or to sell securities, products,
services nor an offer to buy or sell securities, products or services
unless specifically stated to be so on behalf of the Flipkart group.
Employees of the Flipkart group of companies are expressly required not to
make defamatory statements and not to infringe or authorise any
infringement of copyright or any other legal right by email communications.
Any such communication is contrary to organizational policy and outside the
scope of the employment of the individual concerned. The organization will
not accept any liability in respect of such communication, and the employee
responsible will be personally liable for any damages or other liability
*Our organization accepts no liability for the
content of this email, or for the consequences of any actions taken on the
basis of the information *provided,* unless that information is
subsequently confirmed in writing. If you are not the intended recipient,
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
Hi, @Yehuda Sadeh-Weinraub <yehuda(a)redhat.com> @Casey Bodley
<cbodley(a)redhat.com> @Matt Benjamin <mbenjamin(a)redhat.com> and Cephers
We met a problem with the Elastic Search sync module: bucket with custom
placement could not be synced to the target zone. The target zone tries to
create a bucket instance based on the placement name, but the target zone
does have the placement. logs as following:
meta sync: ERROR: can't store key: bucket.instance: *
data sync: ERROR: failed to fetch bucket instance info for *
ERROR: select_bucket_placement() returned -22
we are going to fix this issue. here are some questions:
1. should we sync those buckets to the ES zone? it seems not necessary
2. should we sync placements? if we sync placements to the target zone, we
also need to create rados pools in the target zone
I might have found the reason why several of our clusters (and maybe
Bryan's too) are getting stuck not trimming osdmaps.
It seems that when an osd fails, the min_last_epoch_clean gets stuck
forever (even long after HEALTH_OK), until the ceph-mons are
I've updated the ticket: https://tracker.ceph.com/issues/41154
(This is an early update, some tests are still running, as we are
trying to release this point next week before the US holidays, and
have more time to review results)
Details of this release summarized here:
rados - approved by Neha
rgw - approved by Casey
rbd - need approval Jason
krbd - need approval Jason, Ilya
fs - need approval Patrick, Ramana
kcephfs - need approval Patrick, Ramana
multimds - need approval Patrick, Ramana
ceph-deploy - FAILED Sage, Alfredo ?
ceph-disk - N/A
upgrade/client-upgrade-hammer (nautilus) - N/A
upgrade/client-upgrade-jewel (nautilus) - PASSED
upgrade/client-upgrade-mimic (nautilus) - FAILED
upgrade/luminous-p2p - in progress
powercycle - in progress
ceph-ansible - Brad is finxing
upgrade/luminous-x (nautilus) - in progress
upgrade/mimic-x (nautilus) - in progress
ceph-volume - Jan fixing
(please speak up if something is missing)
Currently we have these open statuses:
Need More Info
It seems to me many of these are mostly unused making their presence
confusing to newcomers. I propose we prune these down to:
New: default for new trackers; ideally this list should be short and
regularly looked at.
Triaged: it's been looked at by PTL/team member and could be assigned out.
Need More Info: can't be worked on without more information
In Progress: assignee is working on the ticket.
Need Review: upstream PR ready for review
Pending Backport: upstream PR merged; backports are pending.
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
On Thu, 21 Nov 2019, Muhammad Ahmad wrote:
> While trying to research how crush maps are used/modified I stumbled
> upon these device classes.
> I wanted to highlight that having nvme as a separate class will
> eventually break and should be removed.
> There is already a push within the industry to consolidate future
> command sets and NVMe will likely be it. In other words, NVMe HDDs are
> not too far off. In fact, the recent October OCP F2F discussed this
> topic in detail.
> If the classification is based on performance then command set
> (SATA/SAS/NVMe) is probably not the right classification.
I opened a PR that does this:
I can't remember seeing 'nvme' as a device class on any real cluster; the
exceptoin is my basement one, and I think the only reason it ended up that
way was because I deployed bluestore *very* early on (with ceph-disk) and
the is_nvme() detection helper doesn't work with LVM. That's my theory at
least.. can anybody with bluestore on NVMe devices confirm? Does anybody
see class 'nvme' devices in their cluster?
Adding dev list. We haven't talked through much of this in any detail in
the orchestrator calls yet aside from a vague discussion about what
should/shouldn't be in scope.
On Thu, 28 Nov 2019, Paul Cuzner wrote:
> On Thu, Nov 28, 2019 at 2:37 AM Sage Weil <sweil(a)redhat.com> wrote:
> > On Wed, 27 Nov 2019, Paul Cuzner wrote:
> > > Hi,
> > >
> > > I've got a working gist for the add/remove of the monitoring solution.
> > > https://gist.github.com/pcuzner/ac542ce3fa9a4699bb9310b1fd5095d0
> > >
> > > I'm out for the next couple of days, but will get a PR raised next week
> > to
> > > get this started properly.
> > For some reason it won't let me comment on that gist.
> > - I don't think we should install anything on the host outside of the unit
> > file and /var/lib/ceph/$fsid/$thing. I suggest $thing be 'prometheus',
> > 'alertmanager', 'node-exporter', 'grafana'. We could combine all but
> > node-exporter into a single 'monitoring' thing but i'm worried this
> > obscures things too much when, for example, the user might have an
> > external prometheus but still need alertmanager, and so on.
> > So all the configs should live in
> > /var/lib/ceph/$fsid/$thing/prometheus.yml and so on, and then bound to the
> > right /etc/whatever location by the container config.
> I struggle with this one. Channelling my inner sysadmin: "I expect config
> settings to be in /etc and data to be in /var/lib - that's what FHS says
> and that's how other systems look that I have to manage, so why does Ceph
> have to do things differently?"
1- Because it's a containerized service. Things are in etc inside the
container, not outside. Sprinkling these configs in /etc mixes
containerized service configs with the *host*'s configs, which seems very
untidy to me.
2. Putting it all in /var/lib/ceph/whatever means it's find and
> I'm also not sure of the value of fsid in the dir names. I can see the
> value if a host has to support multiple ceph clusters - but outside dev is
> that something that the community or our customers actually want?
Most deployments won't need it, but it will avoid a whole range of
problems when they do. Especially when it becomes trivial to bootstrap
clusters, you also make it trivial to make multiple clusters overlap on
the same host.
And, like above, it keeps things tidy.
> The gist downloads the separate containers we need in parallel - which I
> think is a good thing! reduces time
Sure... that's something we could do regardless of whether it's a separate
script of part of ceph-daemon. Probably what we actually want is for the
ssh 'host add' commadn to kick off some prestaging of containers in the
background so that the first daemon deployment doesn't wait for a
container download at all.
> IMO, having monitoring-add deploy grafana/prom and alert manager together
> by default is the way to go. TBH, when I started this, I was putting them
> all in the same pod under podman for management and treat them as a single
> unit - but having to support 'legacy' docker put an end to that :)
> If a user wishes to use a separate prometheus, that will normally have it's
> own alertmanager too. Which alertmanager a prometheus server is defined in
> the prometheus.yml. With external prometheus, rules, alerts and receiver
> definitions are going to be an exercise for the reader. We'll need to
> document the settings, but the admin will need to apply them - in this
> scenario, we could possibly generate sample files that the admin can pick
> up and apply? To my mind deployment of monitoring has two pathways;
> default - "monitoring add" yields prom/grafana/alertmanager containers
> deployed to machine
> external-prom - "monitoring add" just deploys grafana, and points it's
> default data source at the external prom url. We're also making an
> assumption here that the prometheus server is open and doesn't require auth
> (OCP's prometheus for example has auth enabled)
I think it makes sense to focus on the out-of-the-box opinionated easy
scenario vs the DIY case, in general at least. But I have a few
- In the DIY case, does it makes sense to leave the node-exporter to the
reader too? Or might it make sense for us to help deploy the
node-exporter, but they run the external/existing prometheus instance?
- Likewise, the alertmanager is going to have a bunch of ceph-specific
alerts configured, right? Might they want their own prom but we deploy
our alerts? (Is there any dependency in the dashboard on a particular set
of alerts in prometheus?)
I'm guessing you think no in both these cases...
> > - Let's teach ceph-daemon how to do this, so that you do 'ceph-daemon
> > deploy --fsid ... --name prometheus.foo -i input.json'. ceph-daemon
> > has the framework for opening firewall ports etc now... just add ports
> > based on the daemon type.
> TBH, I'd keep the monitoring containers away from the ceph daemons. They
> require different parameters, config files etc so why not keep them
> separate and keep the ceph logic clean. This also allows us to change
> monitoring without concerns over logic changes to normal ceph daemon
Okay, but mgr/ssh is still going to be wired up to deploy these. And to do
so on a per-cluster, containerized basis... which means all of the infra
in ceph-daemon will still be useful. It seems easiest to just add it
Your points above seem to point toward simplifying the containers we
deploy to just two containers, one that's one-per-cluster for
prom+alertmanager+grafana, and one that's per-host for the node-exporter.
But I think making it fit in nicely with the other ceph containers (e.g.,
/var/lib/ceph/$fsid/$thing) makes sense. Esp since we can just deploy
these during bootstrap by default (unless some --external-prometheus is
passed) and this all happens without the admin having to think about it.
> > WDYT?
> I'm sure a lot of the above has already been discussed at length with the
> SuSE folks, so apologies for going over ground that you've already covered.
Not yet! :)
On Wed, Apr 18, 2018 at 6:27 AM Nathan Cutler <ncutler(a)suse.cz> wrote:
> > That would be at odds to what Nathan is suggesting though, which is a
> > hard change to Python 3.
> Hm, not sure what hard/soft means in this context. For any given script,
> either it runs with Python 3, or it doesn't. And this is determined by
> the shebang. (Unless the shebang is omitted, of course.)
> I was very surprised to find out that, in SLES and openSUSE, the symlink
> /usr/bin/python -> /usr/bin/python2 will not be changed even when the
> migration of the underlying distro to Python 3 is complete.
> But then my colleagues explained why that is, and I "saw the light".
> Since every single script in the distro has to be audited for Python 3
> compatibility, anyway, it makes sense to have the shebang be an explicit
> declaration of said compatibility.
> By retaining the symlink at it is, all scripts start out the migration
> process with an explicit declaration that they are compatible with
> Python 2. Compatibility with Python 3 is signalled not by saying "it's
> OK with Python 3, we tried it". It's signalled by changing the shebang.
> And this isn't unique to SUSE. Fedora is treating the shebang in the
> same way, apparently. 
Seems that if you only have python3 installed in Fedora31 this is
*not* the case.
# python --version
# /usr/bin/python --version
# ls -l /usr/bin/python
lrwxrwxrwx. 1 root root 9 Nov 18 00:57 /usr/bin/python -> ./python3
"there is no /usr/bin/python"
So the two distros are quite divergent in their approach apparently?
> It may be true that a given script is fine with Python 3, but as long as
> the shebang says "python" (i.e. python2), there's no way to really find
> out, is there? (Barring things like Josh's suggestion of changing the
> shebang on the fly via a teuthology task/workunit, which is fine if we
> decide we need a transition period, which it looks like we will.)
> Sepia mailing list
We have a Ceph cluster (version 12.2.4) with 10 hosts, and there are 21
OSDs on each host.
An EC pool is created with the following commands:
ceph osd erasure-code-profile set profile_jerasure_4_3_reed_sol_van \
ceph osd pool create pool_jerasure_4_3_reed_sol_van 2048 2048 erasure
Here are my questions:
1. The EC pool is created using k=4, m=3, and crush-device-class=hdd, so
we just disable the network interfaces of some hosts (using "ifdown"
command) to verify the functionality of the EC pool while performing ‘rados
However, the IO rate drops immediately to 0 when a single host goes
offline, and it takes a long time (~100 seconds) for the IO rate becoming
As far as I know, the default value of min_size is k+1 or 5, which means
that the EC pool can be still working even if there are two hosts offline.
Is there something wrong with my understanding?
2. According to our observations, it seems that the IO rate becomes
normal when Ceph detects all OSDs corresponding to the failed host.
Is there any way to reduce the time needed for Ceph to detect all failed
Thanks for any help.
Recently I encountered a situation requires reliable file storage with
cephfs, and the point is those data is not allowed to get modified or
After some learning I found that the WORM(write once read many) feature is
exactly what I need.Unfortunately, as far as I know, there is no worm
feature in cephfs.
So I was wondering is there any plan or design about this feature?