November 2019 - Dev - lists.ceph.io

by Patrick Donnelly

For developers submitting jobs using teuthology, we now have recommendations on what priority level to use: https://docs.ceph.com/docs/master/dev/developer_guide/#testing-priority -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

1 year, 1 month

5
7
0 0

Radosgw/Objecter behaviour for homeless session

by Biswajeet Patra

Hi All, I have a query regarding objecter behaviour for homeless session. In situations when all OSDs containing copies (*let say replication 3*) of an object are down, the objecter assigns a homeless session (OSD=-1) to a client request. This request makes radosgw thread hang indefinitely as the data could never be served because all required OSDs are down. With multiple similar requests, all the radosgw threads gets exhausted and hanged indefinitely waiting for the OSDs to come up. This creates complete service unavailability as no rgw threads are present to process valid requests which could have been directed towards active PGs/OSDs. I think we should have behaviour in objecter or radosgw to terminate request and return early in case of a homeless session. Let me know your thoughts on this. Regards, Biswajeet -- *-----------------------------------------------------------------------------------------* *This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.***** **** *Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the organization. Any information on shares, debentures or similar instruments, recommended product pricing, valuations and the like are for information purposes only. It is not meant to be an instruction or recommendation, as the case may be, to buy or to sell securities, products, services nor an offer to buy or sell securities, products or services unless specifically stated to be so on behalf of the Flipkart group. Employees of the Flipkart group of companies are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to organizational policy and outside the scope of the employment of the individual concerned. The organization will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising.***** **** *Our organization accepts no liability for the content of this email, or for the consequences of any actions taken on the basis of the information *provided,* unless that information is subsequently confirmed in writing. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.* _-----------------------------------------------------------------------------------------_

4 years, 2 months

1
1
0 0

Multi-site sync failed when buckets uses not default placement in source zone

by liuchang0812

Hi, @Yehuda Sadeh-Weinraub <yehuda(a)redhat.com> @Casey Bodley <cbodley(a)redhat.com> @Matt Benjamin <mbenjamin(a)redhat.com> and Cephers We met a problem with the Elastic Search sync module: bucket with custom placement could not be synced to the target zone. The target zone tries to create a bucket instance based on the placement name, but the target zone does have the placement. logs as following: meta sync: ERROR: can't store key: bucket.instance: * data sync: ERROR: failed to fetch bucket instance info for * ERROR: select_bucket_placement() returned -22 we are going to fix this issue. here are some questions: 1. should we sync those buckets to the ES zone? it seems not necessary 2. should we sync placements? if we sync placements to the target zone, we also need to create rados pools in the target zone thanks Chang Liu

4 years, 4 months

2
3
0 0

osdmaps not trimmed until ceph-mon's restarted (if cluster has a down osd)

by Dan van der Ster

Hi Joao, I might have found the reason why several of our clusters (and maybe Bryan's too) are getting stuck not trimming osdmaps. It seems that when an osd fails, the min_last_epoch_clean gets stuck forever (even long after HEALTH_OK), until the ceph-mons are restarted. I've updated the ticket: https://tracker.ceph.com/issues/41154 Cheers, Dan

4 years, 4 months

5
6
0 0

14.2.5 QE Nautilus validation status

by Yuri Weinstein

(This is an early update, some tests are still running, as we are trying to release this point next week before the US holidays, and have more time to review results) Details of this release summarized here: https://tracker.ceph.com/issues/42839#note-3 rados - approved by Neha rgw - approved by Casey rbd - need approval Jason krbd - need approval Jason, Ilya fs - need approval Patrick, Ramana kcephfs - need approval Patrick, Ramana multimds - need approval Patrick, Ramana ceph-deploy - FAILED Sage, Alfredo ? ceph-disk - N/A upgrade/client-upgrade-hammer (nautilus) - N/A upgrade/client-upgrade-jewel (nautilus) - PASSED upgrade/client-upgrade-mimic (nautilus) - FAILED upgrade/luminous-p2p - in progress powercycle - in progress ceph-ansible - Brad is finxing upgrade/luminous-x (nautilus) - in progress upgrade/mimic-x (nautilus) - in progress ceph-volume - Jan fixing (please speak up if something is missing) Thx YuriW

4 years, 4 months

11
23
0 0

Simplifying Ceph Project Redmine Open Statuses

by Patrick Donnelly

Currently we have these open statuses: New Triaged Verified Need More Info In Progress Feedback Need Review Need Test Testing Pending Backport Pending Upstream It seems to me many of these are mostly unused making their presence confusing to newcomers. I propose we prune these down to: New: default for new trackers; ideally this list should be short and regularly looked at. Triaged: it's been looked at by PTL/team member and could be assigned out. Need More Info: can't be worked on without more information In Progress: assignee is working on the ticket. Need Review: upstream PR ready for review Pending Backport: upstream PR merged; backports are pending. -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

4 years, 4 months

8
11
0 0

Re: device class : nvme

by Sage Weil

Adding dev(a)ceph.io On Thu, 21 Nov 2019, Muhammad Ahmad wrote: > While trying to research how crush maps are used/modified I stumbled > upon these device classes. > https://ceph.io/community/new-luminous-crush-device-classes/ > > I wanted to highlight that having nvme as a separate class will > eventually break and should be removed. > > There is already a push within the industry to consolidate future > command sets and NVMe will likely be it. In other words, NVMe HDDs are > not too far off. In fact, the recent October OCP F2F discussed this > topic in detail. > > If the classification is based on performance then command set > (SATA/SAS/NVMe) is probably not the right classification. I opened a PR that does this: https://github.com/ceph/ceph/pull/31796 I can't remember seeing 'nvme' as a device class on any real cluster; the exceptoin is my basement one, and I think the only reason it ended up that way was because I deployed bluestore *very* early on (with ceph-disk) and the is_nvme() detection helper doesn't work with LVM. That's my theory at least.. can anybody with bluestore on NVMe devices confirm? Does anybody see class 'nvme' devices in their cluster? Thanks! sage

4 years, 4 months

6
6
0 0

Re: monitoring

by Sage Weil

Adding dev list. We haven't talked through much of this in any detail in the orchestrator calls yet aside from a vague discussion about what should/shouldn't be in scope. On Thu, 28 Nov 2019, Paul Cuzner wrote: > On Thu, Nov 28, 2019 at 2:37 AM Sage Weil <sweil(a)redhat.com> wrote: > > > On Wed, 27 Nov 2019, Paul Cuzner wrote: > > > Hi, > > > > > > I've got a working gist for the add/remove of the monitoring solution. > > > https://gist.github.com/pcuzner/ac542ce3fa9a4699bb9310b1fd5095d0 > > > > > > I'm out for the next couple of days, but will get a PR raised next week > > to > > > get this started properly. > > > > For some reason it won't let me comment on that gist. > > > > - I don't think we should install anything on the host outside of the unit > > file and /var/lib/ceph/$fsid/$thing. I suggest $thing be 'prometheus', > > 'alertmanager', 'node-exporter', 'grafana'. We could combine all but > > node-exporter into a single 'monitoring' thing but i'm worried this > > obscures things too much when, for example, the user might have an > > external prometheus but still need alertmanager, and so on. > > > > So all the configs should live in > > /var/lib/ceph/$fsid/$thing/prometheus.yml and so on, and then bound to the > > right /etc/whatever location by the container config. > > > > I struggle with this one. Channelling my inner sysadmin: "I expect config > settings to be in /etc and data to be in /var/lib - that's what FHS says > and that's how other systems look that I have to manage, so why does Ceph > have to do things differently?" 1- Because it's a containerized service. Things are in etc inside the container, not outside. Sprinkling these configs in /etc mixes containerized service configs with the *host*'s configs, which seems very untidy to me. 2. Putting it all in /var/lib/ceph/whatever means it's find and clean up. > I'm also not sure of the value of fsid in the dir names. I can see the > value if a host has to support multiple ceph clusters - but outside dev is > that something that the community or our customers actually want? Most deployments won't need it, but it will avoid a whole range of problems when they do. Especially when it becomes trivial to bootstrap clusters, you also make it trivial to make multiple clusters overlap on the same host. And, like above, it keeps things tidy. > The gist downloads the separate containers we need in parallel - which I > think is a good thing! reduces time Sure... that's something we could do regardless of whether it's a separate script of part of ceph-daemon. Probably what we actually want is for the ssh 'host add' commadn to kick off some prestaging of containers in the background so that the first daemon deployment doesn't wait for a container download at all. > IMO, having monitoring-add deploy grafana/prom and alert manager together > by default is the way to go. TBH, when I started this, I was putting them > all in the same pod under podman for management and treat them as a single > unit - but having to support 'legacy' docker put an end to that :) > > If a user wishes to use a separate prometheus, that will normally have it's > own alertmanager too. Which alertmanager a prometheus server is defined in > the prometheus.yml. With external prometheus, rules, alerts and receiver > definitions are going to be an exercise for the reader. We'll need to > document the settings, but the admin will need to apply them - in this > scenario, we could possibly generate sample files that the admin can pick > up and apply? To my mind deployment of monitoring has two pathways; > default - "monitoring add" yields prom/grafana/alertmanager containers > deployed to machine > external-prom - "monitoring add" just deploys grafana, and points it's > default data source at the external prom url. We're also making an > assumption here that the prometheus server is open and doesn't require auth > (OCP's prometheus for example has auth enabled) I think it makes sense to focus on the out-of-the-box opinionated easy scenario vs the DIY case, in general at least. But I have a few questions... - In the DIY case, does it makes sense to leave the node-exporter to the reader too? Or might it make sense for us to help deploy the node-exporter, but they run the external/existing prometheus instance? - Likewise, the alertmanager is going to have a bunch of ceph-specific alerts configured, right? Might they want their own prom but we deploy our alerts? (Is there any dependency in the dashboard on a particular set of alerts in prometheus?) I'm guessing you think no in both these cases... > > - Let's teach ceph-daemon how to do this, so that you do 'ceph-daemon > > deploy --fsid ... --name prometheus.foo -i input.json'. ceph-daemon > > has the framework for opening firewall ports etc now... just add ports > > based on the daemon type. > > > > TBH, I'd keep the monitoring containers away from the ceph daemons. They > require different parameters, config files etc so why not keep them > separate and keep the ceph logic clean. This also allows us to change > monitoring without concerns over logic changes to normal ceph daemon > management. Okay, but mgr/ssh is still going to be wired up to deploy these. And to do so on a per-cluster, containerized basis... which means all of the infra in ceph-daemon will still be useful. It seems easiest to just add it there. Your points above seem to point toward simplifying the containers we deploy to just two containers, one that's one-per-cluster for prom+alertmanager+grafana, and one that's per-host for the node-exporter. But I think making it fit in nicely with the other ceph containers (e.g., /var/lib/ceph/$fsid/$thing) makes sense. Esp since we can just deploy these during bootstrap by default (unless some --external-prometheus is passed) and this all happens without the admin having to think about it. > > WDYT? > > > > > I'm sure a lot of the above has already been discussed at length with the > SuSE folks, so apologies for going over ground that you've already covered. Not yet! :) sage

4 years, 4 months

6
8
0 0

Python 2 exodus is happening now [Was: Re: [sepia] Transition to Python 3]

by Brad Hubbard

On Wed, Apr 18, 2018 at 6:27 AM Nathan Cutler <ncutler(a)suse.cz> wrote: > > That would be at odds to what Nathan is suggesting though, which is a > > hard change to Python 3. > > Hm, not sure what hard/soft means in this context. For any given script, > either it runs with Python 3, or it doesn't. And this is determined by > the shebang. (Unless the shebang is omitted, of course.) > > I was very surprised to find out that, in SLES and openSUSE, the symlink > /usr/bin/python -> /usr/bin/python2 will not be changed even when the > migration of the underlying distro to Python 3 is complete. > > But then my colleagues explained why that is, and I "saw the light". > Since every single script in the distro has to be audited for Python 3 > compatibility, anyway, it makes sense to have the shebang be an explicit > declaration of said compatibility. > > By retaining the symlink at it is, all scripts start out the migration > process with an explicit declaration that they are compatible with > Python 2. Compatibility with Python 3 is signalled not by saying "it's > OK with Python 3, we tried it". It's signalled by changing the shebang. > > And this isn't unique to SUSE. Fedora is treating the shebang in the > same way, apparently. [2] Seems that if you only have python3 installed in Fedora31 this is *not* the case. # python --version Python 3.7.5 # /usr/bin/python --version Python 3.7.5 # ls -l /usr/bin/python lrwxrwxrwx. 1 root root 9 Nov 18 00:57 /usr/bin/python -> ./python3 See https://lists.fedoraproject.org/archives/list/devel-announce@lists.fedorapr… and https://fedoraproject.org/wiki/Changes/RetirePython2#The_python27_package "there is no /usr/bin/python" So the two distros are quite divergent in their approach apparently? > > It may be true that a given script is fine with Python 3, but as long as > the shebang says "python" (i.e. python2), there's no way to really find > out, is there? (Barring things like Josh's suggestion of changing the > shebang on the fly via a teuthology task/workunit, which is fine if we > decide we need a transition period, which it looks like we will.) > > Nathan > > [1] > https://github.com/kubernetes-incubator/external-storage/blob/master/ceph/c… > [2] > https://fedoraproject.org/wiki/FinalizingFedoraSwitchtoPython3#.2Fusr.2Fbin… > _______________________________________________ > Sepia mailing list > Sepia(a)lists.ceph.com > http://lists.ceph.com/listinfo.cgi/sepia-ceph.com -- Cheers, Brad

4 years, 4 months

4
10
0 0

Questions about the EC pool

by majia xiao

Hello, We have a Ceph cluster (version 12.2.4) with 10 hosts, and there are 21 OSDs on each host. An EC pool is created with the following commands: ceph osd erasure-code-profile set profile_jerasure_4_3_reed_sol_van \ plugin=jerasure \ k=4 \ m=3 \ technique=reed_sol_van \ packetsize=2048 \ crush-device-class=hdd \ crush-failure-domain=host ceph osd pool create pool_jerasure_4_3_reed_sol_van 2048 2048 erasure profile_jerasure_4_3_reed_sol_van Here are my questions: 1. The EC pool is created using k=4, m=3, and crush-device-class=hdd, so we just disable the network interfaces of some hosts (using "ifdown" command) to verify the functionality of the EC pool while performing ‘rados bench’ command. However, the IO rate drops immediately to 0 when a single host goes offline, and it takes a long time (~100 seconds) for the IO rate becoming normal. As far as I know, the default value of min_size is k+1 or 5, which means that the EC pool can be still working even if there are two hosts offline. Is there something wrong with my understanding? 2. According to our observations, it seems that the IO rate becomes normal when Ceph detects all OSDs corresponding to the failed host. Is there any way to reduce the time needed for Ceph to detect all failed OSDs? Thanks for any help. Best regards, Majia Xiao

4 years, 4 months

2
1
0 0

2024

2023

2022

2021

2020

2019

Dev November 2019