December 2019 - Dev - lists.ceph.io

12/05/2019 perf meeting is on!

by Mark Nelson

Hi Folks, Perf meeting is on in ~10 minutes! No set discussion topic for today, though we have a lot of PRs that closed over the last two weeks. Please feel free to add your own! Etherpad: https://pad.ceph.com/p/performance_weekly Bluejeans: https://bluejeans.com/908675367 Thanks, Mark

4 years, 4 months

1
0
0 0

upmaps and balancer

by David Zafman

Hi Xie, Pull request https://github.com/ceph/ceph/pull/31774 includes changes to the Balancer which I would like you to look at even though it has already merged. During testing I also uncovered an issue which I filed tracker https://tracker.ceph.com/issues/43124. Please have a look at the tracker. Thanks David

4 years, 4 months

2
1
0 0

PSA: mgr is now blacklisted during failover

by Patrick Donnelly

https://github.com/ceph/ceph/pull/31797 Note: RADOS and cephfs handles use a new mgr interface to register their client address for potential blacklisting [2,3]. This means that the old blacklisted mgr cannot continue interacting with RADOS/CephFS once it's replaced by the monitors. If your ceph-mgr plugin creates a client connection with RADOS separately from the standard "rados" [1] handle, you should register this handle so it can be blacklisted by the monitors during failover. [1] https://github.com/ceph/ceph/blob/88e79c31a7cec0dae8e091aa4da0ee5975108c93/… [2] https://github.com/ceph/ceph/blob/88e79c31a7cec0dae8e091aa4da0ee5975108c93/… [3] https://github.com/ceph/ceph/blob/88e79c31a7cec0dae8e091aa4da0ee5975108c93/… -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

4 years, 4 months

1
0
0 0

Re: monitoring

by Sage Weil

Adding dev list. We haven't talked through much of this in any detail in the orchestrator calls yet aside from a vague discussion about what should/shouldn't be in scope. On Thu, 28 Nov 2019, Paul Cuzner wrote: > On Thu, Nov 28, 2019 at 2:37 AM Sage Weil <sweil(a)redhat.com> wrote: > > > On Wed, 27 Nov 2019, Paul Cuzner wrote: > > > Hi, > > > > > > I've got a working gist for the add/remove of the monitoring solution. > > > https://gist.github.com/pcuzner/ac542ce3fa9a4699bb9310b1fd5095d0 > > > > > > I'm out for the next couple of days, but will get a PR raised next week > > to > > > get this started properly. > > > > For some reason it won't let me comment on that gist. > > > > - I don't think we should install anything on the host outside of the unit > > file and /var/lib/ceph/$fsid/$thing. I suggest $thing be 'prometheus', > > 'alertmanager', 'node-exporter', 'grafana'. We could combine all but > > node-exporter into a single 'monitoring' thing but i'm worried this > > obscures things too much when, for example, the user might have an > > external prometheus but still need alertmanager, and so on. > > > > So all the configs should live in > > /var/lib/ceph/$fsid/$thing/prometheus.yml and so on, and then bound to the > > right /etc/whatever location by the container config. > > > > I struggle with this one. Channelling my inner sysadmin: "I expect config > settings to be in /etc and data to be in /var/lib - that's what FHS says > and that's how other systems look that I have to manage, so why does Ceph > have to do things differently?" 1- Because it's a containerized service. Things are in etc inside the container, not outside. Sprinkling these configs in /etc mixes containerized service configs with the *host*'s configs, which seems very untidy to me. 2. Putting it all in /var/lib/ceph/whatever means it's find and clean up. > I'm also not sure of the value of fsid in the dir names. I can see the > value if a host has to support multiple ceph clusters - but outside dev is > that something that the community or our customers actually want? Most deployments won't need it, but it will avoid a whole range of problems when they do. Especially when it becomes trivial to bootstrap clusters, you also make it trivial to make multiple clusters overlap on the same host. And, like above, it keeps things tidy. > The gist downloads the separate containers we need in parallel - which I > think is a good thing! reduces time Sure... that's something we could do regardless of whether it's a separate script of part of ceph-daemon. Probably what we actually want is for the ssh 'host add' commadn to kick off some prestaging of containers in the background so that the first daemon deployment doesn't wait for a container download at all. > IMO, having monitoring-add deploy grafana/prom and alert manager together > by default is the way to go. TBH, when I started this, I was putting them > all in the same pod under podman for management and treat them as a single > unit - but having to support 'legacy' docker put an end to that :) > > If a user wishes to use a separate prometheus, that will normally have it's > own alertmanager too. Which alertmanager a prometheus server is defined in > the prometheus.yml. With external prometheus, rules, alerts and receiver > definitions are going to be an exercise for the reader. We'll need to > document the settings, but the admin will need to apply them - in this > scenario, we could possibly generate sample files that the admin can pick > up and apply? To my mind deployment of monitoring has two pathways; > default - "monitoring add" yields prom/grafana/alertmanager containers > deployed to machine > external-prom - "monitoring add" just deploys grafana, and points it's > default data source at the external prom url. We're also making an > assumption here that the prometheus server is open and doesn't require auth > (OCP's prometheus for example has auth enabled) I think it makes sense to focus on the out-of-the-box opinionated easy scenario vs the DIY case, in general at least. But I have a few questions... - In the DIY case, does it makes sense to leave the node-exporter to the reader too? Or might it make sense for us to help deploy the node-exporter, but they run the external/existing prometheus instance? - Likewise, the alertmanager is going to have a bunch of ceph-specific alerts configured, right? Might they want their own prom but we deploy our alerts? (Is there any dependency in the dashboard on a particular set of alerts in prometheus?) I'm guessing you think no in both these cases... > > - Let's teach ceph-daemon how to do this, so that you do 'ceph-daemon > > deploy --fsid ... --name prometheus.foo -i input.json'. ceph-daemon > > has the framework for opening firewall ports etc now... just add ports > > based on the daemon type. > > > > TBH, I'd keep the monitoring containers away from the ceph daemons. They > require different parameters, config files etc so why not keep them > separate and keep the ceph logic clean. This also allows us to change > monitoring without concerns over logic changes to normal ceph daemon > management. Okay, but mgr/ssh is still going to be wired up to deploy these. And to do so on a per-cluster, containerized basis... which means all of the infra in ceph-daemon will still be useful. It seems easiest to just add it there. Your points above seem to point toward simplifying the containers we deploy to just two containers, one that's one-per-cluster for prom+alertmanager+grafana, and one that's per-host for the node-exporter. But I think making it fit in nicely with the other ceph containers (e.g., /var/lib/ceph/$fsid/$thing) makes sense. Esp since we can just deploy these during bootstrap by default (unless some --external-prometheus is passed) and this all happens without the admin having to think about it. > > WDYT? > > > > > I'm sure a lot of the above has already been discussed at length with the > SuSE folks, so apologies for going over ground that you've already covered. Not yet! :) sage

4 years, 4 months

6
8
0 0

Documentation Refactor - Ceph Nautilus STS supports WebIdentity (OpenAuth OpenID) (Bug #1)

by John Zachary Dover

I am updating the Ceph documentation. Included in this email is a proposed change to the documentation and a request for information pertaining to that proposed change. If you know about the issue behind the proposed change and you have information pertinent to it that you would like to enshrine in the documentation, reply to this email and tell me. Documentation Link: https://docs.ceph.com/docs/master/radosgw/STSLite/ Proposed Change: The page at the address specified by "Documentation Link" does not inform the reader that Ceph Nautilus STS supports WebIdentity OpenAuth OpenID. This information should be added to this page in a place that fits. Zac's Request: Someone who is familiar with the STS-Ceph interface should tell me about how this works, and I'd like to include some examples in the documentation as well, so anyone who has used OpenID to interact with Ceph Nautilus is encouraged to send me their examples. Tracking Information (this can be ignored by everyone but Zac) Bug # 1 here: https://pad.ceph.com/p/Report_Documentation_Bugs

4 years, 4 months

2
1
0 0

pg becomes creating+peering state after reweight with master branch code

by Wangpan

Hi all, I am backporting the async recovery feature to Hammer :( --- this is a sad story, but we can't upgrade to Luminous or other new version in our product envs. the teuthology tests catch an osd crash error after backport https://github.com/ceph/ceph/pull/19811 But I have reproduced it with master branch code, the detail are described as below: code version: master branch with latest commit 656c8e8049c2c1acd363143c842c2edf1fe09b64 config of vstart: [client.vstart.sh] num mon = 1 num osd = 5 num mds = 0 num mgr = 1 num rgw = 0 config of osd/mon/mgr: [osd] osd_async_recovery_min_cost = 0 # only add this one, others are default with vstart vstart command: MON=1 OSD=5 MDS=0 MGR=1 RGW=0 ../src/vstart.sh --debug --new -X --localhost --bluestore --without-dashboard pool info: pool 1 'rbd' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode off last_change 31 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd osd log: 2019-12-02T09:32:28.779+0000 7f8ff8c82700 5 osd.2 pg_epoch: 783 pg[1.c2( v 657'1488 (0'0,657'1488] local-lis/les=0/0 n=13 ec=25/25 lis/c=0/449 les/c/f=0/450/0 sis=783) [2,1,0] r=0 lpr=783 pi=[449,783)/1 crt=657'1488 mlcod 0'0 unknown mbc={}] enter Started/Primary 2019-12-02T09:32:28.779+0000 7f8ff8c82700 5 osd.2 pg_epoch: 783 pg[1.c2( v 657'1488 (0'0,657'1488] local-lis/les=0/0 n=13 ec=25/25 lis/c=0/449 les/c/f=0/450/0 sis=783) [2,1,0] r=0 lpr=783 pi=[449,783)/1 crt=657'1488 mlcod 0'0 creating mbc={}] enter Started/Primary/Peering ### pg state is creating 2019-12-02T09:32:28.779+0000 7f8ff8c82700 5 osd.2 pg_epoch: 783 pg[1.c2( v 657'1488 (0'0,657'1488] local-lis/les=0/0 n=13 ec=25/25 lis/c=0/449 les/c/f=0/450/0 sis=783) [2,1,0] r=0 lpr=783 pi=[449,783)/1 crt=657'1488 mlcod 0'0 creating+peering mbc={}] enter Started/Primary/Peering/GetInfo ### pg state is creating+peering Reproduce steps: 1. create a pool, wait for active+clean 2. writes to rbd image with fio or other tool during steps below 3. ceph osd reweight 2 0.1 4. wait several minutes, make sure pg are moved to other osd(grep "on_removal" osd.2.log) 5. ceph osd reweight 2 1 6. wait several minutes, make sure pg are moved back to original osd.2(grep "_make_pg" osd.2.log) 7. find a up_primary pg(such as 1.c2 in my log) on osd.2 which was moved out/back during steps 3~6, and it should enter in async recovering after step 6 8. wait for pg becomes active+clean, then you can find it had become to creating+peering state. the root reason may be: 1. after step 6, osd.0 will be async recovery target in pg 1.c2, and pg will be created after reweight to 1 2. after pg->init the local_les=0/history.les=450 2019-12-02T09:21:47.991+0000 7f8ff8c82700 10 osd.2 543 _make_pg 1.c2 2019-12-02T09:21:47.991+0000 7f8ff8c82700 10 osd.2 pg_epoch: 543 pg[1.c2( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0) [] r=-1 lpr=0 crt=0'0 unknown mbc={}] init role 0 up [2,1,0] acting [2,1,0] history ec=25/25 lis/c=449/449 les/c/f=450/450/0 sis=543 pruub=14.367934206s past_intervals ([449,542] all_participants=0,1,3 intervals=([449,542] acting 0,1,3)) 2019-12-02T09:21:47.991+0000 7f8ff8c82700 20 osd.2 pg_epoch: 543 pg[1.c2( empty local-lis/les=0/0 n=0 ec=25/25 lis/c=449/449 les/c/f=450/450/0 sis=543 pruub=14.367934206s) [2,1,0] r=0 lpr=0 pi=[449,543)/1 crt=0'0 mlcod 0'0 unknown mbc={}] on_new_interval 2019-12-02T09:21:48.063+0000 7f8ff8c82700 20 osd.2 pg_epoch: 543 pg[1.c2( empty local-lis/les=0/0 n=0 ec=25/25 lis/c=449/449 les/c/f=450/450/0 sis=543) [2,1,0] r=0 lpr=543 pi=[449,543)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_async_recovery_replicated result want=[0,1] async_recovery=2 2019-12-02T09:21:48.983+0000 7f8ff8c82700 20 osd.2 pg_epoch: 544 pg[1.c2( empty local-lis/les=0/0 n=0 ec=25/25 lis/c=449/449 les/c/f=450/450/0 sis=543) [2,1,0] r=0 lpr=544 pi=[449,543)/1 crt=0'0 mlcod 0'0 unknown mbc={}] new interval newup [2,1,0] newacting [0,1] ## osd.2 is async recovery target in pg 1.c2 acting_primary osd.0 log: 2019-12-02T09:21:49.195+0000 7f2f01060700 20 osd.0 pg_epoch: 544 pg[1.c2( v 542'940 (0'0,542'940] local-lis/les=449/450 n=13 ec=25/25 lis/c=449/449 les/c/f=450/450/0 sis=544) [2,1,0]/[0,1] r=0 lpr=544 pi=[449,544)/1 crt=542'940 lcod 542'939 mlcod 0'0 remapped+peering mbc={}] choose_async_recovery_replicated result want=[0,1] async_recovery=2 3. when repop is coming to osd.2 of pg 1.c2, the append_log func will find local_les(=0) != history.les(=450), and it will use local_les as new history.les, then history.les become 0 void PeeringState::append_log( const vector<pg_log_entry_t>& logv, eversion_t trim_to, eversion_t roll_forward_to, ObjectStore::Transaction &t, bool transaction_applied, bool async) { /* The primary has sent an info updating the history, but it may not * have arrived yet. We want to make sure that we cannot remember this * write without remembering that it happened in an interval which went * active in epoch history.last_epoch_started. */ if (info.last_epoch_started != info.history.last_epoch_started) { info.history.last_epoch_started = info.last_epoch_started; } ... } 4. when the async recovery of osd.2 in pg 1.c2 is over, it will change to acting_primary, pg state will be set to PG_STATE_CREATING PeeringState::Primary::Primary(my_context ctx) : my_base(ctx), NamedState(context< PeeringMachine >().state_history, "Started/Primary") { context< PeeringMachine >().log_enter(state_name); DECLARE_LOCALS; ceph_assert(ps->want_acting.empty()); // set CREATING bit until we have peered for the first time. if (ps->info.history.last_epoch_started == 0) { ps->state_set(PG_STATE_CREATING); ... } So my question is that, this PG_STATE_CREATING state after async recovery is expected or not? if it is, I guess this creating state may result in osd crash, if acting_primary is changed during creating+peering state, the process may be: 1. osd.2 report pg stats to mon 2. mon will add this pg to creating_pgs/creating_pgs_by_osd_epoch void PGMap::stat_pg_add(const pg_t &pgid, const pg_stat_t &s, bool sameosds) { auto pool = pgid.pool(); pg_sum.add(s); num_pg++; num_pg_by_state[s.state]++; num_pg_by_pool_state[pgid.pool()][s.state]++; num_pg_by_pool[pool]++; if ((s.state & PG_STATE_CREATING) && s.parent_split_bits == 0) { creating_pgs.insert(pgid); if (s.acting_primary >= 0) { creating_pgs_by_osd_epoch[s.acting_primary][s.mapping_epoch].insert(pgid); } } ... } 3. when the acting_primary change to a new one, the new acting_primary will receive a MOSDPGCreate/MOSDPGCreate2 msg with a very old pg created epoch(the real created epoch is 5) 4. the new acting_primary will get_osdmap by pg created epoch, this map is trimmed long time ago, then osd will crash at: void OSD::build_initial_pg_history( spg_t pgid, epoch_t created, utime_t created_stamp, pg_history_t *h, PastIntervals *pi) { dout(10) << __func__ << " " << pgid << " created " << created << dendl; *h = pg_history_t(created, created_stamp); OSDMapRef lastmap = service.get_map(created); ... } OSDMapRef get_map(epoch_t e) { OSDMapRef ret(try_get_map(e)); ceph_assert(ret); // crash here because map is trimmed return ret; } I got a related issue here: https://tracker.ceph.com/issues/14592 osd crashes when handling a stale pg-create message (hammer) , but I'm not sure they are same reason or not. Thanks for your attention, welcome to ask me for more detail.

4 years, 4 months

1
0
0 0

Python 2 exodus is happening now [Was: Re: [sepia] Transition to Python 3]

by Brad Hubbard

On Wed, Apr 18, 2018 at 6:27 AM Nathan Cutler <ncutler(a)suse.cz> wrote: > > That would be at odds to what Nathan is suggesting though, which is a > > hard change to Python 3. > > Hm, not sure what hard/soft means in this context. For any given script, > either it runs with Python 3, or it doesn't. And this is determined by > the shebang. (Unless the shebang is omitted, of course.) > > I was very surprised to find out that, in SLES and openSUSE, the symlink > /usr/bin/python -> /usr/bin/python2 will not be changed even when the > migration of the underlying distro to Python 3 is complete. > > But then my colleagues explained why that is, and I "saw the light". > Since every single script in the distro has to be audited for Python 3 > compatibility, anyway, it makes sense to have the shebang be an explicit > declaration of said compatibility. > > By retaining the symlink at it is, all scripts start out the migration > process with an explicit declaration that they are compatible with > Python 2. Compatibility with Python 3 is signalled not by saying "it's > OK with Python 3, we tried it". It's signalled by changing the shebang. > > And this isn't unique to SUSE. Fedora is treating the shebang in the > same way, apparently. [2] Seems that if you only have python3 installed in Fedora31 this is *not* the case. # python --version Python 3.7.5 # /usr/bin/python --version Python 3.7.5 # ls -l /usr/bin/python lrwxrwxrwx. 1 root root 9 Nov 18 00:57 /usr/bin/python -> ./python3 See https://lists.fedoraproject.org/archives/list/devel-announce@lists.fedorapr… and https://fedoraproject.org/wiki/Changes/RetirePython2#The_python27_package "there is no /usr/bin/python" So the two distros are quite divergent in their approach apparently? > > It may be true that a given script is fine with Python 3, but as long as > the shebang says "python" (i.e. python2), there's no way to really find > out, is there? (Barring things like Josh's suggestion of changing the > shebang on the fly via a teuthology task/workunit, which is fine if we > decide we need a transition period, which it looks like we will.) > > Nathan > > [1] > https://github.com/kubernetes-incubator/external-storage/blob/master/ceph/c… > [2] > https://fedoraproject.org/wiki/FinalizingFedoraSwitchtoPython3#.2Fusr.2Fbin… > _______________________________________________ > Sepia mailing list > Sepia(a)lists.ceph.com > http://lists.ceph.com/listinfo.cgi/sepia-ceph.com -- Cheers, Brad

4 years, 4 months

4
10
0 0

multisite rgw and rook

by Sage Weil

Hi Ali, I was thinking a bit about this and I think we should actually extend the ObjectStore CRD first so that we have parity between the ssh orch and rook. That will let us update the multisite documentation and standardize the way rgws are deployed and configured. Automating the multisite via the Realm CRD can come right after that. In particular, - ssh orch now deploys/names RGW's like client.rgw.$realm.$zone[.$id]; we should make the rook CRD do the same. There is probably some weirdness with teh extra .$id part at the end since I think k8s might be adding this? - Config options can/should then go into the client.rgw.$realm.$zone config section so that rook isn't passing lots of random stuff on the command line (ideally, nothing at all except the -n name). - Then we can update all of the RGW docs (including multisite) to suggest deploying the gateways via the ceph orchestartor rgw ... command(s), with a call-out on how to do it manually. Then do the multisite CRD... What do you think? sage

4 years, 4 months

1
0
0 0

CDM this Wednesday

by Sage Weil

Hi everyone, We have our regularly scheduled developer call this Wednesday, 1730 UTC: https://bluejeans.com/908675367 The agenda is here: https://tracker.ceph.com/projects/ceph/wiki/CDM_04-DEC-2019 So far, everything is orchestration related. 1- MDS affinity to a particular fs. This was removed a while back, but the orchestration layer assumes MDS daemons are deployed in cluster/subclusters/sets that are tied to a particular fs. We need to decide how to proceed. 2- mgr/ssh and ceph-daemon install wizard. We'd like to scope out what an install wizard should look like and do for a clean install experience, starting from a ceph-daemon bootstrap command and then jumping immediately to the dashboard to finish the cluster setup. 3- Orchestrator upgrades. How should we divide responsibility between the shared orch code and the orch implementation (rook or ssh). 4- Out-of-the-box monitoring. How much of the monitoring stack (prometheus, grafana, etc) should the ssh orch layer know how to deploy to get a fully functional cluster up and running (for cases where the user isn't doing it themselves). sage

4 years, 4 months

1
0
0 0

2024

2023

2022

2021

2020

2019

Dev December 2019