Hi Xie,
Pull request https://github.com/ceph/ceph/pull/31774 includes changes to
the Balancer which I would like you to look at even though it has
already merged. During testing I also uncovered an issue which I filed
tracker https://tracker.ceph.com/issues/43124. Please have a look at
the tracker.
Thanks
David
Adding dev list. We haven't talked through much of this in any detail in
the orchestrator calls yet aside from a vague discussion about what
should/shouldn't be in scope.
On Thu, 28 Nov 2019, Paul Cuzner wrote:
> On Thu, Nov 28, 2019 at 2:37 AM Sage Weil <sweil(a)redhat.com> wrote:
>
> > On Wed, 27 Nov 2019, Paul Cuzner wrote:
> > > Hi,
> > >
> > > I've got a working gist for the add/remove of the monitoring solution.
> > > https://gist.github.com/pcuzner/ac542ce3fa9a4699bb9310b1fd5095d0
> > >
> > > I'm out for the next couple of days, but will get a PR raised next week
> > to
> > > get this started properly.
> >
> > For some reason it won't let me comment on that gist.
> >
> > - I don't think we should install anything on the host outside of the unit
> > file and /var/lib/ceph/$fsid/$thing. I suggest $thing be 'prometheus',
> > 'alertmanager', 'node-exporter', 'grafana'. We could combine all but
> > node-exporter into a single 'monitoring' thing but i'm worried this
> > obscures things too much when, for example, the user might have an
> > external prometheus but still need alertmanager, and so on.
> >
> > So all the configs should live in
> > /var/lib/ceph/$fsid/$thing/prometheus.yml and so on, and then bound to the
> > right /etc/whatever location by the container config.
> >
>
> I struggle with this one. Channelling my inner sysadmin: "I expect config
> settings to be in /etc and data to be in /var/lib - that's what FHS says
> and that's how other systems look that I have to manage, so why does Ceph
> have to do things differently?"
1- Because it's a containerized service. Things are in etc inside the
container, not outside. Sprinkling these configs in /etc mixes
containerized service configs with the *host*'s configs, which seems very
untidy to me.
2. Putting it all in /var/lib/ceph/whatever means it's find and
clean up.
> I'm also not sure of the value of fsid in the dir names. I can see the
> value if a host has to support multiple ceph clusters - but outside dev is
> that something that the community or our customers actually want?
Most deployments won't need it, but it will avoid a whole range of
problems when they do. Especially when it becomes trivial to bootstrap
clusters, you also make it trivial to make multiple clusters overlap on
the same host.
And, like above, it keeps things tidy.
> The gist downloads the separate containers we need in parallel - which I
> think is a good thing! reduces time
Sure... that's something we could do regardless of whether it's a separate
script of part of ceph-daemon. Probably what we actually want is for the
ssh 'host add' commadn to kick off some prestaging of containers in the
background so that the first daemon deployment doesn't wait for a
container download at all.
> IMO, having monitoring-add deploy grafana/prom and alert manager together
> by default is the way to go. TBH, when I started this, I was putting them
> all in the same pod under podman for management and treat them as a single
> unit - but having to support 'legacy' docker put an end to that :)
>
> If a user wishes to use a separate prometheus, that will normally have it's
> own alertmanager too. Which alertmanager a prometheus server is defined in
> the prometheus.yml. With external prometheus, rules, alerts and receiver
> definitions are going to be an exercise for the reader. We'll need to
> document the settings, but the admin will need to apply them - in this
> scenario, we could possibly generate sample files that the admin can pick
> up and apply? To my mind deployment of monitoring has two pathways;
> default - "monitoring add" yields prom/grafana/alertmanager containers
> deployed to machine
> external-prom - "monitoring add" just deploys grafana, and points it's
> default data source at the external prom url. We're also making an
> assumption here that the prometheus server is open and doesn't require auth
> (OCP's prometheus for example has auth enabled)
I think it makes sense to focus on the out-of-the-box opinionated easy
scenario vs the DIY case, in general at least. But I have a few
questions...
- In the DIY case, does it makes sense to leave the node-exporter to the
reader too? Or might it make sense for us to help deploy the
node-exporter, but they run the external/existing prometheus instance?
- Likewise, the alertmanager is going to have a bunch of ceph-specific
alerts configured, right? Might they want their own prom but we deploy
our alerts? (Is there any dependency in the dashboard on a particular set
of alerts in prometheus?)
I'm guessing you think no in both these cases...
> > - Let's teach ceph-daemon how to do this, so that you do 'ceph-daemon
> > deploy --fsid ... --name prometheus.foo -i input.json'. ceph-daemon
> > has the framework for opening firewall ports etc now... just add ports
> > based on the daemon type.
> >
>
> TBH, I'd keep the monitoring containers away from the ceph daemons. They
> require different parameters, config files etc so why not keep them
> separate and keep the ceph logic clean. This also allows us to change
> monitoring without concerns over logic changes to normal ceph daemon
> management.
Okay, but mgr/ssh is still going to be wired up to deploy these. And to do
so on a per-cluster, containerized basis... which means all of the infra
in ceph-daemon will still be useful. It seems easiest to just add it
there.
Your points above seem to point toward simplifying the containers we
deploy to just two containers, one that's one-per-cluster for
prom+alertmanager+grafana, and one that's per-host for the node-exporter.
But I think making it fit in nicely with the other ceph containers (e.g.,
/var/lib/ceph/$fsid/$thing) makes sense. Esp since we can just deploy
these during bootstrap by default (unless some --external-prometheus is
passed) and this all happens without the admin having to think about it.
> > WDYT?
> >
> >
> I'm sure a lot of the above has already been discussed at length with the
> SuSE folks, so apologies for going over ground that you've already covered.
Not yet! :)
sage
I am updating the Ceph documentation. Included in this email is a proposed
change to
the documentation and a request for information pertaining to that proposed
change.
If you know about the issue behind the proposed change and you have
information
pertinent to it that you would like to enshrine in the documentation, reply
to this
email and tell me.
Documentation Link: https://docs.ceph.com/docs/master/radosgw/STSLite/
Proposed Change: The page at the address specified by "Documentation Link"
does
not inform the reader that Ceph Nautilus STS supports
WebIdentity
OpenAuth OpenID. This information should be added to this
page
in a place that fits.
Zac's Request: Someone who is familiar with the STS-Ceph interface should
tell me
about how this works, and I'd like to include some
examples in the
documentation as well, so anyone who has used OpenID to
interact
with Ceph Nautilus is encouraged to send me their examples.
Tracking Information (this can be ignored by everyone but Zac)
Bug # 1 here: https://pad.ceph.com/p/Report_Documentation_Bugs
Hi all,
I am backporting the async recovery feature to Hammer :( --- this is
a sad story, but we can't upgrade to Luminous or other new version in
our product envs.
the teuthology tests catch an osd crash error after backport
https://github.com/ceph/ceph/pull/19811
But I have reproduced it with master branch code, the detail are
described as below:
code version: master branch with latest commit
656c8e8049c2c1acd363143c842c2edf1fe09b64
config of vstart:
[client.vstart.sh]
num mon = 1
num osd = 5
num mds = 0
num mgr = 1
num rgw = 0
config of osd/mon/mgr:
[osd]
osd_async_recovery_min_cost = 0 # only add this one, others are
default with vstart
vstart command:
MON=1 OSD=5 MDS=0 MGR=1 RGW=0 ../src/vstart.sh --debug --new -X
--localhost --bluestore --without-dashboard
pool info:
pool 1 'rbd' replicated size 3 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 256 pgp_num 256 autoscale_mode off last_change 31
flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
osd log:
2019-12-02T09:32:28.779+0000 7f8ff8c82700 5 osd.2 pg_epoch: 783
pg[1.c2( v 657'1488 (0'0,657'1488] local-lis/les=0/0 n=13 ec=25/25
lis/c=0/449 les/c/f=0/450/0 sis=783) [2,1,0] r=0 lpr=783
pi=[449,783)/1 crt=657'1488 mlcod 0'0 unknown mbc={}] enter
Started/Primary
2019-12-02T09:32:28.779+0000 7f8ff8c82700 5 osd.2 pg_epoch: 783
pg[1.c2( v 657'1488 (0'0,657'1488] local-lis/les=0/0 n=13 ec=25/25
lis/c=0/449 les/c/f=0/450/0 sis=783) [2,1,0] r=0 lpr=783
pi=[449,783)/1 crt=657'1488 mlcod 0'0 creating mbc={}] enter
Started/Primary/Peering ### pg state is creating
2019-12-02T09:32:28.779+0000 7f8ff8c82700 5 osd.2 pg_epoch: 783
pg[1.c2( v 657'1488 (0'0,657'1488] local-lis/les=0/0 n=13 ec=25/25
lis/c=0/449 les/c/f=0/450/0 sis=783) [2,1,0] r=0 lpr=783
pi=[449,783)/1 crt=657'1488 mlcod 0'0 creating+peering mbc={}] enter
Started/Primary/Peering/GetInfo ### pg state is
creating+peering
Reproduce steps:
1. create a pool, wait for active+clean
2. writes to rbd image with fio or other tool during steps below
3. ceph osd reweight 2 0.1
4. wait several minutes, make sure pg are moved to other osd(grep
"on_removal" osd.2.log)
5. ceph osd reweight 2 1
6. wait several minutes, make sure pg are moved back to original
osd.2(grep "_make_pg" osd.2.log)
7. find a up_primary pg(such as 1.c2 in my log) on osd.2 which was
moved out/back during steps 3~6, and it should enter in async
recovering after step 6
8. wait for pg becomes active+clean, then you can find it had become
to creating+peering state.
the root reason may be:
1. after step 6, osd.0 will be async recovery target in pg 1.c2, and
pg will be created after reweight to 1
2. after pg->init the local_les=0/history.les=450
2019-12-02T09:21:47.991+0000 7f8ff8c82700 10 osd.2 543 _make_pg 1.c2
2019-12-02T09:21:47.991+0000 7f8ff8c82700 10 osd.2 pg_epoch: 543
pg[1.c2( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0
les/c/f=0/0/0 sis=0) [] r=-1 lpr=0 crt=0'0 unknown mbc={}] init role 0
up [2,1,0] acting [2,1,0] history ec=25/25 lis/c=449/449
les/c/f=450/450/0 sis=543 pruub=14.367934206s past_intervals
([449,542] all_participants=0,1,3 intervals=([449,542] acting 0,1,3))
2019-12-02T09:21:47.991+0000 7f8ff8c82700 20 osd.2 pg_epoch: 543
pg[1.c2( empty local-lis/les=0/0 n=0 ec=25/25 lis/c=449/449
les/c/f=450/450/0 sis=543 pruub=14.367934206s) [2,1,0] r=0 lpr=0
pi=[449,543)/1 crt=0'0 mlcod 0'0 unknown mbc={}] on_new_interval
2019-12-02T09:21:48.063+0000 7f8ff8c82700 20 osd.2 pg_epoch: 543
pg[1.c2( empty local-lis/les=0/0 n=0 ec=25/25 lis/c=449/449
les/c/f=450/450/0 sis=543) [2,1,0] r=0 lpr=543 pi=[449,543)/1 crt=0'0
mlcod 0'0 peering mbc={}] choose_async_recovery_replicated result
want=[0,1] async_recovery=2
2019-12-02T09:21:48.983+0000 7f8ff8c82700 20 osd.2 pg_epoch: 544
pg[1.c2( empty local-lis/les=0/0 n=0 ec=25/25 lis/c=449/449
les/c/f=450/450/0 sis=543) [2,1,0] r=0 lpr=544 pi=[449,543)/1 crt=0'0
mlcod 0'0 unknown mbc={}] new interval newup [2,1,0] newacting [0,1]
## osd.2 is async recovery target in pg 1.c2
acting_primary osd.0 log:
2019-12-02T09:21:49.195+0000 7f2f01060700 20 osd.0 pg_epoch: 544
pg[1.c2( v 542'940 (0'0,542'940] local-lis/les=449/450 n=13 ec=25/25
lis/c=449/449 les/c/f=450/450/0 sis=544) [2,1,0]/[0,1] r=0 lpr=544
pi=[449,544)/1 crt=542'940 lcod 542'939 mlcod 0'0 remapped+peering
mbc={}] choose_async_recovery_replicated result want=[0,1]
async_recovery=2
3. when repop is coming to osd.2 of pg 1.c2, the append_log func will
find local_les(=0) != history.les(=450), and it will use local_les as
new history.les, then history.les become 0
void PeeringState::append_log(
const vector<pg_log_entry_t>& logv,
eversion_t trim_to,
eversion_t roll_forward_to,
ObjectStore::Transaction &t,
bool transaction_applied,
bool async)
{
/* The primary has sent an info updating the history, but it may not
* have arrived yet. We want to make sure that we cannot remember this
* write without remembering that it happened in an interval which went
* active in epoch history.last_epoch_started.
*/
if (info.last_epoch_started != info.history.last_epoch_started) {
info.history.last_epoch_started = info.last_epoch_started;
}
...
}
4. when the async recovery of osd.2 in pg 1.c2 is over, it will change
to acting_primary, pg state will be set to PG_STATE_CREATING
PeeringState::Primary::Primary(my_context ctx)
: my_base(ctx),
NamedState(context< PeeringMachine >().state_history, "Started/Primary")
{
context< PeeringMachine >().log_enter(state_name);
DECLARE_LOCALS;
ceph_assert(ps->want_acting.empty());
// set CREATING bit until we have peered for the first time.
if (ps->info.history.last_epoch_started == 0) {
ps->state_set(PG_STATE_CREATING);
...
}
So my question is that, this PG_STATE_CREATING state after async
recovery is expected or not?
if it is, I guess this creating state may result in osd crash, if
acting_primary is changed during creating+peering state, the process
may be:
1. osd.2 report pg stats to mon
2. mon will add this pg to creating_pgs/creating_pgs_by_osd_epoch
void PGMap::stat_pg_add(const pg_t &pgid, const pg_stat_t &s,
bool sameosds)
{
auto pool = pgid.pool();
pg_sum.add(s);
num_pg++;
num_pg_by_state[s.state]++;
num_pg_by_pool_state[pgid.pool()][s.state]++;
num_pg_by_pool[pool]++;
if ((s.state & PG_STATE_CREATING) &&
s.parent_split_bits == 0) {
creating_pgs.insert(pgid);
if (s.acting_primary >= 0) {
creating_pgs_by_osd_epoch[s.acting_primary][s.mapping_epoch].insert(pgid);
}
}
...
}
3. when the acting_primary change to a new one, the new acting_primary
will receive a MOSDPGCreate/MOSDPGCreate2 msg with a very old pg
created epoch(the real created epoch is 5)
4. the new acting_primary will get_osdmap by pg created epoch, this
map is trimmed long time ago, then osd will crash at:
void OSD::build_initial_pg_history(
spg_t pgid,
epoch_t created,
utime_t created_stamp,
pg_history_t *h,
PastIntervals *pi)
{
dout(10) << __func__ << " " << pgid << " created " << created << dendl;
*h = pg_history_t(created, created_stamp);
OSDMapRef lastmap = service.get_map(created);
...
}
OSDMapRef get_map(epoch_t e) {
OSDMapRef ret(try_get_map(e));
ceph_assert(ret); // crash here because map is trimmed
return ret;
}
I got a related issue here: https://tracker.ceph.com/issues/14592 osd
crashes when handling a stale pg-create message (hammer) ,
but I'm not sure they are same reason or not.
Thanks for your attention, welcome to ask me for more detail.
On Wed, Apr 18, 2018 at 6:27 AM Nathan Cutler <ncutler(a)suse.cz> wrote:
> > That would be at odds to what Nathan is suggesting though, which is a
> > hard change to Python 3.
>
> Hm, not sure what hard/soft means in this context. For any given script,
> either it runs with Python 3, or it doesn't. And this is determined by
> the shebang. (Unless the shebang is omitted, of course.)
>
> I was very surprised to find out that, in SLES and openSUSE, the symlink
> /usr/bin/python -> /usr/bin/python2 will not be changed even when the
> migration of the underlying distro to Python 3 is complete.
>
> But then my colleagues explained why that is, and I "saw the light".
> Since every single script in the distro has to be audited for Python 3
> compatibility, anyway, it makes sense to have the shebang be an explicit
> declaration of said compatibility.
>
> By retaining the symlink at it is, all scripts start out the migration
> process with an explicit declaration that they are compatible with
> Python 2. Compatibility with Python 3 is signalled not by saying "it's
> OK with Python 3, we tried it". It's signalled by changing the shebang.
>
> And this isn't unique to SUSE. Fedora is treating the shebang in the
> same way, apparently. [2]
Seems that if you only have python3 installed in Fedora31 this is
*not* the case.
# python --version
Python 3.7.5
# /usr/bin/python --version
Python 3.7.5
# ls -l /usr/bin/python
lrwxrwxrwx. 1 root root 9 Nov 18 00:57 /usr/bin/python -> ./python3
See https://lists.fedoraproject.org/archives/list/devel-announce@lists.fedorapr…
and https://fedoraproject.org/wiki/Changes/RetirePython2#The_python27_package
"there is no /usr/bin/python"
So the two distros are quite divergent in their approach apparently?
>
> It may be true that a given script is fine with Python 3, but as long as
> the shebang says "python" (i.e. python2), there's no way to really find
> out, is there? (Barring things like Josh's suggestion of changing the
> shebang on the fly via a teuthology task/workunit, which is fine if we
> decide we need a transition period, which it looks like we will.)
>
> Nathan
>
> [1]
> https://github.com/kubernetes-incubator/external-storage/blob/master/ceph/c…
> [2]
> https://fedoraproject.org/wiki/FinalizingFedoraSwitchtoPython3#.2Fusr.2Fbin…
> _______________________________________________
> Sepia mailing list
> Sepia(a)lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/sepia-ceph.com
--
Cheers,
Brad
Hi Ali,
I was thinking a bit about this and I think we should actually extend the
ObjectStore CRD first so that we have parity between the ssh orch and
rook. That will let us update the multisite documentation and standardize
the way rgws are deployed and configured. Automating the multisite via
the Realm CRD can come right after that.
In particular,
- ssh orch now deploys/names RGW's like client.rgw.$realm.$zone[.$id]; we
should make the rook CRD do the same. There is probably some
weirdness with teh extra .$id part at the end since I think k8s might
be adding this?
- Config options can/should then go into the client.rgw.$realm.$zone
config section so that rook isn't passing lots of random stuff on the
command line (ideally, nothing at all except the -n name).
- Then we can update all of the RGW docs (including multisite) to suggest
deploying the gateways via the ceph orchestartor rgw ... command(s), with
a call-out on how to do it manually.
Then do the multisite CRD...
What do you think?
sage
Hi everyone,
We have our regularly scheduled developer call this Wednesday, 1730 UTC:
https://bluejeans.com/908675367
The agenda is here:
https://tracker.ceph.com/projects/ceph/wiki/CDM_04-DEC-2019
So far, everything is orchestration related.
1- MDS affinity to a particular fs. This was removed a while back, but
the orchestration layer assumes MDS daemons are deployed in
cluster/subclusters/sets that are tied to a particular fs. We need to
decide how to proceed.
2- mgr/ssh and ceph-daemon install wizard. We'd like to scope out what
an install wizard should look like and do for a clean install
experience, starting from a ceph-daemon bootstrap command and then jumping
immediately to the dashboard to finish the cluster setup.
3- Orchestrator upgrades. How should we divide responsibility between the
shared orch code and the orch implementation (rook or ssh).
4- Out-of-the-box monitoring. How much of the monitoring stack
(prometheus, grafana, etc) should the ssh orch layer know how to deploy
to get a fully functional cluster up and running (for cases where the
user isn't doing it themselves).
sage