Sepia March 2020

sepia@ceph.io

5 participants
4 discussions

by Sage Weil

There are several of the failures in this run: http://pulpito.ceph.com/sage-2020-03-16_13:50:19-rados-wip-sage-testing-202…

4 years, 1 month

lab cluster status

by Sage Weil

Hi everyone, I realized today that I didn't tell everyone about the cephadm conversion on the lab cluster, so it seemed like a good time for an overall status update. ~3 weeks ago I converted the lab cluster to use cephadm. This helped shake out a number of issues with the upgrade process, and also ran into some exciting snags. The main issue was that cephadm only works with bluestore OSDs, and I had forgotten that there were lots of old filestore OSDs still in the cluster. To get around this, I ended up dist-ugprading several hosts from xenial to bionic so that the host packages could be installed (this all happened mid-upgrade and an upgrade bug was preventing the OSDs from peering). Once things had upgraded and stabilized, I removed most of the old mira nodes from the cluster (the ones that had all or mostly filestore OSDs) and rebalanced. Then I finished the cephadm conversion. Current status: - everything is cephadm and container-based - all OSDs are bluestore - there are 2 remaining mira nodes in the cluster - most of the hosts are still running xenial. One of the nice things about cephadm is that there are few OS dependencies--we just need podman or docker, python3, and LVM. So it's /mostly/ fine that these machines are running xenial. But it's not ideal. - We still have (and want) ceph-common on the host so that the ceph CLI works. But we don't build octopus for xenial, so the nautilus packages are still installed. There are a few new changes on the CLI side in octopus so we should fix this at some point. I think this just means we should dist-upgrade these machines to xenial. - There are still two mira in the cluster that we may want to remove at some point... A few other changes: - We have a few SSD-based OSDs, but the crush rules were still spreading data over everything. I updated the data pools to use hdd only and the cephfs metadata pool is now on SSDs only. This should have sped things up a bit! Cephadm stuff... For a crash course on cephadm, see the new docs at https://docs.ceph.com/docs/master/cephadm/ The main thing is that all of the daemons ar erunning in containers. If you're on the host and want to stop/start things, it's systemctl stop/start ceph-$fsid@$name For example, systemctl restart ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a(a)mon.reesi002 (The nice thing is that tab completion works for the unit name.) There is also a ceph.target that will stop or start *all* ceph daemons, either for the cluster (ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a.target) or all clusters (ceph.target). The cluster is configured to log to files like traditional ceph deploysments. The only difference is that logs are in /var/log/ceph/$fsid. Again, tab completion is your friend (esp when you remember that the fsid for this cluster starts with "2"). I've been upgrading this cluster regularly (almost every day, if not more) using the cephadm automated upgrades. That command is just ceph orch upgrade start --image quay.io/ceph-ci/ceph:octopus Cephadm can automatically decide where to deploy daemons based on specs you provide about placement, count, etc. You can view that with root@reesi002:~# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/1 9m ago 2d count:1 docker.io/prom/alertmanager:latest 0881eb8f169f crash 9/9 9m ago 2d * quay.io/ceph-ci/ceph:octopus a94ff4985406 grafana 1/1 67s ago 2d count:1 docker.io/pcuzner/ceph-grafana-el8:latest f77afcf0bcf6 mds.cephfs 4/4 9m ago 2d label:mds quay.io/ceph-ci/ceph:octopus d97834ddee42 mgr 2/2 7m ago 2d count:2 quay.io/ceph-ci/ceph:octopus a94ff4985406 mon 5/5 9m ago 2d count:5 quay.io/ceph-ci/ceph:octopus a94ff4985406 prometheus 1/1 9m ago 2d count:1 docker.io/prom/prometheus:latest e935122ab143 You can see the actual daemons with 'ceph orch ps'. You can see that the mds.cephfs service (cephfs == the fs name) is tied to label 'mds'. You can see host labels with root@reesi002:~# ceph orch host ls HOST ADDR LABELS STATUS mira055 mira055 mira060 mira060 mira093 mira093 reesi001 reesi001 mon mds reesi002 reesi002 mon mds reesi003 reesi003 mon mds reesi004 reesi004 mon mgr reesi005 reesi005 mon mgr reesi006 reesi006 mgr mds (There are other labels set there that aren't getting used at the moment.) TODOs... - dist-upgrade everything to xenial. - Upgrade the host packages. this is most easily done with ./cephadm add-repo --release octopus ./cephadm install cephadm ceph-common which will install the packaged cephadm (so it's in the path and doesn't have to be curled manually) and ceph-common (which has all the important CLI commands). We should probably uninstall the other ceph packages. - We aren't deploying the fulling monitoring (prometheus etc) stack via cephadm because some of those components are already installed (node-exporter I think?) and I'm not sure how that was done or the right way to remove them (or use them as is). Also at the moment there are a few bugs in the config files for alertmanager and grafana that cephadm is generating. sage

4 years, 1 month

Please clean up your home dirs on teuthology.front

by David Galloway

If you keep your teuthology checkout in ~/src, this will clean up checkouts older than 90 days. find ~/src -mtime +90 -maxdepth 1 -exec rm -rvf {} \; -- David Galloway Systems Administrator, RDU Ceph Engineering IRC: dgalloway

4 years, 1 month

senta and vossi in paddles?

by David Galloway

Hey all, I've gotten a couple requests in the past 24 hours asking how to "lock" the new dev machines https://wiki.sepia.ceph.com/doku.php?id=hardware:vossi These systems aren't in paddles so `teuthology-lock` isn't going to work here. Is that something you all want? My understanding is, historically, the rex and senta have been shared machines where there is a chance devs can step on each others' toes. I get the desire to have exclusive use of a machine but I don't want to have to be the one to police machine-hogging. -- David Galloway Systems Administrator, RDU Ceph Engineering IRC: dgalloway

4 years, 1 month

2024

2023

2022

2021

2020

2019

Sepia March 2020