Sepia June 2020

sepia@ceph.io

5 participants
3 discussions

by kefu chai

hi folks, we just migrated ceph:teuthology and all tests under qa/ in ceph:ceph to python3. and from now on, the teuthology-worker runs in a python3 environment by default unless specified otherwise using "--teuthology-branch py2". which means: - we need to write tests in python3 in master now - teuthology should be python3 compatible. - teuthology bug fixes should be backported to "py2" branch. if you run into any issues related to python3 due to the above changes, please let me know. and i will try to fix it ASAP. currently, the tests under qa/ directories in ceph:ceph master branch are python2 and python3 compatible. but since we've moved to python3, there is no need to be python2 compatible anymore. since the sepia lab is still using ubuntu xenial, we cannot use features offered by python3.6 at this moment yet. but we do plan to upgrade the OS to bionic soon. before that happens, the tests need to be compatible with Python3.5. the next step is to - drop python2 support in ceph:ceph master branch, and - drop python2 support in ceph:teuthology master. - backport python3 compatible changes to octopus and nautilus to ease the pain of backport -- Regards Kefu Chai

3 years, 10 months

Sepia Long Running Cluster mishap

by David Galloway

Around 18JUN2020 0700 UTC, an errant `sudo rm -rf ceph` from the root directory on a senta unfortunately wiped out almost all data on the Ceph cluster in our upstream Sepia lab (AKA Long Running Cluster or LRC). Only teuthology job logs were preserved. I would guess because teuthology workers were actively writing jobs logs and files, the /teuthology-archive directory didn't get entirely wiped out. Here is a list of directories we lost: bz cephdrop (drop.ceph.com) cephfs-perf chacra (chacra.ceph.com) containers (quay.ceph.io) dgalloway diskprediction_config.txt doug-is-great el8 filedump.ceph.com firmware home.backup01 home.gitbuilder-archive job1.0.0 jspray.senta02.home.tar.gz old.repos post (files submitted using ceph-post-file) sftp (drop.ceph.com/qa) shaman signer (signed upstream release packages) tmp traces While I /did/ have backups of chacra.ceph.com binaries, the amount of data (> 1TB) backed up was too much to keep snapshots of. My daily backup script performs an `rsync --delete-delay` so if files are gone on the source, they get deleted from the backup. This is fine (and preferred) for backups we have snapshots of. However, the backup script ran *after* the errant `rm -rf` so unfortunately everything on chacra.ceph.com is gone. I have patched the backup script to *not* --delete-delay backups that we don't keep snapshots of. I restored the vagrant and valgrind chacra.ceph.com repos because I saw teuthology jobs failing because of those missing repos. Kefu also rebuilt and pushed ceph-libboost 1.72. (THANK YOU, KEFU!) We started using the quay.ceph.io registry (instead of quay.io) on June 17. Containers pushed to that registry were stored on the LRC as well so I had to delete the repo and start over this morning. Anything you see in the web UI should pull without issue: https://quay.ceph.io/repository/ceph-ci/ceph?tab=tags To prevent data loss in the future, Patrick graciously set up new filesystems and client credentials on the LRC. Because senta{02..04} are considered developer playgrounds, all users have sudo access. The sentas now mount /teuthology-archive read-only at /teuthology. If you need to unzip and inspect log files on a senta, you can do so in /scratch (another new filesystem on the LRC). It will likely take weeks of "where did X go" e-mails to mailing lists, job and build failures, bugs filed, IRC pings, etc. for me to find and restore everything that was used on a regular basis. I appreciate your patience and understanding in the meantime. Take care & be well, -- David Galloway Systems Administrator, RDU Ceph Engineering IRC: dgalloway

3 years, 10 months

Please check your preserved teuthology logs

by David Galloway

Hey all, The Sepia LRC got up to 96% full. As you may or may not recall, a full cluster results in the lab hanging, lost jobs, and other nasty side effects. So I started to manually clean up some old teuthology logs after checking with Josh. Everything from 2016, 2017, and Jan-May of 2018 are gone but this still didn't give us the needed reduction I was hoping for. Please go through your old logs on /a and remove your .preserve sentinel files from jobs you no longer need so the prune script can clean them up. $ find /a/$(whoami)-* -name .preserve -exec dirname {} \; Thanks, -- David Galloway Systems Administrator, RDU Ceph Engineering IRC: dgalloway

3 years, 10 months

2024

2023

2022

2021

2020

2019

Sepia June 2020