hi folks,
we just migrated ceph:teuthology and all tests under qa/ in ceph:ceph
to python3. and from now on, the teuthology-worker runs in a python3
environment by default unless specified otherwise using
"--teuthology-branch py2".
which means:
- we need to write tests in python3 in master now
- teuthology should be python3 compatible.
- teuthology bug fixes should be backported to "py2" branch.
if you run into any issues related to python3 due to the above
changes, please let me know. and i will try to fix it ASAP.
currently, the tests under qa/ directories in ceph:ceph master branch
are python2 and python3 compatible. but since we've moved to python3,
there is no need to be python2 compatible anymore. since the sepia lab
is still using ubuntu xenial, we cannot use features offered by
python3.6 at this moment yet. but we do plan to upgrade the OS to
bionic soon. before that happens, the tests need to be compatible with
Python3.5.
the next step is to
- drop python2 support in ceph:ceph master branch, and
- drop python2 support in ceph:teuthology master.
- backport python3 compatible changes to octopus and nautilus to ease
the pain of backport
--
Regards
Kefu Chai
Around 18JUN2020 0700 UTC, an errant `sudo rm -rf ceph` from the root
directory on a senta unfortunately wiped out almost all data on the Ceph
cluster in our upstream Sepia lab (AKA Long Running Cluster or LRC).
Only teuthology job logs were preserved.
I would guess because teuthology workers were actively writing jobs logs
and files, the /teuthology-archive directory didn't get entirely wiped out.
Here is a list of directories we lost:
bz
cephdrop (drop.ceph.com)
cephfs-perf
chacra (chacra.ceph.com)
containers (quay.ceph.io)
dgalloway
diskprediction_config.txt
doug-is-great
el8
filedump.ceph.com
firmware
home.backup01
home.gitbuilder-archive
job1.0.0
jspray.senta02.home.tar.gz
old.repos
post (files submitted using ceph-post-file)
sftp (drop.ceph.com/qa)
shaman
signer (signed upstream release packages)
tmp
traces
While I /did/ have backups of chacra.ceph.com binaries, the amount of
data (> 1TB) backed up was too much to keep snapshots of. My daily
backup script performs an `rsync --delete-delay` so if files are gone on
the source, they get deleted from the backup. This is fine (and
preferred) for backups we have snapshots of. However, the backup script
ran *after* the errant `rm -rf` so unfortunately everything on
chacra.ceph.com is gone. I have patched the backup script to *not*
--delete-delay backups that we don't keep snapshots of.
I restored the vagrant and valgrind chacra.ceph.com repos because I saw
teuthology jobs failing because of those missing repos. Kefu also
rebuilt and pushed ceph-libboost 1.72. (THANK YOU, KEFU!)
We started using the quay.ceph.io registry (instead of quay.io) on June
17. Containers pushed to that registry were stored on the LRC as well
so I had to delete the repo and start over this morning. Anything you
see in the web UI should pull without issue:
https://quay.ceph.io/repository/ceph-ci/ceph?tab=tags
To prevent data loss in the future, Patrick graciously set up new
filesystems and client credentials on the LRC. Because senta{02..04}
are considered developer playgrounds, all users have sudo access. The
sentas now mount /teuthology-archive read-only at /teuthology. If you
need to unzip and inspect log files on a senta, you can do so in
/scratch (another new filesystem on the LRC).
It will likely take weeks of "where did X go" e-mails to mailing lists,
job and build failures, bugs filed, IRC pings, etc. for me to find and
restore everything that was used on a regular basis. I appreciate your
patience and understanding in the meantime.
Take care & be well,
--
David Galloway
Systems Administrator, RDU
Ceph Engineering
IRC: dgalloway
Hey all,
The Sepia LRC got up to 96% full. As you may or may not recall, a full
cluster results in the lab hanging, lost jobs, and other nasty side effects.
So I started to manually clean up some old teuthology logs after
checking with Josh.
Everything from 2016, 2017, and Jan-May of 2018 are gone but this still
didn't give us the needed reduction I was hoping for.
Please go through your old logs on /a and remove your .preserve sentinel
files from jobs you no longer need so the prune script can clean them up.
$ find /a/$(whoami)-* -name .preserve -exec dirname {} \;
Thanks,
--
David Galloway
Systems Administrator, RDU
Ceph Engineering
IRC: dgalloway