As you may know, the Sepia Long Running Cluster has been hitting
capacity limits over the past week or so. This has resulted in service
disruptions to teuthology runs, chacra.ceph.com,
docker-mirror.front.sepia.ceph.com, and quay.ceph.io.
We've been able to get by by deleting/compressing logs more aggressively
but it's not ideal or sustainable.
Patrick has created a new erasure coded pool/filesystem that will allow
us to keep the same amount of logs but use less space. In order to have
teuthology workers start writing logs to that pool, we need to take an
outage.
At 0400 UTC 19AUG2020, I will instruct all teuthology workers to die
after their running jobs finish. At 1300 UTC, I will kill any jobs that
are still running. This gives the lab 9 hours to gracefully shut down.
At that point, we will switch the mountpoint on teuthology.front over to
the new EC pool and start storing new logs there.
At the same time, Patrick will start migrating logs on the existing/old
pool to the new pool. This means that logs from 7/20 through 8/19 will
be unavailable (you'll see 404s) via the Pulpito web UI and qa-proxy
URLs until they're migrated to the new EC pool.
Let me know if you have any questions/concerns.
Thanks,
--
David Galloway
Systems Administrator, RDU
Ceph Engineering
IRC: dgalloway
Hi all,
Due to increases in amount of testing and length of logs, the (Long
Running) Ceph cluster in the Sepia lab has been reaching 95-98% capacity
over the past few days. Since almost everything else got deleted on the
cluster a few months ago, I need to reduce the amount of test logs we
keep on hand.
Currently we:
- Keep 14 days of passed job logs
- Compress failed job logs older than 30 days
- Delete failed job logs older than 365 days
We will now be deleting failed job logs older than 300 days. We may be
able to increase the cluster's capacity with the purchase of additional
hardware which I will discuss with the appropriate stakeholders.
--
David Galloway
Systems Administrator, RDU
Ceph Engineering
IRC: dgalloway