RHEL8.4 was GAed Tuesday. We have FOG images for it in the lab now.
qa yamls should be updated accordingly.
As a side note, I'm going to start cleaning up old versions more
aggressively. I had incorrectly assumed nothing was using the RHEL 8.0
or 8.1 images anymore so I deleted those repos from the Satellite
server. RHEL 8.0, for example, is over 2 years old.
The latest distros we have and should be targeting now are:
Ubuntu Focal 20.04
CentOS 8.3
RHEL 8.4
CentOS 8 and 9 Stream
--
David Galloway
Senior Systems Administrator
Ceph Engineering
After getting CentOS 8 working, I spent Friday adding the latest CentOS 9 Stream compose to the Sepia lab.
Same deal with 8. We have FOG images for smithi and gibba.
teuthology-lock --lock-many 1 -m smithi --os-type centos --os-version 9.stream
Some important differences:
- Missing Packages: podman-docker, ant, libev-devel, python3-nose, python3-virtualenv, dbench, iozone
- No lab-extras
- No EPEL9
Not sure how useful it'll be but have fun.
--
David Galloway
Senior Systems Administrator
Ceph Engineering
I have no idea what moving parts need to be adjusted in teuthology (if any) but we have CentOS 8 Stream FOG images for smithi and gibba machine types.
I tried to make it simple to adapt to by just using 'stream' as the minor version number.
teuthology-lock --lock-many 1 -m smithi --os-type centos --os-version 8.stream
Still working on CentOS 9.
Let me know if you run into any issues.
--
David Galloway
Senior Systems Administrator
Ceph Engineering
Hi all,
I wanted to provide an RCA for the outage you may have been affected by yesterday. Some services that went down:
- All CI/testing
- quay.ceph.io
- telemetry.ceph.com (your cluster may have gone into HEALTH_WARN if you report telemetry data)
- lists.ceph.io (so all mailing lists)
All of our critical infra is running in a Red Hat Virtualization (RHV) instance backed by Red Hat Gluster Storage (RHGS) as the storage. Before you go, "wait.. Gluster?" Yes, this cluster was set up before Ceph was supported as backend storage for RHV/RHEV.
The root cause for the outage is the Gluster volumes got 100% full. Once no writes were possible, RHV paused all the VMs.
Why didn't monitoring catch this? I honestly don't know.
# grep ssdstore01 nagios-05-*2021* | grep Disk
nagios-05-01-2021-00.log:[1619740800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-02-2021-00.log:[1619827200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-03-2021-00.log:[1619913600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-04-2021-00.log:[1620000000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-05-2021-00.log:[1620086400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-06-2021-00.log:[1620172800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-07-2021-00.log:[1620259200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-08-2021-00.log:[1620345600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-09-2021-00.log:[1620432000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-10-2021-00.log:[1620518400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-11-2021-00.log:[1620604800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
Yet RHV knew we were running out of space. I don't have e-mail notifications set up in RHV, however.
# zgrep "disk space" engine*202105*.gz | cut -d ',' -f4 | head -n 10
Low disk space. hosted_storage domain has 24 GB of free space.
Low disk space. hosted_storage domain has 24 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 21 GB of free space.
Low disk space. hosted_storage domain has 20 GB of free space.
Low disk space. hosted_storage domain has 11 GB of free space.
Our nagios instances runs this to check disk space: https://github.com/ceph/ceph-cm-ansible/blob/master/roles/common/files/libe…
You can ignore the comment about it only working for EXT2.
[root@ssdstore01 ~]# /usr/libexec/diskusage.pl 90 95
Disks are OK now
I ran this manually on one of the storage hosts and intentionally set the WARN level to a number lower than the current usage percentage.
[root@ssdstore01 ~]# df -h | grep 'Size\|gluster'
Filesystem Size Used Avail Use% Mounted on
/dev/md124 8.8T 6.7T 2.1T 77% /gluster
[root@ssdstore01 ~]# /usr/libexec/diskusage.pl 95 70
/gluster is at 77%
[root@ssdstore01 ~]# echo $?
2
When I logged in to the storage hosts yesterday morning, the /gluster mount was at 100%. So nagios should have known.
How'd it get fixed? I happened to have some large capacity drives that fit the storage nodes lying around. They're being installed in a different project soon. However, I was able to add these drives, add "bricks" to the Gluster storage, then rebalance the data. Once that was done, I was able to restart all the VMs and delete old VMs and snapshots I no longer needed.
How do we keep this from happening again? Well, as you may have been able to deduce... we were running out of space at a rate of 1-10 GB/day. As you can see now, the Gluster volume has 2.1TB of space left. So even if we grew by 10GB/day again, we'd be okay for 200ish days.
I aim to have some (if not all) of these services moved off this platform and into an Openshift cluster backed by Ceph this year. Sadly, I just don't think I have enough logging enabled to nail down exactly what happened.
--
David Galloway
Senior Systems Administrator
Ceph Engineering