Sepia May 2021

sepia@ceph.io

2 participants
5 discussions

by thomaswilliam2444＠gmail.com

If you want to score A+ grade in your college assignment then allassignmenthelp is the best among the best. They are having the assignment experts, who are ready to provide you the best assignment help at lowest charges. [url=https://www.allassignmenthelp.com/ca/do-my-homework.html]do my homework online[/url] | [url=https://www.allassignmenthelp.com/samples/crisis-management-plan-and-go… of bilingual children[/url]

2 months, 3 weeks

RHEL8.4 available | Retired 8.0 and 8.1

by David Galloway

RHEL8.4 was GAed Tuesday. We have FOG images for it in the lab now. qa yamls should be updated accordingly. As a side note, I'm going to start cleaning up old versions more aggressively. I had incorrectly assumed nothing was using the RHEL 8.0 or 8.1 images anymore so I deleted those repos from the Satellite server. RHEL 8.0, for example, is over 2 years old. The latest distros we have and should be targeting now are: Ubuntu Focal 20.04 CentOS 8.3 RHEL 8.4 CentOS 8 and 9 Stream -- David Galloway Senior Systems Administrator Ceph Engineering

2 years, 11 months

CentOS 9 Stream available in Sepia!

by David Galloway

After getting CentOS 8 working, I spent Friday adding the latest CentOS 9 Stream compose to the Sepia lab. Same deal with 8. We have FOG images for smithi and gibba. teuthology-lock --lock-many 1 -m smithi --os-type centos --os-version 9.stream Some important differences: - Missing Packages: podman-docker, ant, libev-devel, python3-nose, python3-virtualenv, dbench, iozone - No lab-extras - No EPEL9 Not sure how useful it'll be but have fun. -- David Galloway Senior Systems Administrator Ceph Engineering

2 years, 11 months

CentOS 8 Stream is now available in the Sepia lab

by David Galloway

I have no idea what moving parts need to be adjusted in teuthology (if any) but we have CentOS 8 Stream FOG images for smithi and gibba machine types. I tried to make it simple to adapt to by just using 'stream' as the minor version number. teuthology-lock --lock-many 1 -m smithi --os-type centos --os-version 8.stream Still working on CentOS 9. Let me know if you run into any issues. -- David Galloway Senior Systems Administrator Ceph Engineering

2 years, 11 months

May 10 Upstream Lab Outage

by David Galloway

Hi all, I wanted to provide an RCA for the outage you may have been affected by yesterday. Some services that went down: - All CI/testing - quay.ceph.io - telemetry.ceph.com (your cluster may have gone into HEALTH_WARN if you report telemetry data) - lists.ceph.io (so all mailing lists) All of our critical infra is running in a Red Hat Virtualization (RHV) instance backed by Red Hat Gluster Storage (RHGS) as the storage. Before you go, "wait.. Gluster?" Yes, this cluster was set up before Ceph was supported as backend storage for RHV/RHEV. The root cause for the outage is the Gluster volumes got 100% full. Once no writes were possible, RHV paused all the VMs. Why didn't monitoring catch this? I honestly don't know. # grep ssdstore01 nagios-05-*2021* | grep Disk nagios-05-01-2021-00.log:[1619740800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-02-2021-00.log:[1619827200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-03-2021-00.log:[1619913600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-04-2021-00.log:[1620000000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-05-2021-00.log:[1620086400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-06-2021-00.log:[1620172800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-07-2021-00.log:[1620259200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-08-2021-00.log:[1620345600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-09-2021-00.log:[1620432000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-10-2021-00.log:[1620518400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-11-2021-00.log:[1620604800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now Yet RHV knew we were running out of space. I don't have e-mail notifications set up in RHV, however. # zgrep "disk space" engine*202105*.gz | cut -d ',' -f4 | head -n 10 Low disk space. hosted_storage domain has 24 GB of free space. Low disk space. hosted_storage domain has 24 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 21 GB of free space. Low disk space. hosted_storage domain has 20 GB of free space. Low disk space. hosted_storage domain has 11 GB of free space. Our nagios instances runs this to check disk space: https://github.com/ceph/ceph-cm-ansible/blob/master/roles/common/files/libe… You can ignore the comment about it only working for EXT2. [root@ssdstore01 ~]# /usr/libexec/diskusage.pl 90 95 Disks are OK now I ran this manually on one of the storage hosts and intentionally set the WARN level to a number lower than the current usage percentage. [root@ssdstore01 ~]# df -h | grep 'Size\|gluster' Filesystem Size Used Avail Use% Mounted on /dev/md124 8.8T 6.7T 2.1T 77% /gluster [root@ssdstore01 ~]# /usr/libexec/diskusage.pl 95 70 /gluster is at 77% [root@ssdstore01 ~]# echo $? 2 When I logged in to the storage hosts yesterday morning, the /gluster mount was at 100%. So nagios should have known. How'd it get fixed? I happened to have some large capacity drives that fit the storage nodes lying around. They're being installed in a different project soon. However, I was able to add these drives, add "bricks" to the Gluster storage, then rebalance the data. Once that was done, I was able to restart all the VMs and delete old VMs and snapshots I no longer needed. How do we keep this from happening again? Well, as you may have been able to deduce... we were running out of space at a rate of 1-10 GB/day. As you can see now, the Gluster volume has 2.1TB of space left. So even if we grew by 10GB/day again, we'd be okay for 200ish days. I aim to have some (if not all) of these services moved off this platform and into an Openshift cluster backed by Ceph this year. Sadly, I just don't think I have enough logging enabled to nail down exactly what happened. -- David Galloway Senior Systems Administrator Ceph Engineering

2 years, 11 months

2024

2023

2022

2021

2020

2019

Sepia May 2021