I have no idea what moving parts need to be adjusted in teuthology (if any) but we have CentOS 8 Stream FOG images for smithi and gibba machine types.
I tried to make it simple to adapt to by just using 'stream' as the minor version number.
teuthology-lock --lock-many 1 -m smithi --os-type centos --os-version 8.stream
Still working on CentOS 9.
Let me know if you run into any issues.
--
David Galloway
Senior Systems Administrator
Ceph Engineering
Hi all,
I wanted to provide an RCA for the outage you may have been affected by yesterday. Some services that went down:
- All CI/testing
- quay.ceph.io
- telemetry.ceph.com (your cluster may have gone into HEALTH_WARN if you report telemetry data)
- lists.ceph.io (so all mailing lists)
All of our critical infra is running in a Red Hat Virtualization (RHV) instance backed by Red Hat Gluster Storage (RHGS) as the storage. Before you go, "wait.. Gluster?" Yes, this cluster was set up before Ceph was supported as backend storage for RHV/RHEV.
The root cause for the outage is the Gluster volumes got 100% full. Once no writes were possible, RHV paused all the VMs.
Why didn't monitoring catch this? I honestly don't know.
# grep ssdstore01 nagios-05-*2021* | grep Disk
nagios-05-01-2021-00.log:[1619740800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-02-2021-00.log:[1619827200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-03-2021-00.log:[1619913600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-04-2021-00.log:[1620000000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-05-2021-00.log:[1620086400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-06-2021-00.log:[1620172800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-07-2021-00.log:[1620259200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-08-2021-00.log:[1620345600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-09-2021-00.log:[1620432000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-10-2021-00.log:[1620518400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-11-2021-00.log:[1620604800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
Yet RHV knew we were running out of space. I don't have e-mail notifications set up in RHV, however.
# zgrep "disk space" engine*202105*.gz | cut -d ',' -f4 | head -n 10
Low disk space. hosted_storage domain has 24 GB of free space.
Low disk space. hosted_storage domain has 24 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 21 GB of free space.
Low disk space. hosted_storage domain has 20 GB of free space.
Low disk space. hosted_storage domain has 11 GB of free space.
Our nagios instances runs this to check disk space: https://github.com/ceph/ceph-cm-ansible/blob/master/roles/common/files/libe…
You can ignore the comment about it only working for EXT2.
[root@ssdstore01 ~]# /usr/libexec/diskusage.pl 90 95
Disks are OK now
I ran this manually on one of the storage hosts and intentionally set the WARN level to a number lower than the current usage percentage.
[root@ssdstore01 ~]# df -h | grep 'Size\|gluster'
Filesystem Size Used Avail Use% Mounted on
/dev/md124 8.8T 6.7T 2.1T 77% /gluster
[root@ssdstore01 ~]# /usr/libexec/diskusage.pl 95 70
/gluster is at 77%
[root@ssdstore01 ~]# echo $?
2
When I logged in to the storage hosts yesterday morning, the /gluster mount was at 100%. So nagios should have known.
How'd it get fixed? I happened to have some large capacity drives that fit the storage nodes lying around. They're being installed in a different project soon. However, I was able to add these drives, add "bricks" to the Gluster storage, then rebalance the data. Once that was done, I was able to restart all the VMs and delete old VMs and snapshots I no longer needed.
How do we keep this from happening again? Well, as you may have been able to deduce... we were running out of space at a rate of 1-10 GB/day. As you can see now, the Gluster volume has 2.1TB of space left. So even if we grew by 10GB/day again, we'd be okay for 200ish days.
I aim to have some (if not all) of these services moved off this platform and into an Openshift cluster backed by Ceph this year. Sadly, I just don't think I have enough logging enabled to nail down exactly what happened.
--
David Galloway
Senior Systems Administrator
Ceph Engineering
Unlike traditional classes and exams, there is a very slight amount of monitoring when it comes to online classes and exams. This often makes students less attentive and focused. Lack of focus and determination makes it tough for them to be consistent. Many of them fail to gain good grades and ask to take my online exam for me to professionals. Another common reason why many students prefer to take essay helper from experts is that they lack confidence. Not having enough courage or belief in their potential makes them think that they can mess things.
https://www.allassignmenthelp.com/pay-someone-to-take-my-online-class.html
Hi everyone,
In June 2021, we're hosting a month of Ceph presentations, lightning
talks, and unconference sessions such as BOFs. There is no
registration or cost to attend this event.
The CFP is now open until May 12th.
https://ceph.io/events/ceph-month-june-2021/cfp
Speakers will receive confirmation that their presentation is accepted
and further instructions for scheduling by May 16th.
The schedule will be available on May 19th.
Join the Ceph community as we discuss how Ceph, the massively
scalable, open-source, software-defined storage system, can radically
improve the economics and management of data storage for your
enterprise.
--
Mike Perez
Hello,
I'm digging way back here - in
https://tracker.ceph.com/issues/12405#note-12, Sage said:
> 2017-11-29
> Newer kernels fix syncfs(2) to use a dirty inode list for this. (The fix isn't in the latest el7 kernel(s) yet, though.)
Try as I might, I can't seem to narrow down which upstream kernel
version gained this improvement. Any chance that someone remembers?
Thanks!
Josh
Bonjour,
Here is a proposal for onboarding new member of the Ceph Stable Releases team[0]:
* Say Hi! on the IRC channel[1]
* Kindly ask to be appointed "Backporter" in the tracker[2]
* Read how to submit backports[3]
* Pick a backport in Octopus that has status "New"[4] and work on it.
If you're lucky the "work on it" part will consist of running a single command and watch the pull request pass all tests. You can then move to the next backport, until you're not so lucky and there is a conflict or the tests fail. That's when the real work begins and a human brain is useful.
Cheers
P.S. A word of advice: if you're exceptionally lucky and manage to submit ten backports that have no conflict and are all green, it is advisable to stop there. There may be unexpected problems later on (when teuthology tests are run[5]) that will require your attention. And you may be in trouble if you have too many pull requests in flight and not enough time to address them all.
[0] https://tracker.ceph.com/projects/ceph-releases
[1] irc://oftc.net/#ceph-backports
[2] https://tracker.ceph.com/projects/ceph
[3] https://github.com/ceph/ceph/blob/master/SubmittingPatches-backports.rst
[4] https://tracker.ceph.com/projects/ceph/issues?query_id=199
[5] https://github.com/ceph/ceph/blob/master/qa/crontab/teuthology-cronjobs etc.
--
Loïc Dachary, Artisan Logiciel Libre
The purpose of this email is to trigger a discussion on how we rectify
the following situation so client commands are executed by the monitor
only once and in the order they are submitted.
https://tracker.ceph.com/issues/49428 describes a scenario where we
can end up with commands executed more than once and out of order
according to the client.
In that tracker a client sends an 'erasure-code-profile rm' to mon.c
and immediately receives an injected connection failure. The client
then connects to mon.a and reissues the 'rm' command which returns the
expected 'does not exist' result. The client code then issues an
'erasure-code-profile set' command. Shortly after this the original
'rm' command is forwarded to mon.a from mon.c and the 'set' command is
cancelled because of this and the command, and the test, fails.
From the client side the commands executed look like this.
erasure-code-profile rm
erasure-code-profile set
From the mon side it looks like this.
erasure-code-profile rm
erasure-code-profile set
erasure-code-profile rm
I appreciate any feedback on the best way to tackle this one.
--
Cheers,
Brad