May 2021 - Dev - lists.ceph.io

by David Galloway

This is a hotfix release addressing a number of security issues and regressions. We recommend all users update to this release. For a detailed release notes with links & changelog please refer to the official blog entry at https://ceph.io/releases/v14-2-21-nautilus-released Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at https://download.ceph.com/tarballs/ceph-14.2.21.tar.gz * For packages, see https://docs.ceph.com/docs/master/install/get-packages/ * Release git sha1: 5ef401921d7a88aea18ec7558f7f9374ebd8f5a6

2 years, 11 months

1
0
0 0

CentOS 8 Stream is now available in the Sepia lab

by David Galloway

I have no idea what moving parts need to be adjusted in teuthology (if any) but we have CentOS 8 Stream FOG images for smithi and gibba machine types. I tried to make it simple to adapt to by just using 'stream' as the minor version number. teuthology-lock --lock-many 1 -m smithi --os-type centos --os-version 8.stream Still working on CentOS 9. Let me know if you run into any issues. -- David Galloway Senior Systems Administrator Ceph Engineering

2 years, 11 months

1
0
0 0

May 10 Upstream Lab Outage

by David Galloway

Hi all, I wanted to provide an RCA for the outage you may have been affected by yesterday. Some services that went down: - All CI/testing - quay.ceph.io - telemetry.ceph.com (your cluster may have gone into HEALTH_WARN if you report telemetry data) - lists.ceph.io (so all mailing lists) All of our critical infra is running in a Red Hat Virtualization (RHV) instance backed by Red Hat Gluster Storage (RHGS) as the storage. Before you go, "wait.. Gluster?" Yes, this cluster was set up before Ceph was supported as backend storage for RHV/RHEV. The root cause for the outage is the Gluster volumes got 100% full. Once no writes were possible, RHV paused all the VMs. Why didn't monitoring catch this? I honestly don't know. # grep ssdstore01 nagios-05-*2021* | grep Disk nagios-05-01-2021-00.log:[1619740800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-02-2021-00.log:[1619827200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-03-2021-00.log:[1619913600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-04-2021-00.log:[1620000000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-05-2021-00.log:[1620086400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-06-2021-00.log:[1620172800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-07-2021-00.log:[1620259200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-08-2021-00.log:[1620345600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-09-2021-00.log:[1620432000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-10-2021-00.log:[1620518400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-11-2021-00.log:[1620604800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now Yet RHV knew we were running out of space. I don't have e-mail notifications set up in RHV, however. # zgrep "disk space" engine*202105*.gz | cut -d ',' -f4 | head -n 10 Low disk space. hosted_storage domain has 24 GB of free space. Low disk space. hosted_storage domain has 24 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 21 GB of free space. Low disk space. hosted_storage domain has 20 GB of free space. Low disk space. hosted_storage domain has 11 GB of free space. Our nagios instances runs this to check disk space: https://github.com/ceph/ceph-cm-ansible/blob/master/roles/common/files/libe… You can ignore the comment about it only working for EXT2. [root@ssdstore01 ~]# /usr/libexec/diskusage.pl 90 95 Disks are OK now I ran this manually on one of the storage hosts and intentionally set the WARN level to a number lower than the current usage percentage. [root@ssdstore01 ~]# df -h | grep 'Size\|gluster' Filesystem Size Used Avail Use% Mounted on /dev/md124 8.8T 6.7T 2.1T 77% /gluster [root@ssdstore01 ~]# /usr/libexec/diskusage.pl 95 70 /gluster is at 77% [root@ssdstore01 ~]# echo $? 2 When I logged in to the storage hosts yesterday morning, the /gluster mount was at 100%. So nagios should have known. How'd it get fixed? I happened to have some large capacity drives that fit the storage nodes lying around. They're being installed in a different project soon. However, I was able to add these drives, add "bricks" to the Gluster storage, then rebalance the data. Once that was done, I was able to restart all the VMs and delete old VMs and snapshots I no longer needed. How do we keep this from happening again? Well, as you may have been able to deduce... we were running out of space at a rate of 1-10 GB/day. As you can see now, the Gluster volume has 2.1TB of space left. So even if we grew by 10GB/day again, we'd be okay for 200ish days. I aim to have some (if not all) of these services moved off this platform and into an Openshift cluster backed by Ceph this year. Sadly, I just don't think I have enough logging enabled to nail down exactly what happened. -- David Galloway Senior Systems Administrator Ceph Engineering

2 years, 11 months

1
0
0 0

take my online class

by maxwellmarco1727＠gmail.com

Unlike traditional classes and exams, there is a very slight amount of monitoring when it comes to online classes and exams. This often makes students less attentive and focused. Lack of focus and determination makes it tough for them to be consistent. Many of them fail to gain good grades and ask to take my online exam for me to professionals. Another common reason why many students prefer to take essay helper from experts is that they lack confidence. Not having enough courage or belief in their potential makes them think that they can mess things. https://www.allassignmenthelp.com/pay-someone-to-take-my-online-class.html

2 years, 11 months

2
2
0 0

Ceph Month June 2021 Event

by Mike Perez

Hi everyone, In June 2021, we're hosting a month of Ceph presentations, lightning talks, and unconference sessions such as BOFs. There is no registration or cost to attend this event. The CFP is now open until May 12th. https://ceph.io/events/ceph-month-june-2021/cfp Speakers will receive confirmation that their presentation is accepted and further instructions for scheduling by May 16th. The schedule will be available on May 19th. Join the Ceph community as we discuss how Ceph, the massively scalable, open-source, software-defined storage system, can radically improve the economics and management of data storage for your enterprise. -- Mike Perez

2 years, 11 months

1
2
0 0

filestore syncfs CPU load

by Josh Baergen

Hello, I'm digging way back here - in https://tracker.ceph.com/issues/12405#note-12, Sage said: > 2017-11-29 > Newer kernels fix syncfs(2) to use a dirty inode list for this. (The fix isn't in the latest el7 kernel(s) yet, though.) Try as I might, I can't seem to narrow down which upstream kernel version gained this improvement. Any chance that someone remembers? Thanks! Josh

2 years, 11 months

1
0
0 0

Onboarding backporters

by Loïc Dachary

Bonjour, Here is a proposal for onboarding new member of the Ceph Stable Releases team[0]: * Say Hi! on the IRC channel[1] * Kindly ask to be appointed "Backporter" in the tracker[2] * Read how to submit backports[3] * Pick a backport in Octopus that has status "New"[4] and work on it. If you're lucky the "work on it" part will consist of running a single command and watch the pull request pass all tests. You can then move to the next backport, until you're not so lucky and there is a conflict or the tests fail. That's when the real work begins and a human brain is useful. Cheers P.S. A word of advice: if you're exceptionally lucky and manage to submit ten backports that have no conflict and are all green, it is advisable to stop there. There may be unexpected problems later on (when teuthology tests are run[5]) that will require your attention. And you may be in trouble if you have too many pull requests in flight and not enough time to address them all. [0] https://tracker.ceph.com/projects/ceph-releases [1] irc://oftc.net/#ceph-backports [2] https://tracker.ceph.com/projects/ceph [3] https://github.com/ceph/ceph/blob/master/SubmittingPatches-backports.rst [4] https://tracker.ceph.com/projects/ceph/issues?query_id=199 [5] https://github.com/ceph/ceph/blob/master/qa/crontab/teuthology-cronjobs etc. -- Loïc Dachary, Artisan Logiciel Libre

2 years, 11 months

1
0
0 0

[FS code walkthrough] CephFS mirror daemon -- Today, Apr 26, 2021

by Ramana Venkatesh Raja

Hi folks, The FS code walkthrough will start in about an hour (10 AM ET) at https://bluejeans.com/908675367 Today, Venky Shankar will continue walking us through the FS mirroring code with focus on the mirror daemon itself. The youtube links for the recordings will be at https://tracker.ceph.com/projects/ceph/wiki/CephFS_Code_Walkthroughs Thanks, Ramana

2 years, 11 months

2
1
0 0

quay.ceph.io outage information?

by John Fulton

quay.ceph.io seems to be back up. Is there any information on what happened that caused the outage? Thanks, John

2 years, 11 months

2
1
0 0

Problematic monitor command forwarding

by Brad Hubbard

The purpose of this email is to trigger a discussion on how we rectify the following situation so client commands are executed by the monitor only once and in the order they are submitted. https://tracker.ceph.com/issues/49428 describes a scenario where we can end up with commands executed more than once and out of order according to the client. In that tracker a client sends an 'erasure-code-profile rm' to mon.c and immediately receives an injected connection failure. The client then connects to mon.a and reissues the 'rm' command which returns the expected 'does not exist' result. The client code then issues an 'erasure-code-profile set' command. Shortly after this the original 'rm' command is forwarded to mon.a from mon.c and the 'set' command is cancelled because of this and the command, and the test, fails. From the client side the commands executed look like this. erasure-code-profile rm erasure-code-profile set From the mon side it looks like this. erasure-code-profile rm erasure-code-profile set erasure-code-profile rm I appreciate any feedback on the best way to tackle this one. -- Cheers, Brad

2 years, 11 months

2
1
0 0

2024

2023

2022

2021

2020

2019

Dev May 2021