As you may know, the Sepia Long Running Cluster has been hitting
capacity limits over the past week or so. This has resulted in service
disruptions to teuthology runs, chacra.ceph.com,
docker-mirror.front.sepia.ceph.com, and quay.ceph.io.
We've been able to get by by deleting/compressing logs more aggressively
but it's not ideal or sustainable.
Patrick has created a new erasure coded pool/filesystem that will allow
us to keep the same amount of logs but use less space. In order to have
teuthology workers start writing logs to that pool, we need to take an
outage.
At 0400 UTC 19AUG2020, I will instruct all teuthology workers to die
after their running jobs finish. At 1300 UTC, I will kill any jobs that
are still running. This gives the lab 9 hours to gracefully shut down.
At that point, we will switch the mountpoint on teuthology.front over to
the new EC pool and start storing new logs there.
At the same time, Patrick will start migrating logs on the existing/old
pool to the new pool. This means that logs from 7/20 through 8/19 will
be unavailable (you'll see 404s) via the Pulpito web UI and qa-proxy
URLs until they're migrated to the new EC pool.
Let me know if you have any questions/concerns.
Thanks,
--
David Galloway
Systems Administrator, RDU
Ceph Engineering
IRC: dgalloway
Hi everyone,
Some of you may be encountering failures due to "MDS_ALL_DOWN" or
"MDS_UP_LESS_THAN_MAX" in teuthology runs. That's because
https://github.com/ceph/ceph/pull/36527 and
https://github.com/ceph/teuthology/pull/1545 have now merged. Please
update your local teuthology repo to include the
teuthology change and you should not see any issues after that.
Thanks,
Neha
On November 1st docker.io will rate limit anonymous pulls to 100 per 6
hours [1]. What upstream Ceph CI jobs might be affected by this? E.g.
ceph-ansible CI.
The OpenStack TripleO project has CI jobs pulling the Ceph container
from docker [2]. TripleO is already affected by current rate limits
but the problem will get worse in November and the project is
considering ideas [3], including local container builds for each job.
At the moment that would only be OpenStack containers and I'm thinking
about how to keep the Ceph in the TripleO CI.
Is anyone in the Ceph community thinking about this?
Thanks,
John
[1] https://www.docker.com/blog/scaling-docker-to-serve-millions-more-developer…
[2] https://hub.docker.com/r/ceph/daemon/
[3] http://lists.openstack.org/pipermail/openstack-discuss/2020-July/016116.html
There is a general documentation meeting called the "DocuBetter Meeting",
and it is held every two weeks. The next DocuBetter Meeting will be on
August 26, 2020 at 1800 PST, and will run for thirty minutes. Everyone with
a documentation-related request or complaint is invited. The meeting will
be held here: https://bluejeans.com/908675367
This meeting will continue the discussion of the reorganization of the
docs.ceph.com website.
Send documentation-related requests and complaints to me by replying to
this email and CCing me at zac.dover(a)gmail.com.
The next DocuBetter meeting is scheduled for:
26 Aug 2020 1800 PST
27 Aug 2020 0100 UTC
27 Aug 2020 1100 AEST
Etherpad: https://pad.ceph.com/p/Ceph_Documentation
Meeting: https://bluejeans.com/908675367
Thanks, everyone.
Zac Dover
tl;dr: archived runs on pulpito/qa-proxy should work better
The recent transition of teuthology log storage to an EC pool carried
out by dgalloway and batrick (thank you both!) had involved compressing
some history files like the teuthology.log file to save more space.
This meant that pulpito result pages (also shown at qa-proxy.ceph.com)
had broken links, as they referred to "teuthology.log" rather than
"teuthology.log.gz".
It turns out there's a simple nginx config option that causes it to
first look for the non-gz file, and, if not found, look for the gz file,
and, if everyone agrees they can handle compressed files, return the
compressed file anyway. I've implemented that.
Mostly what this means is that links from old runs should just work; it
also means that we can adopt storing the logs in .gz form as a matter of
course going forward, so I'll also be filing an RFE for teuthology to
compress what it can when storing the archives.
Questions/comments to me please, and enjoy.
Hi all,
There appears to be a performance regression going from 15.2.4 to HEAD. I
first realized this when testing my patches to Ceph on an 8-node cluster,
but it is easily reproducible on *vanilla* Ceph with vstart as well, using
the following steps:
$ git clone https://github.com/ceph/ceph.git && cd ceph
$ ./do_cmake.sh -DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_MANPAGE=OFF
-DWITH_BABELTRACE=OFF -DWITH_MGR_DASHBOARD_FRONTEND=OFF && cd build && make
-j32 vstart
$ MON=1 OSD=1 MDS=0 ../src/vstart.sh --debug --new --localhost --bluestore
--bluestore-devs /dev/xxx
$ sudo ./bin/ceph osd pool create foo 32 32
$ sudo ./bin/rados bench -p foo 100 write --no-cleanup
With the old hard drive that I have (Hitachi HUA72201), I'm getting an
average throughput of 60 MiB/s. When I switch to v15.2.4 (git checkout
v15.2.4), rebuild, and repeat the experiment, and I get an average
throughput of 90 MiB/s. I've reliably reproduced similar difference
between 15.2.4 and HEAD by building release packages and running them on an
8-node cluster.
Is this expected or is this a performance regression?
Thanks!