Hello,
- gibba nodes are used inefficiently
- used a lot closer to the end of the major release cycle (or for
specific projects, e.g. mclock), but largely idle in the middle of
the release cycle
- a considerable waste of hardware resources if used only to exercise
upgrading to some (currently reef) backport releases
- proposal to release gibba nodes for teuthology (Patrick)
- for special-purpose suites where jobs require more nodes and/or
more time than usual (e.g. running for 10h with 6-8 nodes)?
- run tests for different components on the same cluster
concurrently, this is lacking today except for a few bits in
upgrade suites
- ... or even just existing suites (Casey)
- need Neha to weigh in as gibba cluster caretaker
- 18.2.1 blockers
- MDS crashing on old kernel clients
-
https://github.com/ceph/ceph/pull/54677 is a temporary stop-gap
change in smoke and powercycle suites needed for reproducing
- increases the number of jobs in reef (scheduling with --subset
would defeat the purpose of the change)
- needs ack from core
-
https://github.com/ceph/ceph/pull/54407 is the fix
- Venky to test with amended smoke suite, merge and hand off to
Yuri for LRC upgrade
- discussion on test suite changes would be held separately
-
https://tracker.ceph.com/issues/63618 (next item)
- potential data corruption in bluestore (!!!)
- can occur under heavy fragmentation if db is co-located with the
main device or after bluefs spillover to the main device, when the
main device is configured with 64k alloc size
- affects OSDs that were upgraded without redeploying from octopus
and earlier releases
- a crash on ceph_assert(available >= allocated) during OSD startup
is an indicator
- more likely than actual data corruption? (Igor)
- Laura to check telemetry for instances of this assert
- assumed to be caused by
https://github.com/ceph/ceph/pull/48854
which shipped in 18.2.0 and was backported to 16.2.14 and 17.2.6,
meaning that all release streams are vulnerable
- tracked in
https://tracker.ceph.com/issues/63618 (hit on 17.2.7)
-
https://tracker.ceph.com/issues/62282 was hit by Adam on 17.2.6,
Igor believes the root cause to be the same
- for now, this is a blocker for 16.2.15 and 18.2.1
- might necessitate hot fixes (also for quincy)
- regression for RHEL tests on main ("nothing provides lua-devel")
-
https://tracker.ceph.com/issues/63672
- 42 pacific PRs left to be triaged
-
https://github.com/ceph/ceph/pulls?q=is%3Aopen+is%3Apr+milestone%3Apacific
- move to v16.2.15 milestone or close PR and reject backport
Thanks,
Ilya