Hi Everyone,
David Orman mentioned in the CLT meeting this morning that there are a
number of people on the mailing list asking about performance
regressions in Pacific+ vs older releases. I want to document a couple
of the bigger ones that we know about for the community's benefit. I
want to be clear that Pacific does have a number of performance
improvements over previous releases, and we do have tests showing
improvement relative to nautilus (especially RBD on NVMe drives). Some
of these regressions are going to have a bigger effect for some users
than others. Having said that, let's get into them.
********** Regression #1: RocksDB Log File Recycling **********
Effects: More metadata updates to the underlying FS, higher
write-amplification (Observed by Digital Ocean), Slower performance
especially when the WAL device is saturated.
When bluestore was created back in 2015 Sage implemented an optimization
in RocksDB that allowed WAL log files to be recycled. The idea is that
instead of deleting logs when they are flushed, rocksdb can simply reuse
them. The benefit here is that it allows records to be written and
fadatasync can be called without touching the inode for every IO. Sage
did a pretty good job of explaining the benefit in the PR available here:
https://github.com/facebook/rocksdb/pull/746
After much discussion, that PR was merged and received a couple of bug
fixes over the years:
Locking bug fix from Somnath back in 2016:
https://github.com/facebook/rocksdb/pull/1313
Another bug fix from ajkr in 2020:
https://github.com/facebook/rocksdb/pull/5900
In 2020, the RocksDB folks discovered there is a fundamental flaw in the
way that the original PR works. It turns out that the feature to
recycle log files is incompatible with RocksDB's kPointInTimeRecovery,
kAbsoluteConsistency, and kTolerateCorruptedTailRecords recovery modes.
One of the later PR's included a very good and concise description of
the problem:
"The two features are naturally incompatible. WAL recycling expects the
recovery to succeed upon encountering a corrupt record at the point
where new data ends and recycled data remains at the tail. However,
WALRecoveryMode::kTolerateCorruptedTailRecords must fail upon
encountering any such corrupt record, as it cannot differentiate between
this and a real corruption, which would cause committed updates to be
truncated."
More background discussion on the RocksDB side available in these PRs
and comments:
https://github.com/facebook/rocksdb/pull/6351https://github.com/facebook/rocksdb/pull/6351#issuecomment-672838284https://github.com/facebook/rocksdb/pull/7252https://github.com/facebook/rocksdb/pull/7271
On the Ceph side, there was a PR to try to re-enable the old behavior
which we rejected as unsafe based on the analysis by the RocksDB folks
(which we agree with):
https://github.com/ceph/ceph/pull/36579
Sage also commented about a potential way forward:
https://github.com/ceph/ceph/pull/36579#issuecomment-870884583
"tbh I think the best approach would be to create a new WAL file format
that (1) is 4k block aligned and (2) has a header for each block that
indicates the generation # for that log file (so we can see whether what
we read is from a previous pass or corruption). That would be a fair bit
of effort, though."
On a side note, Igor tried to also disable WAL file recycling as a
backport to Octopus but was thwarted by a BlueFS bug. That PR was
eventually reverted leaving the old (dangerous!) behavior being left in
place:
https://github.com/ceph/ceph/pull/45040https://github.com/ceph/ceph/pull/47053
The gist of it is that releases of Ceph older than Pacific are
benefiting from the speed improvement of log file recycling but may be
vulnerable to the issue as described above. This is likely one of the
more impactful regressions that people upgrading to Pacific or later
releases are seeing.
Josh Baergen from Digital Ocean followed up that there is a slew of
additional information on this issue in the following tracker as well:
https://tracker.ceph.com/issues/58530
********** Regression #1 Potential Fixes ***********
Josh Baergen also mentioned that the write-amplification effect that was
observed due to this issue is mitigated in by
https://github.com/ceph/ceph/pull/48915 which was merged into 16.2.11
back in December. That however does not improve write IOPS amplification.
Beyond that, we could follow Sage's idea and try to implement a new WAL
file format. The risks here are that it could be a lot of work and we
don't know if there is really any appetite on the RocksDB side to merge
something like this upstream. My personal take is that we're already
kind of abusing the RocksDB WAL for short lived PG log updates and I'm
not thrilled about trying to add further code into RocksDB to try and
support our use cases (though there is benefit here that goes beyond
Ceph). We already maintain a custom version of RocksDB's LRU cache in
our code to tie into our memory autotuning system but it would be really
nice to avoid custom code like that in the future.
One alternative: Igor Fedetov implemented a prototype WAL inside
bluestore itself and we saw very good initial results from it with the
RocksDB WAL disabled. These can be seen on slide 24 of my performance
deck from Cephalocon 2023:
https://www.linkedin.com/in/markhpc/overlay/experience/2113859303/multiple-…
If Igor (or others) want to continue this work, I personally would be in
favor of trying to move the WAL into Bluestore itself. I suspect we can
make better decisions about PG log life cycles and have better BlueFS
integration than what RocksDB provides us. Igor probably has a better
idea of the pitfalls here though so I think we should hear out his
thoughts on whether this is the right path forward. Igor also mentioned
that he is continuing to work on his Bluestore WAL prototype with
promising results, but that PG Log will (as expected) likely require a
different solution that looks more like a specialized ring buffer. I
think moving the WAL out of RocksDB is a good step toward that eventual
goal.
********** Regression #2: (re-)Enabling BlueFS Buffered IO **********
Effects: Works around unexpected readahead behavior in RocksDB by
utilizing underlying kernel page cache. Hurts write performance on fast
devices.
We're stuck between a bit of a rock and a hard place here. Over the
years we have sea-sawed back and forth regarding when we should or
should not use buffered IO:
https://github.com/ceph/ceph/pull/11012https://github.com/ceph/ceph/pull/11059https://github.com/ceph/ceph/pull/18172https://github.com/ceph/ceph/pull/20542https://github.com/ceph/ceph/pull/34224https://github.com/ceph/ceph/pull/38044 <-- lots of discussion here
The gist of it is that there are upsides and downside to having
bluefs_buffered_io=true. Direct IO is faster in some scenarios,
especially more recent write tests on NVMe drives. The trade off is
that RocksDB really seems to benefit from kernel buffer cache and there
are other scenarios where bluefs_buffered_io is a big win. 2 years ago
Adam and I did a walkthrough of the RocksDB code to try to understand
the behavior regarding RocksDB readahead and we couldn't understand why
it was re-reading data from the file system so often (or in the case of
buffered IO the page cache!). I wrote up our walkthrough of the code here:
https://github.com/ceph/ceph/pull/38044#issuecomment-790157415
********** Regression #2 Potential Fixes **********
In a recent discussion with Mark Callaghan (of MyRocks/RocksDB
performance tuning fame), he pointed out that RocksDB has an option to
pre-populate the block cache with the data from SSTs created by memtable
flush and that might help when O_DIRECT is used:
https://github.com/facebook/rocksdb/blob/main/include/rocksdb/table.h#L600
We may want to experiment to see if this helps keep the block cache
pre-populated after compaction and avoid (re)reads from the disk during
iteration. We also might want to revisit this topic in general with the
compact on iteration feature that was recently added and backported to
pacific in 16.2.13. I'm still a little concerned however that we were
seeing repeated overlapping reads for the same ranges during iteration
that I would have expected to be cached by RocksDB on a previous read.
Ultimately I think many of us would prefer to move entirely to directIO
but there's more work to do to figure this one out.
Josh Baergen provided further advice here: They have had good luck
enabling buffered IO for rgw bucket index OSDs and disabling it
everywhere else. This assumes that bucket indexes are on their own
dedicated OSDs though, and personally I am a bit wary of hitting slow
cases in RocksDB even on "regular" OSDs, but this might be something to
consider as they've had good luck with this configuration for over a year.
********** Regression #3: RadosGW Coroutine and Request Timeout Changes
**********
Effects: Higher RadosGW CPU usage, lower performance, especially for
small object workloads
Back when Pacific was released it was observed that RadosGW was showing
much higher CPU usage and lower performance vs Nautilus for small (4KB)
objects. It's likely that larger objects may be affected, though to a
lesser degree A git bisection was performed and the results are
summarized in the introduction section of the folliowing RGW performance
analysis blog post here:
https://ceph.io/en/news/blog/2023/reef-freeze-rgw-performance/
The bisection uncovered two primary PRs that were causing performance
regression:
https://github.com/ceph/ceph/pull/31580https://github.com/ceph/ceph/pull/35355
The good news is that once those PRs were identified, the RGW team
started working to improve things, especially for #35355:
https://github.com/ceph/ceph/pull/43761 <-- Fixes issues introduced in
#35355, backported to Pacific in 2022
*********** Regression #3 Potential Fixes **********
Quincy (and due to the backport likely Pacific) is showing significantly
better behavior in recent tests due to PR #43761. The effects of #31580
are still present, but are considered a necessary trade-off. Other
improvements since then may be helping, but we'll need to continue to
make up the difference in other areas and start really investigating
where we are spending cycles/time, especially in Reef.
********** Regression #4: Gradually slowing down OSDs **********
Effects: Significant slowdown after 1-2 weeks of OSD runtime
Igor Fedetov pointed this one out in discussion earlier today:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/OXAUOK7CQXW…
This one is pretty new and there is not much there yet other than
perhaps low memory (and cache?) usage despite regular IO workload.
Onode misses can absolutely cause performance degradation, but it's not
clear yet whether this is memory related issue or something else. More
investigation needed. Hopefully we'll get perf data from the users who
encountered it to help diagnose what's going on here.
********** Conclusion **********
There may be other performance issues that I'm not remembering, but
these are the big ones I can think of off the top of my head at the
moment. Hopefully this helps clarify what's going on if people are
seeing a regression, what to look for, and if they are hitting it, the
why behind it.
Thanks,
Mark
--
Best Regards,
Mark Nelson
Head of R&D (USA)
Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nelson(a)clyso.com
We are hiring: https://www.clyso.com/jobs/
Hi Cephers,
These are the topics that we just covered in today's meeting:
- *Issues recording our meetings in Jitsi (Mike Perez)*
- David Orman suggested using a self-hosted Jitsi instance:
https://jitsi.github.io/handbook/docs/devops-guide/. Tested on a
single container, with 4 cores for 4-5 attendees a call.
- Help needed from Dan Mick & Adam Kraitman.
- Ilya: this might scale for smaller meetings (dailies), but not for
the larger ones (CDM, User-Devs, etc.).
- *Reef release status review*
- Josh: CentOS 9 blocked by Dashboard/Cephadm Python deps missing in
CentOS 9. Casey: It's blocking teuthology testing.
- Already discussed in devs mailing list a couple of months ago,
for Quincy + Centos 9 (the missing packages would be the same).
- Ken Dreyer preferred to keep the legacy approach (distro
packages) instead of embedding Python deps.
- Casey, Matt & Ernesto to resume that discussion with Ken Dreyer.
- Other issues:
- Radek:
- Performance issue with RocksDB config.
- Mismatching client-server features during upgrade. msgr
encoder-decoder issues, introduced feature bit for Squid (
https://github.com/ceph/ceph/commit/1049d3e5eff0b7fa4fc9e5853494cb21c10b290a).
Performance concerns. To be further discussed at Mark's
Perf meeting (incl.
Yuval)
- Paul Cuzner perf testing Reef: higher CPU usage in Reef vs
Quincy (more time spent in RocksDB get calls)
- Target GA remains this June. Missing CentOS 9 packages becoming a
blocker issue for the release.
- *Perf regression in Pacific vs Nautilus (David Orman)*
- https://tracker.ceph.com/issues/58530
- Mark: Missing change in upstream Rocksdb project.
- David Orman: Is this degradation still happening in newer Rocksdb
versions (reef)? Mark: No reason to think otherwise.
Kind Regards,
Ernesto
Details of this release are summarized here:
https://tracker.ceph.com/issues/59542#note-1
Release Notes - TBD
Seeking approvals for:
smoke - Radek, Laura
rados - Radek, Laura
rook - Sébastien Han
cephadm - Adam K
dashboard - Ernesto
rgw - Casey
rbd - Ilya
krbd - Ilya
fs - Venky, Patrick
upgrade/octopus-x (pacific) - Laura (look the same as in 16.2.8)
upgrade/pacific-p2p - Laura
powercycle - Brad (SELinux denials)
ceph-volume - Guillaume, Adam K
Thx
YuriW
Dear Developers,
I am excited to announce that I will be presenting a demo of Perf CI, a
project that I have been working on, at our next Performance meeting.
Perf CI is a dashboard that provides an easy and efficient way to display
and compare performance results from CBT between Teuthology runs on
different branches or against baseline test results.
The Perf CI dashboard not only offers a comprehensive overview of
performance metrics, but also allows us to identify trends and pinpoint
specific points in time when performance was affected.
This will enable us to quickly diagnose and resolve performance issues,
ultimately leading to improved product quality and user satisfaction.
During the demo, I will be showcasing the key features of Perf CI and
explaining its potential impact on our organization.
I believe that Perf CI has the potential to bring significant benefits to
our team, and I am eager to get your feedback on it.
The demo will take place at our next Performance meeting, which is
scheduled for May 11th at 3:00 PM UTC.
Please mark your calendars and make sure to attend. If you have any
questions or concerns, please don't hesitate to contact me.
Thank you, and I look forward to seeing you all soon.
Best regards, Nitzan
Hi. I'm a new member of the List site. I really hope that I can share my
thoughts and some information on the forum. However, my comment is
seemingly in moderation. Please check it again. Thanks.
We want to do the next urgent point release for pacific 16.2.13 ASAP.
The tip of the current pacific branch will be used as a base for this
release and we will build it later today.
Dev leads - if you have any outstanding PRs that must be included pls
merged them now.
Thx
YuriW
Hi Folks,
The weekly performance meeting will be starting in approximately 20
minutes at 8AM PST! Today we'll catch up on pull requests and various
topics since Cephalocon. Please feel free to add your own topic as well!
Etherpad:
https://pad.ceph.com/p/performance_weekly
Meeting URL:
https://meet.jit.si/ceph-performance
Mark
Hi everyone,
The May CDM is coming up tomorrow, *Wednesday, May 3rd @ 1:00 UTC*. See
more meeting details below. Note that we are now meeting on Jitsi.
Please add any topics you'd like to discuss to the agenda:
https://tracker.ceph.com/projects/ceph/wiki/CDM_03-MAY-2023
- Laura Flores
Meeting Link:
https://meet.jit.si/ceph-dev-monthly
UTC: Thursday, May 4, 1:00 UTC
Mountain View, CA, US: Wednesday, May 3, 18:00 PDT
Phoenix, AZ, US: Wednesday, May 3, 18:00 MST
Denver, CO, US: Wednesday, May 3, 19:00 MDT
Huntsville, AL, US: Wednesday, May 3, 20:00 CDT
Raleigh, NC, US: Wednesday, May 3, 21:00 EDT
London, England: Thursday, May 4, 2:00 BST
Paris, France: Thursday, May 4, 3:00 CEST
Helsinki, Finland: Thursday, May 4, 4:00 EEST
Tel Aviv, Israel: Thursday, May 4, 4:00 IDT
Pune, India: Thursday, May 4, 6:30 IST
Brisbane, Australia: Thursday, May 4, 11:00 AEST
Singapore, Asia: Thursday, May 4, 9:00 +08
Auckland, New Zealand: Thursday, May 4, 13:00 NZST
--
Laura Flores
She/Her/Hers
Software Engineer, Ceph Storage <https://ceph.io>
Chicago, IL
lflores(a)ibm.com | lflores(a)redhat.com <lflores(a)redhat.com>
M: +17087388804