May 2023 - Dev - lists.ceph.io

Discussion thread for Known Pacific Performance Regressions

by Mark Nelson

Hi Everyone, David Orman mentioned in the CLT meeting this morning that there are a number of people on the mailing list asking about performance regressions in Pacific+ vs older releases. I want to document a couple of the bigger ones that we know about for the community's benefit. I want to be clear that Pacific does have a number of performance improvements over previous releases, and we do have tests showing improvement relative to nautilus (especially RBD on NVMe drives). Some of these regressions are going to have a bigger effect for some users than others. Having said that, let's get into them. ********** Regression #1: RocksDB Log File Recycling ********** Effects: More metadata updates to the underlying FS, higher write-amplification (Observed by Digital Ocean), Slower performance especially when the WAL device is saturated. When bluestore was created back in 2015 Sage implemented an optimization in RocksDB that allowed WAL log files to be recycled. The idea is that instead of deleting logs when they are flushed, rocksdb can simply reuse them. The benefit here is that it allows records to be written and fadatasync can be called without touching the inode for every IO. Sage did a pretty good job of explaining the benefit in the PR available here: https://github.com/facebook/rocksdb/pull/746 After much discussion, that PR was merged and received a couple of bug fixes over the years: Locking bug fix from Somnath back in 2016: https://github.com/facebook/rocksdb/pull/1313 Another bug fix from ajkr in 2020: https://github.com/facebook/rocksdb/pull/5900 In 2020, the RocksDB folks discovered there is a fundamental flaw in the way that the original PR works. It turns out that the feature to recycle log files is incompatible with RocksDB's kPointInTimeRecovery, kAbsoluteConsistency, and kTolerateCorruptedTailRecords recovery modes. One of the later PR's included a very good and concise description of the problem: "The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, WALRecoveryMode::kTolerateCorruptedTailRecords must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated." More background discussion on the RocksDB side available in these PRs and comments: https://github.com/facebook/rocksdb/pull/6351 https://github.com/facebook/rocksdb/pull/6351#issuecomment-672838284 https://github.com/facebook/rocksdb/pull/7252 https://github.com/facebook/rocksdb/pull/7271 On the Ceph side, there was a PR to try to re-enable the old behavior which we rejected as unsafe based on the analysis by the RocksDB folks (which we agree with): https://github.com/ceph/ceph/pull/36579 Sage also commented about a potential way forward: https://github.com/ceph/ceph/pull/36579#issuecomment-870884583 "tbh I think the best approach would be to create a new WAL file format that (1) is 4k block aligned and (2) has a header for each block that indicates the generation # for that log file (so we can see whether what we read is from a previous pass or corruption). That would be a fair bit of effort, though." On a side note, Igor tried to also disable WAL file recycling as a backport to Octopus but was thwarted by a BlueFS bug. That PR was eventually reverted leaving the old (dangerous!) behavior being left in place: https://github.com/ceph/ceph/pull/45040 https://github.com/ceph/ceph/pull/47053 The gist of it is that releases of Ceph older than Pacific are benefiting from the speed improvement of log file recycling but may be vulnerable to the issue as described above. This is likely one of the more impactful regressions that people upgrading to Pacific or later releases are seeing. Josh Baergen from Digital Ocean followed up that there is a slew of additional information on this issue in the following tracker as well: https://tracker.ceph.com/issues/58530 ********** Regression #1 Potential Fixes *********** Josh Baergen also mentioned that the write-amplification effect that was observed due to this issue is mitigated in by https://github.com/ceph/ceph/pull/48915 which was merged into 16.2.11 back in December. That however does not improve write IOPS amplification. Beyond that, we could follow Sage's idea and try to implement a new WAL file format. The risks here are that it could be a lot of work and we don't know if there is really any appetite on the RocksDB side to merge something like this upstream. My personal take is that we're already kind of abusing the RocksDB WAL for short lived PG log updates and I'm not thrilled about trying to add further code into RocksDB to try and support our use cases (though there is benefit here that goes beyond Ceph). We already maintain a custom version of RocksDB's LRU cache in our code to tie into our memory autotuning system but it would be really nice to avoid custom code like that in the future. One alternative: Igor Fedetov implemented a prototype WAL inside bluestore itself and we saw very good initial results from it with the RocksDB WAL disabled. These can be seen on slide 24 of my performance deck from Cephalocon 2023: https://www.linkedin.com/in/markhpc/overlay/experience/2113859303/multiple-… If Igor (or others) want to continue this work, I personally would be in favor of trying to move the WAL into Bluestore itself. I suspect we can make better decisions about PG log life cycles and have better BlueFS integration than what RocksDB provides us. Igor probably has a better idea of the pitfalls here though so I think we should hear out his thoughts on whether this is the right path forward. Igor also mentioned that he is continuing to work on his Bluestore WAL prototype with promising results, but that PG Log will (as expected) likely require a different solution that looks more like a specialized ring buffer. I think moving the WAL out of RocksDB is a good step toward that eventual goal. ********** Regression #2: (re-)Enabling BlueFS Buffered IO ********** Effects: Works around unexpected readahead behavior in RocksDB by utilizing underlying kernel page cache. Hurts write performance on fast devices. We're stuck between a bit of a rock and a hard place here. Over the years we have sea-sawed back and forth regarding when we should or should not use buffered IO: https://github.com/ceph/ceph/pull/11012 https://github.com/ceph/ceph/pull/11059 https://github.com/ceph/ceph/pull/18172 https://github.com/ceph/ceph/pull/20542 https://github.com/ceph/ceph/pull/34224 https://github.com/ceph/ceph/pull/38044 <-- lots of discussion here The gist of it is that there are upsides and downside to having bluefs_buffered_io=true. Direct IO is faster in some scenarios, especially more recent write tests on NVMe drives. The trade off is that RocksDB really seems to benefit from kernel buffer cache and there are other scenarios where bluefs_buffered_io is a big win. 2 years ago Adam and I did a walkthrough of the RocksDB code to try to understand the behavior regarding RocksDB readahead and we couldn't understand why it was re-reading data from the file system so often (or in the case of buffered IO the page cache!). I wrote up our walkthrough of the code here: https://github.com/ceph/ceph/pull/38044#issuecomment-790157415 ********** Regression #2 Potential Fixes ********** In a recent discussion with Mark Callaghan (of MyRocks/RocksDB performance tuning fame), he pointed out that RocksDB has an option to pre-populate the block cache with the data from SSTs created by memtable flush and that might help when O_DIRECT is used: https://github.com/facebook/rocksdb/blob/main/include/rocksdb/table.h#L600 We may want to experiment to see if this helps keep the block cache pre-populated after compaction and avoid (re)reads from the disk during iteration. We also might want to revisit this topic in general with the compact on iteration feature that was recently added and backported to pacific in 16.2.13. I'm still a little concerned however that we were seeing repeated overlapping reads for the same ranges during iteration that I would have expected to be cached by RocksDB on a previous read. Ultimately I think many of us would prefer to move entirely to directIO but there's more work to do to figure this one out. Josh Baergen provided further advice here: They have had good luck enabling buffered IO for rgw bucket index OSDs and disabling it everywhere else. This assumes that bucket indexes are on their own dedicated OSDs though, and personally I am a bit wary of hitting slow cases in RocksDB even on "regular" OSDs, but this might be something to consider as they've had good luck with this configuration for over a year. ********** Regression #3: RadosGW Coroutine and Request Timeout Changes ********** Effects: Higher RadosGW CPU usage, lower performance, especially for small object workloads Back when Pacific was released it was observed that RadosGW was showing much higher CPU usage and lower performance vs Nautilus for small (4KB) objects. It's likely that larger objects may be affected, though to a lesser degree A git bisection was performed and the results are summarized in the introduction section of the folliowing RGW performance analysis blog post here: https://ceph.io/en/news/blog/2023/reef-freeze-rgw-performance/ The bisection uncovered two primary PRs that were causing performance regression: https://github.com/ceph/ceph/pull/31580 https://github.com/ceph/ceph/pull/35355 The good news is that once those PRs were identified, the RGW team started working to improve things, especially for #35355: https://github.com/ceph/ceph/pull/43761 <-- Fixes issues introduced in #35355, backported to Pacific in 2022 *********** Regression #3 Potential Fixes ********** Quincy (and due to the backport likely Pacific) is showing significantly better behavior in recent tests due to PR #43761. The effects of #31580 are still present, but are considered a necessary trade-off. Other improvements since then may be helping, but we'll need to continue to make up the difference in other areas and start really investigating where we are spending cycles/time, especially in Reef. ********** Regression #4: Gradually slowing down OSDs ********** Effects: Significant slowdown after 1-2 weeks of OSD runtime Igor Fedetov pointed this one out in discussion earlier today: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/OXAUOK7CQXW… This one is pretty new and there is not much there yet other than perhaps low memory (and cache?) usage despite regular IO workload. Onode misses can absolutely cause performance degradation, but it's not clear yet whether this is memory related issue or something else. More investigation needed. Hopefully we'll get perf data from the users who encountered it to help diagnose what's going on here. ********** Conclusion ********** There may be other performance issues that I'm not remembering, but these are the big ones I can think of off the top of my head at the moment. Hopefully this helps clarify what's going on if people are seeing a regression, what to look for, and if they are hitting it, the why behind it. Thanks, Mark -- Best Regards, Mark Nelson Head of R&D (USA) Clyso GmbH p: +49 89 21552391 12 a: Loristraße 8 | 80335 München | Germany w: https://clyso.com | e: mark.nelson(a)clyso.com We are hiring: https://www.clyso.com/jobs/

11 months, 3 weeks

2
1
0 0

Ceph Leadership Team Meeting Minutes - May 10, 2023

by Ernesto Puerta

Hi Cephers, These are the topics that we just covered in today's meeting: - *Issues recording our meetings in Jitsi (Mike Perez)* - David Orman suggested using a self-hosted Jitsi instance: https://jitsi.github.io/handbook/docs/devops-guide/. Tested on a single container, with 4 cores for 4-5 attendees a call. - Help needed from Dan Mick & Adam Kraitman. - Ilya: this might scale for smaller meetings (dailies), but not for the larger ones (CDM, User-Devs, etc.). - *Reef release status review* - Josh: CentOS 9 blocked by Dashboard/Cephadm Python deps missing in CentOS 9. Casey: It's blocking teuthology testing. - Already discussed in devs mailing list a couple of months ago, for Quincy + Centos 9 (the missing packages would be the same). - Ken Dreyer preferred to keep the legacy approach (distro packages) instead of embedding Python deps. - Casey, Matt & Ernesto to resume that discussion with Ken Dreyer. - Other issues: - Radek: - Performance issue with RocksDB config. - Mismatching client-server features during upgrade. msgr encoder-decoder issues, introduced feature bit for Squid ( https://github.com/ceph/ceph/commit/1049d3e5eff0b7fa4fc9e5853494cb21c10b290a). Performance concerns. To be further discussed at Mark's Perf meeting (incl. Yuval) - Paul Cuzner perf testing Reef: higher CPU usage in Reef vs Quincy (more time spent in RocksDB get calls) - Target GA remains this June. Missing CentOS 9 packages becoming a blocker issue for the release. - *Perf regression in Pacific vs Nautilus (David Orman)* - https://tracker.ceph.com/issues/58530 - Mark: Missing change in upstream Rocksdb project. - David Orman: Is this degradation still happening in newer Rocksdb versions (reef)? Mark: No reason to think otherwise. Kind Regards, Ernesto

11 months, 3 weeks

1
0
0 0

v16.2.13 Pacific released

by Yuri Weinstein

We're happy to announce the 13th backport release in the Pacific series. https://ceph.io/en/news/blog/2023/v16-2-13-pacific-released/ Notable Changes --------------- * CEPHFS: Rename the `mds_max_retries_on_remount_failure` option to `client_max_retries_on_remount_failure` and move it from mds.yaml.in to mds-client.yaml.in because this option was only used by MDS client from its birth. * `ceph mgr dump` command now outputs `last_failure_osd_epoch` and `active_clients` fields at the top level. Previously, these fields were output under `always_on_modules` field. Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at https://download.ceph.com/tarballs/ceph-16.2.13.tar.gz * Containers at https://quay.io/repository/ceph/ceph * For packages, see https://docs.ceph.com/en/latest/install/get-packages/ * Release git sha1: 5378749ba6be3a0868b51803968ee9cde4833a3e

11 months, 3 weeks

2
1
0 0

16.2.13 pacific QE validation status

by Yuri Weinstein

Details of this release are summarized here: https://tracker.ceph.com/issues/59542#note-1 Release Notes - TBD Seeking approvals for: smoke - Radek, Laura rados - Radek, Laura rook - Sébastien Han cephadm - Adam K dashboard - Ernesto rgw - Casey rbd - Ilya krbd - Ilya fs - Venky, Patrick upgrade/octopus-x (pacific) - Laura (look the same as in 16.2.8) upgrade/pacific-p2p - Laura powercycle - Brad (SELinux denials) ceph-volume - Guillaume, Adam K Thx YuriW

11 months, 4 weeks

11
27
0 0

Perf CI Demo

by Nitzan Mordechai

Dear Developers, I am excited to announce that I will be presenting a demo of Perf CI, a project that I have been working on, at our next Performance meeting. Perf CI is a dashboard that provides an easy and efficient way to display and compare performance results from CBT between Teuthology runs on different branches or against baseline test results. The Perf CI dashboard not only offers a comprehensive overview of performance metrics, but also allows us to identify trends and pinpoint specific points in time when performance was affected. This will enable us to quickly diagnose and resolve performance issues, ultimately leading to improved product quality and user satisfaction. During the demo, I will be showcasing the key features of Perf CI and explaining its potential impact on our organization. I believe that Perf CI has the potential to bring significant benefits to our team, and I am eager to get your feedback on it. The demo will take place at our next Performance meeting, which is scheduled for May 11th at 3:00 PM UTC. Please mark your calendars and make sure to attend. If you have any questions or concerns, please don't hesitate to contact me. Thank you, and I look forward to seeing you all soon. Best regards, Nitzan

11 months, 4 weeks

1
0
0 0

help usage

by 祁振东

help

11 months, 4 weeks

1
0
0 0

Posting comments

by Regina Lora Gwen

Hi. I'm a new member of the List site. I really hope that I can share my thoughts and some information on the forum. However, my comment is seemingly in moderation. Please check it again. Thanks.

12 months

1
0
0 0

pacific 16.2.13 point release

by Yuri Weinstein

We want to do the next urgent point release for pacific 16.2.13 ASAP. The tip of the current pacific branch will be used as a base for this release and we will build it later today. Dev leads - if you have any outstanding PRs that must be included pls merged them now. Thx YuriW

12 months

4
4
0 0

05/04/2023 perf meeting is on!

by Mark Nelson

Hi Folks, The weekly performance meeting will be starting in approximately 20 minutes at 8AM PST! Today we'll catch up on pull requests and various topics since Cephalocon. Please feel free to add your own topic as well! Etherpad: https://pad.ceph.com/p/performance_weekly Meeting URL: https://meet.jit.si/ceph-performance Mark

12 months

1
0
0 0

Ceph Developer Monthly is tomorrow!

by Laura Flores

Hi everyone, The May CDM is coming up tomorrow, *Wednesday, May 3rd @ 1:00 UTC*. See more meeting details below. Note that we are now meeting on Jitsi. Please add any topics you'd like to discuss to the agenda: https://tracker.ceph.com/projects/ceph/wiki/CDM_03-MAY-2023 - Laura Flores Meeting Link: https://meet.jit.si/ceph-dev-monthly UTC: Thursday, May 4, 1:00 UTC Mountain View, CA, US: Wednesday, May 3, 18:00 PDT Phoenix, AZ, US: Wednesday, May 3, 18:00 MST Denver, CO, US: Wednesday, May 3, 19:00 MDT Huntsville, AL, US: Wednesday, May 3, 20:00 CDT Raleigh, NC, US: Wednesday, May 3, 21:00 EDT London, England: Thursday, May 4, 2:00 BST Paris, France: Thursday, May 4, 3:00 CEST Helsinki, Finland: Thursday, May 4, 4:00 EEST Tel Aviv, Israel: Thursday, May 4, 4:00 IDT Pune, India: Thursday, May 4, 6:30 IST Brisbane, Australia: Thursday, May 4, 11:00 AEST Singapore, Asia: Thursday, May 4, 9:00 +08 Auckland, New Zealand: Thursday, May 4, 13:00 NZST -- Laura Flores She/Her/Hers Software Engineer, Ceph Storage <https://ceph.io> Chicago, IL lflores(a)ibm.com | lflores(a)redhat.com <lflores(a)redhat.com> M: +17087388804

12 months

1
1
0 0

2024

2023

2022

2021

2020

2019

Dev May 2023