I would like to understand how the per-OSD data from "ceph osd perf"
(i.e. apply_latency, commit_latency) is generated. So far I couldn't
find documentation on this. "ceph osd perf" output is nice for a quick
glimpse, but is not very well suited for graphing. Output values are
from the most recent 5s-averages apparently.
With "ceph daemon osd.X perf dump" OTOH, you get quite a lot of latency
metrics, while it is just not obvious to me how they aggregate into
apply_latency and commit_latency. Or some comparably easy read latency
metric (something that is missing completely in "ceph osd perf").
Can somebody shed some light on this?
today I did the first update from octopus to pacific, and it looks like the
avg apply latency went up from 1ms to 2ms.
All 36 OSDs are 4TB SSDs and nothing else changed.
Someone knows if this is an issue, or am I just missing a config value?
We encountered a Ceph failure where the system became unresponsive with no IOPS or throughput after encountering a failed node. Upon investigation, it appears that the OSD process on one of the Ceph storage nodes is stuck, but ping is still responsive. However, during the failure, Ceph was unable to recognize the problematic node, which resulted in all other OSDs in the cluster experiencing slow operations and no IOPS in the cluster at all.
Here's the timeline of the incident:
- At 10:40, an alert is triggered, indicating a problem with the OSD.
- After the alert, Ceph becomes unresponsive with no IOPS or throughput.
- At 11:26, an engineer discovers that there is a gradual OSD failure, with 6 out of 12 OSDs on the node being down.
- At 11:46, the Ceph engineer is unable to SSH into the faulty node and attempts a soft restart, but the "smartmontools" process is stuck while shutting down the server. Ping works during this time.
- After waiting for about one or two minutes, a hard restart is attempted for the server.
- At 11:57, after the Ceph node starts normally, service resumes as usual, indicating that the issue has been resolved.
Here is some basic information about our services:
- `Mon: 5 daemons, quorum host001, host002, host003, host004, host005 (age 4w)`
- `Mgr: host005 (active, since 4w), standbys: host001, host002, host003, host004`
- `Osd: 218 osds: 218 up (since 22h), 218 in (since 22h)`
We have a cluster with 19 nodes, including 15 SSD nodes and 4 HDD nodes. In total, there are 218 OSDs. The SSD nodes have 11 OSDs with Samsung EVO 870 SSD and each drive DB/WAL by 1.6T NVME drive. We are using Ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable).
Here is the health check detail:
[root@node21 ~]# ceph health detail
HEALTH_WARN 1 osds down; Reduced data availability: 12 pgs inactive, 12 pgs peering; Degraded data redundancy: 272273/43967625 objects degraded (0.619%), 88 pgs degraded, 5 pgs undersized; 18192 slow ops, oldest one blocked for 3730 sec, daemons [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]... have slow ops.
[WRN] OSD_DOWN: 1 osds down
osd.174 (root=default,host=hkhost031) is down
[WRN] PG_AVAILABILITY: Reduced data availability: 12 pgs inactive, 12 pgs peering
pg 2.dc is stuck peering for 49m, current state peering, last acting [87,95,172]
pg 2.e2 is stuck peering for 15m, current state peering, last acting [51,177,97]
pg 2.f7e is active+undersized+degraded, acting [10,214]
pg 2.f84 is active+undersized+degraded, acting [91,52]
[WRN] SLOW_OPS: 18192 slow ops, oldest one blocked for 3730 sec, daemons [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]... have slow ops.
I have the following questions:
1. Why couldn't Ceph detect the faulty node and automatically abandon its resources? Can anyone provide more troubleshooting guidance for this case?
2. What is Ceph's detection mechanism and where can I find related information? All of our production cloud machines were affected and suspended. If RBD is unstable, we cannot continue to use Ceph technology for our RBD source.
3. Did we miss any patches or bug fixes?
4. Is there anyone who can suggest improvements and how we can quickly detect and avoid similar issues in the future?
something is very wrong with my hardware it seems and i'm slowly turning
I'm trying to debug why ceph has incredibly poor performance for us.
- 3 EPYC 7713 dual-cpu systems
- datacenter nvme drives (3GB/s top)
- 100G infiniband
ceph does 800MB/s read max,
CPU is idle, network is idle, I/O nowhere near saturation
now while hunting for queues, i wanted to recompile ceph and poke around
but ./install-deps.sh is now running for 6 hours with the exact same
CPU is idle, network is idle, zero I/O
what am i doing wrong? is there another computer resource we're
bottlenecked on that i just dont know about?
how to understand ``Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6 enable better utilization of arbitrary DB device sizes, and the Pacific release brings experimental dynamic level support. `` in the document ``https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/#sizing``
is there a related patch??
I am currently testing some new disks, doing some benchmarks and stuff, and I would like to undertand how the OSD bench works.
If I quicky explain our setup, we have a small ceph cluster, where our new disks are inserted. And we have some pools with no replication at all, and 1 PG only, up-mapped to those new disks. So I can do some benchmarks on them.
The thing that is odd, is that doing some tests with fio tool, I have similar results on all disks, and doing the rados bench during 5 minutes as well. But the OSD bench at startup of the OSD, for mClock to configure osd_mclock_max_capacity_iops_hdd gives me a very big difference between disks. (600 vs 2200).
I am running Pacific on this test cluster.
Is there anywhere documentation of how this works? Or if anyone could explain that would be great.
I did not found any documentation on how OSD benchmark works, only how to used it. But playing a little bit with it, it seems the results we get is highly dependent on the block sizes we use. Same for rados bench, results are dependent, at least on my tests, of the block size we use, which I found a little bit weird to be honest.
And as mClock depends on that, it is impactful performance wise. On our cluster we can reach a lot better performances if we teak those values, instead of letting the cluster do proper measurements. And this looks to impact certain disk vendors more than others.
Stabilization Period: Monday, April 3rd - Friday, April 14th, 2023
Submission Deadline: Tuesday, May 16st, 2023 AoE
The IO500 is now accepting and encouraging submissions for the upcoming
12th semi-annual IO500 list, in conjunction with ISC23. Once again, we
are also accepting submissions to the 10 Client Node Challenge to
encourage the submission of small scale results. The new ranked lists
will be announced at the ISC23 BoF . We hope to see many new
1. Creation of Production and Research Lists - Starting with ISC'22, we
proposed a separation of the list into separate Production and Research
lists. This better reflects the important distinction between storage
systems that run in production environments and those that may use more
experimental hardware and software configurations. At ISC23, we will
formally create these two lists and users will be able to submit to
either of the two lists (and their 10 client-node counterparts). Please
see the requirements for each list on the IO500 rules page .
2. New Submission Tool - There is now a new IO500 submission tool that
improves the overall submission experience. Users can create accounts
and then update and manage all of their submissions through that
account. As part of this new tool, we have improved the submission
fields that describe the hardware and software of the system under test.
For reproducibility and analysis reasons, we now made the easily
obtainable fields mandatory - data from storage servers are for users
often difficult to obtain, therefore, most remain optional. As a new
system, there may be quirks, please reach out on Slack or the mailing
list if you see any issues. Further details will be released on the
submission page .
3. Reproducibility - Every submission will now receive a reproducibility
score based upon the provided system details and the reproducibility
questionnaire. This score will inform the community on the amount of
details provided in the submission and the obtainability of the storage
system. Further, this score will be used to evaluate if a submission is
eligible for the Production list.
4. New Phases - We are continuing to evaluate the inclusion of optional
test phases for additional key workloads - split easy/hard find phases,
4KB and 1MB random read/write phases, and concurrent metadata
operations. This is called an extended run. At the moment, we collect
the information to verify that additional phases do not significantly
impact the results of a standard run and an extended run to facilitate
comparisons between the existing and new benchmark phases. In a future
release, we may include some or all of these results as part of the
standard benchmark. The extended results are not currently included in
the scoring of any ranked list.
The benchmark suite is designed to be easy to run and the community has
multiple active support channels to help with any questions. Please note
that submissions of all sizes are welcome; the site has customizable
sorting, so it is possible to submit on a small system and still get a
very good per-client score, for example. Additionally, the list is about
much more than just the raw rank; all submissions help the community by
collecting and publishing a wider corpus of data. More details below.
Following the success of the Top500 in collecting and analyzing
historical trends in supercomputer technology and evolution, the IO500
was created in 2017, published its first list at SC17, and has grown
continually since then. The need for such an initiative has long been
known within High-Performance Computing; however, defining appropriate
benchmarks has long been challenging. Despite this challenge, the
community, after long and spirited discussion, finally reached consensus
on a suite of benchmarks and a metric for resolving the scores into a
The multi-fold goals of the benchmark suite are as follows:
1. Maximizing simplicity in running the benchmark suite
2. Encouraging optimization and documentation of tuning parameters for
3. Allowing submitters to highlight their "hero run" performance numbers
4. Forcing submitters to simultaneously report performance for
challenging IO patterns.
Specifically, the benchmark suite includes a hero-run of both IOR and
mdtest configured however possible to maximize performance and establish
an upper-bound for performance. It also includes an IOR and mdtest run
with highly prescribed parameters in an attempt to determine a lower
performance bound. Finally, it includes a namespace search as this has
been determined to be a highly sought-after feature in HPC storage
systems that has historically not been well-measured. Submitters are
encouraged to share their tuning insights for publication.
The goals of the community are also multi-fold:
1. Gather historical data for the sake of analysis and to aid
predictions of storage futures
2. Collect tuning information to share valuable performance
optimizations across the community
3. Encourage vendors and designers to optimize for workloads beyond
4. Establish bounded expectations for users, procurers, and
The IO500 follows a two-staged approach. First, there will be a two-week
stabilization period during which we encourage the community to verify
that the benchmark runs properly on a variety of storage systems. During
this period the benchmark may be updated based upon feedback from the
community. The final benchmark will then be released. We expect that
runs compliant with the rules made during the stabilization period will
be valid as a final submission unless a significant defect is found.
10 Client Node I/O Challenge
The 10 Client Node Challenge is conducted using the regular IO500
benchmark, however, with the rule that exactly 10 client nodes must be
used to run the benchmark. You may use any shared storage with any
number of servers. We will announce the results in the Production and
Research lists as well as in separate derived lists.
Once again, we encourage you to submit  to join our community, and to
attend the ISC23 BoF , where we will announce the new IO500
Production and Research lists and their 10 client node counterparts. The
current list includes results from twenty different storage system types
and 70 institutions. We hope that the upcoming list grows even more.
The IO500 Committee