Hi all. About 2 days my Ceph cluster goes 1 millions IO/s on reads
default.rgw.buckets.index. When this happens my PUT requests goes up to 200
req/s but I had 500 req/s before and there were no high IO/s on that pool.
When this happens my rgw nodes and OSDs goes up to 100% cpu usage.
Do you have any indea what’s going on here that this pool gets 1 millions
Also I have upgraded to 14.2.8 but problem still persists.
Thanks for you help :)
We're happy to announce that a couple of weeks ago, we've submitted a few Github pull requests adding initial Windows support. A big thank you to the people that have already reviewed the patches.
To bring some context about the scope and current status of our work: we're mostly targeting the client side, allowing Windows hosts to consume rados, rbd and cephfs resources.
We have Windows binaries capable of writing to rados pools. We're using mingw to build the ceph components, mostly due to the fact that it requires the minimum amount of changes to cross compile ceph for Windows. However, we're soon going to switch to MSVC/Clang due to mingw limitations and long standing bugs. Porting the unit tests is also something that we're currently working on.
The next step will be implementing a virtual miniport driver so that RBD volumes can be exposed to Windows hosts and Hyper-V guests. We're hoping to leverage librbd as much as possible as part of a daemon that will communicate with the driver. We're also aiming at cephfs and considering using Dokan, which is FUSE compatible.
Merging the open PRs would allow us to move forward, focusing on the drivers and avoiding rebase issues. Any help on that is greatly appreciated.
Last but not least, I'd like to thank Suse, who's sponsoring this effort!
I often get this error on my PUT requests.
rgw.log-552376-2020-03-23 15:12:53.270 7fe5aa6da700 0 WARNING:
set_req_state_err err_no=5 resorting to 500
rgw.log-552377-2020-03-23 15:12:53.270 7fe5aa6da700 1 ====== req done
req=0x7fe5aa6d38c0 op status=-5 http_status=500 latency=59.9965s ======
My put requests are always under 1s but this requests spend 1min!
Any logs I can read to find out what's going on with my cluster?
we're currently investigating to set up a Teuthology cluster to run the
Ceph integration test suite on IBM Z, to improve test coverage on our
However, we're not sure what hardware resources are required to do so. The
target configuration should be large enough to comfortably support running
an instance of the full Ceph integration tests. Is there some data
available from your experience with such installations on how large this
cluster needs to be then?
In particular, what number of nodes, #cpus and memory per node, number
(type/size) of disks that should be attached?
Thanks for any data / estimates you can provide!
We have been evaluating other cluster storage solutions and one of them is
just about as fast as Ceph, but only uses FUSE. They mentioned that recent
improvement in the FUSE code allows for similar performance to kernel code.
So, I'm doing some tests between CephFS kernel and FUSE and that is not
true in the Ceph case.
It seems that there is a lot of time spent in locks and polls. I'm
wondering if this was needed to be done in the past to get around some
deficiencies in FUSE, but are no longer needed. I don't know enough about
FUSE to figure it out on my own.
This is a very parallel workload running during these samples.
Running `perf top`, I'm seeing:
16.90% [kernel] [k] do_sys_poll
16.68% libopen-pal.so.20.10.1 [.] 0x0000000000082091
12.21% [kernel] [k] __fget
8.36% [kernel] [k] fput
7.01% [kernel] [k] tcp_poll
2.94% [kernel] [k] sock_poll
1.96% [vdso] [.] 0x0000000000000977
1.92% [kernel] [k] syscall_return_via_sysret
1.58% [kernel] [k] tcp_stream_memory_free
Annotating the do_sys_poll, I get
0.09 │ → callq poll_freewait
0.09 │ mov -0x3d8(%rbp),%rcx
│ lea -0x3b0(%rbp),%rsi
│ xor %r8d,%r8d
0.00 │3f3: mov 0x8(%rsi),%eax
0.09 │ lea 0xc(%rsi),%r9
0.00 │ test %eax,%eax
│ ↓ jle 4ce
│ xor %edx,%edx
│ ↓ jmp 416
2.03 │406: add $0x1,%edx
2.02 │ add $0x8,%rcx
6.33 │ cmp %edx,0x8(%rsi)
0.19 │ ↓ jle 4ce
0.09 │416: movslq %edx,%rax
1.99 │ movzwl 0x6(%r9,%rax,8),%edi
22.59 │ stac
2.01 │ mov %r8d,%eax
8.88 │ mov %di,0x6(%rcx)
26.62 │ clac
0.00 │ test %eax,%eax
2.12 │ ↑ je 406
│430: mov $0xfffffff2,%r13d
0.00 │436: mov -0x3b0(%rbp),%rdi
│ test %rdi,%rdi
0.09 │ ↓ je 452
│442: mov (%rdi),%rbx
│ → callq kfree
│ test %rbx,%rbx
│ mov %rbx,%rdi
│ ↑ jne 442
The libopen-pal.so.20.10.1 doesn't provide much info (because I'm not sure
how to load the symbols)
15.03% [.] 0x0000000000082091
0.62% [.] 0x0000000000082093
0.59% [.] opal_libevent2022_event_base_loop
0.50% [.] 0x00000000000820a0
0.47% [.] opal_progress
0.07% [.] 0x000000000006e41b
0.07% [.] opal_libevent2022_evutil_tv_to_msec
And in __fget
2.52 │ sbb %rax,%rax
0.12 │ mov 0x8(%rdx),%rdx
0.14 │ and %edi,%eax
0.21 │ lea (%rdx,%rax,8),%rax
5.45 │ mov (%rax),%rdx
0.45 │ test %rdx,%rdx
│ ↓ je 5c
19.25 │ test %esi,0x44(%rdx)
│ ↓ jne 76
3.15 │ mov 0x38(%rdx),%rax
2.33 │ test %rax,%rax
│ ↑ je 1c
0.00 │ lea 0x1(%rax),%rcx
0.19 │ lea 0x38(%rdx),%r10
58.09 │ lock cmpxchg %rcx,0x38(%rdx)
0.02 │ ↓ jne 61
2.31 │5c: mov %rdx,%rax
0.00 │ pop %rbp
0.00 │ ← retq
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
Does anyone know how to trace the write/read request from client to
Is there any useful document besides the doc/dev/blkin?
I'm wondering how to trace the large distributed storage system e.g.
Ceph to observer/monitor it.