+1
1500 OSDs, mgr is constant %100 after upgrading from 14.2.2 to 14.2.5.
On Thu, Dec 19, 2019 at 11:06 AM Toby Darling <toby(a)mrc-lmb.cam.ac.uk> wrote:
>
> On 18/12/2019 22:40, Bryan Stillwell wrote:
> > That's how we noticed it too. Our graphs went silent after the upgrade
> > completed. Is your large cluster over 350 OSDs?
>
> A 'me too' on this - graphs have gone quiet, and mgr is using 100% CPU.
> This happened when we grew our 14.2.5 cluster from 328 to 436 OSDs.
>
> Cheers
> Toby
> --
> Toby Darling, Scientific Computing (2N249)
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue
> Cambridge Biomedical Campus
> Cambridge CB2 0QH
> Phone 01223 267070
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
I'm working on adding some material to the PG-repair page on the docs
website, and after a bit of reading and watching of videos, I have boiled
it down to a couple of commands:
Diagnose problems using this command:
$ sudo ceph pg dump --format=json-pretty
and then use the output of that command to discover the number of the
placement group that is inconsistent or broken, and run a command that
looks like this:
$ sudo ceph pg repair 1.4
where "1.4" is the number of the affected placement group.
This is a fine start, but I thought that I would ask everyone here for
their experiences with "pg repair", because I'd like the docs to be a bit
beefier than just a couple of commands.
Thanks in advance everyone.
Zac
(The documentation guy)
Hi,
I'm looking for somebody with more git-fu than I have.
When building ceph we load a ton of submodules.
But I do not need all of them. Some of them are loaded as FreeBSD
packages. Others are specific for Linux. And things like {s,d}pdk and
seastar would be nice if ceph on FreeBSD needed them, but I'm not yet
that far.
Is there a "easy" way of specifying which I do not want?
Other than hacking at the .submodules file after I fetched the git repo?
I looked in `git help submodules`, but there is little I could imagine
other than setting 'active = false' before fetching the submodules.
--WjW
Submodule path 'ceph-erasure-code-corpus': checked out
'2d7d78b9cc52e8a9529d8cc2d2954c7d375d5dd7'
Submodule path 'ceph-object-corpus': checked out
'e9bd1dbea014d62f6ada4d1535241ba4091a7b88'
Submodule path 'cephadm-adoption-corpus': checked out
'80c2e76549e35651ac7d7ee17e6badaee42dc866'
Submodule path 'src/blkin': checked out
'f24ceec055ea236a093988237a9821d145f5f7c8'
Submodule path 'src/c-ares': checked out
'fd6124c74da0801f23f9d324559d8b66fb83f533'
Submodule path 'src/civetweb': checked out
'bb99e93da00c3fe8c6b6a98520fb17cf64710ce7'
Submodule path 'src/crypto/isa-l/isa-l_crypto': checked out
'603529a4e06ac8a1662c13d6b31f122e21830352'
Submodule path 'src/dmclock': checked out
'47703948cb73d3c858cdf0701b741bb82978020a'
Submodule path 'src/erasure-code/jerasure/gf-complete': checked out
'7e61b44404f0ed410c83cfd3947a52e88ae044e1'
Submodule path 'src/erasure-code/jerasure/jerasure': checked out
'96c76b89d661c163f65a014b8042c9354ccf7f31'
Submodule path 'src/fmt': checked out
'7ad3015f5bc77eda28d52f820e6d89955bf0784a'
Submodule path 'src/googletest': checked out
'4e29e48840e611ecbef33d10960d7480d2e9034a'
Submodule path 'src/isa-l': checked out
'7e1a337433a340bc0974ed0f04301bdaca374af6'
Submodule path 'src/lua': checked out
'1fce39c6397056db645718b8f5821571d97869a4'
Submodule path 'src/rapidjson': checked out
'f54b0e47a08782a6131cc3d60f94d038fa6e0a51'
Submodule path 'src/rapidjson/thirdparty/gtest': checked out
'0a439623f75c029912728d80cb7f1b8b48739ca4'
Submodule path 'src/rocksdb': checked out
'4c736f177851cbf9fb7a6790282306ffac5065f8'
Submodule path 'src/seastar': checked out
'fb4d559f1417edd44580a44ee90c25c3cb76ea6e'
Submodule path 'src/seastar/dpdk': checked out
'7c29bbc804687fca5a2f71d05a120e81b2bd0066'
Submodule path 'src/spdk': checked out
'06d09c1108b16197ea985ae4d67867ed672a1e18'
Submodule path 'src/spdk/dpdk': checked out
'cb4240afc36b5da057cd4940d33964f84d0512c8'
Submodule path 'src/spdk/intel-ipsec-mb': checked out
'489ec6082a9d4a65d7569d1772dce64d2e96f5b5'
Submodule path 'src/spdk/isa-l': checked out
'09e787231b31add1234ec9a3dfe718533f1c3bf4'
Submodule path 'src/spdk/ocf': checked out
'515137f25ec71dca0c268fbd1437dd7d177e4f8d'
Submodule path 'src/xxHash': checked out
'1f40c6511fa8dd9d2e337ca8c9bc18b3e87663c9'
Submodule path 'src/zstd': checked out
'83b51e9f886be7c2a4d477b6e7bc6db831791d8d'
Hi,
Recently we meet a requirement about Zlib windowBits for compression.
In the source code, we found that the zlib windowBits is hard code as
-15[1]. While in zlib, it's a parameter that can be setting to different
value.[2] According to zlib guide, windowBits can be set to [-15, -8) for
raw deflate, (8, 15] for compression with header and trailer and [16,) for
optional gzip encoding. Now we want to set it to 15 in Ceph to satisfy our
requirement. Is it possible for Ceph Upstream to make it configurable, so
that users can change it with their different use case? Or any other way to
support it?
Hope to get community's feedback. And I have registered a feature[3] as
well. Feel free to leave any comment.
Thanks
Xiyuan Wang
[1]:
https://github.com/ceph/ceph/blob/master/src/compressor/zlib/ZlibCompressor…
[2]:
https://github.com/madler/zlib/blob/cacf7f1d4e3d44d871b605da3b647f07d718623…
[3]: https://tracker.ceph.com/issues/43324
After upgrading one of our clusters from Nautilus 14.2.2 to Nautilus 14.2.5 I'm seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H'). Attaching to the thread with strace shows a lot of mmap and munmap calls. Here's the distribution after watching it for a few minutes:
48.73% - mmap
49.48% - munmap
1.75% - futex
0.05% - madvise
I've upgraded 3 other clusters so far (120 OSDs, 30 OSDs, 200 OSDs), but this is the only one which has seen the problem (355 OSDs). Perhaps it has something to do with its size?
I was suspecting it might have to do with one of the modules misbehaving, so I disabled all of them:
# ceph mgr module ls | jq -r '.enabled_modules'
[]
But that didn't help (I restarted the mgrs after disabling the modules too).
I also tried setting debug_mgr and debug_mgrc to 20, but nothing popped out at me as being the cause of the problem.
It only seems to affect the active mgr. If I stop the active mgr the problem moves to one of the other mgrs.
Any guesses or tips on what next steps I should take to figure out what's going on?
Thanks,
Bryan
That's how we noticed it too. Our graphs went silent after the upgrade completed. Is your large cluster over 350 OSDs?
Bryan
On Dec 18, 2019, at 2:59 PM, Paul Mezzanini <pfmeec(a)rit.edu<mailto:pfmeec@rit.edu>> wrote:
Notice: This email is from an external sender.
Just wanted to say that we are seeing the same thing on our large cluster. It manifested mainly in the from of Prometheus stats being totally broken (they take too long to return if at all so the requesting program just gives up)
--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfmeec(a)rit.edu<mailto:pfmeec@rit.edu>
Sent from my phone. Please excuse any brevity or typoos.
CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.
------------------------
________________________________
From: Bryan Stillwell <bstillwell(a)godaddy.com<mailto:bstillwell@godaddy.com>>
Sent: Wednesday, December 18, 2019 4:44:45 PM
To: Sage Weil <sage(a)newdream.net<mailto:sage@newdream.net>>
Cc: ceph-users <ceph-users(a)ceph.io<mailto:ceph-users@ceph.io>>; dev(a)ceph.io<mailto:dev@ceph.io> <dev(a)ceph.io<mailto:dev@ceph.io>>
Subject: [ceph-users] Re: High CPU usage by ceph-mgr in 14.2.5
On Dec 18, 2019, at 11:58 AM, Sage Weil <sage(a)newdream.net<mailto:sage@newdream.net>> wrote:
On Wed, 18 Dec 2019, Bryan Stillwell wrote:
After upgrading one of our clusters from Nautilus 14.2.2 to Nautilus 14.2.5 I'm seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H'). Attaching to the thread with strace shows a lot of mmap and munmap calls. Here's the distribution after watching it for a few minutes:
48.73% - mmap
49.48% - munmap
1.75% - futex
0.05% - madvise
I've upgraded 3 other clusters so far (120 OSDs, 30 OSDs, 200 OSDs), but this is the only one which has seen the problem (355 OSDs). Perhaps it has something to do with its size?
I was suspecting it might have to do with one of the modules misbehaving, so I disabled all of them:
# ceph mgr module ls | jq -r '.enabled_modules'
[]
But that didn't help (I restarted the mgrs after disabling the modules too).
I also tried setting debug_mgr and debug_mgrc to 20, but nothing popped out at me as being the cause of the problem.
It only seems to affect the active mgr. If I stop the active mgr the problem moves to one of the other mgrs.
Any guesses or tips on what next steps I should take to figure out what's going on?
What are the balancer modes on the affected and unaffected cluster(s)?
Affected cluster has a balancer mode of "none".
The other three are "upmap", "none", and "upmap".
I don't know if you saw in ceph-users, but this bug report seems to point at the finisher-Mgr thread:
https://tracker.ceph.com/issues/43364
Thanks,
Bryan
On Tue, Dec 10, 2019 at 11:02 PM Li Wang <laurence.liwang(a)gmail.com> wrote:
>
> Hi Jason,
> If possible to do the following optimization,
> (1) For write, update in memory map first, then write data and
> asynchrously update map,
> therefore will not have the first write performance problem
> (2) For rbd open, after exclusive lock acquired, before loading the map,
> write a flag MAP_IN_USE into the rbd header
> (3) Before releasing exclusive lock, flush pending map writes, clean the flag
> (4) For rbd open, if the flag exists before loading map, discard and
> rebuild the map
Changing the behaviour like this would break backwards compatibility
with older clients. Therefore, it would really need a new feature bit
to describe "object-map v2". Rebuilding the map on a large image is
not a "free" operation since you might have to loop through tens of
thousands of objects. That could be quite the unexpected surprise for
a user attempting to restart a failed VM.
> Cheers,
> Li Wang
>
> Jason Dillaman <jdillama(a)redhat.com> 于2019年12月9日周一 下午9:42写道:
> >
> > On Mon, Dec 9, 2019 at 8:19 AM Li Wang <laurence.liwang(a)gmail.com> wrote:
> > >
> > > Hi Jason,
> > > If before the first write to object, the object map is updated first
> > > to indicate
> > > the object EXIST, what happen if crash occured before the data write, and after
> > > the object map write, will the map wrongly indicate one object EXIST but in fact
> > > NOTEXIST. In other words, the map subject to the following semantics,
> > > if an object
> >
> > That's not an issue that would result in an object leak or data
> > corruption. If the object-map flags the object as existing when it
> > doesn't due to an untimely crash, it will either do an unnecessary
> > read IO or delete request when removing the image.
> >
> > > NOTEXIST in map, it REALLY not exist. If an object EXIST in map,
> > > it not necessarily exist. The read/write to such a object will return ENOENT,
> > > and the client will read parent/copy up from parent then write, so
> > > that it is not a problem.
> > > If the above understanding is correct, how about diff computation,
> > > will the wrong indication
> >
> > Yes, it will be wrong for the affected object so your diff will
> > potentially include an extra object on the delta (but no data
> > corruption). The object-map can be re-built using the CLI, but there
> > really shouldn't be a need for such a corner case (that is just
> > slightly sub-optimal).
> >
> > > in the map cause a problem. And, we are wondering what is the negative impacts
> > > if disabling object map.
> > >
> > > Cheers,
> > > Li Wang
> > >
> > > Jason Dillaman <jdillama(a)redhat.com> 于2019年12月6日周五 下午9:56写道:
> > > >
> > > > On Thu, Dec 5, 2019 at 11:14 PM Li Wang <laurence.liwang(a)gmail.com> wrote:
> > > > >
> > > > > Hi Jason,
> > > > > We found the synchronous process of object map, which, as a result,
> > > > > write two objects
> > > > > every write greatly slow down the first write performance of a newly
> > > > > created rbd by up to 10x,
> > > > > which is not acceptable in our scenario, so could we do some
> > > > > optimizations on it,
> > > > > for example, batch the map writes or lazy update the map, do we need
> > > > > maintain accurate
> > > > > synchronization between the map and the data objects? but after a
> > > > > glimpse of the librbd codes,
> > > > > it seems no transactional design for the two objects (map object and
> > > > > data object) write?
> > > >
> > > > If you don't update the object-map before issuing the first write to
> > > > the associated object, you could crash and therefore the object-map's
> > > > state is worthless since you couldn't trust it to tell the truth. The
> > > > cost of object-map is supposed to be amortized over time so the first
> > > > writes on a new image will incur the performance hits, but future
> > > > writes do not.
> > > >
> > > > The good news is that you are more than welcome to disable
> > > > object-map/fast-diff if the performance penalty is too great for your
> > > > application -- it's not a required feature of RBD.
> > > >
> > > > >
> > > > > Cheers,
> > > > > Li Wang
> > > > >
> > > >
> > > >
> > > > --
> > > > Jason
> > > >
> > >
> >
> >
> > --
> > Jason
> >
>
--
Jason
Adding the dev list since it seems like a bug in 14.2.5.
I was able to capture the output from perf top:
21.58% libceph-common.so.0 [.] ceph::buffer::v14_2_0::list::append
20.90% libstdc++.so.6.0.19 [.] std::getline<char, std::char_traits<char>, std::allocator<char> >
13.25% libceph-common.so.0 [.] ceph::buffer::v14_2_0::list::append
10.11% libstdc++.so.6.0.19 [.] std::istream::sentry::sentry
8.94% libstdc++.so.6.0.19 [.] std::basic_ios<char, std::char_traits<char> >::clear
3.24% libceph-common.so.0 [.] ceph::buffer::v14_2_0::ptr::unused_tail_length
1.69% libceph-common.so.0 [.] std::getline<char, std::char_traits<char>, std::allocator<char> >@plt
1.63% libstdc++.so.6.0.19 [.] std::istream::sentry::sentry@plt
1.21% [kernel] [k] __do_softirq
0.77% libpython2.7.so.1.0 [.] PyEval_EvalFrameEx
0.55% [kernel] [k] _raw_spin_unlock_irqrestore
I increased mon debugging to 20 and nothing stuck out to me.
Bryan
> On Dec 12, 2019, at 4:46 PM, Bryan Stillwell <bstillwell(a)godaddy.com> wrote:
>
> On our test cluster after upgrading to 14.2.5 I'm having problems with the mons pegging a CPU core while moving data around. I'm currently converting the OSDs from FileStore to BlueStore by marking the OSDs out in multiple nodes, destroying the OSDs, and then recreating them with ceph-volume lvm batch. This seems too get the ceph-mon process into a state where it pegs a CPU core on one of the mons:
>
> 1764450 ceph 20 0 4802412 2.1g 16980 S 100.0 28.1 4:54.72 ceph-mon
>
> Has anyone else run into this with 14.2.5 yet? I didn't see this problem while the cluster was running 14.2.4.
>
> Thanks,
> Bryan
Hi,
I've been seeing a mon segfault in current master which can be consistently
tripped from a kernel CephFS mount attempt against a vstart cluster:
Thread 14 "msgr-worker-2" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f42403bf700 (LWP 92639)]
0x00007f4247d2b960 in __lll_unlock_elision () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007f4247d2b960 in __lll_unlock_elision () from /lib64/libpthread.so.0
#1 0x00007f424b1b4f4f in __gthread_mutex_unlock (__mutex=0x5613b7e0a2c8) at /usr/include/c++/7/x86_64-suse-linux/bits/gthr-default.h:778
#2 0x00007f424b1baa8a in std::mutex::unlock (this=0x5613b7e0a2c8) at /usr/include/c++/7/bits/std_mutex.h:121
#3 0x00007f424b5664d0 in ProtocolV1::open (this=0x5613b83a7800, reply=..., authorizer_reply=...) at /home/david/ceph/src/msg/async/ProtocolV1.cc:2481
#4 0x00007f424b561e4a in ProtocolV1::handle_connect_message_2 (this=0x5613b83a7800) at /home/david/ceph/src/msg/async/ProtocolV1.cc:2055
#5 0x00007f424b55fd1f in ProtocolV1::handle_connect_message_1 (this=0x5613b83a7800, buffer=0x5613b8289000 "\332*\244j\270\217\001/\b", r=0)
at /home/david/ceph/src/msg/async/ProtocolV1.cc:1915
(rest in https://paste.opensuse.org/43958295 )
Git bisect points at the following commit as the culprit:
c48a29b9edde3c6d3c msg/async: do not register lossy client connections
I'll raise a ticket to track this, but just thought I'd ping the list to
see whether others were hitting it...
Cheers, David
This is the eighth backport release in the Ceph Mimic stable release
series. Its sole purpose is to fix a regression that found its way into
the previous release.
Notable Changes
---------------
* Due to a missed backport, clusters in the process of being upgraded from
13.2.6 to 13.2.7 might suffer an OSD crash in build_incremental_map_msg.
This regression was reported in https://tracker.ceph.com/issues/43106
and is fixed in 13.2.8 (this release). Users of 13.2.6 can upgrade to 13.2.8
directly - i.e., skip 13.2.7 - to avoid this.
Changelog
---------
* osd: fix sending incremental map messages (issue#43106 pr#32000, Sage Weil)
* tests: added missing point release versions (pr#32087, Yuri Weinstein)
* tests: rgw: add missing force-branch: ceph-mimic for swift tasks (pr#32033, Casey Bodley)
For a blog with links to PRs and issues please check out
https://ceph.io/releases/v13-2-8-mimic-released/
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-13.2.8.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0
--
Abhishek Lekshmanan
SUSE Software Solutions Germany GmbH