December 2019 - Dev - lists.ceph.io

Re: [ceph-users] Re: High CPU usage by ceph-mgr in 14.2.5

by Serkan Çoban

+1 1500 OSDs, mgr is constant %100 after upgrading from 14.2.2 to 14.2.5. On Thu, Dec 19, 2019 at 11:06 AM Toby Darling <toby(a)mrc-lmb.cam.ac.uk> wrote: > > On 18/12/2019 22:40, Bryan Stillwell wrote: > > That's how we noticed it too. Our graphs went silent after the upgrade > > completed. Is your large cluster over 350 OSDs? > > A 'me too' on this - graphs have gone quiet, and mgr is using 100% CPU. > This happened when we grew our 14.2.5 cluster from 328 to 436 OSDs. > > Cheers > Toby > -- > Toby Darling, Scientific Computing (2N249) > MRC Laboratory of Molecular Biology > Francis Crick Avenue > Cambridge Biomedical Campus > Cambridge CB2 0QH > Phone 01223 267070 > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

4 years, 4 months

2
1
0 0

Documentation Refactor: PG-repair Request for Comments

by John Zachary Dover

I'm working on adding some material to the PG-repair page on the docs website, and after a bit of reading and watching of videos, I have boiled it down to a couple of commands: Diagnose problems using this command: $ sudo ceph pg dump --format=json-pretty and then use the output of that command to discover the number of the placement group that is inconsistent or broken, and run a command that looks like this: $ sudo ceph pg repair 1.4 where "1.4" is the number of the affected placement group. This is a fine start, but I thought that I would ask everyone here for their experiences with "pg repair", because I'd like the docs to be a bit beefier than just a couple of commands. Thanks in advance everyone. Zac (The documentation guy)

4 years, 4 months

1
0
0 0

I want to load less submodules when building

by Willem Jan Withagen

Hi, I'm looking for somebody with more git-fu than I have. When building ceph we load a ton of submodules. But I do not need all of them. Some of them are loaded as FreeBSD packages. Others are specific for Linux. And things like {s,d}pdk and seastar would be nice if ceph on FreeBSD needed them, but I'm not yet that far. Is there a "easy" way of specifying which I do not want? Other than hacking at the .submodules file after I fetched the git repo? I looked in `git help submodules`, but there is little I could imagine other than setting 'active = false' before fetching the submodules. --WjW Submodule path 'ceph-erasure-code-corpus': checked out '2d7d78b9cc52e8a9529d8cc2d2954c7d375d5dd7' Submodule path 'ceph-object-corpus': checked out 'e9bd1dbea014d62f6ada4d1535241ba4091a7b88' Submodule path 'cephadm-adoption-corpus': checked out '80c2e76549e35651ac7d7ee17e6badaee42dc866' Submodule path 'src/blkin': checked out 'f24ceec055ea236a093988237a9821d145f5f7c8' Submodule path 'src/c-ares': checked out 'fd6124c74da0801f23f9d324559d8b66fb83f533' Submodule path 'src/civetweb': checked out 'bb99e93da00c3fe8c6b6a98520fb17cf64710ce7' Submodule path 'src/crypto/isa-l/isa-l_crypto': checked out '603529a4e06ac8a1662c13d6b31f122e21830352' Submodule path 'src/dmclock': checked out '47703948cb73d3c858cdf0701b741bb82978020a' Submodule path 'src/erasure-code/jerasure/gf-complete': checked out '7e61b44404f0ed410c83cfd3947a52e88ae044e1' Submodule path 'src/erasure-code/jerasure/jerasure': checked out '96c76b89d661c163f65a014b8042c9354ccf7f31' Submodule path 'src/fmt': checked out '7ad3015f5bc77eda28d52f820e6d89955bf0784a' Submodule path 'src/googletest': checked out '4e29e48840e611ecbef33d10960d7480d2e9034a' Submodule path 'src/isa-l': checked out '7e1a337433a340bc0974ed0f04301bdaca374af6' Submodule path 'src/lua': checked out '1fce39c6397056db645718b8f5821571d97869a4' Submodule path 'src/rapidjson': checked out 'f54b0e47a08782a6131cc3d60f94d038fa6e0a51' Submodule path 'src/rapidjson/thirdparty/gtest': checked out '0a439623f75c029912728d80cb7f1b8b48739ca4' Submodule path 'src/rocksdb': checked out '4c736f177851cbf9fb7a6790282306ffac5065f8' Submodule path 'src/seastar': checked out 'fb4d559f1417edd44580a44ee90c25c3cb76ea6e' Submodule path 'src/seastar/dpdk': checked out '7c29bbc804687fca5a2f71d05a120e81b2bd0066' Submodule path 'src/spdk': checked out '06d09c1108b16197ea985ae4d67867ed672a1e18' Submodule path 'src/spdk/dpdk': checked out 'cb4240afc36b5da057cd4940d33964f84d0512c8' Submodule path 'src/spdk/intel-ipsec-mb': checked out '489ec6082a9d4a65d7569d1772dce64d2e96f5b5' Submodule path 'src/spdk/isa-l': checked out '09e787231b31add1234ec9a3dfe718533f1c3bf4' Submodule path 'src/spdk/ocf': checked out '515137f25ec71dca0c268fbd1437dd7d177e4f8d' Submodule path 'src/xxHash': checked out '1f40c6511fa8dd9d2e337ca8c9bc18b3e87663c9' Submodule path 'src/zstd': checked out '83b51e9f886be7c2a4d477b6e7bc6db831791d8d'

4 years, 4 months

1
0
0 0

About Zlib windowBits for compression

by Xiyuan Wang

Hi, Recently we meet a requirement about Zlib windowBits for compression. In the source code, we found that the zlib windowBits is hard code as -15[1]. While in zlib, it's a parameter that can be setting to different value.[2] According to zlib guide, windowBits can be set to [-15, -8) for raw deflate, (8, 15] for compression with header and trailer and [16,) for optional gzip encoding. Now we want to set it to 15 in Ceph to satisfy our requirement. Is it possible for Ceph Upstream to make it configurable, so that users can change it with their different use case? Or any other way to support it? Hope to get community's feedback. And I have registered a feature[3] as well. Feel free to leave any comment. Thanks Xiyuan Wang [1]: https://github.com/ceph/ceph/blob/master/src/compressor/zlib/ZlibCompressor… [2]: https://github.com/madler/zlib/blob/cacf7f1d4e3d44d871b605da3b647f07d718623… [3]: https://tracker.ceph.com/issues/43324

4 years, 4 months

2
2
0 0

High CPU usage by ceph-mgr in 14.2.5

by Bryan Stillwell

After upgrading one of our clusters from Nautilus 14.2.2 to Nautilus 14.2.5 I'm seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H'). Attaching to the thread with strace shows a lot of mmap and munmap calls. Here's the distribution after watching it for a few minutes: 48.73% - mmap 49.48% - munmap 1.75% - futex 0.05% - madvise I've upgraded 3 other clusters so far (120 OSDs, 30 OSDs, 200 OSDs), but this is the only one which has seen the problem (355 OSDs). Perhaps it has something to do with its size? I was suspecting it might have to do with one of the modules misbehaving, so I disabled all of them: # ceph mgr module ls | jq -r '.enabled_modules' [] But that didn't help (I restarted the mgrs after disabling the modules too). I also tried setting debug_mgr and debug_mgrc to 20, but nothing popped out at me as being the cause of the problem. It only seems to affect the active mgr. If I stop the active mgr the problem moves to one of the other mgrs. Any guesses or tips on what next steps I should take to figure out what's going on? Thanks, Bryan

4 years, 4 months

3
3
0 0

Re: High CPU usage by ceph-mgr in 14.2.5

by Bryan Stillwell

That's how we noticed it too. Our graphs went silent after the upgrade completed. Is your large cluster over 350 OSDs? Bryan On Dec 18, 2019, at 2:59 PM, Paul Mezzanini <pfmeec(a)rit.edu<mailto:pfmeec@rit.edu>> wrote: Notice: This email is from an external sender. Just wanted to say that we are seeing the same thing on our large cluster. It manifested mainly in the from of Prometheus stats being totally broken (they take too long to return if at all so the requesting program just gives up) -- Paul Mezzanini Sr Systems Administrator / Engineer, Research Computing Information & Technology Services Finance & Administration Rochester Institute of Technology o:(585) 475-3245 | pfmeec(a)rit.edu<mailto:pfmeec@rit.edu> Sent from my phone. Please excuse any brevity or typoos. CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and destroy any copies of this information. ------------------------ ________________________________ From: Bryan Stillwell <bstillwell(a)godaddy.com<mailto:bstillwell@godaddy.com>> Sent: Wednesday, December 18, 2019 4:44:45 PM To: Sage Weil <sage(a)newdream.net<mailto:sage@newdream.net>> Cc: ceph-users <ceph-users(a)ceph.io<mailto:ceph-users@ceph.io>>; dev(a)ceph.io<mailto:dev@ceph.io> <dev(a)ceph.io<mailto:dev@ceph.io>> Subject: [ceph-users] Re: High CPU usage by ceph-mgr in 14.2.5 On Dec 18, 2019, at 11:58 AM, Sage Weil <sage(a)newdream.net<mailto:sage@newdream.net>> wrote: On Wed, 18 Dec 2019, Bryan Stillwell wrote: After upgrading one of our clusters from Nautilus 14.2.2 to Nautilus 14.2.5 I'm seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H'). Attaching to the thread with strace shows a lot of mmap and munmap calls. Here's the distribution after watching it for a few minutes: 48.73% - mmap 49.48% - munmap 1.75% - futex 0.05% - madvise I've upgraded 3 other clusters so far (120 OSDs, 30 OSDs, 200 OSDs), but this is the only one which has seen the problem (355 OSDs). Perhaps it has something to do with its size? I was suspecting it might have to do with one of the modules misbehaving, so I disabled all of them: # ceph mgr module ls | jq -r '.enabled_modules' [] But that didn't help (I restarted the mgrs after disabling the modules too). I also tried setting debug_mgr and debug_mgrc to 20, but nothing popped out at me as being the cause of the problem. It only seems to affect the active mgr. If I stop the active mgr the problem moves to one of the other mgrs. Any guesses or tips on what next steps I should take to figure out what's going on? What are the balancer modes on the affected and unaffected cluster(s)? Affected cluster has a balancer mode of "none". The other three are "upmap", "none", and "upmap". I don't know if you saw in ceph-users, but this bug report seems to point at the finisher-Mgr thread: https://tracker.ceph.com/issues/43364 Thanks, Bryan

4 years, 4 months

1
0
0 0

Re: About the optimization of rbd object map

by Jason Dillaman

On Tue, Dec 10, 2019 at 11:02 PM Li Wang <laurence.liwang(a)gmail.com> wrote: > > Hi Jason, > If possible to do the following optimization, > (1) For write, update in memory map first, then write data and > asynchrously update map, > therefore will not have the first write performance problem > (2) For rbd open, after exclusive lock acquired, before loading the map, > write a flag MAP_IN_USE into the rbd header > (3) Before releasing exclusive lock, flush pending map writes, clean the flag > (4) For rbd open, if the flag exists before loading map, discard and > rebuild the map Changing the behaviour like this would break backwards compatibility with older clients. Therefore, it would really need a new feature bit to describe "object-map v2". Rebuilding the map on a large image is not a "free" operation since you might have to loop through tens of thousands of objects. That could be quite the unexpected surprise for a user attempting to restart a failed VM. > Cheers, > Li Wang > > Jason Dillaman <jdillama(a)redhat.com> 于2019年12月9日周一下午9:42写道： > > > > On Mon, Dec 9, 2019 at 8:19 AM Li Wang <laurence.liwang(a)gmail.com> wrote: > > > > > > Hi Jason, > > > If before the first write to object, the object map is updated first > > > to indicate > > > the object EXIST, what happen if crash occured before the data write, and after > > > the object map write, will the map wrongly indicate one object EXIST but in fact > > > NOTEXIST. In other words, the map subject to the following semantics, > > > if an object > > > > That's not an issue that would result in an object leak or data > > corruption. If the object-map flags the object as existing when it > > doesn't due to an untimely crash, it will either do an unnecessary > > read IO or delete request when removing the image. > > > > > NOTEXIST in map, it REALLY not exist. If an object EXIST in map, > > > it not necessarily exist. The read/write to such a object will return ENOENT, > > > and the client will read parent/copy up from parent then write, so > > > that it is not a problem. > > > If the above understanding is correct, how about diff computation, > > > will the wrong indication > > > > Yes, it will be wrong for the affected object so your diff will > > potentially include an extra object on the delta (but no data > > corruption). The object-map can be re-built using the CLI, but there > > really shouldn't be a need for such a corner case (that is just > > slightly sub-optimal). > > > > > in the map cause a problem. And, we are wondering what is the negative impacts > > > if disabling object map. > > > > > > Cheers, > > > Li Wang > > > > > > Jason Dillaman <jdillama(a)redhat.com> 于2019年12月6日周五下午9:56写道： > > > > > > > > On Thu, Dec 5, 2019 at 11:14 PM Li Wang <laurence.liwang(a)gmail.com> wrote: > > > > > > > > > > Hi Jason, > > > > > We found the synchronous process of object map, which, as a result, > > > > > write two objects > > > > > every write greatly slow down the first write performance of a newly > > > > > created rbd by up to 10x, > > > > > which is not acceptable in our scenario, so could we do some > > > > > optimizations on it, > > > > > for example, batch the map writes or lazy update the map, do we need > > > > > maintain accurate > > > > > synchronization between the map and the data objects? but after a > > > > > glimpse of the librbd codes, > > > > > it seems no transactional design for the two objects (map object and > > > > > data object) write? > > > > > > > > If you don't update the object-map before issuing the first write to > > > > the associated object, you could crash and therefore the object-map's > > > > state is worthless since you couldn't trust it to tell the truth. The > > > > cost of object-map is supposed to be amortized over time so the first > > > > writes on a new image will incur the performance hits, but future > > > > writes do not. > > > > > > > > The good news is that you are more than welcome to disable > > > > object-map/fast-diff if the performance penalty is too great for your > > > > application -- it's not a required feature of RBD. > > > > > > > > > > > > > > Cheers, > > > > > Li Wang > > > > > > > > > > > > > > > > > -- > > > > Jason > > > > > > > > > > > > > -- > > Jason > > > -- Jason

4 years, 4 months

2
2
0 0

Re: ceph-mon using 100% CPU after upgrade to 14.2.5

by Bryan Stillwell

Adding the dev list since it seems like a bug in 14.2.5. I was able to capture the output from perf top: 21.58% libceph-common.so.0 [.] ceph::buffer::v14_2_0::list::append 20.90% libstdc++.so.6.0.19 [.] std::getline<char, std::char_traits<char>, std::allocator<char> > 13.25% libceph-common.so.0 [.] ceph::buffer::v14_2_0::list::append 10.11% libstdc++.so.6.0.19 [.] std::istream::sentry::sentry 8.94% libstdc++.so.6.0.19 [.] std::basic_ios<char, std::char_traits<char> >::clear 3.24% libceph-common.so.0 [.] ceph::buffer::v14_2_0::ptr::unused_tail_length 1.69% libceph-common.so.0 [.] std::getline<char, std::char_traits<char>, std::allocator<char> >@plt 1.63% libstdc++.so.6.0.19 [.] std::istream::sentry::sentry@plt 1.21% [kernel] [k] __do_softirq 0.77% libpython2.7.so.1.0 [.] PyEval_EvalFrameEx 0.55% [kernel] [k] _raw_spin_unlock_irqrestore I increased mon debugging to 20 and nothing stuck out to me. Bryan > On Dec 12, 2019, at 4:46 PM, Bryan Stillwell <bstillwell(a)godaddy.com> wrote: > > On our test cluster after upgrading to 14.2.5 I'm having problems with the mons pegging a CPU core while moving data around. I'm currently converting the OSDs from FileStore to BlueStore by marking the OSDs out in multiple nodes, destroying the OSDs, and then recreating them with ceph-volume lvm batch. This seems too get the ceph-mon process into a state where it pegs a CPU core on one of the mons: > > 1764450 ceph 20 0 4802412 2.1g 16980 S 100.0 28.1 4:54.72 ceph-mon > > Has anyone else run into this with 14.2.5 yet? I didn't see this problem while the cluster was running 14.2.4. > > Thanks, > Bryan

4 years, 4 months

2
3
0 0

mon segv in msg following "do not register lossy client connections"

by David Disseldorp

Hi, I've been seeing a mon segfault in current master which can be consistently tripped from a kernel CephFS mount attempt against a vstart cluster: Thread 14 "msgr-worker-2" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f42403bf700 (LWP 92639)] 0x00007f4247d2b960 in __lll_unlock_elision () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f4247d2b960 in __lll_unlock_elision () from /lib64/libpthread.so.0 #1 0x00007f424b1b4f4f in __gthread_mutex_unlock (__mutex=0x5613b7e0a2c8) at /usr/include/c++/7/x86_64-suse-linux/bits/gthr-default.h:778 #2 0x00007f424b1baa8a in std::mutex::unlock (this=0x5613b7e0a2c8) at /usr/include/c++/7/bits/std_mutex.h:121 #3 0x00007f424b5664d0 in ProtocolV1::open (this=0x5613b83a7800, reply=..., authorizer_reply=...) at /home/david/ceph/src/msg/async/ProtocolV1.cc:2481 #4 0x00007f424b561e4a in ProtocolV1::handle_connect_message_2 (this=0x5613b83a7800) at /home/david/ceph/src/msg/async/ProtocolV1.cc:2055 #5 0x00007f424b55fd1f in ProtocolV1::handle_connect_message_1 (this=0x5613b83a7800, buffer=0x5613b8289000 "\332*\244j\270\217\001/\b", r=0) at /home/david/ceph/src/msg/async/ProtocolV1.cc:1915 (rest in https://paste.opensuse.org/43958295 ) Git bisect points at the following commit as the culprit: c48a29b9edde3c6d3c msg/async: do not register lossy client connections I'll raise a ticket to track this, but just thought I'd ping the list to see whether others were hitting it... Cheers, David

4 years, 4 months

1
0
0 0

v13.2.8 Mimic released

by Abhishek Lekshmanan

This is the eighth backport release in the Ceph Mimic stable release series. Its sole purpose is to fix a regression that found its way into the previous release. Notable Changes --------------- * Due to a missed backport, clusters in the process of being upgraded from 13.2.6 to 13.2.7 might suffer an OSD crash in build_incremental_map_msg. This regression was reported in https://tracker.ceph.com/issues/43106 and is fixed in 13.2.8 (this release). Users of 13.2.6 can upgrade to 13.2.8 directly - i.e., skip 13.2.7 - to avoid this. Changelog --------- * osd: fix sending incremental map messages (issue#43106 pr#32000, Sage Weil) * tests: added missing point release versions (pr#32087, Yuri Weinstein) * tests: rgw: add missing force-branch: ceph-mimic for swift tasks (pr#32033, Casey Bodley) For a blog with links to PRs and issues please check out https://ceph.io/releases/v13-2-8-mimic-released/ Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at http://download.ceph.com/tarballs/ceph-13.2.8.tar.gz * For packages, see http://docs.ceph.com/docs/master/install/get-packages/ * Release git sha1: 5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0 -- Abhishek Lekshmanan SUSE Software Solutions Germany GmbH

4 years, 4 months

1
0
0 0

2024

2023

2022

2021

2020

2019

Dev December 2019