December 2019 - ceph-users

Radosgw/Objecter behaviour for homeless session

by Biswajeet Patra

Hi All, I have a query regarding objecter behaviour for homeless session. In situations when all OSDs containing copies (*let say replication 3*) of an object are down, the objecter assigns a homeless session (OSD=-1) to a client request. This request makes radosgw thread hang indefinitely as the data could never be served because all required OSDs are down. With multiple similar requests, all the radosgw threads gets exhausted and hanged indefinitely waiting for the OSDs to come up. This creates complete service unavailability as no rgw threads are present to process valid requests which could have been directed towards active PGs/OSDs. I think we should have behaviour in objecter or radosgw to terminate request and return early in case of a homeless session. Let me know your thoughts on this. Regards, Biswajeet -- *-----------------------------------------------------------------------------------------* *This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.***** **** *Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the organization. Any information on shares, debentures or similar instruments, recommended product pricing, valuations and the like are for information purposes only. It is not meant to be an instruction or recommendation, as the case may be, to buy or to sell securities, products, services nor an offer to buy or sell securities, products or services unless specifically stated to be so on behalf of the Flipkart group. Employees of the Flipkart group of companies are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to organizational policy and outside the scope of the employment of the individual concerned. The organization will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising.***** **** *Our organization accepts no liability for the content of this email, or for the consequences of any actions taken on the basis of the information *provided,* unless that information is subsequently confirmed in writing. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.* _-----------------------------------------------------------------------------------------_

4 years, 3 months

1
1
0 0

PG lock contention? CephFS metadata pool rebalance

by Stefan Kooman

Hi, Like I said in an earlier mail to this list, we re-balanced ~ 60% of the CephFS metadata pool to NVMe backed devices. Roughly 422 M objects (1.2 Billion replicated). We have 512 PGs allocated to them. While rebalancing we suffered from quite a few SLOW_OPS. Memory, CPU and device IOPS capacity were not a limiting factor as far as we can see (plenty of them available ... nowhere near max capacity). We saw quite a few slow ops with the following events: "time": "2019-12-19 09:41:02.712010", "event": "reached_pg" }, { "time": "2019-12-19 09:41:02.712014", "event": "waiting for rw locks" }, { "time": "2019-12-19 09:41:02.881939", "event": "reached_pg" ... and this repeated 100's of times taking ~ 30 seconds to complete Does this indicate PG lock contention? If so ... would we need to provide more PGs to the metadata pool to avoid this? The metadata pool is only ~ 166 MiB big ... but with loads of OMAPs ... Most advice on PG planning is concerned with the _amount_ of data ... but the metadata pool (and this might also be true for RGW index pools) seem to be a special case. Thanks for your insights, Gr. Stefan -- | BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl

4 years, 3 months

1
1
0 0

deep-scrub / backfilling: large amount of SLOW_OPS after upgrade to 13.2.8

by Stefan Kooman

Hi, After the upgrade to 13.2.8 deep-scrub has a big impact on client IO: loads of SLOW_OPS and high latency. We hardly ever had SLOW_OPS, but since the upgrade the impact is so big that we even have OSDs marking each other out (OSD op thread timeout) multiple times during the scrub window. Plenty of CPU / RAM / IOPS left, hardly any load on these OSD servers. Has there anything changed in this release that can explain this behaviour? Besides this the impact of rebalance is very severe as well. With only the balancer remapping a couple of PGs at a time there are loads of (MDS_)SLOW_OPS. This morning the cephfs metadata pool got rebalanced ... and that triggered a lot of SLOW_OPS. One particular OSD was pegged at 1000% CPU for more than half an hour (not doing that much IO): that's 10 cores going full throttle! After a restart this issue was gone. Thanks, Stefan -- | BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl

4 years, 3 months

2
3
0 0

High CPU usage by ceph-mgr in 14.2.5

by Bryan Stillwell

After upgrading one of our clusters from Nautilus 14.2.2 to Nautilus 14.2.5 I'm seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H'). Attaching to the thread with strace shows a lot of mmap and munmap calls. Here's the distribution after watching it for a few minutes: 48.73% - mmap 49.48% - munmap 1.75% - futex 0.05% - madvise I've upgraded 3 other clusters so far (120 OSDs, 30 OSDs, 200 OSDs), but this is the only one which has seen the problem (355 OSDs). Perhaps it has something to do with its size? I was suspecting it might have to do with one of the modules misbehaving, so I disabled all of them: # ceph mgr module ls | jq -r '.enabled_modules' [] But that didn't help (I restarted the mgrs after disabling the modules too). I also tried setting debug_mgr and debug_mgrc to 20, but nothing popped out at me as being the cause of the problem. It only seems to affect the active mgr. If I stop the active mgr the problem moves to one of the other mgrs. Any guesses or tips on what next steps I should take to figure out what's going on? Thanks, Bryan

4 years, 3 months

16
24
0 0

why osd's heartbeat partner comes from another root tree?

by opengers

Hello: According to my understanding, osd's heartbeat partners only come from those osds who assume the same pg See below(# ceph osd tree), osd.10 and osd.0-6 cannot assume the same pg, because osd.10 and osd.0-6 are from different root tree, and pg in my cluster doesn't map across root trees(# ceph osd crush rule dump). so, osd.0-6 cannot become the heartbeat partner of osd.10 But, below is the log on osd.10, It can be seen that the osd.10's heartbeat partner include osd.0/1/2/5, why? thanks for any help > # osd.10 log > 2019-11-20 09:21:50.431799 7fbb369fb700 -1 osd.10 7344 heartbeat_check: no > reply from 10.13.6.162:6806 osd.2 since back 2019-11-20 09:21:19.979712 > front 2019-11-20 09:21:19.979712 (cutoff 2019-11-20 09:21:30.431768) > 2019-11-20 13:15:59.175060 7fbb369fb700 -1 osd.10 7357 heartbeat_check: no > reply from 10.13.6.162:6806 <http://10.13.6.152:6806> osd.2 since back > 2019-11-20 13:15:38.710424 front 2019-11-20 13:15:38.710424 (cutoff > 2019-11-20 13:15:39.175058) > 2019-11-20 13:15:59.175110 7fbb369fb700 -1 osd.10 7357 heartbeat_check: no > reply from 10.13.6.160:6803 osd.0 since back 2019-11-20 13:15:38.710424 > front 2019-11-20 13:15:38.710424 (cutoff 2019-11-20 13:15:39.175058) > 2019-11-20 13:15:59.175118 7fbb369fb700 -1 osd.10 7357 heartbeat_check: no > reply from 10.13.6.161:6803 osd.1 since back 2019-11-20 13:15:38.710424 > front 2019-11-20 13:15:38.710424 (cutoff 2019-11-20 13:15:39.175058) > 2019-11-21 02:52:24.656783 7fbb369fb700 -1 osd.10 7374 heartbeat_check: no > reply from 10.13.6.158:6810 osd.5 since back 2019-11-21 02:52:04.557548 > front 2019-11-21 02:52:04.557548 (cutoff 2019-11-21 02:52:04.656781) > # ceph osd tree > -17 3.29095 root ssd-storage > > -25 1.09698 rack rack-ssd-A > > -18 1.09698 host ssd-osd01 > 10 hdd 1.09698 osd.10 up 1.00000 > 1.00000 > -26 1.09698 rack rack-ssd-B > > -19 1.09698 host ssd-osd02 > 11 hdd 1.09698 osd.11 up 1.00000 > 1.00000 > -27 1.09698 rack rack-ssd-C > > -20 1.09698 host ssd-osd03 > 12 hdd 1.09698 osd.12 up 1.00000 > 1.00000 > -1 3.22256 root default > > -3 0.29300 host test-osd01 > > 0 hdd 0.29300 osd.0 up 1.00000 > 1.00000 > -5 0.29300 host test-osd02 > > 1 hdd 0.29300 osd.1 up 0.89999 > 1.00000 > -7 0.29300 host test-osd03 > > 2 hdd 0.29300 osd.2 up 0.79999 > 1.00000 > -9 0.29300 host test-osd04 > > 3 hdd 0.29300 osd.3 up 1.00000 > 1.00000 > -11 0.29300 host test-osd05 > > 4 hdd 0.29300 osd.4 up 1.00000 > 1.00000 > -13 0.29300 host test-osd06 > > 5 hdd 0.29300 osd.5 up 1.00000 > 1.00000 > -15 0.29300 host test-osd07 > > 6 hdd 0.29300 osd.6 up 1.00000 > 1.00000 # ceph osd crush rule dump > > [ > { > "rule_id": 0, > "rule_name": "replicated_rule", > "ruleset": 0, > "type": 1, > "min_size": 1, > "max_size": 10, > "steps": [ > { > "op": "take", > "item": -1, > "item_name": "default" > }, > { > "op": "chooseleaf_firstn", > "num": 0, > "type": "host" > }, > { > "op": "emit" > } > ] > }, > { > "rule_id": 1, > "rule_name": "replicated_rule_ssd", > "ruleset": 1, > "type": 1, > "min_size": 1, > "max_size": 10, > "steps": [ > { > "op": "take", > "item": -17, > "item_name": "ssd-storage" > }, > { > "op": "chooseleaf_firstn", > "num": 0, > "type": "rack" > }, > { > "op": "emit" > } > ] > } > ] > # some parameters > "mon_osd_down_out_interval": "600", > "mon_osd_down_out_subtree_limit": "rack", > "mds_debug_subtrees": "false", > "mon_osd_down_out_subtree_limit": "rack", > "mon_osd_reporter_subtree_level": "host",

4 years, 3 months

3
3
0 0

MDS failing under load with large cache sizes

by Janek Bevendorff

Hi, I am trying to copy the contents of our storage server into a CephFS, but am experiencing stability issues with my MDSs. The CephFS sits on top of an erasure-coded pool with 5 MONs, 5 MDSs and a max_mds setting of two. My Ceph cluster version is Nautilus, the client is Mimic and uses the kernel module to mount the FS. The index of filenames to copy is about 23GB and I am using 16 parallel rsync processes over a 10G link to copy the files over to Ceph. This works perfectly for a while, but then the MDSs start reporting oversized caches (between 20 and 50GB, sometimes more) and an inode count between 1 and 4 million. Particularly the Inode count seems quite high to me. Each rsync job has 25k files to work with, so if all 16 processes open all their files at the same time, I should not exceed 400k. Even if I double this number to account for the client's page cache, I should get nowhere near that number of inodes (a sync flush takes about 1 second). Then after a few hours, my MDSs start failing with messages like this: -21> 2019-07-22 14:00:05.877 7f67eacec700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 -20> 2019-07-22 14:00:05.877 7f67eacec700 0 mds.beacon.XXX Skipping beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal heartbeat is not healthy! The standby nodes try to take over, but take forever to become active and will fail as well eventually. During my research, I found this related topic: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015959.html, but I tried everything in there from increasing to lowering my cache size, the number of segments etc. I also played around with the number of active MDSs and two appears to work the best, whereas one cannot keep up with the load and three seems to be the worst of all choices. Do you have any ideas how I can improve the stability of my MDS damons to handle the load properly? single 10G link is a toy and we could query the cluster with a lot more requests per second, but it's already yielding to 16 rsync processes. Thanks

4 years, 3 months

4
33
0 0

RESEND: Re: PG Balancer Upmap mode not working

by David Zafman

Please file a tracker with the symptom and examples. Please attach your OSDMap (ceph osd getmap > osdmap.bin). Note that https://github.com/ceph/ceph/pull/31956 has the Nautilus version of improved upmap code. It also changes osdmaptool to match the mgr behavior, so that one can observe the behavior of the upmap balancer offline. Thanks David On 12/8/19 11:04 AM, Philippe D'Anjou wrote: > It's only getting worse after raising PGs now. > > Anything between: > 96 hdd 9.09470 1.00000 9.1 TiB 4.9 TiB 4.9 TiB 97 KiB 13 GiB 4.2 > TiB 53.62 0.76 54 up > > and > > 89 hdd 9.09470 1.00000 9.1 TiB 8.1 TiB 8.1 TiB 88 KiB 21 GiB 1001 > GiB 89.25 1.27 87 up > > How is that possible? I dont know how much more proof I need to > present that there's a bug. > > > > _______________________________________________ > ceph-users mailing list > ceph-users(a)lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

4 years, 3 months

5
20
0 0

Balancing PGs across OSDs

by Thomas Schneider

Hi, in this <https://ceph.io/community/the-first-telemetry-results-are-in/> blog post I find this statement: "So, in our ideal world so far (assuming equal size OSDs), every OSD now has the same number of PGs assigned." My issue is that accross all pools the number of PGs per OSD is not equal. And I conclude that this is causing very unbalanced data placement. As a matter of fact the data stored on my 1.6TB HDD in specific pool "hdb_backup" is in a range starting with osd.228 size: 1.6 usage: 52.61 reweight: 1.00000 and ending with osd.145 size: 1.6 usage: 81.11 reweight: 1.00000 This impacts the amount of data that can be stored in the cluster heavily. Ceph balancer is enabled, but this is not solving this issue. root@ld3955:~# ceph balancer status { "active": true, "plans": [], "mode": "upmap" } Therefore I would ask you for suggestions how to work on this unbalanced data distribution. I have attached pastebin for - ceph osd df sorted by usage <https://pastebin.com/QLQHjA9g> - ceph osd df tree <https://pastebin.com/SvhP2hp5> My cluster has multiple crush roots respresenting different disks. In addition I have defined multiple pools, one pool for each disk type: hdd, ssd, nvme. THX

4 years, 3 months

6
23
0 0

radosgw - Etags suffixed with #x0e

by Ingo Reimann

Hi, We had a strange problem with some buckets. After a s3cmd sync, some objects got ETAGs with the suffix "#x0e". This rendered the XML output of "GET /" e.g. (s3cmd du) invalid. Unfortunately, this behaviour was not reproducable but could be fixed by "GET /{object}" + "PUT /{object}" (s3cmd get + s3cmd put). I am not sure, how this appeared and how to avoid that. Just now, we have nautilus mons and osds with jewel radosgws. At the time of first appearence, also a nautilus gateway had been online, but the requests had been handled by both types. Any ideas? best regards, Ingo -- Ingo Reimann Teamleiter Technik [ https://www.dunkel.de/ ] Dunkel GmbH Philipp-Reis-Straße 2 65795 Hattersheim Fon: +49 6190 889-100 Fax: +49 6190 889-399 eMail: support(a)dunkel.de http://www.Dunkel.de/ Amtsgericht Frankfurt/Main HRB: 37971 Geschäftsführer: Axel Dunkel Ust-ID: DE 811622001

4 years, 3 months

2
4
0 0

Mimic downgrade (13.2.8 -> 13.2.6) failed assert in combination with bitmap allocator

by Stefan Kooman

Hi, We have seen several issues (mailed about that earlier to this list) after the upgrade to Mimic 13.2.8. We decided to downgrade the OSD servewrs to 13.2.6 to check if issues disappear. However we ran into issues with that ... We use bluestore allocator since Luminous 12.2.12 to combat latency issues on the OSDs. We also used that succesfully on Mimic 13.2.6. bluestore_allocator = bitmap bluefs_allocator = bitmap When downgrading to 13.2.6 we hit the following assert: 2019-12-27 14:14:16.409 7f2ed2dcce00 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-0/block size 3.5 TiB 2019-12-27 14:14:16.409 7f2ed2dcce00 1 bluefs mount 2019-12-27 14:14:16.413 7f2ed2dcce00 -1 /build/ceph-13.2.6/src/os/bluestore/fastbmap_allocator_impl.h: In function 'void AllocatorLevel02<T>::_mark_allocated(uint64_t, uint64_t) [with L1 = AllocatorLevel01Loose; uint64_t = long unsigned int]' thread 7f2ed2dcce00 time 2019-12-27 14:14:16.414793 /build/ceph-13.2.6/src/os/bluestore/fastbmap_allocator_impl.h: 749: FAILED assert(available >= allocated) ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f2eca0a497e] 2: (()+0x2fab07) [0x7f2eca0a4b07] 3: (BitmapAllocator::init_rm_free(unsigned long, unsigned long)+0x44d) [0xc91dbd] 4: (BlueFS::mount()+0x260) [0xc6e6c0] 5: (BlueStore::_open_db(bool, bool)+0x17cd) [0xb8f50d] 6: (BlueStore::_mount(bool, bool)+0x4b7) [0xbbfb77] 7: (OSD::init()+0x295) [0x761fc5] 8: (main()+0x367b) [0x64f23b] 9: (__libc_start_main()+0xf0) [0x7f2ec7c3f830] 10: (_start()+0x29) [0x718929] We upgraded the node back to 13.2.8 again which started without issues. We did do a "downgrade test" on a test cluster ... that cluster did not suffer from this issue. It turned out that the cluster was not using the bitmap allocator ... after enabling the bitmap allocator there on a 13.2.6 node (that has been previviously downgraded but had never run with the bitmap allocator) and restarting the node this came online just fine. However, an upgrade to 13.2.8 with bitmap allocator enabled, and a downgrade again to 13.2.6 would trigger the same assert. Switching back to default (stupid allocator) again would work (initially) for 2 out of 3 OSDs. One would fail right away with rocksdb corruption: 2019-12-27 15:10:50.945 7fc77fbcbe00 20 osd.6 1952 register_pg 2.16 0x990c800 2019-12-27 15:10:50.945 7fc77fbcbe00 10 osd.6:2._attach_pg 2.16 0x990c800 2019-12-27 15:10:50.945 7fc77fbcbe00 10 osd.6 1952 pgid 2.0 coll 2.0_head 2019-12-27 15:10:50.945 7fc77fbcbe00 10 osd.6 1952 _make_pg 2.0 2019-12-27 15:10:50.945 7fc77fbcbe00 5 osd.6 pg_epoch: 1952 pg[2.0(unlocked)] enter Initial 2019-12-27 15:10:50.945 7fc77fbcbe00 20 osd.6 pg_epoch: 1952 pg[2.0(unlocked)] enter NotTrimming 2019-12-27 15:10:50.945 7fc77fbcbe00 -1 abort: Corruption: block checksum mismatch: expected 1122551773, got 2333355710 in db/000397.sst offset 57741 size 4044 2019-12-27 15:10:50.949 7fc77fbcbe00 -1 *** Caught signal (Aborted) ** in thread 7fc77fbcbe00 thread_name:ceph-osd ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (()+0x11390) [0x7fc775520390] 2: (gsignal()+0x38) [0x7fc774a53428] 3: (abort()+0x16a) [0x7fc774a5502a] 4: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::list*)+0x4a8) [0xbff498] 5: (BlueStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::list> > >*)+0x201) [0xb852e1] 6: (PG::read_info(ObjectStore*, spg_t, coll_t const&, pg_info_t&, PastIntervals&, unsigned char&)+0x16b) [0x7ecc8b] 7: (PG::read_state(ObjectStore*)+0x56) [0x81aff6] 8: (OSD::load_pgs()+0x566) [0x759516] 9: (OSD::init()+0xcd3) [0x762a03] 10: (main()+0x367b) [0x64f23b] 11: (__libc_start_main()+0xf0) [0x7fc774a3e830] 12: (_start()+0x29) [0x718929] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. And after a restart we got rocksdb messages like this: ... 2019-12-27 15:11:32.322 7fd1c0fbbe00 4 rocksdb: [/build/ceph-13.2.6/src/rocksdb/db/version_set.cc:3088] Recovering from manifest file: MANIFEST-000402 ... ... -352> 2019-12-27 15:11:11.598 7fa0f0558e00 -1 abort: Corruption: Bad table magic number: expected 9863518390377041911, found 11124 in db/000397.sst ... After we set osd.6 out ... osd.7 crashed after a while (while backfilling) and would fail to restart again with the following message: 2019-12-27 15:27:10.833 7f1bb4701e00 4 rocksdb: [/build/ceph-13.2.6/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all background work 2019-12-27 15:27:10.833 7f1bb4701e00 4 rocksdb: [/build/ceph-13.2.6/src/rocksdb/db/db_impl.cc:397] Shutdown complete 2019-12-27 15:27:10.833 7f1bb4701e00 -1 rocksdb: Corruption: CURRENT file does not end with newline 2019-12-27 15:27:10.833 7f1bb4701e00 -1 bluestore(/var/lib/ceph/osd/ceph-7) _open_db erroring opening db: 2019-12-27 15:27:10.833 7f1bb4701e00 1 bluefs umount 2019-12-27 15:27:10.833 7f1bb4701e00 1 stupidalloc 0x0x325aee0 shutdown 2019-12-27 15:27:10.833 7f1bb4701e00 1 bdev(0x380a380 /var/lib/ceph/osd/ceph-7/block) close 2019-12-27 15:27:11.093 7f1bb4701e00 1 bdev(0x380a000 /var/lib/ceph/osd/ceph-7/block) close 2019-12-27 15:27:11.345 7f1bb4701e00 -1 osd.7 0 OSD:init: unable to mount object store 2019-12-27 15:27:11.345 7f1bb4701e00 -1 ESC[0;31m ** ERROR: osd init failed: (5) Input/output errorESC[0m A restart / reboot of the node would not help. For those of you still running 13.2.6 ... I would not recommend upgrading to 13.2.8 (at least not for storage nodes ... mon / mds still seem to work fine). Does bitmap allocator modify the OSD on disk data in some way? Are you supposed to be able to switch between different allocators? Thanks, Stefan -- | BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl

4 years, 3 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users December 2019