July 2019 - Dev - lists.ceph.io

by Brad Hubbard

Latest static analyser results are up on http://people.redhat.com/bhubbard/ Weekly Fedora Copr builds are at https://copr.fedorainfracloud.org/coprs/badone/ceph-weeklies/ -- Cheers, Brad

4 years, 9 months

1
0
0 0

Re: [crimson] replicated write as a phase in PGPipeline

by kefu chai

sorry for the double post. forwarding to the correct list. On Fri, Jul 12, 2019 at 9:16 PM kefu chai <tchaikov(a)gmail.com> wrote: > > hi Sam, > > as you know, i am working on replicated write. and, i think we need to > make it an optional phase in ClientRequest::PGPipeline, but i am not > quite sure about how we should represent it as a blocking step using > dump_detail() > > currently, my plan is to add a Blocker for wrapping 1 * local_txn + m > * replicated_txn. and print them as > - a tid, > - a local return code > - an array of {peer:pg_shard_t, last_complete_on_disk:eversion} > > does this make sense to you? > > -- > Regards > Kefu Chai -- Regards Kefu Chai

4 years, 9 months

2
1
0 0

Pool stats issue with upgrades to nautilus

by Sage Weil

Hi everyone, All current Nautilus releases have an issue where deploying a single new (Nautilus) BlueStore OSD on an upgraded cluster (i.e. one that was originally deployed pre-Nautilus) breaks the pool utilization stats reported by ``ceph df``. Until all OSDs have been reprovisioned or updated (via ``ceph-bluestore-tool repair``), the pool stats will show values that are lower than the true value. A fix is in the works but will not appear until 14.2.3. Users who have upgraded to Nautilus (or are considering upgrading) may want to delay provisioning new OSDs until the fix is available in the next release. This issue will only affect you if: - You started with a pre-nautilus cluster and upgraded - You then provision one or more new BlueStore OSDs, or run 'ceph-bluestore-tool repair' on an upgraded OSD. The symptom is that the pool stats from 'ceph df' are too small. For example, the pre-upgrade stats on our test cluster were ... POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL data 0 63 TiB 44.59M 63 TiB 30.21 48 TiB ... but when one OSD was updated it changed to POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL data 0 558 GiB 43.50M 1.7 TiB 1.22 45 TiB The root cause is that, starting with Nautilus, BlueStore maintains per-pool usage stats, but it requires a slight on-disk format change; upgraded OSDs won't have the new stats until you run a ceph-bluestore-tool repair. The problem is that the mon starts using the new stats as soon as *any* OSDs are reporting per-pool stats (instead of waiting until *all* OSDs are doing so). To avoid the issue, either - do not provision new BlueStore OSDs after the upgrade, or - update all OSDs to keep new per-pool stats. An existing BlueStore OSD can be converted with systemctl stop ceph-osd@$N ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-$N systemctl start ceph-osd@$N Note that FileStore does not support the new per-pool stats at all, so if there are filestore OSDs in your cluster there is no workaround that doesn't involve replacing the filestore OSDs with bluestore. A fix[1] is working it's way through QA and will appear in 14.2.3; it won't quite make the 14.2.2 release. sage [1] https://github.com/ceph/ceph/pull/28978

4 years, 9 months

2
3
0 0

paper tube machine

by sales01＠tradepro.net

Hi Sir Good day! QUALITY IS OUR CULTURE. YOVON machinery factory since 1988. Below is some of our best sellers for your reference. 1. toilet tissue paper core, fabric long tube, toilet paper core. aluminium foil paper core, steel paper core tube all types of rolling machinery 2: a set of industry paper tube making machine, include rolling , laleblling, curling and capsulling machinery 3.a set of fireworks paper tube making machine: a.high speed small inner paper tube making machine b.Roman candle paper tube making machine c.Kraft parallel paper tube making machine d.parallel outer firework paper tube making machine 4.a set of Chinese red firecrackers tube machine: a.200type firecracker paper tube making machine b.long red paper tube making machine c.match firecrackers paper tube making machine 5..snapper pop pop wrapping and packing machine 6: firecrackers chakker paper tube ID5 making mahcine We have good business reputation and ready market whether indoor or outdoor. We have successfully sold all types of firework & firecracker machines to India, Pakistan, Malaysia, Russia, Turkey, Iran, Mexico, Portugal, Sri Lank, Brazil, Egypt, Australia, Malta, Peru etc. YOVON machinery has been working hard towards with a unique design, super technology, elaborate making, superior service to our customers . The heartiest welcomes to visit YOVON ! Liling YOVON machine Company Email: yovon1188(a)vip.126.com Whatsapp:+8613974166125Candy Facebook(TEL):+8618273288055 wechat: 13974166125 Candy

4 years, 9 months

1
0
0 0

paper tube machine

by sales08＠tradepro.net

Hi Sir Good day! QUALITY IS OUR CULTURE. YOVON machinery factory since 1988. Below is some of our best sellers for your reference. 1. toilet tissue paper core, fabric long tube, toilet paper core. aluminium foil paper core, steel paper core tube all types of rolling machinery 2: a set of industry paper tube making machine, include rolling , laleblling, curling and capsulling machinery 3.a set of fireworks paper tube making machine: a.high speed small inner paper tube making machine b.Roman candle paper tube making machine c.Kraft parallel paper tube making machine d.parallel outer firework paper tube making machine 4.a set of Chinese red firecrackers tube machine: a.200type firecracker paper tube making machine b.long red paper tube making machine c.match firecrackers paper tube making machine 5..snapper pop pop wrapping and packing machine We have good business reputation and ready market whether indoor or outdoor. We have successfully sold all types of firework & firecracker machines to India, Pakistan, Malaysia, Russia, Turkey, Iran, Mexico, Portugal, Sri Lank, Brazil, Egypt, Australia, Malta, Peru etc. YOVON machinery has been working hard towards with a unique design, super technology, elaborate making, superior service to our customers . The heartiest welcomes to visit YOVON ! Liling YOVON machine Company Email: yovon1188(a)vip.126.com Whatsapp:+8613974166125Candy Facebook(TEL):+8618273288055 wechat: 13974166125 Candy

4 years, 9 months

1
0
0 0

by Jongyul Kim

Hi, I'm Jongyul Kim and interested in the performance of Ceph. I tried to figure out the advantage of using 2 MDS daemons instead of a single MDS under massive metadata operations(rename). But the result was that two MDS daemons performed worse than a single MDS daemon. I'd like to ask your advice why this happens. Here is what I did. I wrote a micro benchmark that 1) creates a file, 2) writes 4KB to the file and 3) rename it to another directory. These three steps are done by each process and I measured the throughput (operations/sec) of Ceph increasing the number of processes in the benchmark. The experimental setup is like below. [Ceph and HW configuration] - Ceph version: 14.2.1 - Configured as Filestore - NVM with ext4-dax was used as a storage device - IPoIB with 40Gbps Infiniband NIC - Sufficient cores and memory (96 cores with hyperthreading, about 300GB DRAM) [Base setup] - There are two nodes: Node A and Node B - Node A: 1 MON, 1 MGR, 1 OSD, 1 MDS and the micro benchmark run. - Node B: 1 OSD and the micro benchmark run. - Each process does 1,000 operations(OPs). - Each process has its own source directory and target directory for renaming. So, there is no contention between rename requests of different processes; Each process renames 1,000 files from its own source directory to its own target directory. As I increased the number of processes of the micro benchmark, Ceph stopped to scale (from the perspective of throughput) around 8 processes per node. I expect the bottleneck is MDS daemon because rename requests took more time as the number of processes increased. (I.e. the rename portion of total execution time: 10% with 1 process per node --> 24% with 8 processes per node) I added one more active MDS daemon on Node B (Now, there are two MDS daemons. One on Node A and the other on Node B) to achieve higher throughput. Additionally, directories were pinned to one of two nodes for sharding metadata operations. That is, directories accessed by processes of Node A were pinned to MDS daemon running on Node A (in the same way for Node B). The result was, as I mentioned at the beginning, 2 MDS achieved lower throughput, about 50~60% of 1 MDS case's. The rename portion of the total execution time also increased (50% with 1 process per node and 88% with 8 processes per node). I found the rename request to the MDS on Node B takes much longer time than a request to the MDS on Node A. (MDS on Node A has an authority of '/' directory. So, it seems to be a master MDS.) I checked the logs and confirmed that a rename MDS request in Node B was re-dispatched three times again in the MDS Server to acquire permissions for renaming (One for "pin inode" and two for "scatter locks". I'm not sure why scatter locks should be requested twice.), whereas there was no re-dispatch of a rename MDS request in MDS Server on Node A. Although directories were pinned to the MDS on Node B, this MDS continually requested permissions to the MDS on Node A on every rename request. As a result, a directory pinning for metadata operation sharding was useless and 2 MDS gets worse performance. Why this happens? Why the second MDS needs to re-acquire permissions, "pin inode" and the other scatter locks, on every rename request, even though this MDS has an authority for the directories (by directory pinning)? The performance would increase if an authorized MDS (MDS on Node B in this case) keeps locks or the "pin inode" permission of its authorized directories until a revocation actually required. It will eliminate the ping-ponging of locks and permissions between MDSes. But, current Ceph MDS implementation does not do in this way. I ask advice of you for the rationale of this design point. Any comments and advice will be appreciated. Thanks. Sincerely, Jongyul Kim

4 years, 9 months

2
4
0 0

ceph mds crash (mimic)

by Wyllys Ingersoll

My cephfs FS recently went through a long recovery from losing some PGs and ODSs. It finally came back to "HEALTH_OK" for a bit, but then the MDS servers started crashing with this error in the logs: I cannot get any of the 3 MDS servers to stay up now. -313> 2019-07-11 17:42:39.820 7f612c147700 1 -- 10.10.30.116:6800/543707238 --> 10.10.30.115:6801/81746 -- mgrreport(unknown.ic2mon02 +0-0 packed 1374) v6 -- 0x2ed1c00 con 0 -313> 2019-07-11 17:42:39.820 7f612b946700 -1 /build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)' thread 7f612b946700 time 2019-07-11 17:42:39.820872 /build/ceph-13.2.6/src/mds/MDCache.cc: 1680: FAILED assert(follows >= realm->get_newest_seq()) ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f61367b997e] 2: (()+0x2fab07) [0x7f61367b9b07] 3: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0xd3f) [0x5f821f] 4: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, snapid_t)+0xc0) [0x5f8450] 5: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x4b1) [0x5f9141] 6: (Locker::scatter_writebehind(ScatterLock*)+0x465) [0x64a615] 7: (Locker::simple_sync(SimpleLock*, bool*)+0x176) [0x64e506] 8: (Locker::scatter_nudge(ScatterLock*, MDSInternalContextBase*, bool)+0x3dd) [0x652f6d] 9: (Locker::scatter_tick()+0x1e4) [0x6535a4] 10: (Locker::tick()+0x9) [0x6538b9] 11: (MDSRankDispatcher::tick()+0x1e9) [0x4f00d9] 12: (FunctionContext::finish(int)+0x2c) [0x4d52dc] 13: (Context::complete(int)+0x9) [0x4d31d9] 14: (SafeTimer::timer_thread()+0x18b) [0x7f61367b620b] 15: (SafeTimerThread::entry()+0xd) [0x7f61367b786d] 16: (()+0x76ba) [0x7f61360356ba] 17: (clone()+0x6d) [0x7f613585e41d] -313> 2019-07-11 17:42:39.820 7f612b946700 -1 *** Caught signal (Aborted) ** in thread 7f612b946700 thread_name:safe_timer ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (()+0x11390) [0x7f613603f390] 2: (gsignal()+0x38) [0x7f613578c428] 3: (abort()+0x16a) [0x7f613578e02a] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7f61367b9a86] 5: (()+0x2fab07) [0x7f61367b9b07] 6: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0xd3f) [0x5f821f] 7: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, snapid_t)+0xc0) [0x5f8450] 8: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x4b1) [0x5f9141] 9: (Locker::scatter_writebehind(ScatterLock*)+0x465) [0x64a615] 10: (Locker::simple_sync(SimpleLock*, bool*)+0x176) [0x64e506] 11: (Locker::scatter_nudge(ScatterLock*, MDSInternalContextBase*, bool)+0x3dd) [0x652f6d] 12: (Locker::scatter_tick()+0x1e4) [0x6535a4] 13: (Locker::tick()+0x9) [0x6538b9] 14: (MDSRankDispatcher::tick()+0x1e9) [0x4f00d9] 15: (FunctionContext::finish(int)+0x2c) [0x4d52dc] 16: (Context::complete(int)+0x9) [0x4d31d9] 17: (SafeTimer::timer_thread()+0x18b) [0x7f61367b620b] 18: (SafeTimerThread::entry()+0xd) [0x7f61367b786d] 19: (()+0x76ba) [0x7f61360356ba] 20: (clone()+0x6d) [0x7f613585e41d]

4 years, 9 months

1
0
0 0

next mimic PRs

by Yuri Weinstein

Pls tag all PRs wanted for next mimic with "mimic-batch-1" and "needs-qa" Current queue - https://github.com/ceph/ceph/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label… Thx YuriW

4 years, 9 months

1
0
0 0

07/11/2019 perf meeting is canceled

by Mark Nelson

Hi Folks, Most of the Red Hat people are double booked with another meeting today, so we'll cancel the perf meeting. Hopefully third time's a charm and we'll be able to meet next week! Thanks, Mark

4 years, 9 months

1
0
0 0

Re: outcome::result<> for "error handing" in crimson

by Ronen Friedman

... and now it compiles in godbolt: https://godbolt.org/z/xPMzP5 (I've also made some changes to the code, to better demonstrate the possible error-handling paths) Comments/corrections anyone? Thanks Ronen On Thu, Jul 4, 2019 at 5:34 PM Ronen Friedman <rfriedma(a)redhat.com> wrote: > Hi Kefu, > > Like I said in the meeting: I really think this is the right direction. > I am not sure I fully understand the specifics you've proposed. Here is my > take (which may be exactly what you meant): > > Suppose this is the "happy path": > [image: outcome_1.jpg] > > > In the face of an error we may want to have this behavior (1'st option): > [image: outcome_2.jpg] > > > - let's assume that the values of type T are wrapped in > outcome::outcome<T, error-code, exception-code> (I'd rather using that over > outcome::result); > - Thus, f1() should be written to return the wrapper type > (outcome<T1>), > - f2() is now a function from outcome<T1> to outcome<T2>. > - as we may want to have error codes "pass thru" f2(), without > requiring f2() to be modified with the boiler-plat code for the pass-thru, > let us wrap f2() into pass_thru<T1,T2>(f2). I have attached code that > presents one way of doing just that. > > note that while f2() is just a "pass thru" for an incoming error code, we > still need to "re-code" the error value in the incoming outcome<T1> into > the same value in outcome<T2>. > > > Another desired behaviour, as you mention in the email, is to split the > logic into a separate "happy days" and "error path": > > [image: outcome_3.jpg] > > We can use a separate wrapper around the two paths (note that both paths > are expected to return optional<T2>). > > Here is some code that demonstrates what I mean (compiled with g++9, but > the link in Godbolt is only as a display mechanism): > > https://godbolt.org/z/i-d4TU <- use the new link above, not this one > > Note that the code is not polished yet. For example: I think I would be > able to replace the call > > return WpassR<Oint, Ofloat>(of2_ps, x) > > > with > > > return WpassR(of2_ps, x) > > > using some magic (deduction guides? that will require building some > structs around) - if we think this direction is worth our time. > > Ronen > > > > On Wed, Jul 3, 2019 at 7:13 AM kefu chai <tchaikov(a)gmail.com> wrote: > >> hi guys, >> >> i just came across boost::outcome[0]. it reminded me the discussion we >> had back in Barcelona regarding to the error handling in crimson. >> well, strictly speaking, it's not limited to errors. it covers the >> non-error handling as well. >> >> the question is: shall we start prototyping the crimson variant of >> outcome<> now? if yes, probably we can leverage boost::outcome<>? >> >> a little bit background: >> >> seastar uses exception for propagating the error. but it incurs >> runtime overhead. because, to throw an exception, the libstdc++ >> runtime needs to acquire a global lock. >> >> well, some of us might want to argue, why not just return a >> future<Result, Error>? let me use an example here, imagine we are >> handling a write request in OSD. we might need to go through following >> steps: >> >> 1. perform some sanity tests, for instance, to see if the OSD is ready >> for handling the write request >> 2. try to read the object info of the object from local storage to see >> if it already exists >> 3. write to the object to the local storage, and send write requests >> to replica OSDs (assuming it's in a replicated pool), wait for the >> completions of these write ops. >> 4. update the statistics >> 5. reply to the client >> >> and it's intuitive to structure these steps using chained continuation >> like >> >> do_with(std::move(request), [this](auto request) { >> return perform_tests(request->object_id).then([request, this] { >> return read_object_info(request->object_id); >> }).then([request, this](optional<object_info> object_info) { >> return when_all( >> write_local(request->object_id, request->offset, request->data), >> parallel_for_each(replica_osds, [request](auto replica_osd) { >> return replica_osd->write_remote(request->object_id, >> request->offset, request->data) >> })); >> }).then([write_size=request.data.size(), this] { >> update_statistics(write_size); >> return reply_to(reply_t::success, request); >> }); >> }).handle_exception([](auto exception) { >> return reply_to(reply_t::failure, exception.error_code, request); >> }); >> >> in which, if any test fails in step#1, we either need to wait until >> the OSD is ready, or just need to bail out, and skip the following >> steps. the "handle_exception()" clause is used to handle the "bail >> out" case, where we cannot do anything to serve the request. for >> instance, the request is invalid. >> >> we want to differentiate two types of errors. one of them are actually >> exceptions which does not happen often in real world, and we don't >> need/want to optimize for this case. but the other case could be >> normal. for instance, it's fairly normal that an object does not exist >> yet, when we are trying to write to it. and we do want to be >> performant when handling these "errors" in this category, and also, we >> want to do this in a convenient way just like handling exceptions. >> >> because, we need an efficient way to convey the message to caller that >> "please skip the following continuations, and i would go to this >> handling route instead". if my memory serves me correctly, we think >> that we need to create a wrapper around seastar::future<> to allow the >> caller to do something like >> >> // a helper to run func or skip it >> template<typename Func> >> auto ignore_on_error(Func&& f) { >> return [f=std::move(f)](auto&& t) { >> return t.is_value() ? f(t.value()) ? t; >> } >> } >> >> return read_object_info(oid).then( >> return ignore_on_error([](object_info& oi) { >> return handle_write_with_object_info(std::move(oi)); >> }).then([](auto t) { >> return handle_write_without_object_info(); >> }); >> ); >> >> in the example above, i assume we will do something very different >> depending on if the object's existence. >> >> >> cheers, >> >> --- >> [0] >> https://www.boost.org/doc/libs/1_70_0/libs/outcome/doc/html/index.html >> >> -- >> Regards >> Kefu Chai >> >

4 years, 9 months

3
2
0 0

2024

2023

2022

2021

2020

2019

Dev July 2019