sorry for the double post. forwarding to the correct list.
On Fri, Jul 12, 2019 at 9:16 PM kefu chai <tchaikov(a)gmail.com> wrote:
>
> hi Sam,
>
> as you know, i am working on replicated write. and, i think we need to
> make it an optional phase in ClientRequest::PGPipeline, but i am not
> quite sure about how we should represent it as a blocking step using
> dump_detail()
>
> currently, my plan is to add a Blocker for wrapping 1 * local_txn + m
> * replicated_txn. and print them as
> - a tid,
> - a local return code
> - an array of {peer:pg_shard_t, last_complete_on_disk:eversion}
>
> does this make sense to you?
>
> --
> Regards
> Kefu Chai
--
Regards
Kefu Chai
Hi everyone,
All current Nautilus releases have an issue where deploying a single new
(Nautilus) BlueStore OSD on an upgraded cluster (i.e. one that was
originally deployed pre-Nautilus) breaks the pool utilization stats
reported by ``ceph df``. Until all OSDs have been reprovisioned or
updated (via ``ceph-bluestore-tool repair``), the pool stats will show
values that are lower than the true value. A fix is in the works but will
not appear until 14.2.3. Users who have upgraded to Nautilus (or are
considering upgrading) may want to delay provisioning new OSDs until the
fix is available in the next release.
This issue will only affect you if:
- You started with a pre-nautilus cluster and upgraded
- You then provision one or more new BlueStore OSDs, or run
'ceph-bluestore-tool repair' on an upgraded OSD.
The symptom is that the pool stats from 'ceph df' are too small. For
example, the pre-upgrade stats on our test cluster were
...
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL
data 0 63 TiB 44.59M 63 TiB 30.21 48 TiB
...
but when one OSD was updated it changed to
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL
data 0 558 GiB 43.50M 1.7 TiB 1.22 45 TiB
The root cause is that, starting with Nautilus, BlueStore maintains
per-pool usage stats, but it requires a slight on-disk format change;
upgraded OSDs won't have the new stats until you run a ceph-bluestore-tool
repair. The problem is that the mon starts using the new stats as soon as
*any* OSDs are reporting per-pool stats (instead of waiting until *all*
OSDs are doing so).
To avoid the issue, either
- do not provision new BlueStore OSDs after the upgrade, or
- update all OSDs to keep new per-pool stats. An existing BlueStore
OSD can be converted with
systemctl stop ceph-osd@$N
ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-$N
systemctl start ceph-osd@$N
Note that FileStore does not support the new per-pool stats at all, so
if there are filestore OSDs in your cluster there is no workaround
that doesn't involve replacing the filestore OSDs with bluestore.
A fix[1] is working it's way through QA and will appear in 14.2.3; it
won't quite make the 14.2.2 release.
sage
[1] https://github.com/ceph/ceph/pull/28978
Hi Sir
Good day!
QUALITY IS OUR CULTURE.
YOVON machinery factory since 1988.
Below is some of our best sellers for your reference.
1. toilet tissue paper core, fabric long tube, toilet paper core. aluminium foil paper core, steel paper core tube all types of rolling machinery
2: a set of industry paper tube making machine, include rolling , laleblling, curling and capsulling machinery
3.a set of fireworks paper tube making machine:
a.high speed small inner paper tube making machine
b.Roman candle paper tube making machine
c.Kraft parallel paper tube making machine
d.parallel outer firework paper tube making machine
4.a set of Chinese red firecrackers tube machine:
a.200type firecracker paper tube making machine
b.long red paper tube making machine
c.match firecrackers paper tube making machine
5..snapper pop pop wrapping and packing machine
6: firecrackers chakker paper tube ID5 making mahcine
We have good business reputation and ready market whether indoor or outdoor. We have successfully sold all types of firework & firecracker machines to India, Pakistan, Malaysia, Russia, Turkey, Iran, Mexico, Portugal, Sri Lank, Brazil, Egypt, Australia, Malta, Peru etc.
YOVON machinery has been working hard towards with a unique design, super technology, elaborate making, superior service to our customers . The heartiest welcomes to visit YOVON !
Liling YOVON machine Company
Email: yovon1188(a)vip.126.com
Whatsapp:+8613974166125Candy
Facebook(TEL):+8618273288055
wechat: 13974166125 Candy
Hi Sir
Good day!
QUALITY IS OUR CULTURE.
YOVON machinery factory since 1988.
Below is some of our best sellers for your reference.
1. toilet tissue paper core, fabric long tube, toilet paper core. aluminium foil paper core, steel paper core tube all types of rolling machinery
2: a set of industry paper tube making machine, include rolling , laleblling, curling and capsulling machinery
3.a set of fireworks paper tube making machine:
a.high speed small inner paper tube making machine
b.Roman candle paper tube making machine
c.Kraft parallel paper tube making machine
d.parallel outer firework paper tube making machine
4.a set of Chinese red firecrackers tube machine:
a.200type firecracker paper tube making machine
b.long red paper tube making machine
c.match firecrackers paper tube making machine
5..snapper pop pop wrapping and packing machine
We have good business reputation and ready market whether indoor or outdoor. We have successfully sold all types of firework & firecracker machines to India, Pakistan, Malaysia, Russia, Turkey, Iran, Mexico, Portugal, Sri Lank, Brazil, Egypt, Australia, Malta, Peru etc.
YOVON machinery has been working hard towards with a unique design, super technology, elaborate making, superior service to our customers . The heartiest welcomes to visit YOVON !
Liling YOVON machine Company
Email: yovon1188(a)vip.126.com
Whatsapp:+8613974166125Candy
Facebook(TEL):+8618273288055
wechat: 13974166125 Candy
Hi, I'm Jongyul Kim and interested in the performance of Ceph.
I tried to figure out the advantage of using 2 MDS daemons instead of a
single MDS under massive metadata operations(rename). But the result was
that two MDS daemons performed worse than a single MDS daemon. I'd like to
ask your advice why this happens.
Here is what I did.
I wrote a micro benchmark that 1) creates a file, 2) writes 4KB to the file
and 3) rename it to another directory. These three steps are done by each
process and I measured the throughput (operations/sec) of Ceph increasing
the number of processes in the benchmark. The experimental setup is like
below.
[Ceph and HW configuration]
- Ceph version: 14.2.1
- Configured as Filestore
- NVM with ext4-dax was used as a storage device
- IPoIB with 40Gbps Infiniband NIC
- Sufficient cores and memory (96 cores with hyperthreading, about 300GB
DRAM)
[Base setup]
- There are two nodes: Node A and Node B
- Node A: 1 MON, 1 MGR, 1 OSD, 1 MDS and the micro benchmark run.
- Node B: 1 OSD and the micro benchmark run.
- Each process does 1,000 operations(OPs).
- Each process has its own source directory and target directory for
renaming. So, there is no contention between rename requests of different
processes; Each process renames 1,000 files from its own source directory
to its own target directory.
As I increased the number of processes of the micro benchmark, Ceph stopped
to scale (from the perspective of throughput) around 8 processes per node.
I expect the bottleneck is MDS daemon because rename requests took more
time as the number of processes increased. (I.e. the rename portion of
total execution time: 10% with 1 process per node --> 24% with 8 processes
per node)
I added one more active MDS daemon on Node B (Now, there are two MDS
daemons. One on Node A and the other on Node B) to achieve higher
throughput. Additionally, directories were pinned to one of two nodes for
sharding metadata operations. That is, directories accessed by processes of
Node A were pinned to MDS daemon running on Node A (in the same way for
Node B). The result was, as I mentioned at the beginning, 2 MDS achieved
lower throughput, about 50~60% of 1 MDS case's. The rename portion of the
total execution time also increased (50% with 1 process per node and 88%
with 8 processes per node).
I found the rename request to the MDS on Node B takes much longer time than
a request to the MDS on Node A. (MDS on Node A has an authority of '/'
directory. So, it seems to be a master MDS.) I checked the logs and
confirmed that a rename MDS request in Node B was re-dispatched three times
again in the MDS Server to acquire permissions for renaming (One for "pin
inode" and two for "scatter locks". I'm not sure why scatter locks should
be requested twice.), whereas there was no re-dispatch of a rename MDS
request in MDS Server on Node A.
Although directories were pinned to the MDS on Node B, this MDS continually
requested permissions to the MDS on Node A on every rename request. As a
result, a directory pinning for metadata operation sharding was useless and
2 MDS gets worse performance.
Why this happens? Why the second MDS needs to re-acquire permissions, "pin
inode" and the other scatter locks, on every rename request, even though
this MDS has an authority for the directories (by directory pinning)? The
performance would increase if an authorized MDS (MDS on Node B in this
case) keeps locks or the "pin inode" permission of its authorized
directories until a revocation actually required. It will eliminate the
ping-ponging of locks and permissions between MDSes. But, current Ceph MDS
implementation does not do in this way. I ask advice of you for the
rationale of this design point.
Any comments and advice will be appreciated. Thanks.
Sincerely,
Jongyul Kim
Hi Folks,
Most of the Red Hat people are double booked with another meeting today,
so we'll cancel the perf meeting. Hopefully third time's a charm and
we'll be able to meet next week!
Thanks,
Mark
... and now it compiles in godbolt:
https://godbolt.org/z/xPMzP5
(I've also made some changes to the code, to better demonstrate the
possible error-handling paths)
Comments/corrections anyone?
Thanks
Ronen
On Thu, Jul 4, 2019 at 5:34 PM Ronen Friedman <rfriedma(a)redhat.com> wrote:
> Hi Kefu,
>
> Like I said in the meeting: I really think this is the right direction.
> I am not sure I fully understand the specifics you've proposed. Here is my
> take (which may be exactly what you meant):
>
> Suppose this is the "happy path":
> [image: outcome_1.jpg]
>
>
> In the face of an error we may want to have this behavior (1'st option):
> [image: outcome_2.jpg]
>
>
> - let's assume that the values of type T are wrapped in
> outcome::outcome<T, error-code, exception-code> (I'd rather using that over
> outcome::result);
> - Thus, f1() should be written to return the wrapper type
> (outcome<T1>),
> - f2() is now a function from outcome<T1> to outcome<T2>.
> - as we may want to have error codes "pass thru" f2(), without
> requiring f2() to be modified with the boiler-plat code for the pass-thru,
> let us wrap f2() into pass_thru<T1,T2>(f2). I have attached code that
> presents one way of doing just that.
>
> note that while f2() is just a "pass thru" for an incoming error code, we
> still need to "re-code" the error value in the incoming outcome<T1> into
> the same value in outcome<T2>.
>
>
> Another desired behaviour, as you mention in the email, is to split the
> logic into a separate "happy days" and "error path":
>
> [image: outcome_3.jpg]
>
> We can use a separate wrapper around the two paths (note that both paths
> are expected to return optional<T2>).
>
> Here is some code that demonstrates what I mean (compiled with g++9, but
> the link in Godbolt is only as a display mechanism):
>
> https://godbolt.org/z/i-d4TU <- use the new link above, not this one
>
> Note that the code is not polished yet. For example: I think I would be
> able to replace the call
>
> return WpassR<Oint, Ofloat>(of2_ps, x)
>
>
> with
>
>
> return WpassR(of2_ps, x)
>
>
> using some magic (deduction guides? that will require building some
> structs around) - if we think this direction is worth our time.
>
> Ronen
>
>
>
> On Wed, Jul 3, 2019 at 7:13 AM kefu chai <tchaikov(a)gmail.com> wrote:
>
>> hi guys,
>>
>> i just came across boost::outcome[0]. it reminded me the discussion we
>> had back in Barcelona regarding to the error handling in crimson.
>> well, strictly speaking, it's not limited to errors. it covers the
>> non-error handling as well.
>>
>> the question is: shall we start prototyping the crimson variant of
>> outcome<> now? if yes, probably we can leverage boost::outcome<>?
>>
>> a little bit background:
>>
>> seastar uses exception for propagating the error. but it incurs
>> runtime overhead. because, to throw an exception, the libstdc++
>> runtime needs to acquire a global lock.
>>
>> well, some of us might want to argue, why not just return a
>> future<Result, Error>? let me use an example here, imagine we are
>> handling a write request in OSD. we might need to go through following
>> steps:
>>
>> 1. perform some sanity tests, for instance, to see if the OSD is ready
>> for handling the write request
>> 2. try to read the object info of the object from local storage to see
>> if it already exists
>> 3. write to the object to the local storage, and send write requests
>> to replica OSDs (assuming it's in a replicated pool), wait for the
>> completions of these write ops.
>> 4. update the statistics
>> 5. reply to the client
>>
>> and it's intuitive to structure these steps using chained continuation
>> like
>>
>> do_with(std::move(request), [this](auto request) {
>> return perform_tests(request->object_id).then([request, this] {
>> return read_object_info(request->object_id);
>> }).then([request, this](optional<object_info> object_info) {
>> return when_all(
>> write_local(request->object_id, request->offset, request->data),
>> parallel_for_each(replica_osds, [request](auto replica_osd) {
>> return replica_osd->write_remote(request->object_id,
>> request->offset, request->data)
>> }));
>> }).then([write_size=request.data.size(), this] {
>> update_statistics(write_size);
>> return reply_to(reply_t::success, request);
>> });
>> }).handle_exception([](auto exception) {
>> return reply_to(reply_t::failure, exception.error_code, request);
>> });
>>
>> in which, if any test fails in step#1, we either need to wait until
>> the OSD is ready, or just need to bail out, and skip the following
>> steps. the "handle_exception()" clause is used to handle the "bail
>> out" case, where we cannot do anything to serve the request. for
>> instance, the request is invalid.
>>
>> we want to differentiate two types of errors. one of them are actually
>> exceptions which does not happen often in real world, and we don't
>> need/want to optimize for this case. but the other case could be
>> normal. for instance, it's fairly normal that an object does not exist
>> yet, when we are trying to write to it. and we do want to be
>> performant when handling these "errors" in this category, and also, we
>> want to do this in a convenient way just like handling exceptions.
>>
>> because, we need an efficient way to convey the message to caller that
>> "please skip the following continuations, and i would go to this
>> handling route instead". if my memory serves me correctly, we think
>> that we need to create a wrapper around seastar::future<> to allow the
>> caller to do something like
>>
>> // a helper to run func or skip it
>> template<typename Func>
>> auto ignore_on_error(Func&& f) {
>> return [f=std::move(f)](auto&& t) {
>> return t.is_value() ? f(t.value()) ? t;
>> }
>> }
>>
>> return read_object_info(oid).then(
>> return ignore_on_error([](object_info& oi) {
>> return handle_write_with_object_info(std::move(oi));
>> }).then([](auto t) {
>> return handle_write_without_object_info();
>> });
>> );
>>
>> in the example above, i assume we will do something very different
>> depending on if the object's existence.
>>
>>
>> cheers,
>>
>> ---
>> [0]
>> https://www.boost.org/doc/libs/1_70_0/libs/outcome/doc/html/index.html
>>
>> --
>> Regards
>> Kefu Chai
>>
>