Hi,
Are there any plans to implement a per-client throttle on mds client requests?
We just had an interesting case where a new cephfs user was hammering
an mds from several hosts. In the end we found that their code was
doing:
while d=getafewbytesofdata():
f=open(file.dat)
f.append(d)
f.close()
By changing their code to:
f=open(file.dat)
while d=getafewbytesofdata():
f.append(d)
f.close()
it completely removes their load on the mds (for obvious reasons).
In a multi-user environment it's hard to scrutinize every user's
application, so we'd prefer to just throttle down the client req rates
(and let them suffer from the poor performance).
Thoughts?
Thanks,
Dan
Hi everyone,
The target release date for Octopus is March 1, 2020.
The freeze will be January 1, 2020. As a practical matter, that means any
features need to be in before people leave for the holidays, ensuring the
features get in in time and also that we can run tests over the holidays
while the test lab is relatively idle.
We plan to stick to a 12 month cadence going forward, so the P release
target would be March 1 2021 (regardless of whether Octopus is early or
late).
Thanks!
sage
We created dev(a)ceph.io several weeks back. There has been plenty of
time now for everyone to get subscribed, so please now direct all dev
discussion for Ceph proper to dev(a)ceph.io and use this list for
ceph kernel client development only. Avoid copying both lists unless the
discussion is relevant both for userspace and the kernel.
https://lists.ceph.io/postorius/lists/dev.ceph.io/
Thanks!
sage
Hi everyone,
I am currently working on a project where the Rados Gateway SSE-KMS
feature is required;
I cannot rely on the solution based on Barbican and Vault is the KMS of choice.
For these reason, here [1] a proposal to abstract the key management
service and an initial sketch for a refactoring strategy to support
HashiCorp Vault.
I am currently not planning on adding any new SSE strategy (such as
SSE-S3), although the refactoring might simplify its implementation.
Thanks.
[1] https://pad.ceph.com/p/rgw_sse-kms
--
Andrea Baglioni
Hello everyone,
Recently I was trying to expand an OSD disk using bluefs-bdev-expand.
Since this OSD is a virtual machine managed by oVirt, I first resized
the virtual disk of the virtual machine. The result from ''lsblk'' was:
vdb
252:16 0 200G 0 disk
└─ceph--bc94ec07--2ac3--4965--8750--bb9e42ec670f-osd--block--aa7de90e--0442--4cd9--9927--a17dd666ea74
253:2 0 100G 0 lvm
As you can see the block device /dev/vdb has 200G but the logical volume
is still 100G. I then used the following:
lvextend -L+100G
/dev/ceph--bc94ec07--2ac3--4965--8750--bb9e42ec670f/osd--block--aa7de90e--0442--4cd9--9927--a17dd666ea74
After using lvextend I then ran:
# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-7/
inferring bluefs devices from bluestore path
{
"/var/lib/ceph/osd/ceph-7//block": {
"osd_uuid": "aa7de90e-0442-4cd9-9927-a17dd666ea74",
"size": 107372085248,
"btime": "2019-07-02 13:56:58.589154",
"description": "main",
"bluefs": "1",
"ceph_fsid": "6effd8df-d109-4ef3-9cfa-c68f9756a54b",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "AQCJRhtdZZgTEBAA7G7fzTyj0d2r4RRa/uxaZQ==",
"ready": "ready",
"whoami": "7"
}
}
So the command: ceph-bluestore-tool bluefs-bdev-expand --path
/var/lib/ceph/osd/ceph-7 results in an error I currently cannot
reproduce but the bottom line is that it doesn't expand.
Is bluefs-bdev-expand supported on mimic? Is there a clean way to
expand an OSD ? Right now I'm running the following from ceph-deploy:
# ceph-deploy disk zap vm1-osd1 /dev/vdb
# ceph-deploy osd create vm1-osd1 --data /dev/vdb
The above deletes everything and recreates it which is really not ideal.
Any suggestion?
Thanks in advance.
--
Met vriendelijke groeten,
Valentin Bajrami
Target Holding
hi guys,
i just came across boost::outcome[0]. it reminded me the discussion we
had back in Barcelona regarding to the error handling in crimson.
well, strictly speaking, it's not limited to errors. it covers the
non-error handling as well.
the question is: shall we start prototyping the crimson variant of
outcome<> now? if yes, probably we can leverage boost::outcome<>?
a little bit background:
seastar uses exception for propagating the error. but it incurs
runtime overhead. because, to throw an exception, the libstdc++
runtime needs to acquire a global lock.
well, some of us might want to argue, why not just return a
future<Result, Error>? let me use an example here, imagine we are
handling a write request in OSD. we might need to go through following
steps:
1. perform some sanity tests, for instance, to see if the OSD is ready
for handling the write request
2. try to read the object info of the object from local storage to see
if it already exists
3. write to the object to the local storage, and send write requests
to replica OSDs (assuming it's in a replicated pool), wait for the
completions of these write ops.
4. update the statistics
5. reply to the client
and it's intuitive to structure these steps using chained continuation like
do_with(std::move(request), [this](auto request) {
return perform_tests(request->object_id).then([request, this] {
return read_object_info(request->object_id);
}).then([request, this](optional<object_info> object_info) {
return when_all(
write_local(request->object_id, request->offset, request->data),
parallel_for_each(replica_osds, [request](auto replica_osd) {
return replica_osd->write_remote(request->object_id,
request->offset, request->data)
}));
}).then([write_size=request.data.size(), this] {
update_statistics(write_size);
return reply_to(reply_t::success, request);
});
}).handle_exception([](auto exception) {
return reply_to(reply_t::failure, exception.error_code, request);
});
in which, if any test fails in step#1, we either need to wait until
the OSD is ready, or just need to bail out, and skip the following
steps. the "handle_exception()" clause is used to handle the "bail
out" case, where we cannot do anything to serve the request. for
instance, the request is invalid.
we want to differentiate two types of errors. one of them are actually
exceptions which does not happen often in real world, and we don't
need/want to optimize for this case. but the other case could be
normal. for instance, it's fairly normal that an object does not exist
yet, when we are trying to write to it. and we do want to be
performant when handling these "errors" in this category, and also, we
want to do this in a convenient way just like handling exceptions.
because, we need an efficient way to convey the message to caller that
"please skip the following continuations, and i would go to this
handling route instead". if my memory serves me correctly, we think
that we need to create a wrapper around seastar::future<> to allow the
caller to do something like
// a helper to run func or skip it
template<typename Func>
auto ignore_on_error(Func&& f) {
return [f=std::move(f)](auto&& t) {
return t.is_value() ? f(t.value()) ? t;
}
}
return read_object_info(oid).then(
return ignore_on_error([](object_info& oi) {
return handle_write_with_object_info(std::move(oi));
}).then([](auto t) {
return handle_write_without_object_info();
});
);
in the example above, i assume we will do something very different
depending on if the object's existence.
cheers,
---
[0] https://www.boost.org/doc/libs/1_70_0/libs/outcome/doc/html/index.html
--
Regards
Kefu Chai
The next Ceph Developer Monthly falls on this Wednesday, July 3. Since
this is adjacent to a US holidy it's likely many people won't make it.
More importantly, we failed to send out an agenda last week.
Let's delay this until next week, Jul 10 9PM ET (Jul 11 0100 UTC).
Thanks!
sage
hi Mark,
i am working on using cbt for testing crimson. as you might known,
crimson-osd is currently using a variant of memstore as its object
store backend. so it'd be very easy for crimson-osd to run out of
memory, as the default run "time" of cbt radosbench is 300 seconds.
currently, each radosbench run is composed of 3 steps:
1. prefill // optional, enabled if "prefill_time" or "prefill_objects" is set
2. write.
3. read // optional, enabled if "write_only" is not test
the pain point is that the run times for write and read step are
specified using the same setting -- "time".
so, i am wondering if it's okay to add an option named "read_only" to
skip the "write" step to let the prefill to prepare the testbench for
the read test. so we can specify the time for prefilling and the time
for read separately?
as an alternative, we could have an option for "write_time", which
defaults to "time" if not specified, but if it takes precedence over
"time" if specified. and it's "0", the "write" step will be skipped.
what do you think?
--
Regards
Kefu Chai