sharing a design for feedback. please let me know if you spot any other
races, issues or optimizations!
current resharding steps:
1) copy the 'source' bucket instance into a new 'target' bucket instance
with a new instance id
2) flag all source bucket index shards with RESHARD_IN_PROGRESS
3) flag the source bucket instance with RESHARD_IN_PROGRESS
4) list all omap entries in the source bucket index shards
(cls_rgw_bi_list) and write each entry to its target bucket index shard
(cls_rgw_bi_put)
5a) on success: link target bucket instance, delete source bucket index
shards, delete source bucket instance
5b) on failure: reset RESHARD_IN_PROGRESS flag on source bucket index
shards, delete target bucket index shards, delete target bucket instance
the current blocking strategy is enforced on the source bucket index
shards. any write operations received by cls_rgw while the
RESHARD_IN_PROGRESS flag is set are rejected with ERR_BUSY_RESHARDING.
radosgw handles these errors by waiting/polling until the reshard
finishes, then it resends the operation to the new target bucket index
shard.
to avoid blocking write ops during a reshard, we could instead apply
their bucket index operations to both the source and target bucket index
shards in parallel. this includes both the 'prepare' op to start the
transaction, and the asynchronous 'complete' to commit. allowing both
buckets to mutate during reshard introduces several new races:
I) between steps (2) and (3), radosgw doesn't yet see the
RESHARD_IN_PROGRESS flag in the bucket instance info, so doesn't know to
send the extra index operations to the target bucket index shard
II) operations applied on the target bucket index shards could be
overwritten by the omap entries copied from the source bucket index
shards in step (4)
III) radosgw sends a 'prepare' op to the source bucket index shard
before step (2), then sends the async 'complete' op to the source bucket
index shard after (2). before step (5), this complete op would fail with
ERR_BUSY_RESHARDING. after step (5), it would fail with ENOENT. since
the complete is async, and we've already replied to the client, it's too
late for any recovery
IV) radosgw sends an operation to both the source and target bucket
index shards that races with (5) and fails with ENOENT on either the
source shard (5a) or the target shard (5b)
introducing a new generation number or 'reshard_epoch' to each bucket
that increments on a reshard attempt can help to resolve these races. so
in step (2), the call to cls_rgw_set_bucket_resharding() would also
increment the bucket index shard's reshard_epoch. similarly, step (3)
would increment the bucket instance's reshard_epoch.
to resolve the race in (I), cls_rgw would reject bucket index operations
with a reshard_epoch older than the one stored in the bucket index
shard. this ERR_BUSY_RESHARDING error would direct radosgw to re-read
its bucket instance, detect the reshard in progress, and resend the
operation to both the source and target bucket index shards with the
updated reshard_epoch
to resolve the race in (II), cls_rgw_bi_put() would have to test whether
the given key exists before overwriting
the race in (III) is benign, because the 'prepared' entry was reliably
stored in the source shard before reshard, so we're guaranteed to see a
copy on the target shard. even though the 'complete' operation isn't
applied, the dir_suggest mechanism will detect the incomplete
transaction and repair the index the next time the target bucket is listed
the race in (IV) can be treated as a success if the operation succeeds
on the target bucket index shard. if it fails on the target shard,
radosgw needs to re-read the bucket entrypoint and instance to retarget
the operation
one thing this strategy cannot handle is versioned buckets. some index
operations for versioning (namely cls_rgw_bucket_link_olh and
cls_rgw_bucket_unlink_instance) involve writes to two or more related
omap entries. because step (4) copies over single omap entries, it can't
preserve the consistency of these relationships once we allow mutation.
so we'd need to stick with the blocking strategy for versioned buckets
Background: In nautilus, bluestore started maintaining usage stats on a
per-pool basis. BlueStore OSDs created before nautilus lack these stats.
Running a ceph-bluestore-tool repair can calculate the usage so that
the OSD can maintain and report them going forward.
There are two options:
- bluestore_warn_on_legacy_statfs (bool, default: true), which makes the
cluster issue a health warning when there are OSDs that have legacy stats.
- bluestore_no_per_pool_stats_tolerance (enum enforce, until_fsck,
until_repair, default: until_repair).
'until_fsck' will tolerate the legacy but fsck will fail
'until_repair' will tolerate the legacy but fsck will pass
'enforce' will tolerate the legacy but disable the warning
The octopus addition of per-pool omap usage tracking presents an identical
problem: a new tracking ability in bluestore that reqires a conversion to
enable after upgrade.
I think that we can simplify these settings and make them less confusing,
still with two options:
- bluestore_fsck_error_on_no_per_pool_omap (bool, default: false). During
fsck, we can either generate a 'warning' about non-per-pool omap, or an
error. Generate a warning by default, which means that the fsck return
code can indicate success.
- bluestore_warn_on_no_per_pool_omap (bool, default: true). At runtime, we
can generate a health warning if the OSD is using the legacy non-per-pool
omap.
The overall default behavior is the same as we have with the
legacy_statfs: OSDs still work, fsck passes, and we generate a health
warning.
Setting bluestore_warn_on_no_per_pool_omap=false is the same, AFAICS, as
setting bluestore_no_per_pool_stats_tolerance=enforce. (Except maybe
repair won't do the conversion? I don't see why we'd ever not want to
do the conversion, though.)
Setting bluestore_fsck_error_on_no_per_pool_omap=true is the same, AFAICS,
as bluestore_no_per_pool_stats_tolerance=until_fsck.
Overall, this seems simpler and easier for a user to understand.
Realistically, the only option I expect a user will ever change is
bluestore_warn_on_no_per_pool_omap=false to make the health warning go
away after an upgrade.
What do you think? Should I convert the legacy_statfs to behave the same
way?
sage
Hi Guys,
Earlier this week I was working on investigating the impact of OMAP
performance on RGW and wanted to see if putting rocksdb on ramdisk would
help speed up bucket index updates. While running tests I found out
that the benchmark tool I was using consumed roughly 15 cores of CPU to
push 4K puts/second to RGW from 128 threads. That wasn't really viable,
so I started looking for alternate S3 benchmarking tools. COSBench is
sort of the most well known choice out there, but it's a bit cumbersome
if you just want to run some quick tests from the command-line.
I managed to find a simple yet very nice benchmark written in go
developed by Wasabi Inc called s3-benchmark. While it works well, it
really only targets single buckets and is designed more for AWS testing
than for Ceph. I forked the project and pretty much refactored the
whole thing to be more useful for the kind of testing I want to do.
It's now at the point where I think it might be ready to experiment
with. S3 benchmarking has been a semi-recurring topic on the list so I
figured other folks might be interested in trying it too and hopefully
providing feedback.
The new benchmark is called hsbench and it's available here:
https://github.com/markhpc/hsbench
See the README.md for some of the advantages over the original
s3-benchmark program it was forked from. I consider this release to be
alpha level quality and disclaim all responsibility if it breaks and
deletes every object in all of your buckets. Consider yourself warned. :)
Mark
Hello,
we've been trying to get Ceph to work on IBM Z, a big-endian system,
and have been running into various serious issues relating to endian
conversion code.
The main issue we've been seeing is that while the old-style
decode/encode machinery in include/encoding.h automatically
byte-swaps all integers on big-endian systems (to ensure the
serialized format is always little endian), the *new-style*
machinery in include/denc.h does not.
This seemed confusing at first since there is quite a bit of code
there that appears intended to perform exactly that function, e.g.
template<typename T>
struct ExtType<T, std::enable_if_t<std::is_same_v<T, int32_t> ||
std::is_same_v<T, uint32_t>>> {
using type = __le32;
};
However, it turns out that at this point __le32 is actually
just an alias for __u32, so this whole machinery doesn't
really do anything at all.
Looking at the old code in encoding.h, I notice that it works
similarly, but uses ceph_le32 instead of __le32. The former
is a C++ class that actually does perform byte-swap on access.
Even more confusing, there is this code in include/types.h:
// temporarily remap __le* to ceph_le* for benefit of shared
kernel/userland headers
#define __le16 ceph_le16
#define __le32 ceph_le32
#define __le64 ceph_le64
#include "ceph_fs.h"
#include "ceph_frag.h"
#include "rbd_types.h"
#undef __le16
#undef __le32
#undef __le64
which --sometimes-- redefines __le32 as ceph_le32, but those
redefines are not active at the point denc.h is included.
So it would appear that the usage of __le32 in denc.h is incorrect,
and this code should be using ceph_le32 instead. Is this right?
But even so, grepping for __le32 throughout the code base shows
quite a bit of additional places where it is used, most of which
also appear to make the assumption that byte-swaps automatically
happen. In addition, there appear to be some places where e.g.
ceph_fs.h is included directly, without going via types.h -- and
in those places, we suddenly no longer get the byte-swaps ...
Now I was wondering whether the best way forward might be to just
have __le32 always be defined as ceph_le32 when compiling user
space code. But then I noticed that it used to be that way,
but that was deliberated changed by this commit back in 2010:
commit 737b5043576153817a6b4195b292672585df10d3
Author: Sage Weil <sage(a)newdream.net>
Date: Fri May 7 13:45:00 2010 -0700
endian: simplify __le* type hackery
Instead of preventing linux/types.h from being included, instead name
our types ceph_le*, and remap using #define _only_ when including the
shared kernel/userspace headers.
So I'm a bit at a loss to understand how all this is supposed to
be working. Any suggestions would be welcome -- we'd be willing
to implement whatever's needed, but would like some guidance as
to how the solution should look like ...
Bye,
Ulrich
Hi folks,
The next Ceph Developer Monthly will be:
Wed Sept 4 at 9PM ET, or
Thu Sept 5 at 0100 UTC
The agenda:
https://tracker.ceph.com/projects/ceph/wiki/CDM_04-SEP-2019
Current topics are:
- wandering log (FIFO striped over rados objects)
- rados: return bufferlist attached to write operations
- github actions workflows for Ceph
Feel free to add more to the wiki.
See you there!
Josh
Hi,
This is Peter from Fanxi,which specialize in jewelry displays and package for 11 years
coorpetate with Cartier,VanCleef,Harry Winston,SWAROVSKI for many years.
We can design jewelry packaging and displays according to your shop counter size for free,
Catalogue will be sent if needed!
Best regards,
Peter
Whatsapp:+86 15727655534
Wechat:zsp444528275