August 2019 - Dev - lists.ceph.io

rgw: resharding buckets without blocking write ops

by Casey Bodley

sharing a design for feedback. please let me know if you spot any other races, issues or optimizations! current resharding steps: 1) copy the 'source' bucket instance into a new 'target' bucket instance with a new instance id 2) flag all source bucket index shards with RESHARD_IN_PROGRESS 3) flag the source bucket instance with RESHARD_IN_PROGRESS 4) list all omap entries in the source bucket index shards (cls_rgw_bi_list) and write each entry to its target bucket index shard (cls_rgw_bi_put) 5a) on success: link target bucket instance, delete source bucket index shards, delete source bucket instance 5b) on failure: reset RESHARD_IN_PROGRESS flag on source bucket index shards, delete target bucket index shards, delete target bucket instance the current blocking strategy is enforced on the source bucket index shards. any write operations received by cls_rgw while the RESHARD_IN_PROGRESS flag is set are rejected with ERR_BUSY_RESHARDING. radosgw handles these errors by waiting/polling until the reshard finishes, then it resends the operation to the new target bucket index shard. to avoid blocking write ops during a reshard, we could instead apply their bucket index operations to both the source and target bucket index shards in parallel. this includes both the 'prepare' op to start the transaction, and the asynchronous 'complete' to commit. allowing both buckets to mutate during reshard introduces several new races: I) between steps (2) and (3), radosgw doesn't yet see the RESHARD_IN_PROGRESS flag in the bucket instance info, so doesn't know to send the extra index operations to the target bucket index shard II) operations applied on the target bucket index shards could be overwritten by the omap entries copied from the source bucket index shards in step (4) III) radosgw sends a 'prepare' op to the source bucket index shard before step (2), then sends the async 'complete' op to the source bucket index shard after (2). before step (5), this complete op would fail with ERR_BUSY_RESHARDING. after step (5), it would fail with ENOENT. since the complete is async, and we've already replied to the client, it's too late for any recovery IV) radosgw sends an operation to both the source and target bucket index shards that races with (5) and fails with ENOENT on either the source shard (5a) or the target shard (5b) introducing a new generation number or 'reshard_epoch' to each bucket that increments on a reshard attempt can help to resolve these races. so in step (2), the call to cls_rgw_set_bucket_resharding() would also increment the bucket index shard's reshard_epoch. similarly, step (3) would increment the bucket instance's reshard_epoch. to resolve the race in (I), cls_rgw would reject bucket index operations with a reshard_epoch older than the one stored in the bucket index shard. this ERR_BUSY_RESHARDING error would direct radosgw to re-read its bucket instance, detect the reshard in progress, and resend the operation to both the source and target bucket index shards with the updated reshard_epoch to resolve the race in (II), cls_rgw_bi_put() would have to test whether the given key exists before overwriting the race in (III) is benign, because the 'prepared' entry was reliably stored in the source shard before reshard, so we're guaranteed to see a copy on the target shard. even though the 'complete' operation isn't applied, the dir_suggest mechanism will detect the incomplete transaction and repair the index the next time the target bucket is listed the race in (IV) can be treated as a success if the operation succeeds on the target bucket index shard. if it fails on the target shard, radosgw needs to re-read the bucket entrypoint and instance to retarget the operation one thing this strategy cannot handle is versioned buckets. some index operations for versioning (namely cls_rgw_bucket_link_olh and cls_rgw_bucket_unlink_instance) involve writes to two or more related omap entries. because step (4) copies over single omap entries, it can't preserve the consistency of these relationships once we allow mutation. so we'd need to stick with the blocking strategy for versioned buckets

4 years, 7 months

3
4
0 0

bluestore fsck behavior with legacy stats etc

by Sage Weil

Background: In nautilus, bluestore started maintaining usage stats on a per-pool basis. BlueStore OSDs created before nautilus lack these stats. Running a ceph-bluestore-tool repair can calculate the usage so that the OSD can maintain and report them going forward. There are two options: - bluestore_warn_on_legacy_statfs (bool, default: true), which makes the cluster issue a health warning when there are OSDs that have legacy stats. - bluestore_no_per_pool_stats_tolerance (enum enforce, until_fsck, until_repair, default: until_repair). 'until_fsck' will tolerate the legacy but fsck will fail 'until_repair' will tolerate the legacy but fsck will pass 'enforce' will tolerate the legacy but disable the warning The octopus addition of per-pool omap usage tracking presents an identical problem: a new tracking ability in bluestore that reqires a conversion to enable after upgrade. I think that we can simplify these settings and make them less confusing, still with two options: - bluestore_fsck_error_on_no_per_pool_omap (bool, default: false). During fsck, we can either generate a 'warning' about non-per-pool omap, or an error. Generate a warning by default, which means that the fsck return code can indicate success. - bluestore_warn_on_no_per_pool_omap (bool, default: true). At runtime, we can generate a health warning if the OSD is using the legacy non-per-pool omap. The overall default behavior is the same as we have with the legacy_statfs: OSDs still work, fsck passes, and we generate a health warning. Setting bluestore_warn_on_no_per_pool_omap=false is the same, AFAICS, as setting bluestore_no_per_pool_stats_tolerance=enforce. (Except maybe repair won't do the conversion? I don't see why we'd ever not want to do the conversion, though.) Setting bluestore_fsck_error_on_no_per_pool_omap=true is the same, AFAICS, as bluestore_no_per_pool_stats_tolerance=until_fsck. Overall, this seems simpler and easier for a user to understand. Realistically, the only option I expect a user will ever change is bluestore_warn_on_no_per_pool_omap=false to make the health warning go away after an upgrade. What do you think? Should I convert the legacy_statfs to behave the same way? sage

4 years, 7 months

3
2
0 0

[RFC] New S3 Benchmark

by Mark Nelson

Hi Guys, Earlier this week I was working on investigating the impact of OMAP performance on RGW and wanted to see if putting rocksdb on ramdisk would help speed up bucket index updates. While running tests I found out that the benchmark tool I was using consumed roughly 15 cores of CPU to push 4K puts/second to RGW from 128 threads. That wasn't really viable, so I started looking for alternate S3 benchmarking tools. COSBench is sort of the most well known choice out there, but it's a bit cumbersome if you just want to run some quick tests from the command-line. I managed to find a simple yet very nice benchmark written in go developed by Wasabi Inc called s3-benchmark. While it works well, it really only targets single buckets and is designed more for AWS testing than for Ceph. I forked the project and pretty much refactored the whole thing to be more useful for the kind of testing I want to do. It's now at the point where I think it might be ready to experiment with. S3 benchmarking has been a semi-recurring topic on the list so I figured other folks might be interested in trying it too and hopefully providing feedback. The new benchmark is called hsbench and it's available here: https://github.com/markhpc/hsbench See the README.md for some of the advantages over the original s3-benchmark program it was forked from. I consider this release to be alpha level quality and disclaim all responsibility if it breaks and deletes every object in all of your buckets. Consider yourself warned. :) Mark

4 years, 7 months

4
6
0 0

Ceph broken on big-endian systems

by Ulrich Weigand

Hello, we've been trying to get Ceph to work on IBM Z, a big-endian system, and have been running into various serious issues relating to endian conversion code. The main issue we've been seeing is that while the old-style decode/encode machinery in include/encoding.h automatically byte-swaps all integers on big-endian systems (to ensure the serialized format is always little endian), the *new-style* machinery in include/denc.h does not. This seemed confusing at first since there is quite a bit of code there that appears intended to perform exactly that function, e.g. template<typename T> struct ExtType<T, std::enable_if_t<std::is_same_v<T, int32_t> || std::is_same_v<T, uint32_t>>> { using type = __le32; }; However, it turns out that at this point __le32 is actually just an alias for __u32, so this whole machinery doesn't really do anything at all. Looking at the old code in encoding.h, I notice that it works similarly, but uses ceph_le32 instead of __le32. The former is a C++ class that actually does perform byte-swap on access. Even more confusing, there is this code in include/types.h: // temporarily remap __le* to ceph_le* for benefit of shared kernel/userland headers #define __le16 ceph_le16 #define __le32 ceph_le32 #define __le64 ceph_le64 #include "ceph_fs.h" #include "ceph_frag.h" #include "rbd_types.h" #undef __le16 #undef __le32 #undef __le64 which --sometimes-- redefines __le32 as ceph_le32, but those redefines are not active at the point denc.h is included. So it would appear that the usage of __le32 in denc.h is incorrect, and this code should be using ceph_le32 instead. Is this right? But even so, grepping for __le32 throughout the code base shows quite a bit of additional places where it is used, most of which also appear to make the assumption that byte-swaps automatically happen. In addition, there appear to be some places where e.g. ceph_fs.h is included directly, without going via types.h -- and in those places, we suddenly no longer get the byte-swaps ... Now I was wondering whether the best way forward might be to just have __le32 always be defined as ceph_le32 when compiling user space code. But then I noticed that it used to be that way, but that was deliberated changed by this commit back in 2010: commit 737b5043576153817a6b4195b292672585df10d3 Author: Sage Weil <sage(a)newdream.net> Date: Fri May 7 13:45:00 2010 -0700 endian: simplify __le* type hackery Instead of preventing linux/types.h from being included, instead name our types ceph_le*, and remap using #define _only_ when including the shared kernel/userspace headers. So I'm a bit at a loss to understand how all this is supposed to be working. Any suggestions would be welcome -- we'd be willing to implement whatever's needed, but would like some guidance as to how the solution should look like ... Bye, Ulrich

4 years, 7 months

2
6
1 0

Re: Customized packaging case / EVA case supplier

by Cicy Chan

4 years, 7 months

1
0
0 0

***** SPAM 5.8 ***** Re: New design of photographic equipment in 2019

by candy

4 years, 7 months

1
0
0 0

CDM next week (APAC-friendly time)

by Josh Durgin

Hi folks, The next Ceph Developer Monthly will be: Wed Sept 4 at 9PM ET, or Thu Sept 5 at 0100 UTC The agenda: https://tracker.ceph.com/projects/ceph/wiki/CDM_04-SEP-2019 Current topics are: - wandering log (FIFO striped over rados objects) - rados: return bufferlist attached to write operations - github actions workflows for Ceph Feel free to add more to the wiki. See you there! Josh

4 years, 7 months

1
0
0 0

***** SPAM 5.9 ***** Re:【Ready to ship】why famous brand select FANXI Jewelry package & displays(with pictures)

by Sales98＠toppacking.com

Hi, This is Peter from Fanxi,which specialize in jewelry displays and package for 11 years coorpetate with Cartier,VanCleef,Harry Winston,SWAROVSKI for many years. We can design jewelry packaging and displays according to your shop counter size for free, Catalogue will be sent if needed! Best regards, Peter Whatsapp:+86 15727655534 Wechat:zsp444528275

4 years, 7 months

1
0
0 0

Static Analysis

by Brad Hubbard

Latest static analyser results are up on http://people.redhat.com/bhubbard/ There is only an update to Coverity results this week since the environment I use to run the other scans is broken and I need to set up an alternate environment. Weekly Fedora Copr builds are at https://copr.fedorainfracloud.org/coprs/badone/ceph-weeklies/ -- Cheers, Brad

4 years, 7 months

1
0
0 0

message too long

by tapas＠wolk.com

Hi, I am getting an error when trying to upload a file size > 90mb "message too long" Thank you !! Tapas

4 years, 7 months

2
1
0 0

2024

2023

2022

2021

2020

2019

Dev August 2019