This is the fourth release in the Ceph Nautilus stable release series. Its sole
purpose is to fix a regression that found its way into the previous release.
Notable Changes
---------------
The ceph-volume in Nautilus v14.2.3 was found to contain a serious
regression, described in https://tracker.ceph.com/issues/41660, which
prevented deployment tools like ceph-ansible, DeepSea, Rook, etc. from
deploying/removing OSDs.
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-14.2.3.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 75f4de193b3ea58512f204623e6c5a16e6c1e1ba
--
Abhishek Lekshmanan
SUSE Software Solutions Germany GmbH
sharing a design for feedback. please let me know if you spot any other
races, issues or optimizations!
current resharding steps:
1) copy the 'source' bucket instance into a new 'target' bucket instance
with a new instance id
2) flag all source bucket index shards with RESHARD_IN_PROGRESS
3) flag the source bucket instance with RESHARD_IN_PROGRESS
4) list all omap entries in the source bucket index shards
(cls_rgw_bi_list) and write each entry to its target bucket index shard
(cls_rgw_bi_put)
5a) on success: link target bucket instance, delete source bucket index
shards, delete source bucket instance
5b) on failure: reset RESHARD_IN_PROGRESS flag on source bucket index
shards, delete target bucket index shards, delete target bucket instance
the current blocking strategy is enforced on the source bucket index
shards. any write operations received by cls_rgw while the
RESHARD_IN_PROGRESS flag is set are rejected with ERR_BUSY_RESHARDING.
radosgw handles these errors by waiting/polling until the reshard
finishes, then it resends the operation to the new target bucket index
shard.
to avoid blocking write ops during a reshard, we could instead apply
their bucket index operations to both the source and target bucket index
shards in parallel. this includes both the 'prepare' op to start the
transaction, and the asynchronous 'complete' to commit. allowing both
buckets to mutate during reshard introduces several new races:
I) between steps (2) and (3), radosgw doesn't yet see the
RESHARD_IN_PROGRESS flag in the bucket instance info, so doesn't know to
send the extra index operations to the target bucket index shard
II) operations applied on the target bucket index shards could be
overwritten by the omap entries copied from the source bucket index
shards in step (4)
III) radosgw sends a 'prepare' op to the source bucket index shard
before step (2), then sends the async 'complete' op to the source bucket
index shard after (2). before step (5), this complete op would fail with
ERR_BUSY_RESHARDING. after step (5), it would fail with ENOENT. since
the complete is async, and we've already replied to the client, it's too
late for any recovery
IV) radosgw sends an operation to both the source and target bucket
index shards that races with (5) and fails with ENOENT on either the
source shard (5a) or the target shard (5b)
introducing a new generation number or 'reshard_epoch' to each bucket
that increments on a reshard attempt can help to resolve these races. so
in step (2), the call to cls_rgw_set_bucket_resharding() would also
increment the bucket index shard's reshard_epoch. similarly, step (3)
would increment the bucket instance's reshard_epoch.
to resolve the race in (I), cls_rgw would reject bucket index operations
with a reshard_epoch older than the one stored in the bucket index
shard. this ERR_BUSY_RESHARDING error would direct radosgw to re-read
its bucket instance, detect the reshard in progress, and resend the
operation to both the source and target bucket index shards with the
updated reshard_epoch
to resolve the race in (II), cls_rgw_bi_put() would have to test whether
the given key exists before overwriting
the race in (III) is benign, because the 'prepared' entry was reliably
stored in the source shard before reshard, so we're guaranteed to see a
copy on the target shard. even though the 'complete' operation isn't
applied, the dir_suggest mechanism will detect the incomplete
transaction and repair the index the next time the target bucket is listed
the race in (IV) can be treated as a success if the operation succeeds
on the target bucket index shard. if it fails on the target shard,
radosgw needs to re-read the bucket entrypoint and instance to retarget
the operation
one thing this strategy cannot handle is versioned buckets. some index
operations for versioning (namely cls_rgw_bucket_link_olh and
cls_rgw_bucket_unlink_instance) involve writes to two or more related
omap entries. because step (4) copies over single omap entries, it can't
preserve the consistency of these relationships once we allow mutation.
so we'd need to stick with the blocking strategy for versioned buckets
Dear Manager ,
How are you? This is Yoyo from Visa Mould. We can offer you:
1) Part design or tooling consulting
2) Provide the quote for you within 24 hours
3) Develop the new part design and suggestions
4) Provide the DFM, Mold flow Analysis if needed, 2D&3D mold design
5) Tooling making with the precision machines and professional technical team
6) Mold trial with samples, provide the molding parameter and mold trial video
7) Lower volume production & higher volume production
8) On time delivery for tooling and parts
Please feel free to send me 2d or 3D drawings for quote even if only for price comparing. Thanks for your attention.
Of course, if you are not responsible for this matter, please help me to pass this email to Purchase department or Sales department. Thanks you my friend. Thanks.
Best Regards,
Yoyo Huang
ISO9001:2015 Certificated
Visa Mould Industrial Limited
Add:JiangNan Industrial Park,ShaBeiLi Village,
LongDong, LongGang Street, LongGang District,Shenzhen
Skype: yoyo.gf
WhatsApp:+8618617115614
Email:yoyonice99@126.com
Tel: +86-755-2992 8690
Fax: +86-755-2992 8697
Hello,
While working on CephFS Quick Start guide[1], the major issue that I
came across was choosing the value for pg_num for the pools that will
serve CephFS. I've tried the values from 4 to 128 for both data and
metadata pools and have always got "undersized+peered" instead of
"active+clean". Copying pg_num values from the cluster setup by
vstart.sh (8 for data and 16 for metadata pools) gave me the same
result.
About the cluster: I had a single node running Fedora 29 with 1 MON, 1
MGR, 1 MDS and 3 OSDs each with a disk size of 10 GB. Thinking that
disk size might have a role to play, I changed the number of OSDs to 2
each with 20 GB disks and later with 50 GB disks but neither helped. I
used dnf to install ceph and ceph-deploy to setup the cluster.
I've copied the the cluster status after every attempt here[2] in case
that helps. Any suggestions about pg_num values I should choose and on
the pg_num values that would be nice for a user looking forward to get
quickly started with CephFS?
[1] https://docs.ceph.com/docs/master/start/quick-cephfs/
[2] https://paste.fedoraproject.org/paste/Q-WH8VWtwu6JwF7eW2JmnA
Thanks,
- Rishabh