June 2021 - Dev - lists.ceph.io

Asynchronous non-blocking I/O for libcephfs

by Frank Filz

Recently (or not so recently, it's been almost 2 years), the nfs-ganesha project implemented capability to utilize asynchronous non-blocking I/O to storage backends to prevent thread starvation. The assumption is that the backend provides non-blocking I/O with a callback mechanism to notify nfs-ganesha when the I/O is complete so that nfs-ganesha can subsequently asynchronously respond to the client indicating I/O completion. Ceph looks like it is structured to allow for such with Context objects having finish and complete methods that allow the I/O path to notify completion. In general libcephfs seems to use some form of condition variable Context to block and wait for this notification. This would be relatively easy to replace with a call back Context. However, libcephfs does use ObjectCacher and sets the block_writes_upfront flag which seems to make any writes that go through ObjectCacher to block using an internal condition variable and not utilize the onfreespace Context object (which maybe should have been named onfinish?). I'm wondering what the implication of setting block_writes_upfront to false would be for libcephs beyond needing to assure an onfreespace Context object is passed. Thanks Frank Filz

2 years, 9 months

1
0
0 0

claim append parf of bufferlist

by changcheng.liu

Hi all, Does anyone know how to claim-append part of bufferlist? After the head part has been claim-appended, the left part can be still cliam-appended. I know there's bufferlist::claim_append API. However it will claim append the whole part of the bufferlist. The background is: I use one function handle_io_am_write_request to recevie the network data sent by peer node, then append it into the cache data space(e.g. bufferlist object recv_pending_bl). Then I trigger the up software layer to read the received the data in recv_pending_bl. The code is below (you can also click the above link to read the code, no more than 40 lines). There're several bugs in the below code. I'm looking for the high efficiency method to receive the data and trigger the up software layer to read the data in the right way. Any suggesion is welcome to supply the high efficiency method to do it. B.R. Changcheng

2 years, 9 months

1
0
0 0

crush rule min_size

by Dan van der Ster

Hi, The crush rule min_size property is easily confused with pool min_size. One could imagine a data loss scenario where an operator "fixed" a misconfigured cluster by setting the crush rule min_size to 2 (but left a pool min_size at 1). Should we rename one of them (... the crush one)? ... e.g. min_osds/max_osds ? Going further, do we even have a use-case for manually changing a crush_rule's min_size/max_size. Could we simply hide them and hardcode internally to min_size=1 and max_size=100? Cheers, Dan

2 years, 9 months

2
1
0 0

hybrid allocator based on btree allocator

by Kefu Chai

hi Adam, while looking at Hybrid Allocator [0] and the newly introduced Btree Allocator [1], i am wondering if we still need the bitmap allocator to cap the memory usage due to the large overhead of AVL allocator? because btree is much more spatially efficient than AVL tree. cheers --- [0] https://github.com/ceph/ceph/pull/33365 [1] https://github.com/ceph/ceph/pull/41828

2 years, 9 months

2
3
0 0

Re: bluestore disks reading as ATARI/AHDI partitions

by Sebastian Wagner

So, turns out ceph-volume has two modes: raw and lvm. Rook uses the raw mode, while typical non-Rook clusters are using the lvm mode of ceph-volume. And as the lvm mode doesn't use lsblk to list devices, we're only seeing this in Rook deployed clusters. Is there a way we can fix the raw mode here? Best, Sebastian Am 23.06.21 um 01:57 schrieb Blaine Gardner: > From what I have been reading, it's actually quite easy because AHDI > partitions are so easily/loosely defined. I suspect Ubuntu has the > kernel built with this enabled to allow for users emulating Atari systems. > > See here for reference: > https://bugs.launchpad.net/ubuntu/+source/util-linux/+bug/1531404/comments/… > <https://bugs.launchpad.net/ubuntu/+source/util-linux/+bug/1531404/comments/…> > > I suspect that Ceph's Bluestore layout has a high probability of > matching these loose AHDI partitions at least in some cases. It seems > very good that Centos 7 isn't built with AHDI support. > > Blaine > > On Tue, 22 Jun 2021 at 16:53, Steven Ellis <sellis(a)redhat.com > <mailto:sellis@redhat.com>> wrote: > > I'm rather puzzled as to why a disk is showing AHDI partition > layout. It isn't something that should happen on a modern disk > > On Wed, 23 Jun 2021 at 10:07, Blaine Gardner <brgardne(a)redhat.com > <mailto:brgardne@redhat.com>> wrote: > > Hi All, > > We have had a strange issue crop up in Rook, and I wanted to > bring it > up for discussion more widely since I think it has the > potential to > affect cephadm deployments as well as Rook. > > Here if you want to look through it: > https://github.com/rook/rook/issues/7940#issuecomment-865005220 > <https://github.com/rook/rook/issues/7940#issuecomment-865005220> > > In Rook, we list disks with `lsblk` to be provisioned. I'm not > sure > how ceph-volume does this internally or how cephadm does this. > In Rook > (especially with 500+GB SSDs for some reason), many users are > starting > to see unexpected partitions appear. One user finally found > out that > this is because the Ubuntu kernel they're using has support > for ATARI > (AHDI) partitions. The kernel detects these as partitions, but > lsblk > does not understand the ATARI type and thus sees them as empty > partitions. (It's unclear to me yet if `blkid` is able to see the > ATARI type.) > > In Rook, this is really only a problem for OSDs that are > provisioned > on nodes (not a downstream-supported use-case). On PVCs (our > downstream use-case), we only run the detection/provisioning > step once > at first setup, and so I believe Rook's downstream isn't affected. > When provisioning OSDs on nodes, disk detection and OSD > provisioning > both run each time we reconcile. I'm not sure if that is true for > cephadm or not. > > Additionally, the upstream user reports that ATARI/AHDI disk > support > is disabled for Centos 7. I hope this means we don't have to worry > about this with Centos 8 or any RHEL we ship, but it might be > good to > confirm this for the RHEL versions we ship to customers. > > I'll leave the discussion of Rook-specific ideas for me, Seb, and > Travis, but here are the questions I have for what we might > need to > think about for Ceph more broadly: > > Do we need to check Centos/RHEL versions for ATARI/AHDI disk > support? > > Do we need to communicate to Ceph users that their bluestore > OSD disks > might appear as ATARI partitions on their host systems? > > Do we need to modify ceph-volume to do anything special if it > finds > ATARI/AHDI partitions? > > Are there other considerations I'm missing? > > Thanks again! And feel free to chime in with questions if I > haven't > explained things very well. > Blaine > > > > -- > > Steven Ellis > > TECHNICAL PORTFOLIO EVANGELIST - APAC > > <https://www.redhat.com> > > APAC SENIOR PRINCIPAL PRODUCT MANAGER - OPENSHIFT INFRA & DATA > SERVICES > > Red Hat Asia Pacific - Auckland NZ <https://www.redhat.com> > > E: sellis(a)redhat.com <mailto:sellis@redhat.com> M: +64 21 321 673 > <tel:+64-21-321673> > > @RedHat <https://twitter.com/redhat> Red Hat > <https://www.linkedin.com/company/red-hat> Red Hat > <https://www.facebook.com/RedHatInc> > <https://red.ht/sig> > > <https://redhat.com/trusted> >

2 years, 9 months

5
4
0 0

06/24/2021 perf meeting is canceled

by Mark Nelson

Hi Folks, The performance meeting will be canceled this week due to a conflicting meeting that affects many of the participants. We'll reconvene next week. Thanks, Mark

2 years, 10 months

1
0
0 0

Ceph rbd-nbd performance benchmark

by Bobby

Hi, I am trying to benchmark the Ceph rbd-nbd performance. Are there any authentic existing benchmark results of rbd-nbd for comparison? BR Bobby

2 years, 10 months

1
0
0 0

rgw multisite and metadata abstractions/backends

by Casey Bodley

multisite's metadata sync relies on some abstractions in order to transfer metadata between zones in json format. it also requires that any time we write changes to a metadata object, we also write to the metadata log so other zones know when/what they need to sync the main abstraction is the RGWMetadataHandler, which we implement for each kind of metadata. so we have a RGWBucketInstanceMetadataHandler that knows how to json-encode/decode and read/write bucket instances, a RGWUserMetadataHandler for users, etc etc in addition to json and reading/writing, these handlers also have some important logic in them. for example, when RGWBucketInstanceMetadataHandler writes a new bucket instance we haven't seen before, it will also create and initialize its bucket index objects the 'archive zone' uses special handler wrappers like RGWArchiveBucketMetadataHandler and RGWArchiveBucketInstanceMetadataHandler to force-enable object versioning, preserve deleted buckets, etc in https://github.com/ceph/ceph/pull/28679, we made a lot of changes to move the actual rados reads/writes into 'metadata backends'. Yehuda did a good job documenting the motivations and design decisions in the PR description, so i'd encourage everyone to read through that this predated the zipper work, but shared a similar goal of non-rados backends. however, this metadata backend stuff is complicated without providing any tangible benefits, while also making it significantly more difficult to add new types of metadata (see Abhishek's work to support role metadata in https://github.com/ceph/ceph/pull/37679). i'd like to see these backends reimagined in terms of zipper to allow metadata sync between different stores, but i think there are some open questions here: * zipper just has a Bucket interface - the distinction between the 'bucket entrypoint' and 'bucket instance' metadata is specific to the rados store * the split between MetadataHandler logic and the backends doesn't seem right, at least in the case of the RGWBucketInstanceMetadataHandler creating index objects, where the bucket index itself is a detail of the rados store i feel like a better approach would be for each zipper store to implement the MetadataHandlers itself instead of trying to maintain the separation between the metadata handlers and backends. so each store would have functions like create_bucket_metadata_handler(), create_user_metadata_handler(), etc then for the archive zone, its handlers would wrap the handlers it gets from zipper, and use generic zipper APIs to implement the extra stuff like enabling versioning

2 years, 10 months

1
0
0 0

by Yuval Lifshitz

Dear Community, Currently, there are several events that could be sent as bucket notifications but are missing some critical information from their objects: * "Post" event sent at the beginning of a multipart upload - when multipart start there is no information on the size of the object, its etag, and other content related attributes * "Put" event for a part during a multipart upload - there is only information o the specific part being uploaded, which is usually less useful for the recipient of the notification. In addition, this could be confusing, since the final object does not exist yet * "Delete" event when a deletion marker is being deleted. Unlike the case of creation of a deletion marker (where there is information on the version that was deleted) in the case of deletion of a deletion marker no information on the deleted object exists So, the plan is not to send notification in the above cases, however, since this would be a behavior change to the RGW, I would like to make sure that this won't break existing integrations. Your feedback is welcome! Yuval

2 years, 10 months

1
1
0 0

PSA: sqlite3 databases now available for ceph-mgr modules

by Patrick Donnelly

Introduced by [1] for Quincy release. This builds on work in [2] to add RADOS-backed sqlite3 support to Ceph (available in Pacific). The MgrModule API for accessing your module's database is introduced in [3]. An example of a module ("devicehealth") using the API can be seen in [4]. Please let me know if you have any questions or feedback. [1] https://github.com/ceph/ceph/pull/40740 [2] https://github.com/ceph/ceph/pull/28822 [3] https://github.com/ceph/ceph/commit/e3d771702da3bb858064b67eb6c710a659bfb08d [4] https://github.com/ceph/ceph/commit/abd35d47696c208990355395d48c1c1e261de95c -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

2 years, 10 months

3
5
0 0

2024

2023

2022

2021

2020

2019

Dev June 2021