Recently (or not so recently, it's been almost 2 years), the nfs-ganesha
project implemented capability to utilize asynchronous non-blocking I/O
to storage backends to prevent thread starvation. The assumption is that
the backend provides non-blocking I/O with a callback mechanism to
notify nfs-ganesha when the I/O is complete so that nfs-ganesha can
subsequently asynchronously respond to the client indicating I/O completion.
Ceph looks like it is structured to allow for such with Context objects
having finish and complete methods that allow the I/O path to notify
completion. In general libcephfs seems to use some form of condition
variable Context to block and wait for this notification. This would be
relatively easy to replace with a call back Context.
However, libcephfs does use ObjectCacher and sets the
block_writes_upfront flag which seems to make any writes that go through
ObjectCacher to block using an internal condition variable and not
utilize the onfreespace Context object (which maybe should have been
named onfinish?).
I'm wondering what the implication of setting block_writes_upfront to
false would be for libcephs beyond needing to assure an onfreespace
Context object is passed.
Thanks
Frank Filz
Hi all,
Does anyone know how to claim-append part of bufferlist? After the head part has been claim-appended, the left part can be still cliam-appended.
I know there's bufferlist::claim_append API. However it will claim append the whole part of the bufferlist.
The background is:
I use one function handle_io_am_write_request to recevie the network data sent by peer node, then append it into the cache data space(e.g. bufferlist object recv_pending_bl).
Then I trigger the up software layer to read the received the data in recv_pending_bl.
The code is below (you can also click the above link to read the code, no more than 40 lines).
There're several bugs in the below code. I'm looking for the high efficiency method to receive the data and trigger the up software layer to read the data in the right way.
Any suggesion is welcome to supply the high efficiency method to do it.
B.R.
Changcheng
Hi,
The crush rule min_size property is easily confused with pool min_size.
One could imagine a data loss scenario where an operator "fixed" a
misconfigured cluster by setting the crush rule min_size to 2 (but
left a pool min_size at 1).
Should we rename one of them (... the crush one)? ... e.g. min_osds/max_osds ?
Going further, do we even have a use-case for manually changing a
crush_rule's min_size/max_size. Could we simply hide them and hardcode
internally to min_size=1 and max_size=100?
Cheers, Dan
hi Adam,
while looking at Hybrid Allocator [0] and the newly introduced Btree
Allocator [1], i am wondering if we still need the bitmap allocator to
cap the memory usage due to the large overhead of AVL allocator?
because btree is much more spatially efficient than AVL tree.
cheers
---
[0] https://github.com/ceph/ceph/pull/33365
[1] https://github.com/ceph/ceph/pull/41828
So, turns out ceph-volume has two modes: raw and lvm. Rook uses the raw
mode,
while typical non-Rook clusters are using the lvm mode of ceph-volume. And
as the lvm mode doesn't use lsblk to list devices, we're only seeing
this in Rook
deployed clusters.
Is there a way we can fix the raw mode here?
Best,
Sebastian
Am 23.06.21 um 01:57 schrieb Blaine Gardner:
> From what I have been reading, it's actually quite easy because AHDI
> partitions are so easily/loosely defined. I suspect Ubuntu has the
> kernel built with this enabled to allow for users emulating Atari systems.
>
> See here for reference:
> https://bugs.launchpad.net/ubuntu/+source/util-linux/+bug/1531404/comments/…
> <https://bugs.launchpad.net/ubuntu/+source/util-linux/+bug/1531404/comments/…>
>
> I suspect that Ceph's Bluestore layout has a high probability of
> matching these loose AHDI partitions at least in some cases. It seems
> very good that Centos 7 isn't built with AHDI support.
>
> Blaine
>
> On Tue, 22 Jun 2021 at 16:53, Steven Ellis <sellis(a)redhat.com
> <mailto:sellis@redhat.com>> wrote:
>
> I'm rather puzzled as to why a disk is showing AHDI partition
> layout. It isn't something that should happen on a modern disk
>
> On Wed, 23 Jun 2021 at 10:07, Blaine Gardner <brgardne(a)redhat.com
> <mailto:brgardne@redhat.com>> wrote:
>
> Hi All,
>
> We have had a strange issue crop up in Rook, and I wanted to
> bring it
> up for discussion more widely since I think it has the
> potential to
> affect cephadm deployments as well as Rook.
>
> Here if you want to look through it:
> https://github.com/rook/rook/issues/7940#issuecomment-865005220
> <https://github.com/rook/rook/issues/7940#issuecomment-865005220>
>
> In Rook, we list disks with `lsblk` to be provisioned. I'm not
> sure
> how ceph-volume does this internally or how cephadm does this.
> In Rook
> (especially with 500+GB SSDs for some reason), many users are
> starting
> to see unexpected partitions appear. One user finally found
> out that
> this is because the Ubuntu kernel they're using has support
> for ATARI
> (AHDI) partitions. The kernel detects these as partitions, but
> lsblk
> does not understand the ATARI type and thus sees them as empty
> partitions. (It's unclear to me yet if `blkid` is able to see the
> ATARI type.)
>
> In Rook, this is really only a problem for OSDs that are
> provisioned
> on nodes (not a downstream-supported use-case). On PVCs (our
> downstream use-case), we only run the detection/provisioning
> step once
> at first setup, and so I believe Rook's downstream isn't affected.
> When provisioning OSDs on nodes, disk detection and OSD
> provisioning
> both run each time we reconcile. I'm not sure if that is true for
> cephadm or not.
>
> Additionally, the upstream user reports that ATARI/AHDI disk
> support
> is disabled for Centos 7. I hope this means we don't have to worry
> about this with Centos 8 or any RHEL we ship, but it might be
> good to
> confirm this for the RHEL versions we ship to customers.
>
> I'll leave the discussion of Rook-specific ideas for me, Seb, and
> Travis, but here are the questions I have for what we might
> need to
> think about for Ceph more broadly:
>
> Do we need to check Centos/RHEL versions for ATARI/AHDI disk
> support?
>
> Do we need to communicate to Ceph users that their bluestore
> OSD disks
> might appear as ATARI partitions on their host systems?
>
> Do we need to modify ceph-volume to do anything special if it
> finds
> ATARI/AHDI partitions?
>
> Are there other considerations I'm missing?
>
> Thanks again! And feel free to chime in with questions if I
> haven't
> explained things very well.
> Blaine
>
>
>
> --
>
> Steven Ellis
>
> TECHNICAL PORTFOLIO EVANGELIST - APAC
>
> <https://www.redhat.com>
>
> APAC SENIOR PRINCIPAL PRODUCT MANAGER - OPENSHIFT INFRA & DATA
> SERVICES
>
> Red Hat Asia Pacific - Auckland NZ <https://www.redhat.com>
>
> E: sellis(a)redhat.com <mailto:sellis@redhat.com> M: +64 21 321 673
> <tel:+64-21-321673>
>
> @RedHat <https://twitter.com/redhat> Red Hat
> <https://www.linkedin.com/company/red-hat> Red Hat
> <https://www.facebook.com/RedHatInc>
> <https://red.ht/sig>
>
> <https://redhat.com/trusted>
>
Hi Folks,
The performance meeting will be canceled this week due to a conflicting
meeting that affects many of the participants. We'll reconvene next week.
Thanks,
Mark
multisite's metadata sync relies on some abstractions in order to
transfer metadata between zones in json format. it also requires that
any time we write changes to a metadata object, we also write to the
metadata log so other zones know when/what they need to sync
the main abstraction is the RGWMetadataHandler, which we implement for
each kind of metadata. so we have a RGWBucketInstanceMetadataHandler
that knows how to json-encode/decode and read/write bucket instances,
a RGWUserMetadataHandler for users, etc etc
in addition to json and reading/writing, these handlers also have some
important logic in them. for example, when
RGWBucketInstanceMetadataHandler writes a new bucket instance we
haven't seen before, it will also create and initialize its bucket
index objects
the 'archive zone' uses special handler wrappers like
RGWArchiveBucketMetadataHandler and
RGWArchiveBucketInstanceMetadataHandler to force-enable object
versioning, preserve deleted buckets, etc
in https://github.com/ceph/ceph/pull/28679, we made a lot of changes
to move the actual rados reads/writes into 'metadata backends'. Yehuda
did a good job documenting the motivations and design decisions in the
PR description, so i'd encourage everyone to read through that
this predated the zipper work, but shared a similar goal of non-rados
backends. however, this metadata backend stuff is complicated without
providing any tangible benefits, while also making it significantly
more difficult to add new types of metadata (see Abhishek's work to
support role metadata in https://github.com/ceph/ceph/pull/37679). i'd
like to see these backends reimagined in terms of zipper to allow
metadata sync between different stores, but i think there are some
open questions here:
* zipper just has a Bucket interface - the distinction between the
'bucket entrypoint' and 'bucket instance' metadata is specific to the
rados store
* the split between MetadataHandler logic and the backends doesn't
seem right, at least in the case of the
RGWBucketInstanceMetadataHandler creating index objects, where the
bucket index itself is a detail of the rados store
i feel like a better approach would be for each zipper store to
implement the MetadataHandlers itself instead of trying to maintain
the separation between the metadata handlers and backends. so each
store would have functions like create_bucket_metadata_handler(),
create_user_metadata_handler(), etc
then for the archive zone, its handlers would wrap the handlers it
gets from zipper, and use generic zipper APIs to implement the extra
stuff like enabling versioning
Dear Community,
Currently, there are several events that could be sent as bucket
notifications but are missing some critical information from their objects:
* "Post" event sent at the beginning of a multipart upload - when multipart
start there is no information on the size of the object, its etag, and
other content related attributes
* "Put" event for a part during a multipart upload - there is only
information o the specific part being uploaded, which is usually less
useful for the recipient of the notification. In addition, this could be
confusing, since the final object does not exist yet
* "Delete" event when a deletion marker is being deleted. Unlike the case
of creation of a deletion marker (where there is information on the version
that was deleted) in the case of deletion of a deletion marker no
information on the deleted object exists
So, the plan is not to send notification in the above cases, however, since
this would be a behavior change to the RGW, I would like to make sure that
this won't break existing integrations.
Your feedback is welcome!
Yuval