On Tue, Jan 21, 2020 at 11:19 PM Casey Bodley <cbodley@redhat.com> wrote:

On 1/17/20 3:48 AM, Yuval Lifshitz wrote:
>
>
> On Thu, Jan 16, 2020 at 10:06 PM Matt Benjamin <mbenjami@redhat.com
> <mailto:mbenjami@redhat.com>> wrote:
>
> Well, the need for reservation is general, arising from need for
> reliable delivery.
>
>
> agree, reservations are needed for any queue in order to allow push
> back from failed endpoints.
> the reserve/commit mechanism would allow us to reliably signal the
> client that notifications triggered by the operation would fail.

To be more specific, this reserve/commit mechanism is only a requirement
for a strictly-bounded queue. And while a strict bound may be desirable
for several reasons, I don't see it as a requirement for reliable
delivery. Reliable delivery just means that we either a) notify the
endpoint directly and receive an ack or b) record an intent to notify
before responding the client.

my intent is to use the bounded queue pushback as a mechanism to signal the client that notifications cannot be sent anymore.

consider the case where an endpoint is slower to ack than the clients' request rate. in this case the queue will fill up, and at some point an error will be sent to the client.

same for cases where endpoint are down for too long to consider the notifications valuable.

bounded queue is a good way to allow for this push back without being too sensitive to transient failures or slowdowns.

Separately, I have some concerns with separating the 'reserve' request
from the one that actually 'commits' a notification to this queue.
Mainly, the osd request to reserve might succeed, but the request to
commit may fail or hang - for example, if the object's pg is no longer
available.

in these cases we will just record the errors. however, the main purpose of the queue is to overcome endpoint issues and RGW failures. 

if this is becoming an issue and the cls queue is proven to be not reliable enough, we can consider a different implementation (e.g. memory mapped file would survive SW failures but not HW).

Or maybe radosgw crashes before it's able to commit, and
(depending on implementation) the reserved size may leak. But the point
is that a successful reservation can't guarantee a successful commit.

agree. we would need a timeout mechanism for the reservations.

> You could substitute the fifo abstraction if 128M
> is too small.
>
> Matt
>
> On Thu, Jan 16, 2020 at 1:23 PM Casey Bodley <cbodley@redhat.com
> <mailto:cbodley@redhat.com>> wrote:
> >
> >
> > On 1/16/20 12:13 PM, Yuval Lifshitz wrote:
> > > two updates on the design (after some discussions):
> > >
> > > (1) "best effort queue" (stretch goal) is probably not needed:
> > > - cls queue performance should be high enough when put on
> fast media pool
> > > - the "acl level" settings allow for existing mechanism to
> perform as
> > > "best effort" and non-blocking for topics that does not need
> delivery
> > > guarantees
> > >
> > > (2) since the cls queue does not allow for random access (without
> > > linear search) the retries will have to be implemented based
> only on
> > > the end of the queue. This means that we must assume that the
> acks or
> > > nack arrive in the same order in which the notifications were set.
> > > This is true only for a specific endpoint (e.g. a specific kafka
> > > broker) which means that there will have to be a separate cls
> queue
> > > instance for each endpoint
> > >
> > >
> > > On Tue, Jan 14, 2020 at 3:47 PM Yuval Lifshitz
> <ylifshit@redhat.com <mailto:ylifshit@redhat.com>
> > > <mailto:ylifshit@redhat.com <mailto:ylifshit@redhat.com>>> wrote:
> > >
> > > Dear Community,
> > > Would like to share some design ideas around the above topic.
> > > Feedback is welcomed!
> > >
> > > Current State
> > >
> > > - in "pull mode" [1] we have the same guarantees as the
> multisite
> > > syncing mechanism (guarantee against HW/SW failures). On
> top of
> > > that, if writing the event to RADOS fails, this trickle
> back as
> > > sync failure, which means that the master zone will try to
> sync
> > > the pubsub zone
> > >
> > > - in "push mode" [2] we send the notification from the ops
> context
> > > that triggered the notification. The original operation is
> blocked
> > > until we get a reply from the endpoint. As part of the
> > > configuration for the endpoint, we also configure the "ack
> level",
> > > indicating whether we block until we get a reply from the
> endpoint
> > > or not.
> > > Since the operation response is not sent back to the
> client until
> > > the endpoint acks, this method guarantees against any
> failure in
> > > the radosgw (at the cost of adding latency to the operation).
> > > This, however, does not guarantee delivery if the endpoint
> is down
> > > or disconnected. The endpoint we interact with (rabbitmq,
> kafka) ,
> > > usually have built in redundancy mechanism, but this does not
> > > cover the case where there is a network disconnect between our
> > > gateways and these systems.
> > > In some cases we can get a nack from the endpoint,
> indicating that
> > > our message would never reach the endpoint. But we can
> only log
> > > these cases:
> > > - we cannot fail the operation that triggered us, because
> we send
> > > the notification only after the actual operation (e.g. "put
> > > object") was done (=no atomicity)
> > > - no retry mechanism (in theory, we can add one)
> > >
> > > Next Phase Requirements
> > >
> > > We would like to add delivery guarantee to "push mode" for
> > > endpoint failures. For that we would use a message queue
> with the
> > > following features:
> > > - rados backed, so it would survive HW/SW failures
> > > - blocking only on local read/writes (so it introduces smaller
> > > latency than over-the-wire endpoint acks)
> > > - has reserve/commit semantics, so we can "reserve" before the
> > > operation (e.g. "put object") was done, and fail it if we
> cannot
> > > reserve a slot on the queue, and commit the notification
> to the
> > > queue only after the operation was successful (and
> unreserve if
> > > the operation failed)
> > >
> > I guess this reservation piece is only a requirement because of the
> > choice of cls_queue, which resides in a single rados object and so
> > enforces a bound on the total space used. The maximum size is
> > configurable, but can't exceed osd_max_object_size=128M. How many
> > notifications could we fit within that the 128M limit?
>
>
> attached an example of a notification, which is 700 bytes long. if
> longer bucket and object names are used, together with metadata it may
> go up to 1KB. which would allow for 128K notifications (per endpoint).
> not sure what is the official operations per second we support per RGW
> (for small objects)?
> but if we assume 3.5K requests/sec [1] we can handle a 35 sec
> disconnect before we start pushing back on the client.
> what would probably make sense here is that we will define multiple
> queues per endpoint, that would scale with the number of RGWs.
>
> [1]
> https://www.slideshare.net/alohamora/ceph-object-storage-performance-secrets-and-ceph-data-lake-solution-81010759
>
> I worry that
> > clusters at a sufficient scale could fill that pretty quickly if the
> > notification endpoint is unavailable or slow, and that would leave
> > radosgw unable to satisfy any requests that would generate a
> notification.
> >
> > > - we would have a retry mechanism based on the queue,
> which means
> > > that if a notification was successfully pushed into the
> queue, we
> > > can assume it would (eventually) be successfully delivered
> to the
> > > endpoint
> > >
> > > Proposed Solution
> > >
> > > - use the cls_queue [3] (cls_queue is not omap based,
> hence, no
> > > builtin iops limitations)
> > > - add reserve/commit functionality (probably store that
> info in
> > > the queue head)
> > > - a dedicated thread(s) should be reading requests from
> the queue,
> > > sending the notifications to the endpoints, and waiting for
> > > the replies (if needed) - this should be done via coroutines
> > > - acked requests are removed from the queue, nacked or
> > > timed-out requests should be retried (at least for a while)
> > > - both mechanism would coexist, as this would be
> configurable per
> > > topic
> > > - as a stretch goal, we may add a "best effort queue".
> This would
> > > be similar to the cls_queue solution, but won't address
> > > radosgw failures (as the queue would be in-memory), only
> endpoint
> > > failures/disconnects
> > > - for now, this mechanism won't be supported for pushing
> events
> > > from the pubsub zone (="pull+push mode"), but might be
> added if
> > > users would find it useful
> > >
> > > Yuval
> > >
> > > [1] https://docs.ceph.com/docs/master/radosgw/pubsub-module/
> > > [2] https://docs.ceph.com/docs/master/radosgw/notifications/
> > > [3] https://github.com/ceph/ceph/tree/master/src/cls/queue
> > >
> > >
> > > _______________________________________________
> > > Dev mailing list -- dev@ceph.io <mailto:dev@ceph.io>
> > > To unsubscribe send an email to dev-leave@ceph.io
> <mailto:dev-leave@ceph.io>
> > _______________________________________________
> > Dev mailing list -- dev@ceph.io <mailto:dev@ceph.io>
> > To unsubscribe send an email to dev-leave@ceph.io
> <mailto:dev-leave@ceph.io>
>
>
>
> --
>
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel. 734-821-5101
> fax. 734-769-8938
> cel. 734-216-5309
> _______________________________________________
> Dev mailing list -- dev@ceph.io <mailto:dev@ceph.io>
> To unsubscribe send an email to dev-leave@ceph.io
> <mailto:dev-leave@ceph.io>
>