On Thu, Jan 28, 2021 at 7:34 PM Schoonjans, Tom (RFI,RAL,-) <
Tom.Schoonjans(a)rfi.ac.uk> wrote:
Hi Yuval,
Together with Tom Byrne I ran some more tests today while keeping an eye
on the logs as well.
We immediately noticed that the nodes were logging errors when uploading
files like:
2021-01-28 16:10:45.825 7f56ff5cf700 1 ====== starting new request req=0x7f56ff5c87f0
=====
2021-01-28 16:10:45.828 7f5721e14700 1 AMQP connect: exchange mismatch
2021-01-28 16:10:45.828 7f5721e14700 1 ERROR: failed to create push endpoint:
amqp://<username>:<password>@<my.rabbitmq.server>:5672 due to: pubsub
endpoint configuration error: AMQP: failed to create connection to:
amqp://<username>:<password>@<my.rabbitmq.server>:5672
2021-01-28 16:10:45.828 7f571ee0e700 1 ====== req done req=0x7f571ee077f0 op status=0
http_status=200 latency=0.0569997s ======
Which resulted in no connections being established to the RabbitMQ server.
Tom restarted then the Ceph services on one gateway node, which led to
events being sent to RabbitMQ without blocking, but only if this particular
node was picked up by the boto3 upload request in the round-robin DNS.
Restarting the Ceph service on all nodes fixed the problem and I got a
nice steady stream of events to my consumer Python script!
we should fix it. no restart should be needed if one of the connection
parameters was wrong
I did notice that any events that were sent while my
consumer script was
not running are lost, as they are not picked up after I restart the script.
Any thoughts on this?
this is strange. in our code [1] we don't require immediate transfer of
messages.
how is the exchange declared?
can you check if this is happening when you send messages from a python
producer as well?
[1]
Many thanks!!
Best,
Tom
Dr Tom Schoonjans
Research Software Engineer - HPC and Cloud
Rosalind Franklin Institute
Harwell Science & Innovation Campus
Didcot
Oxfordshire
OX11 0FA
United Kingdom
https://www.rfi.ac.uk
The Rosalind Franklin Institute is a registered charity in England and
Wales, No. 1179810 Company Limited by Guarantee Registered in England
and Wales, No.11266143. Funded by UK Research and Innovation through
the Engineering and Physical Sciences Research Council.
On 27 Jan 2021, at 16:21, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
On Wed, Jan 27, 2021 at 5:34 PM Schoonjans, Tom (RFI,RAL,-) <
Tom.Schoonjans(a)rfi.ac.uk> wrote:
Looks like there’s already a ticket open for AMQP
SSL support:
https://tracker.ceph.com/issues/42902 (you opened it ;-))
I will give a try myself if I have some time, but don’t hold your breath
with lockdown and home schooling. Also I am not much of a C++ coder.
I need to go over the logs with Tom Byrne to see why it is not working
properly. And perhaps I will be able to come up with a fix then.
However this is what I have run into so far today:
1. After configuring a bucket with a topic using the non-SSL port, I
tried a couple of uploads to this bucket. They all hanged, which seemed
like something was very wrong, so I Ctrl-C’ed every time. After some time I
figured out from the RabbitMQ admin UI that Ceph was indeed connecting to
it, and the connections remained so I killed them from the UI.
sending the notification to the rabbitmq server is synchronous with the
upload to the bucket. so, if the server is slow or not acking the
notification, the upload request would hang. not that the upload itself is
done first, but the reply to the client does not happen until rabbitmq
server acks.
would be great if you can share the radosgw logs.
maybe the issue is related to the user/password method we use? we use:
AMQP_SASL_METHOD_PLAIN
one possible workaround would be to set "amqp-ack-level" to "none".
in
this case the radosgw does not wait for an ack
in "pacific" you could use "persistent topics" where the
notifications are
sent asynchronously to the endpoint.
2. I then wrote a python script with Pika to consume the events, hoping
that would stop the blocking. I had some minor
success with this. Usually
the first three or four uploaded files would generate events that I could
consume with my script.
the radosgw is waiting for an ack from the broker, not the end consumer,
so this should not have mattered...
did you actually see any notifications delivered to the consumer?
However, the rest would block for ever. I
repeated this a couple of times
but always the same result. I noticed that after I stopped uploading,
removed the bucket and the topic, the connection from Ceph in the RabbitMQ
UI remained. I killed it but it came back seconds later from another port
on the Ceph cluster. I ended up playing whack-a-mole with this until no
more connections would be established from Ceph to RabbitMQ. I probably
killed a 100 or so of them.
once you remove the bucket there cannot be new notification sent. if you
create the bucket again you may see notifications again (this is fixed in
"pacific").
either way, even if the connection to the rabbitmq server would still be
open, but no new notifications should be sent there. just having the
connection should not be an issue but would be nice to fix that as well:
https://tracker.ceph.com/issues/49033
3. After this I couldn’t get any events sent anymore. There is no more
blocking when uploading, files get written but
nothing else happens. No
connections are made anymore from Ceph to RabbitMQ.
Hope this helps…
yes, this is very helpful!
Best,
Tom
Dr Tom Schoonjans
Research Software Engineer - HPC and Cloud
Rosalind Franklin Institute
Harwell Science & Innovation Campus
Didcot
Oxfordshire
OX11 0FA
United Kingdom
https://www.rfi.ac.uk
The Rosalind Franklin Institute is a registered charity in England and
Wales, No. 1179810 Company Limited by Guarantee Registered in England
and Wales, No.11266143. Funded by UK Research and Innovation through
the Engineering and Physical Sciences Research Council.
On 27 Jan 2021, at 13:04, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
On Wed, Jan 27, 2021 at 11:33 AM Schoonjans, Tom (RFI,RAL,-) <
Tom.Schoonjans(a)rfi.ac.uk> wrote:
Hi Yuval,
Switching to non-SSL connections to RabbitMQ allowed us to get things
working, although currently it’s not very reliable.
can you please add more about that? what reliability issues did you see?
I will open a new ticket over this if we can’t
fix things ourselves.
this would be great. we have ssl support for kafka and http endpoint, so,
if you decide to give it a try you can look at them as examples.
and let me know if you have questions or need help.
I will open an issue on the tracker as soon as my
account request has
been approved :-)
Best,
Tom
Dr Tom Schoonjans
Research Software Engineer - HPC and Cloud
Rosalind Franklin Institute
Harwell Science & Innovation Campus
Didcot
Oxfordshire
OX11 0FA
United Kingdom
https://www.rfi.ac.uk
The Rosalind Franklin Institute is a registered charity in England and
Wales, No. 1179810 Company Limited by Guarantee Registered in England
and Wales, No.11266143. Funded by UK Research and Innovation through
the Engineering and Physical Sciences Research Council.
On 26 Jan 2021, at 20:02, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
On Tue, Jan 26, 2021 at 9:48 PM Schoonjans, Tom (RFI,RAL,-) <
Tom.Schoonjans(a)rfi.ac.uk> wrote:
Hi Yuval,
I worked on this earlier today with Tom Byrne and I think I may be able
to provide some more information.
I set up the RabbitMQ server myself, and created the exchange with type
’topic’ before configuring the bucket.
Not sure if this matters, but the RabbitMQ endpoint is reached over
SSL, using certificates generated with Letsencrypt.
it actually does. we don't support amqp over ssl.
feel free to open a tracker for that - as we should probably support
that!
but note that it would probably be backported only to later versions
than nautilus.
Many thanks,
Tom
Dr Tom Schoonjans
Research Software Engineer - HPC and Cloud
Rosalind Franklin Institute
Harwell Science & Innovation Campus
Didcot
Oxfordshire
OX11 0FA
United Kingdom
https://www.rfi.ac.uk
The Rosalind Franklin Institute is a registered charity in England and
Wales, No. 1179810 Company Limited by Guarantee Registered in England
and Wales, No.11266143. Funded by UK Research and Innovation through
the Engineering and Physical Sciences Research Council.
On 26 Jan 2021, at 19:37, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
Hi Tom,
Did you create the exchange in rabbitmq? The RGW does not create it and
assume it is already created?
Could you increase the log level in RGW and see if there are more log
messages that have "AMQP" in them?
Thanks,
Yuval
On Tue, Jan 26, 2021 at 7:33 PM Byrne, Thomas (STFC,RAL,SC) <
tom.byrne(a)stfc.ac.uk> wrote:
> Hi all,
>
> We've been trying to get RGW Bucket notifications working with a
> RabbitMQ endpoint on our Nautilus 14.2.15 cluster. The gateway host can
> communicate with the rabbitMQ server just fine, but when RGW tries to send
> a message to the endpoint, the message never appears in the queue, and we
> get this error from in the RGW logs:
>
> 2021-01-26 16:28:17.271 7f0468b1f700 1 push to endpoint AMQP(0.9.1)
> Endpoint
> URI: amqp://user:pass@host:5671
> Topic: ceph-topic-test
> Exchange: ceph-test
> Ack Level: broker failed, with error: -4098
>
> We've confirmed the URI is correct, and that the gateway host can send
> messages to the RabbitMQ via a standalone script (using the same
> information as in the URI). Does anyone have any hints about how to dig
> into this?
>
> Cheers,
> Tom
>
> This email and any attachments are intended solely for the use of the
> named recipients. If you are not the intended recipient you must not use,
> disclose, copy or distribute this email or any of its attachments and
> should notify the sender immediately and delete this email from your
> system. UK Research and Innovation (UKRI) has taken every reasonable
> precaution to minimise risk of this email or any attachments containing
> viruses or malware but the recipient should carry out its own virus and
> malware checks before opening the attachments. UKRI does not accept any
> liability for any losses or damages which the recipient may sustain due to
> presence of any viruses. Opinions, conclusions or other information in this
> message and attachments that are not related directly to UKRI business are
> solely those of the author and do not represent the views of UKRI.
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>