I have been doing some testing with RBD-Mirror Snapshots to a remote Ceph cluster.
Does anyone know if the images on the remote cluster can be utilized in anyway? Would love the ability to clone them, or even readonly would be nice.
Hi -
We keep on getting errors like these on specific OSDs with Nautilus (14.2.16):
2021-01-29 06:14:19.174 7fbeaab92c00 -1 osd.8 12568359 unable to obtain rotating service keys; retrying
2021-01-29 06:14:49.173 7fbeaab92c00 0 monclient: wait_auth_rotating timed out after 30
2021-01-29 06:14:49.173 7fbeaab92c00 -1 osd.8 12568359 unable to obtain rotating service keys; retrying
2021-01-29 06:15:19.173 7fbeaab92c00 0 monclient: wait_auth_rotating timed out after 30
2021-01-29 06:15:19.173 7fbeaab92c00 -1 osd.8 12568359 unable to obtain rotating service keys; retrying
2021-01-29 06:15:49.174 7fbeaab92c00 0 monclient: wait_auth_rotating timed out after 30
2021-01-29 06:15:49.174 7fbeaab92c00 -1 osd.8 12568359 unable to obtain rotating service keys; retrying
2021-01-29 06:15:49.174 7fbeaab92c00 -1 osd.8 12568359 init wait_auth_rotating timed out
From googling it seems like it could be a variety of things. We do think time is in sync. It is particularly perplexing as we'll have a single OSD get this error while all other OSDs on the same node are fine.
It seems exactly like this:
https://tracker.ceph.com/issues/17170
Stopping the managers and restarting the mons fixes it temporarily.
From this old thread we do have msgr2 enabled:
https://www.spinics.net/lists/ceph-users/msg60631.html
This blog seems to point to storage slowness being the root cause in there env:
http://www.florentflament.com/blog/ceph-monitor-status-switching-due-to-slo…
Any advice for sorting out what is causing this?
Thanks,
Will
Hi.
I am not sure why this is not working, but I am now unable to use the ceph command on any of my hosts.
When I try to launch ceph, I get the following response:
[errno 13] RADOS permission denied (error connecting to the cluster)
The web management interface is working fine.
I have a suspicion that this started after trying to recreate an nfs cluster
I first removed the existing one with: ceph nfs cluster delete <id>
And then tried to create it again with: ceph nfs cluster create cephfs <id>
The command seemed to hang, and after several hours I ended the command with ctrl-c.
Since then I have been unable to use the ceph command.
This is fortunately a test environment, and it is running Octopus 15.2.8
Does anyone have an idea on how I can get access again?
Regards
Jens Hyllegaard
Hi, I have tried to enable RGW management in the dashboard.
The dashboard works fine, and I tried to add a new system user:
radosgw-admin user create --uid=some-user --display-name="User for
dashboard" --system
and set the accesskey and secret:
ceph dashboard set-rgw-api-access-key access-key
ceph dashboard set-rgw-api-secret-key secret-key
However, after adding these I get this in the dashboard, when I try to
access the Object Gateway:
"key data not in dict [u'user1', u'user2', u'some-user']"
If I try to verify that the credentials are correct with:
ceph dashboard get-rgw-api-access-key
ceph dashboard get-rgw-api-secret-key secret-key
and compare then with the key and secret from:
radosgw-admin user info --uid=some-user
they match.
Running Ceph 14.2.16
To me it looks like it cannot find the user, or match the credentials, but
I cannot see what I have done wrong?
--
Med venlig hilsen
*Troels Hansen*
Senior Linux Konsulent
Tlf.: 22 43 71 57
tha(a)miracle.dk
www.miracle.dk
On Fri, Jan 29, 2021 at 9:18 AM Schoonjans, Tom (RFI,RAL,-) <
Tom.Schoonjans(a)rfi.ac.uk> wrote:
> Hi Yuval,
>
>
> What do I need to do if I want to switch to using a different exchange on
> the RabbitMQ endpoint? Or change the amqp-ack-level option that was used?
> Would you expect the same problem again? Will the existing connections to
> the RabbitMQ server be cleanly terminated?
>
> I think that changing the ack level would take effect on the next publish
(as this is not a feature of the connection, but the calling code), but to
change the exchange (or any other parameter of the connection itself), or
even creating a new topic to the same endpoint with a different exchange,
you would need a restart :-(
(tracking this here: https://tracker.ceph.com/issues/46127)
> I tried the topic example
> <https://www.rabbitmq.com/tutorials/tutorial-five-python.html> from the
> RabbitMQ tutorial and I actually got the same behaviour as with Ceph:
> messages sent before the consumer queue is attached are lost. From what I
> understand this is a *feature* of this type of exchange. See also this
> <https://stackoverflow.com/questions/6148381/rabbitmq-persistent-message-wit…> stackoverflow
> post.
>
>
this is interesting. and looks like a better model than what we
currently have.
we should declare our own fanout "gateway" exchange, connected to an "eat
all" queue. the users may then connect the consumers directly to it, or via
a topic exchange they declare. that would actually fix 2 issues:
- durability of messages before clients are connected
- exchange name configuration issues
> Best,
>
> Tom
>
>
> Dr Tom Schoonjans
>
> Research Software Engineer - HPC and Cloud
>
> Rosalind Franklin Institute
> Harwell Science & Innovation Campus
> Didcot
> Oxfordshire
> OX11 0FA
> United Kingdom
>
> https://www.rfi.ac.uk
>
> The Rosalind Franklin Institute is a registered charity in England and
> Wales, No. 1179810 Company Limited by Guarantee Registered in England
> and Wales, No.11266143. Funded by UK Research and Innovation through
> the Engineering and Physical Sciences Research Council.
>
> On 28 Jan 2021, at 18:16, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
>
>
>
> On Thu, Jan 28, 2021 at 7:34 PM Schoonjans, Tom (RFI,RAL,-) <
> Tom.Schoonjans(a)rfi.ac.uk> wrote:
>
>> Hi Yuval,
>>
>>
>> Together with Tom Byrne I ran some more tests today while keeping an eye
>> on the logs as well.
>>
>> We immediately noticed that the nodes were logging errors when uploading
>> files like:
>>
>> 2021-01-28 16:10:45.825 7f56ff5cf700 1 ====== starting new request req=0x7f56ff5c87f0 =====
>> 2021-01-28 16:10:45.828 7f5721e14700 1 AMQP connect: exchange mismatch
>> 2021-01-28 16:10:45.828 7f5721e14700 1 ERROR: failed to create push endpoint: amqp://<username>:<password>@<my.rabbitmq.server>:5672 due to: pubsub endpoint configuration error: AMQP: failed to create connection to: amqp://<username>:<password>@<my.rabbitmq.server>:5672
>> 2021-01-28 16:10:45.828 7f571ee0e700 1 ====== req done req=0x7f571ee077f0 op status=0 http_status=200 latency=0.0569997s ======
>>
>>
>> Which resulted in no connections being established to the RabbitMQ server.
>>
>> Tom restarted then the Ceph services on one gateway node, which led to
>> events being sent to RabbitMQ without blocking, but only if this particular
>> node was picked up by the boto3 upload request in the round-robin DNS.
>>
>> Restarting the Ceph service on all nodes fixed the problem and I got a
>> nice steady stream of events to my consumer Python script!
>>
>>
> we should fix it. no restart should be needed if one of the connection
> parameters was wrong
>
>
>
>> I did notice that any events that were sent while my consumer script was
>> not running are lost, as they are not picked up after I restart the script.
>> Any thoughts on this?
>>
>>
> this is strange. in our code [1] we don't require immediate transfer of
> messages.
> how is the exchange declared?
> can you check if this is happening when you send messages from a python
> producer as well?
>
> [1] https://github.com/ceph/ceph/blob/master/src/rgw/rgw_amqp.cc#L575
>
>
>
>> Many thanks!!
>>
>> Best,
>>
>> Tom
>>
>>
>>
>> Dr Tom Schoonjans
>>
>> Research Software Engineer - HPC and Cloud
>>
>> Rosalind Franklin Institute
>> Harwell Science & Innovation Campus
>> Didcot
>> Oxfordshire
>> OX11 0FA
>> United Kingdom
>>
>> https://www.rfi.ac.uk
>>
>> The Rosalind Franklin Institute is a registered charity in England and
>> Wales, No. 1179810 Company Limited by Guarantee Registered in England
>> and Wales, No.11266143. Funded by UK Research and Innovation through
>> the Engineering and Physical Sciences Research Council.
>>
>> On 27 Jan 2021, at 16:21, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
>>
>>
>> On Wed, Jan 27, 2021 at 5:34 PM Schoonjans, Tom (RFI,RAL,-) <
>> Tom.Schoonjans(a)rfi.ac.uk> wrote:
>>
>>> Looks like there’s already a ticket open for AMQP SSL support:
>>> https://tracker.ceph.com/issues/42902 (you opened it ;-))
>>>
>>> I will give a try myself if I have some time, but don’t hold your breath
>>> with lockdown and home schooling. Also I am not much of a C++ coder.
>>>
>>> I need to go over the logs with Tom Byrne to see why it is not working
>>> properly. And perhaps I will be able to come up with a fix then.
>>>
>>> However this is what I have run into so far today:
>>>
>>> 1. After configuring a bucket with a topic using the non-SSL port, I
>>> tried a couple of uploads to this bucket. They all hanged, which seemed
>>> like something was very wrong, so I Ctrl-C’ed every time. After some time I
>>> figured out from the RabbitMQ admin UI that Ceph was indeed connecting to
>>> it, and the connections remained so I killed them from the UI.
>>>
>>
>> sending the notification to the rabbitmq server is synchronous with the
>> upload to the bucket. so, if the server is slow or not acking the
>> notification, the upload request would hang. not that the upload itself is
>> done first, but the reply to the client does not happen until rabbitmq
>> server acks.
>>
>> would be great if you can share the radosgw logs.
>> maybe the issue is related to the user/password method we use? we use:
>> AMQP_SASL_METHOD_PLAIN
>>
>> one possible workaround would be to set "amqp-ack-level" to "none". in
>> this case the radosgw does not wait for an ack
>>
>> in "pacific" you could use "persistent topics" where the notifications
>> are sent asynchronously to the endpoint.
>>
>> 2. I then wrote a python script with Pika to consume the events, hoping
>>> that would stop the blocking. I had some minor success with this. Usually
>>> the first three or four uploaded files would generate events that I could
>>> consume with my script.
>>>
>>
>> the radosgw is waiting for an ack from the broker, not the end consumer,
>> so this should not have mattered...
>> did you actually see any notifications delivered to the consumer?
>>
>>
>>> However, the rest would block for ever. I repeated this a couple of
>>> times but always the same result. I noticed that after I stopped uploading,
>>> removed the bucket and the topic, the connection from Ceph in the RabbitMQ
>>> UI remained. I killed it but it came back seconds later from another port
>>> on the Ceph cluster. I ended up playing whack-a-mole with this until no
>>> more connections would be established from Ceph to RabbitMQ. I probably
>>> killed a 100 or so of them.
>>>
>>
>> once you remove the bucket there cannot be new notification sent. if you
>> create the bucket again you may see notifications again (this is fixed in
>> "pacific").
>> either way, even if the connection to the rabbitmq server would still be
>> open, but no new notifications should be sent there. just having the
>> connection should not be an issue but would be nice to fix that as well:
>> https://tracker.ceph.com/issues/49033
>>
>> 3. After this I couldn’t get any events sent anymore. There is no more
>>> blocking when uploading, files get written but nothing else happens. No
>>> connections are made anymore from Ceph to RabbitMQ.
>>>
>>> Hope this helps…
>>>
>>
>> yes, this is very helpful!
>>
>>
>>> Best,
>>>
>>> Tom
>>>
>>>
>>>
>>>
>>> Dr Tom Schoonjans
>>>
>>> Research Software Engineer - HPC and Cloud
>>>
>>> Rosalind Franklin Institute
>>> Harwell Science & Innovation Campus
>>> Didcot
>>> Oxfordshire
>>> OX11 0FA
>>> United Kingdom
>>>
>>> https://www.rfi.ac.uk
>>>
>>> The Rosalind Franklin Institute is a registered charity in England and
>>> Wales, No. 1179810 Company Limited by Guarantee Registered in England
>>> and Wales, No.11266143. Funded by UK Research and Innovation through
>>> the Engineering and Physical Sciences Research Council.
>>>
>>> On 27 Jan 2021, at 13:04, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
>>>
>>>
>>>
>>> On Wed, Jan 27, 2021 at 11:33 AM Schoonjans, Tom (RFI,RAL,-) <
>>> Tom.Schoonjans(a)rfi.ac.uk> wrote:
>>>
>>>> Hi Yuval,
>>>>
>>>>
>>>> Switching to non-SSL connections to RabbitMQ allowed us to get things
>>>> working, although currently it’s not very reliable.
>>>>
>>>
>>> can you please add more about that? what reliability issues did you see?
>>>
>>>
>>>> I will open a new ticket over this if we can’t fix things ourselves.
>>>>
>>>>
>>> this would be great. we have ssl support for kafka and http endpoint,
>>> so, if you decide to give it a try you can look at them as examples.
>>> and let me know if you have questions or need help.
>>>
>>>
>>>
>>>> I will open an issue on the tracker as soon as my account request has
>>>> been approved :-)
>>>>
>>>> Best,
>>>>
>>>> Tom
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Tom Schoonjans
>>>>
>>>> Research Software Engineer - HPC and Cloud
>>>>
>>>> Rosalind Franklin Institute
>>>> Harwell Science & Innovation Campus
>>>> Didcot
>>>> Oxfordshire
>>>> OX11 0FA
>>>> United Kingdom
>>>>
>>>> https://www.rfi.ac.uk
>>>>
>>>> The Rosalind Franklin Institute is a registered charity in England and
>>>> Wales, No. 1179810 Company Limited by Guarantee Registered in England
>>>> and Wales, No.11266143. Funded by UK Research and Innovation through
>>>> the Engineering and Physical Sciences Research Council.
>>>>
>>>> On 26 Jan 2021, at 20:02, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On Tue, Jan 26, 2021 at 9:48 PM Schoonjans, Tom (RFI,RAL,-) <
>>>> Tom.Schoonjans(a)rfi.ac.uk> wrote:
>>>>
>>>>> Hi Yuval,
>>>>>
>>>>>
>>>>> I worked on this earlier today with Tom Byrne and I think I may be
>>>>> able to provide some more information.
>>>>>
>>>>> I set up the RabbitMQ server myself, and created the exchange with
>>>>> type ’topic’ before configuring the bucket.
>>>>>
>>>>> Not sure if this matters, but the RabbitMQ endpoint is reached over
>>>>> SSL, using certificates generated with Letsencrypt.
>>>>>
>>>>>
>>>> it actually does. we don't support amqp over ssl.
>>>> feel free to open a tracker for that - as we should probably support
>>>> that!
>>>> but note that it would probably be backported only to later versions
>>>> than nautilus.
>>>>
>>>>
>>>>
>>>>> Many thanks,
>>>>>
>>>>> Tom
>>>>>
>>>>>
>>>>>
>>>>> Dr Tom Schoonjans
>>>>>
>>>>> Research Software Engineer - HPC and Cloud
>>>>>
>>>>> Rosalind Franklin Institute
>>>>> Harwell Science & Innovation Campus
>>>>> Didcot
>>>>> Oxfordshire
>>>>> OX11 0FA
>>>>> United Kingdom
>>>>>
>>>>> https://www.rfi.ac.uk
>>>>>
>>>>> The Rosalind Franklin Institute is a registered charity in England and
>>>>> Wales, No. 1179810 Company Limited by Guarantee Registered in England
>>>>> and Wales, No.11266143. Funded by UK Research and Innovation through
>>>>> the Engineering and Physical Sciences Research Council.
>>>>>
>>>>> On 26 Jan 2021, at 19:37, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
>>>>>
>>>>> Hi Tom,
>>>>> Did you create the exchange in rabbitmq? The RGW does not create it
>>>>> and assume it is already created?
>>>>> Could you increase the log level in RGW and see if there are more log
>>>>> messages that have "AMQP" in them?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Yuval
>>>>>
>>>>> On Tue, Jan 26, 2021 at 7:33 PM Byrne, Thomas (STFC,RAL,SC) <
>>>>> tom.byrne(a)stfc.ac.uk> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We've been trying to get RGW Bucket notifications working with a
>>>>>> RabbitMQ endpoint on our Nautilus 14.2.15 cluster. The gateway host can
>>>>>> communicate with the rabbitMQ server just fine, but when RGW tries to send
>>>>>> a message to the endpoint, the message never appears in the queue, and we
>>>>>> get this error from in the RGW logs:
>>>>>>
>>>>>> 2021-01-26 16:28:17.271 7f0468b1f700 1 push to endpoint AMQP(0.9.1)
>>>>>> Endpoint
>>>>>> URI: amqp://user:pass@host:5671
>>>>>> Topic: ceph-topic-test
>>>>>> Exchange: ceph-test
>>>>>> Ack Level: broker failed, with error: -4098
>>>>>>
>>>>>> We've confirmed the URI is correct, and that the gateway host can
>>>>>> send messages to the RabbitMQ via a standalone script (using the same
>>>>>> information as in the URI). Does anyone have any hints about how to dig
>>>>>> into this?
>>>>>>
>>>>>> Cheers,
>>>>>> Tom
>>>>>>
>>>>>> This email and any attachments are intended solely for the use of the
>>>>>> named recipients. If you are not the intended recipient you must not use,
>>>>>> disclose, copy or distribute this email or any of its attachments and
>>>>>> should notify the sender immediately and delete this email from your
>>>>>> system. UK Research and Innovation (UKRI) has taken every reasonable
>>>>>> precaution to minimise risk of this email or any attachments containing
>>>>>> viruses or malware but the recipient should carry out its own virus and
>>>>>> malware checks before opening the attachments. UKRI does not accept any
>>>>>> liability for any losses or damages which the recipient may sustain due to
>>>>>> presence of any viruses. Opinions, conclusions or other information in this
>>>>>> message and attachments that are not related directly to UKRI business are
>>>>>> solely those of the author and do not represent the views of UKRI.
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>
>>>>>
>
Hello,all
"ceph daemon osd.0 heap stats" or ceph daemon osd.0 dump_mempools" analyse memory of osd used.
The "Virtual address space used" of "ceph daemon osd.0 heap stats" output is lager than total.bytes of "ceph daemon osd.0 dump_mempools"output.
I dig into source code and I find the memory of "ceph daemon osd.0 heap stats" output is allocated by malloc. Is it contain memory of dump_mempools output? what differences betwen them?
thanks for any reply.
Hello all,
One of our clusters running nautilus release 14.2.15 is reporting health
error. It reports that there are inconsistent PGs. However, when I inspect
each of the reported PGs, I dont see any inconsistencies. Any inputs on
what's going on?
$ sudo ceph health detail
HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent
OSD_SCRUB_ERRORS 3 scrub errors
PG_DAMAGED Possible data damage: 3 pgs inconsistent
pg 2.a4 is active+clean+inconsistent, acting [2,60,73]
pg 2.2b3 is active+clean+inconsistent, acting [15,3,38]
pg 2.758 is active+clean+inconsistent, acting [4,40,35]
$ rados list-inconsistent-obj 2.758 --format=json-pretty
{
"epoch": 9211,
"inconsistents": []
}
$ rados list-inconsistent-obj 2.a4 --format=json-pretty
{
"epoch": 9213,
"inconsistents": []
}
$ rados list-inconsistent-obj 2.758 --format=json-pretty
{
"epoch": 9211,
"inconsistents": []
}
Regards,
Shridhar
Hello Everyone,
We seem to be having a problem on one of our ceph clusters post the OS
patch and reboot of one of the nodes. The three other nodes are showing
OSD fill rates of 77%-81%, but the 60 OSDs contained in the host that was
just rebooted are varying between 64% and 90% since the reboot occurred.
The three other nodes have not yet been patched or rebooted.
The result is:
health: HEALTH_WARN
15 nearfull osd(s)
7 pool(s) nearfull
Low space hindering backfill (add storage if this doesn't
resolve itself): 15 pgs backfill_toofull
Degraded data redundancy: 170940/1437684990 objects degraded
(0.012%), 4 pgs degraded, 4 pgs undersized
services:
mon: 3 daemons, quorum prdceph01,prdceph02,prdceph03 (age 6h)
mgr: prdceph01(active, since 5w), standbys: prdceph02, prdceph03,
prdceph04
mds: ArchiveRepository:1 {0=prdceph01=up:active} 3 up:standby
osd: 240 osds: 240 up (since 6h), 240 in (since 27h); 16 remapped pgs
task status:
scrub status:
mds.prdceph01: idle
data:
pools: 7 pools, 8384 pgs
objects: 479.23M objects, 557 TiB
usage: 1.7 PiB used, 454 TiB / 2.1 PiB avail
pgs: 170940/1437684990 objects degraded (0.012%)
4155186/1437684990 objects misplaced (0.289%)
8332 active+clean
36 active+clean+scrubbing+deep
11 active+remapped+backfill_toofull
2 active+undersized+degraded+remapped+backfill_toofull
2
active+forced_recovery+undersized+degraded+remapped+forced_backfill+backfill_toofull
1 active+remapped+backfilling
io:
client: 9.6 MiB/s rd, 820 KiB/s wr, 1.02k op/s rd, 189 op/s wr
recovery: 0 B/s, 25 keys/s, 10 objects/s
Any suggestions would be greatly appreciated, as currently it is not able
to complete the repair, nor will it backfill, even when attempting to force.
Many thanks in advance.
Marco
Hi,
There is a potential that my Ceph RGW multi site soluton may be down for an extended time (2 weeks?) for a physical relocation. Some questions, particularly in regard to RGW
1. Is there any limit on downtime after which I might have to restart an entire sync? I want to still be able to write data during the downtime to the remaining
2. How do I change the master zone and/or zonegroup? My cluster is a 2-site config with one zonegroup (so I won't have to change that, wth 2 zones. If I am physically moving the current master, how do I change the master so I can continue to do bucket meta data ops (new buckets, etc) while the old master site is in transition?
And also:
3. My site is mostly RGW, but also has some Cephfs and rbd. Ay issues with those? I am not replicating across sites for those.4. How do I re-IP said cluster if need be?
Thanks for any info.
-Chris
On Thu, Jan 28, 2021 at 7:34 PM Schoonjans, Tom (RFI,RAL,-) <
Tom.Schoonjans(a)rfi.ac.uk> wrote:
> Hi Yuval,
>
>
> Together with Tom Byrne I ran some more tests today while keeping an eye
> on the logs as well.
>
> We immediately noticed that the nodes were logging errors when uploading
> files like:
>
> 2021-01-28 16:10:45.825 7f56ff5cf700 1 ====== starting new request req=0x7f56ff5c87f0 =====
> 2021-01-28 16:10:45.828 7f5721e14700 1 AMQP connect: exchange mismatch
> 2021-01-28 16:10:45.828 7f5721e14700 1 ERROR: failed to create push endpoint: amqp://<username>:<password>@<my.rabbitmq.server>:5672 due to: pubsub endpoint configuration error: AMQP: failed to create connection to: amqp://<username>:<password>@<my.rabbitmq.server>:5672
> 2021-01-28 16:10:45.828 7f571ee0e700 1 ====== req done req=0x7f571ee077f0 op status=0 http_status=200 latency=0.0569997s ======
>
>
> Which resulted in no connections being established to the RabbitMQ server.
>
> Tom restarted then the Ceph services on one gateway node, which led to
> events being sent to RabbitMQ without blocking, but only if this particular
> node was picked up by the boto3 upload request in the round-robin DNS.
>
> Restarting the Ceph service on all nodes fixed the problem and I got a
> nice steady stream of events to my consumer Python script!
>
>
we should fix it. no restart should be needed if one of the connection
parameters was wrong
> I did notice that any events that were sent while my consumer script was
> not running are lost, as they are not picked up after I restart the script.
> Any thoughts on this?
>
>
this is strange. in our code [1] we don't require immediate transfer of
messages.
how is the exchange declared?
can you check if this is happening when you send messages from a python
producer as well?
[1] https://github.com/ceph/ceph/blob/master/src/rgw/rgw_amqp.cc#L575
> Many thanks!!
>
> Best,
>
> Tom
>
>
>
> Dr Tom Schoonjans
>
> Research Software Engineer - HPC and Cloud
>
> Rosalind Franklin Institute
> Harwell Science & Innovation Campus
> Didcot
> Oxfordshire
> OX11 0FA
> United Kingdom
>
> https://www.rfi.ac.uk
>
> The Rosalind Franklin Institute is a registered charity in England and
> Wales, No. 1179810 Company Limited by Guarantee Registered in England
> and Wales, No.11266143. Funded by UK Research and Innovation through
> the Engineering and Physical Sciences Research Council.
>
> On 27 Jan 2021, at 16:21, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
>
>
> On Wed, Jan 27, 2021 at 5:34 PM Schoonjans, Tom (RFI,RAL,-) <
> Tom.Schoonjans(a)rfi.ac.uk> wrote:
>
>> Looks like there’s already a ticket open for AMQP SSL support:
>> https://tracker.ceph.com/issues/42902 (you opened it ;-))
>>
>> I will give a try myself if I have some time, but don’t hold your breath
>> with lockdown and home schooling. Also I am not much of a C++ coder.
>>
>> I need to go over the logs with Tom Byrne to see why it is not working
>> properly. And perhaps I will be able to come up with a fix then.
>>
>> However this is what I have run into so far today:
>>
>> 1. After configuring a bucket with a topic using the non-SSL port, I
>> tried a couple of uploads to this bucket. They all hanged, which seemed
>> like something was very wrong, so I Ctrl-C’ed every time. After some time I
>> figured out from the RabbitMQ admin UI that Ceph was indeed connecting to
>> it, and the connections remained so I killed them from the UI.
>>
>
> sending the notification to the rabbitmq server is synchronous with the
> upload to the bucket. so, if the server is slow or not acking the
> notification, the upload request would hang. not that the upload itself is
> done first, but the reply to the client does not happen until rabbitmq
> server acks.
>
> would be great if you can share the radosgw logs.
> maybe the issue is related to the user/password method we use? we use:
> AMQP_SASL_METHOD_PLAIN
>
> one possible workaround would be to set "amqp-ack-level" to "none". in
> this case the radosgw does not wait for an ack
>
> in "pacific" you could use "persistent topics" where the notifications are
> sent asynchronously to the endpoint.
>
> 2. I then wrote a python script with Pika to consume the events, hoping
>> that would stop the blocking. I had some minor success with this. Usually
>> the first three or four uploaded files would generate events that I could
>> consume with my script.
>>
>
> the radosgw is waiting for an ack from the broker, not the end consumer,
> so this should not have mattered...
> did you actually see any notifications delivered to the consumer?
>
>
>> However, the rest would block for ever. I repeated this a couple of times
>> but always the same result. I noticed that after I stopped uploading,
>> removed the bucket and the topic, the connection from Ceph in the RabbitMQ
>> UI remained. I killed it but it came back seconds later from another port
>> on the Ceph cluster. I ended up playing whack-a-mole with this until no
>> more connections would be established from Ceph to RabbitMQ. I probably
>> killed a 100 or so of them.
>>
>
> once you remove the bucket there cannot be new notification sent. if you
> create the bucket again you may see notifications again (this is fixed in
> "pacific").
> either way, even if the connection to the rabbitmq server would still be
> open, but no new notifications should be sent there. just having the
> connection should not be an issue but would be nice to fix that as well:
> https://tracker.ceph.com/issues/49033
>
> 3. After this I couldn’t get any events sent anymore. There is no more
>> blocking when uploading, files get written but nothing else happens. No
>> connections are made anymore from Ceph to RabbitMQ.
>>
>> Hope this helps…
>>
>
> yes, this is very helpful!
>
>
>> Best,
>>
>> Tom
>>
>>
>>
>>
>> Dr Tom Schoonjans
>>
>> Research Software Engineer - HPC and Cloud
>>
>> Rosalind Franklin Institute
>> Harwell Science & Innovation Campus
>> Didcot
>> Oxfordshire
>> OX11 0FA
>> United Kingdom
>>
>> https://www.rfi.ac.uk
>>
>> The Rosalind Franklin Institute is a registered charity in England and
>> Wales, No. 1179810 Company Limited by Guarantee Registered in England
>> and Wales, No.11266143. Funded by UK Research and Innovation through
>> the Engineering and Physical Sciences Research Council.
>>
>> On 27 Jan 2021, at 13:04, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
>>
>>
>>
>> On Wed, Jan 27, 2021 at 11:33 AM Schoonjans, Tom (RFI,RAL,-) <
>> Tom.Schoonjans(a)rfi.ac.uk> wrote:
>>
>>> Hi Yuval,
>>>
>>>
>>> Switching to non-SSL connections to RabbitMQ allowed us to get things
>>> working, although currently it’s not very reliable.
>>>
>>
>> can you please add more about that? what reliability issues did you see?
>>
>>
>>> I will open a new ticket over this if we can’t fix things ourselves.
>>>
>>>
>> this would be great. we have ssl support for kafka and http endpoint, so,
>> if you decide to give it a try you can look at them as examples.
>> and let me know if you have questions or need help.
>>
>>
>>
>>> I will open an issue on the tracker as soon as my account request has
>>> been approved :-)
>>>
>>> Best,
>>>
>>> Tom
>>>
>>>
>>>
>>>
>>>
>>> Dr Tom Schoonjans
>>>
>>> Research Software Engineer - HPC and Cloud
>>>
>>> Rosalind Franklin Institute
>>> Harwell Science & Innovation Campus
>>> Didcot
>>> Oxfordshire
>>> OX11 0FA
>>> United Kingdom
>>>
>>> https://www.rfi.ac.uk
>>>
>>> The Rosalind Franklin Institute is a registered charity in England and
>>> Wales, No. 1179810 Company Limited by Guarantee Registered in England
>>> and Wales, No.11266143. Funded by UK Research and Innovation through
>>> the Engineering and Physical Sciences Research Council.
>>>
>>> On 26 Jan 2021, at 20:02, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
>>>
>>>
>>>
>>> On Tue, Jan 26, 2021 at 9:48 PM Schoonjans, Tom (RFI,RAL,-) <
>>> Tom.Schoonjans(a)rfi.ac.uk> wrote:
>>>
>>>> Hi Yuval,
>>>>
>>>>
>>>> I worked on this earlier today with Tom Byrne and I think I may be able
>>>> to provide some more information.
>>>>
>>>> I set up the RabbitMQ server myself, and created the exchange with type
>>>> ’topic’ before configuring the bucket.
>>>>
>>>> Not sure if this matters, but the RabbitMQ endpoint is reached over
>>>> SSL, using certificates generated with Letsencrypt.
>>>>
>>>>
>>> it actually does. we don't support amqp over ssl.
>>> feel free to open a tracker for that - as we should probably support
>>> that!
>>> but note that it would probably be backported only to later versions
>>> than nautilus.
>>>
>>>
>>>
>>>> Many thanks,
>>>>
>>>> Tom
>>>>
>>>>
>>>>
>>>> Dr Tom Schoonjans
>>>>
>>>> Research Software Engineer - HPC and Cloud
>>>>
>>>> Rosalind Franklin Institute
>>>> Harwell Science & Innovation Campus
>>>> Didcot
>>>> Oxfordshire
>>>> OX11 0FA
>>>> United Kingdom
>>>>
>>>> https://www.rfi.ac.uk
>>>>
>>>> The Rosalind Franklin Institute is a registered charity in England and
>>>> Wales, No. 1179810 Company Limited by Guarantee Registered in England
>>>> and Wales, No.11266143. Funded by UK Research and Innovation through
>>>> the Engineering and Physical Sciences Research Council.
>>>>
>>>> On 26 Jan 2021, at 19:37, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
>>>>
>>>> Hi Tom,
>>>> Did you create the exchange in rabbitmq? The RGW does not create it and
>>>> assume it is already created?
>>>> Could you increase the log level in RGW and see if there are more log
>>>> messages that have "AMQP" in them?
>>>>
>>>> Thanks,
>>>>
>>>> Yuval
>>>>
>>>> On Tue, Jan 26, 2021 at 7:33 PM Byrne, Thomas (STFC,RAL,SC) <
>>>> tom.byrne(a)stfc.ac.uk> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We've been trying to get RGW Bucket notifications working with a
>>>>> RabbitMQ endpoint on our Nautilus 14.2.15 cluster. The gateway host can
>>>>> communicate with the rabbitMQ server just fine, but when RGW tries to send
>>>>> a message to the endpoint, the message never appears in the queue, and we
>>>>> get this error from in the RGW logs:
>>>>>
>>>>> 2021-01-26 16:28:17.271 7f0468b1f700 1 push to endpoint AMQP(0.9.1)
>>>>> Endpoint
>>>>> URI: amqp://user:pass@host:5671
>>>>> Topic: ceph-topic-test
>>>>> Exchange: ceph-test
>>>>> Ack Level: broker failed, with error: -4098
>>>>>
>>>>> We've confirmed the URI is correct, and that the gateway host can send
>>>>> messages to the RabbitMQ via a standalone script (using the same
>>>>> information as in the URI). Does anyone have any hints about how to dig
>>>>> into this?
>>>>>
>>>>> Cheers,
>>>>> Tom
>>>>>
>>>>> This email and any attachments are intended solely for the use of the
>>>>> named recipients. If you are not the intended recipient you must not use,
>>>>> disclose, copy or distribute this email or any of its attachments and
>>>>> should notify the sender immediately and delete this email from your
>>>>> system. UK Research and Innovation (UKRI) has taken every reasonable
>>>>> precaution to minimise risk of this email or any attachments containing
>>>>> viruses or malware but the recipient should carry out its own virus and
>>>>> malware checks before opening the attachments. UKRI does not accept any
>>>>> liability for any losses or damages which the recipient may sustain due to
>>>>> presence of any viruses. Opinions, conclusions or other information in this
>>>>> message and attachments that are not related directly to UKRI business are
>>>>> solely those of the author and do not represent the views of UKRI.
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>
>>>>>
>>>>
>>>
>>
>