Hi,
There is a potential that my Ceph RGW multi site soluton may be down for an extended time (2 weeks?) for a physical relocation. Some questions, particularly in regard to RGW
1. Is there any limit on downtime after which I might have to restart an entire sync? I want to still be able to write data during the downtime to the remaining
2. How do I change the master zone and/or zonegroup? My cluster is a 2-site config with one zonegroup (so I won't have to change that, wth 2 zones. If I am physically moving the current master, how do I change the master so I can continue to do bucket meta data ops (new buckets, etc) while the old master site is in transition?
And also:
3. My site is mostly RGW, but also has some Cephfs and rbd. Ay issues with those? I am not replicating across sites for those.4. How do I re-IP said cluster if need be?
Thanks for any info.
-Chris
On Thu, Jan 28, 2021 at 7:34 PM Schoonjans, Tom (RFI,RAL,-) <
Tom.Schoonjans(a)rfi.ac.uk> wrote:
> Hi Yuval,
>
>
> Together with Tom Byrne I ran some more tests today while keeping an eye
> on the logs as well.
>
> We immediately noticed that the nodes were logging errors when uploading
> files like:
>
> 2021-01-28 16:10:45.825 7f56ff5cf700 1 ====== starting new request req=0x7f56ff5c87f0 =====
> 2021-01-28 16:10:45.828 7f5721e14700 1 AMQP connect: exchange mismatch
> 2021-01-28 16:10:45.828 7f5721e14700 1 ERROR: failed to create push endpoint: amqp://<username>:<password>@<my.rabbitmq.server>:5672 due to: pubsub endpoint configuration error: AMQP: failed to create connection to: amqp://<username>:<password>@<my.rabbitmq.server>:5672
> 2021-01-28 16:10:45.828 7f571ee0e700 1 ====== req done req=0x7f571ee077f0 op status=0 http_status=200 latency=0.0569997s ======
>
>
> Which resulted in no connections being established to the RabbitMQ server.
>
> Tom restarted then the Ceph services on one gateway node, which led to
> events being sent to RabbitMQ without blocking, but only if this particular
> node was picked up by the boto3 upload request in the round-robin DNS.
>
> Restarting the Ceph service on all nodes fixed the problem and I got a
> nice steady stream of events to my consumer Python script!
>
>
we should fix it. no restart should be needed if one of the connection
parameters was wrong
> I did notice that any events that were sent while my consumer script was
> not running are lost, as they are not picked up after I restart the script.
> Any thoughts on this?
>
>
this is strange. in our code [1] we don't require immediate transfer of
messages.
how is the exchange declared?
can you check if this is happening when you send messages from a python
producer as well?
[1] https://github.com/ceph/ceph/blob/master/src/rgw/rgw_amqp.cc#L575
> Many thanks!!
>
> Best,
>
> Tom
>
>
>
> Dr Tom Schoonjans
>
> Research Software Engineer - HPC and Cloud
>
> Rosalind Franklin Institute
> Harwell Science & Innovation Campus
> Didcot
> Oxfordshire
> OX11 0FA
> United Kingdom
>
> https://www.rfi.ac.uk
>
> The Rosalind Franklin Institute is a registered charity in England and
> Wales, No. 1179810 Company Limited by Guarantee Registered in England
> and Wales, No.11266143. Funded by UK Research and Innovation through
> the Engineering and Physical Sciences Research Council.
>
> On 27 Jan 2021, at 16:21, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
>
>
> On Wed, Jan 27, 2021 at 5:34 PM Schoonjans, Tom (RFI,RAL,-) <
> Tom.Schoonjans(a)rfi.ac.uk> wrote:
>
>> Looks like there’s already a ticket open for AMQP SSL support:
>> https://tracker.ceph.com/issues/42902 (you opened it ;-))
>>
>> I will give a try myself if I have some time, but don’t hold your breath
>> with lockdown and home schooling. Also I am not much of a C++ coder.
>>
>> I need to go over the logs with Tom Byrne to see why it is not working
>> properly. And perhaps I will be able to come up with a fix then.
>>
>> However this is what I have run into so far today:
>>
>> 1. After configuring a bucket with a topic using the non-SSL port, I
>> tried a couple of uploads to this bucket. They all hanged, which seemed
>> like something was very wrong, so I Ctrl-C’ed every time. After some time I
>> figured out from the RabbitMQ admin UI that Ceph was indeed connecting to
>> it, and the connections remained so I killed them from the UI.
>>
>
> sending the notification to the rabbitmq server is synchronous with the
> upload to the bucket. so, if the server is slow or not acking the
> notification, the upload request would hang. not that the upload itself is
> done first, but the reply to the client does not happen until rabbitmq
> server acks.
>
> would be great if you can share the radosgw logs.
> maybe the issue is related to the user/password method we use? we use:
> AMQP_SASL_METHOD_PLAIN
>
> one possible workaround would be to set "amqp-ack-level" to "none". in
> this case the radosgw does not wait for an ack
>
> in "pacific" you could use "persistent topics" where the notifications are
> sent asynchronously to the endpoint.
>
> 2. I then wrote a python script with Pika to consume the events, hoping
>> that would stop the blocking. I had some minor success with this. Usually
>> the first three or four uploaded files would generate events that I could
>> consume with my script.
>>
>
> the radosgw is waiting for an ack from the broker, not the end consumer,
> so this should not have mattered...
> did you actually see any notifications delivered to the consumer?
>
>
>> However, the rest would block for ever. I repeated this a couple of times
>> but always the same result. I noticed that after I stopped uploading,
>> removed the bucket and the topic, the connection from Ceph in the RabbitMQ
>> UI remained. I killed it but it came back seconds later from another port
>> on the Ceph cluster. I ended up playing whack-a-mole with this until no
>> more connections would be established from Ceph to RabbitMQ. I probably
>> killed a 100 or so of them.
>>
>
> once you remove the bucket there cannot be new notification sent. if you
> create the bucket again you may see notifications again (this is fixed in
> "pacific").
> either way, even if the connection to the rabbitmq server would still be
> open, but no new notifications should be sent there. just having the
> connection should not be an issue but would be nice to fix that as well:
> https://tracker.ceph.com/issues/49033
>
> 3. After this I couldn’t get any events sent anymore. There is no more
>> blocking when uploading, files get written but nothing else happens. No
>> connections are made anymore from Ceph to RabbitMQ.
>>
>> Hope this helps…
>>
>
> yes, this is very helpful!
>
>
>> Best,
>>
>> Tom
>>
>>
>>
>>
>> Dr Tom Schoonjans
>>
>> Research Software Engineer - HPC and Cloud
>>
>> Rosalind Franklin Institute
>> Harwell Science & Innovation Campus
>> Didcot
>> Oxfordshire
>> OX11 0FA
>> United Kingdom
>>
>> https://www.rfi.ac.uk
>>
>> The Rosalind Franklin Institute is a registered charity in England and
>> Wales, No. 1179810 Company Limited by Guarantee Registered in England
>> and Wales, No.11266143. Funded by UK Research and Innovation through
>> the Engineering and Physical Sciences Research Council.
>>
>> On 27 Jan 2021, at 13:04, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
>>
>>
>>
>> On Wed, Jan 27, 2021 at 11:33 AM Schoonjans, Tom (RFI,RAL,-) <
>> Tom.Schoonjans(a)rfi.ac.uk> wrote:
>>
>>> Hi Yuval,
>>>
>>>
>>> Switching to non-SSL connections to RabbitMQ allowed us to get things
>>> working, although currently it’s not very reliable.
>>>
>>
>> can you please add more about that? what reliability issues did you see?
>>
>>
>>> I will open a new ticket over this if we can’t fix things ourselves.
>>>
>>>
>> this would be great. we have ssl support for kafka and http endpoint, so,
>> if you decide to give it a try you can look at them as examples.
>> and let me know if you have questions or need help.
>>
>>
>>
>>> I will open an issue on the tracker as soon as my account request has
>>> been approved :-)
>>>
>>> Best,
>>>
>>> Tom
>>>
>>>
>>>
>>>
>>>
>>> Dr Tom Schoonjans
>>>
>>> Research Software Engineer - HPC and Cloud
>>>
>>> Rosalind Franklin Institute
>>> Harwell Science & Innovation Campus
>>> Didcot
>>> Oxfordshire
>>> OX11 0FA
>>> United Kingdom
>>>
>>> https://www.rfi.ac.uk
>>>
>>> The Rosalind Franklin Institute is a registered charity in England and
>>> Wales, No. 1179810 Company Limited by Guarantee Registered in England
>>> and Wales, No.11266143. Funded by UK Research and Innovation through
>>> the Engineering and Physical Sciences Research Council.
>>>
>>> On 26 Jan 2021, at 20:02, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
>>>
>>>
>>>
>>> On Tue, Jan 26, 2021 at 9:48 PM Schoonjans, Tom (RFI,RAL,-) <
>>> Tom.Schoonjans(a)rfi.ac.uk> wrote:
>>>
>>>> Hi Yuval,
>>>>
>>>>
>>>> I worked on this earlier today with Tom Byrne and I think I may be able
>>>> to provide some more information.
>>>>
>>>> I set up the RabbitMQ server myself, and created the exchange with type
>>>> ’topic’ before configuring the bucket.
>>>>
>>>> Not sure if this matters, but the RabbitMQ endpoint is reached over
>>>> SSL, using certificates generated with Letsencrypt.
>>>>
>>>>
>>> it actually does. we don't support amqp over ssl.
>>> feel free to open a tracker for that - as we should probably support
>>> that!
>>> but note that it would probably be backported only to later versions
>>> than nautilus.
>>>
>>>
>>>
>>>> Many thanks,
>>>>
>>>> Tom
>>>>
>>>>
>>>>
>>>> Dr Tom Schoonjans
>>>>
>>>> Research Software Engineer - HPC and Cloud
>>>>
>>>> Rosalind Franklin Institute
>>>> Harwell Science & Innovation Campus
>>>> Didcot
>>>> Oxfordshire
>>>> OX11 0FA
>>>> United Kingdom
>>>>
>>>> https://www.rfi.ac.uk
>>>>
>>>> The Rosalind Franklin Institute is a registered charity in England and
>>>> Wales, No. 1179810 Company Limited by Guarantee Registered in England
>>>> and Wales, No.11266143. Funded by UK Research and Innovation through
>>>> the Engineering and Physical Sciences Research Council.
>>>>
>>>> On 26 Jan 2021, at 19:37, Yuval Lifshitz <ylifshit(a)redhat.com> wrote:
>>>>
>>>> Hi Tom,
>>>> Did you create the exchange in rabbitmq? The RGW does not create it and
>>>> assume it is already created?
>>>> Could you increase the log level in RGW and see if there are more log
>>>> messages that have "AMQP" in them?
>>>>
>>>> Thanks,
>>>>
>>>> Yuval
>>>>
>>>> On Tue, Jan 26, 2021 at 7:33 PM Byrne, Thomas (STFC,RAL,SC) <
>>>> tom.byrne(a)stfc.ac.uk> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We've been trying to get RGW Bucket notifications working with a
>>>>> RabbitMQ endpoint on our Nautilus 14.2.15 cluster. The gateway host can
>>>>> communicate with the rabbitMQ server just fine, but when RGW tries to send
>>>>> a message to the endpoint, the message never appears in the queue, and we
>>>>> get this error from in the RGW logs:
>>>>>
>>>>> 2021-01-26 16:28:17.271 7f0468b1f700 1 push to endpoint AMQP(0.9.1)
>>>>> Endpoint
>>>>> URI: amqp://user:pass@host:5671
>>>>> Topic: ceph-topic-test
>>>>> Exchange: ceph-test
>>>>> Ack Level: broker failed, with error: -4098
>>>>>
>>>>> We've confirmed the URI is correct, and that the gateway host can send
>>>>> messages to the RabbitMQ via a standalone script (using the same
>>>>> information as in the URI). Does anyone have any hints about how to dig
>>>>> into this?
>>>>>
>>>>> Cheers,
>>>>> Tom
>>>>>
>>>>> This email and any attachments are intended solely for the use of the
>>>>> named recipients. If you are not the intended recipient you must not use,
>>>>> disclose, copy or distribute this email or any of its attachments and
>>>>> should notify the sender immediately and delete this email from your
>>>>> system. UK Research and Innovation (UKRI) has taken every reasonable
>>>>> precaution to minimise risk of this email or any attachments containing
>>>>> viruses or malware but the recipient should carry out its own virus and
>>>>> malware checks before opening the attachments. UKRI does not accept any
>>>>> liability for any losses or damages which the recipient may sustain due to
>>>>> presence of any viruses. Opinions, conclusions or other information in this
>>>>> message and attachments that are not related directly to UKRI business are
>>>>> solely those of the author and do not represent the views of UKRI.
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>
>>>>>
>>>>
>>>
>>
>
Hi,
I have a ceph nautilus (14.2.9) cluster with 10 nodes. Each node has
19x16TB disks attached.
I created radosgw pools. secondaryzone.rgw.buckets.data pool is configured
as EC 8+2 (jerasure).
ceph df shows 2.1PiB MAX AVAIL space.
Then I configured radosgw as a secondary zone and 100TiB of S3 data is
replicated.
But weirdly enough ceph df shows 1.8PiB MAX AVAIL for the same pool. But
there is only 100TiB of written data. ceph df also confirms it. I can not
figure out where 200TiB capacity is gone.
Would someone please tell me what I am missing?
Thanks.
Hi,
We have a pool where the user has 2 image.
They cleaned up the images, no snaphot in it, but when I see ceph df detail it still shows 458GB in the first column.
Why?
Thanks
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hi *,
is there any correlation between multi-site clusters and inconsistent PGs?
One customer has two Octopus clusters (fresh install a few months ago)
which have been expanded recently with new disks. Before that they had
one single occurence of inconsistent PGs during a deep-scrub which
could be fixed.
Since the expansion they get them on a regular basis, something like
two times a day or so (I don't have access). The latest occurence was
reported as omap_digest mismatch which could be fixed with the pg
repair command.
There are no signs of failing disks or anything suspicious. And when
searching this list I noticed a couple of other threads (one in
example is [1]) with this problem, they all had rgw multi-site in
common. Is it just a conincidence? Does anyone have some insight or
more experience on this?
Thanks!
Eugen
[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034155.html
Hi Marc,
Thanks for participating. At first I thought this is an incorrect report and maybe I need to upgrade to for a bugfix.
But I couldn’t find a such a report and I asked here.
When people shared experiences it appears there may be two causes. Unbalanced OSDs or Storage Amplification.
As far as I understand, this is most likely to be storage amplification. Unbalancing seems to less relevant since this is a fresh cluster. Or I might be misinterpreting ceph osd df. https://pastebin.ubuntu.com/p/ZmQZsGYpr7/ <https://pastebin.ubuntu.com/p/7C9zpXYntR/>
So I am trying to figure out the best way to change bluestore_min_alloc_size_hdd. And also I think pool compression can be a quick solution for the future data writes, but I am not %100 sure. Any idea is more than welcome.
> On 28 Jan 2021, at 12:29, Marc Roos <Marc(a)f1-outsourcing.eu> wrote:
>
> Hi George,
>
> Sorry for asking maybe I skipped an email. But what eventually caused the 'incorrect' report on available storage.
>
>
>
Hi Everyone,
I also have seen this inconsistent with empty when you do list-inconsistent-obj
$ sudo ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent; 1
pgs not deep-scrubbed in time
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 17.7ff is active+clean+inconsistent, acting [232,242,34,280,266,21]
PG_NOT_DEEP_SCRUBBED 1 pgs not deep-scrubbed in time
pg 17.1c2 not deep-scrubbed since 2021-01-15 02:46:16.271811
$ sudo rados list-inconsistent-obj 17.7ff --format=json-pretty
{
"epoch": 183807,
"inconsistents": []
}
Usually these are caused by read errors on the disks, but I've checked
all osd hosts that are part of this osd and there's no smart or dmesg
errors.
Rich
------------------------------
>
> Date: Sun, 17 Jan 2021 14:00:01 +0330
> From: Seena Fallah <seenafallah(a)gmail.com>
> Subject: [ceph-users] Re: PG inconsistent with empty inconsistent
> objects
> To: "Alexander E. Patrakov" <patrakov(a)gmail.com>
> Cc: ceph-users <ceph-users(a)ceph.io>
> Message-ID:
> <CAK3+OmXvdC_x2R-Kox-ui4K3oSDvXh4o8ZeqYbztBUmqMYEAZw(a)mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> It's for a long time ago and I don't have the `ceph health detail` output!
>
> On Sat, Jan 16, 2021 at 9:42 PM Alexander E. Patrakov <patrakov(a)gmail.com>
> wrote:
>
> > For a start, please post the "ceph health detail" output.
> >
> > сб, 19 дек. 2020 г. в 23:48, Seena Fallah <seenafallah(a)gmail.com>:
> > >
> > > Hi,
> > >
> > > I'm facing something strange! One of the PGs in my pool got inconsistent
> > > and when I run `rados list-inconsistent-obj $PG_ID --format=json-pretty`
> > > the `inconsistents` key was empty! What is this? Is it a bug in Ceph
> > or..?
> > >
> > > Thanks.
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users(a)ceph.io
> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >
> >
> >
> > --
> > Alexander E. Patrakov
> > CV: http://u.pc.cd/wT8otalK
> >
>
I upgraded our ceph cluster (6 bare metal nodes, 3 rgw VMs) from v13.2.4 to
v15.2.8. The mon, mgr, mds and osd daemons were all upgraded successfully,
everything looked good.
After the radosgw was upgraded, they refused to work, the log messages are
at the end of this e-mail.
Here are the things I tried:
1. I moved aside the pools for the rgw service, started from scratch
(creating realm, zonegroup, zone, users), but when I tried to run
'radosgw-admin user create ...', it appeared to be stuck and never
returned, other command like 'radosgw-admin period update --commit' also
got stuck.
2. I rolled back radosgw to the old version v13.2.4, then everything works
great again.
What am I missing here? Is there anything extra that needs to be done for
rgw after upgrading from mimic to octopus?
Please kindly help. Thanks.
---------------------
2021-01-24T09:24:10.192-0500 7f638f79f9c0 0 deferred set uid:gid to
64045:64045 (ceph:ceph)
2021-01-24T09:24:10.192-0500 7f638f79f9c0 0 ceph version 15.2.8
(bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process
radosgw, pid 898
2021-01-24T09:24:10.192-0500 7f638f79f9c0 0 framework: civetweb
2021-01-24T09:24:10.192-0500 7f638f79f9c0 0 framework conf key: port, val:
80
2021-01-24T09:24:10.192-0500 7f638f79f9c0 0 framework conf key:
num_threads, val: 1024
2021-01-24T09:24:10.192-0500 7f638f79f9c0 0 framework conf key:
request_timeout_ms, val: 50000
2021-01-24T09:24:10.192-0500 7f638f79f9c0 1 radosgw_Main not setting numa
affinity
2021-01-24T09:29:10.195-0500 7f638cbcd700 -1 Initialization timeout, failed
to initialize
2021-01-24T09:29:10.367-0500 7f4c213ba9c0 0 deferred set uid:gid to
64045:64045 (ceph:ceph)
2021-01-24T09:29:10.367-0500 7f4c213ba9c0 0 ceph version 15.2.8
(bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process
radosgw, pid 1541
2021-01-24T09:29:10.367-0500 7f4c213ba9c0 0 framework: civetweb
2021-01-24T09:29:10.367-0500 7f4c213ba9c0 0 framework conf key: port, val:
80
2021-01-24T09:29:10.367-0500 7f4c213ba9c0 0 framework conf key:
num_threads, val: 1024
2021-01-24T09:29:10.367-0500 7f4c213ba9c0 0 framework conf key:
request_timeout_ms, val: 50000
2021-01-24T09:29:10.367-0500 7f4c213ba9c0 1 radosgw_Main not setting numa
affinity
2021-01-24T09:29:25.883-0500 7f4c213ba9c0 1 robust_notify: If at first you
don't succeed: (110) Connection timed out
2021-01-24T09:29:25.883-0500 7f4c213ba9c0 0 ERROR: failed to distribute
cache for coredumps.rgw.log:meta.history
2021-01-24T09:32:27.754-0500 7fcdac2bf9c0 0 deferred set uid:gid to
64045:64045 (ceph:ceph)
2021-01-24T09:32:27.754-0500 7fcdac2bf9c0 0 ceph version 15.2.8
(bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process
radosgw, pid 978
2021-01-24T09:32:27.758-0500 7fcdac2bf9c0 0 framework: civetweb
2021-01-24T09:32:27.758-0500 7fcdac2bf9c0 0 framework conf key: port, val:
80
2021-01-24T09:32:27.758-0500 7fcdac2bf9c0 0 framework conf key:
num_threads, val: 1024
2021-01-24T09:32:27.758-0500 7fcdac2bf9c0 0 framework conf key:
request_timeout_ms, val: 50000
2021-01-24T09:32:27.758-0500 7fcdac2bf9c0 1 radosgw_Main not setting numa
affinity
2021-01-24T09:32:44.719-0500 7fcdac2bf9c0 1 robust_notify: If at first you
don't succeed: (110) Connection timed out
2021-01-24T09:32:44.719-0500 7fcdac2bf9c0 0 ERROR: failed to distribute
cache for coredumps.rgw.log:meta.history
Nope !
Le 27/01/2021 à 17:40, Anthony D'Atri a écrit :
> Do you have any override reweights set to values less than 1.0?
>
> The REWEIGHT column when you run `ceph osd df`
>
>> On Jan 27, 2021, at 8:15 AM, Francois Legrand <fleg(a)lpnhe.in2p3.fr> wrote:
>>
>> Hi all,
>> I have a cluster with 116 disks (24 new disks of 16TB added in december and the rest of 8TB) running nautilus 14.2.16.
>> I moved (8 month ago) from crush_compat to upmap balancing.
>> But the cluster seems not well balanced, with a number of pgs on the 8TB disks varying from 26 to 52 ! And an occupation from 35 to 69%.
>> The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 30 and 43%.
>> Last week, I realized that some osd were maybe not using upmap because I did a ceph osd crush weight-set ls and got (compat) as result.
>> Thus I ran a ceph osd crush weight-set rm-compat which triggered some rebalancing. Now there is no more recovery for 2 days, but the cluster is still unbalanced.
>> As far as I understand, upmap is supposed to reach an equal number of pgs on all the disks (I guess weighted by their capacity).
>> Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and around 50% usage on all. Which is not the case (by far).
>> The problem is that it impact the free available space in the pools (264Ti while there is more than 578Ti free in the cluster) because free space seems to be based on space available before the first osd will be full !
>> Is it normal ? Did I missed something ? What could I do ?
>>
>> F.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io