Hi, hoping someone could help me get to the bottom of this particular issue I'm having.
I have ceph octopus installed using ceph-ansible.
Currently, I have 3 MDS servers running, and one client connected to the active MDS. I'm currently storing a very large encrypted container on the CephFS file system, 8TB worth, and I'm writing data into it from the client host.
recently I have noticed a severe impact on performance, and the time take to do processing on file within the container has increased from 1 minute to 11 minutes.
in the ceph dashboard, when I take a look at the performance tab on the file system page, the Write Ops are increasing exponentially over time.
At the end of April around the 22nd I had 49 write Ops on the performance page for the MDS deamons. This is now at 266467 Write Ops and increasing.
Also the client requests have gone from 14 to 67 to 117 and is now at 283
would someone be able to help me make sense of why the performance has decreased and what is going on with the client requests and write operations.
Kind regards,
kyle
Hi everyone,
Today is the last day to get your proposal in for the Ceph June Month
event! The types of talks include:
* Lightning talk - 5 minutes
* Presentation - 20 minutes with q/a
* Unconference (Bof) - 40 minutes
We will be confirming with speakers for the date/time by May 16th.
https://ceph.io/events/ceph-month-june-2021/cfp
On Wed, Apr 21, 2021 at 6:30 AM Mike Perez <thingee(a)redhat.com> wrote:
>
> Hi everyone,
>
> We're looking for presentations, lightning talks, and BoFs to schedule
> for Ceph Month in June 2021. Please submit your proposals before May
> 12th:
>
> https://ceph.io/events/ceph-month-june-2021/cfp
>
> On Wed, Apr 14, 2021 at 12:35 PM Mike Perez <thingee(a)redhat.com> wrote:
> >
> > Hi everyone,
> >
> > In June 2021, we're hosting a month of Ceph presentations, lightning
> > talks, and unconference sessions such as BOFs. There is no
> > registration or cost to attend this event.
> >
> > The CFP is now open until May 12th.
> >
> > https://ceph.io/events/ceph-month-june-2021/cfp
> >
> > Speakers will receive confirmation that their presentation is accepted
> > and further instructions for scheduling by May 16th.
> >
> > The schedule will be available on May 19th.
> >
> > Join the Ceph community as we discuss how Ceph, the massively
> > scalable, open-source, software-defined storage system, can radically
> > improve the economics and management of data storage for your
> > enterprise.
> >
> > --
> > Mike Perez
Hello,
I'm trying to deploy my test ceph cluster and enable stretch mode (
https://docs.ceph.com/en/latest/rados/operations/stretch-mode/). My problem
is enabling the stretch mode.
----------------------------------------------------
$ ceph mon enable_stretch_mode ceph-node-05 stretch_rule datacenter
Error EINVAL: Could not find location entry for datacenter on monitor
ceph-node-05
----------------------------------------------------
ceph-node-5 is the tiebreaker monitor
I tried to create the third datacenter and put the tiebreaker there but got
the following error:
----------------------------------------------------
root@ceph-node-01:/home/clouduser# ceph mon enable_stretch_mode
ceph-node-05 stretch_rule datacenter
Error EINVAL: there are 3datacenter's in the cluster but stretch mode
currently only works with 2!
----------------------------------------------------
An additional info:
----------------------------------------------------
Setup method: cephadm (https://docs.ceph.com/en/latest/cephadm/install/)
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.03998 root default
-11 0.01999 datacenter site1
-5 0.00999 host ceph-node-01
0 hdd 0.00999 osd.0 up 1.00000 1.00000
-3 0.00999 host ceph-node-02
1 hdd 0.00999 osd.1 up 1.00000 1.00000
-12 0.01999 datacenter site2
-9 0.00999 host ceph-node-03
3 hdd 0.00999 osd.3 up 1.00000 1.00000
-7 0.00999 host ceph-node-04
2 hdd 0.00999 osd.2 up 1.00000 1.00000
stretch_rule is added to the crush
# ceph mon set_location ceph-node-01 datacenter=site1
# ceph mon set_location ceph-node-02 datacenter=site1
# ceph mon set_location ceph-node-03 datacenter=site2
# ceph mon set_location ceph-node-04 datacenter=site2
# ceph versions
{
"mon": {
"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 5
},
"mgr": {
"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 2
},
"osd": {
"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 4
},
"mds": {},
"overall": {
"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 11
}
}
Thank you for your support.
--
Best regards,
There will be a DocuBetter Meeting held on 12 May 2021 at 1730 UTC.
This is the monthly DocuBetter Meeting that is more convenient for
European and North American Ceph contributors than the other meeting,
which is convenient for people in Australia and Asia (and which is very
rarely attended).
I plan to discuss at this meeting the continuing cleaning of the cephadm
documentation, and to discuss an ambitious plan of virtually Alexandrian
hubris to create a 10ish-page Ceph Overview document (a long-term, tedious
project that will involve a dozen people, so don't get too excited about
it).
Bring your docs complaints and requests to this meeting.
Meeting: https://bluejeans.com/908675367
Etherpad: https://pad.ceph.com/p/Ceph_Documentation
Good call. I just restarted the whole cluster, but the problem still
persists.
I don't think it is a problem with the rados, but with the radosgw.
But I still struggle to pin the issue.
Am Di., 11. Mai 2021 um 10:45 Uhr schrieb Thomas Schneider <
Thomas.Schneider-q2p(a)ruhr-uni-bochum.de>:
> Hey all,
>
> we had slow RGW access when some OSDs were slow due to an (to us) unknown
> OSD bug that made PG access either slow or impossible. (It showed itself
> through slowness of the mgr as well, but nothing other than that).
> We restarted all OSDs that held RGW data and the problem was gone.
> I have no good way to debug the problem since it never occured again after
> we restarted the OSDs.
>
> Kind regards,
> Thomas
>
>
> Am 11. Mai 2021 08:47:06 MESZ schrieb Boris Behrens <bb(a)kervyn.de>:
> >Hi Amit,
> >
> >I just pinged the mons from every system and they are all available.
> >
> >Am Mo., 10. Mai 2021 um 21:18 Uhr schrieb Amit Ghadge <
> amitg.b14(a)gmail.com>:
> >
> >> We seen slowness due to unreachable one of them mgr service, maybe here
> >> are different, you can check monmap/ ceph.conf mon entry and then verify
> >> all nodes are successfully ping.
> >>
> >>
> >> -AmitG
> >>
> >>
> >> On Tue, 11 May 2021 at 12:12 AM, Boris Behrens <bb(a)kervyn.de> wrote:
> >>
> >>> Hi guys,
> >>>
> >>> does someone got any idea?
> >>>
> >>> Am Mi., 5. Mai 2021 um 16:16 Uhr schrieb Boris Behrens <bb(a)kervyn.de>:
> >>>
> >>> > Hi,
> >>> > since a couple of days we experience a strange slowness on some
> >>> > radosgw-admin operations.
> >>> > What is the best way to debug this?
> >>> >
> >>> > For example creating a user takes over 20s.
> >>> > [root@s3db1 ~]# time radosgw-admin user create --uid test-bb-user
> >>> > --display-name=test-bb-user
> >>> > 2021-05-05 14:08:14.297 7f6942286840 1 robust_notify: If at first
> you
> >>> > don't succeed: (110) Connection timed out
> >>> > 2021-05-05 14:08:14.297 7f6942286840 0 ERROR: failed to distribute
> >>> cache
> >>> > for eu-central-1.rgw.users.uid:test-bb-user
> >>> > 2021-05-05 14:08:24.335 7f6942286840 1 robust_notify: If at first
> you
> >>> > don't succeed: (110) Connection timed out
> >>> > 2021-05-05 14:08:24.335 7f6942286840 0 ERROR: failed to distribute
> >>> cache
> >>> > for eu-central-1.rgw.users.keys:****
> >>> > {
> >>> > "user_id": "test-bb-user",
> >>> > "display_name": "test-bb-user",
> >>> > ....
> >>> > }
> >>> > real 0m20.557s
> >>> > user 0m0.087s
> >>> > sys 0m0.030s
> >>> >
> >>> > First I thought that rados operations might be slow, but adding and
> >>> > deleting objects in rados are fast as usual (at least from my
> >>> perspective).
> >>> > Also uploading to buckets is fine.
> >>> >
> >>> > We changed some things and I think it might have to do with this:
> >>> > * We have a HAProxy that distributes via leastconn between the 3
> >>> radosgw's
> >>> > (this did not change)
> >>> > * We had three times a daemon with the name "eu-central-1" running
> (on
> >>> the
> >>> > 3 radosgw's)
> >>> > * Because this might have led to our data duplication problem, we
> have
> >>> > split that up so now the daemons are named per host
> (eu-central-1-s3db1,
> >>> > eu-central-1-s3db2, eu-central-1-s3db3)
> >>> > * We also added dedicated rgw daemons for garbage collection, because
> >>> the
> >>> > current one were not able to keep up.
> >>> > * So basically ceph status went from "rgw: 1 daemon active
> >>> (eu-central-1)"
> >>> > to "rgw: 14 daemons active (eu-central-1-s3db1, eu-central-1-s3db2,
> >>> > eu-central-1-s3db3, gc-s3db12, gc-s3db13...)
> >>> >
> >>> >
> >>> > Cheers
> >>> > Boris
> >>> >
> >>>
> >>>
> >>> --
> >>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend
> im
> >>> groüen Saal.
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users(a)ceph.io
> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >>>
> >>
> >
>
> --
> Thomas Schneider
> IT.SERVICES
> Wissenschaftliche Informationsversorgung Ruhr-Universität Bochum | 44780
> Bochum
> Telefon: +49 234 32 23939
> http://www.it-services.rub.de/
>
--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
Hi all,
I would like to "pair" MonSession with TCP connection to get real process, which is using that session. I need it to identify processes with
old ceph features.
MonSession looks like
MonSession(client.84324148 [..IP...]:0/3096235764 is open allow *, features 0x27018fb86aa42ada (jewel))
What does client.NUMBER and 0/3096235764 means?
How can I resolve client.NUMBER or that /NUMBER with certain TCP session. I have many processes on that server (on that IP) with different
features.
Thank you
--
============
Ing. Jan Pekař
jan.pekar(a)imatic.cz
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz | +420326555326
============
--
On Mon, May 3, 2021 at 12:24 PM Magnus Harlander <magnus(a)harlan.de> wrote:
>
> Am 03.05.21 um 11:22 schrieb Ilya Dryomov:
>
> There is a 6th osd directory on both machines, but it's empty
>
> [root@s0 osd]# ll
> total 0
> drwxrwxrwt. 2 ceph ceph 200 2. Mai 16:31 ceph-1
> drwxrwxrwt. 2 ceph ceph 200 2. Mai 16:31 ceph-3
> drwxrwxrwt. 2 ceph ceph 200 2. Mai 16:31 ceph-4
> drwxrwxrwt. 2 ceph ceph 200 2. Mai 16:31 ceph-5
> drwxr-xr-x. 2 ceph ceph 6 3. Apr 19:50 ceph-8 <===
> drwxrwxrwt. 2 ceph ceph 200 2. Mai 16:31 ceph-9
> [root@s0 osd]# pwd
> /var/lib/ceph/osd
>
> [root@s1 osd]# ll
> total 0
> drwxrwxrwt 2 ceph ceph 200 May 2 15:39 ceph-0
> drwxr-xr-x. 2 ceph ceph 6 Mar 13 17:54 ceph-1 <===
> drwxrwxrwt 2 ceph ceph 200 May 2 15:39 ceph-2
> drwxrwxrwt 2 ceph ceph 200 May 2 15:39 ceph-6
> drwxrwxrwt 2 ceph ceph 200 May 2 15:39 ceph-7
> drwxrwxrwt 2 ceph ceph 200 May 2 15:39 ceph-8
> [root@s1 osd]# pwd
> /var/lib/ceph/osd
>
> The bogus directories are empty and they are
> used on the other machine for a real osd!
>
> How is that?
>
> Should I remove them and restart ceph.target?
I don't think empty directories matter at this point. You may not have
had 12 OSDs at any point in time, but the max_osd value appears to have
gotten bumped when you were replacing those disks.
Note that max_osd being greater than the number of OSDs is not a big
problem by itself. The osdmap is going to be larger and require more
memory but that's it. You can test by setting it back to 12 and trying
to mount -- it should work. The issue is specific to how to those OSDs
were replaced -- something went wrong and the osdmap somehow ended up
with rather bogus addrvec entries. Not sure if it's ceph-deploy's
fault, something weird in ceph.conf (back then) or a an actual ceph
bug.
Thanks,
Ilya
Hi,
Thinking to have 2:2 so I can tolerate 2 hosts loss, but if I just want to tolerate 1 host loss, which one better, 3:2 or 4:1?
Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo(a)agoda.com<mailto:istvan.szabo@agoda.com>
---------------------------------------------------
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hi Amit,
it is the same physical interface but different VLANs. I checked all IP
adresses from all systems and everything is direct connected, without any
gateway hops.
Am Di., 11. Mai 2021 um 10:59 Uhr schrieb Amit Ghadge <amitg.b14(a)gmail.com>:
> I hope you are using a single network interface for the public and cluster?
>
> On Tue, May 11, 2021 at 2:15 PM Thomas Schneider <
> Thomas.Schneider-q2p(a)ruhr-uni-bochum.de> wrote:
>
>> Hey all,
>>
>> we had slow RGW access when some OSDs were slow due to an (to us) unknown
>> OSD bug that made PG access either slow or impossible. (It showed itself
>> through slowness of the mgr as well, but nothing other than that).
>> We restarted all OSDs that held RGW data and the problem was gone.
>> I have no good way to debug the problem since it never occured again
>> after we restarted the OSDs.
>>
>> Kind regards,
>> Thomas
>>
>>
>> Am 11. Mai 2021 08:47:06 MESZ schrieb Boris Behrens <bb(a)kervyn.de>:
>> >Hi Amit,
>> >
>> >I just pinged the mons from every system and they are all available.
>> >
>> >Am Mo., 10. Mai 2021 um 21:18 Uhr schrieb Amit Ghadge <
>> amitg.b14(a)gmail.com>:
>> >
>> >> We seen slowness due to unreachable one of them mgr service, maybe here
>> >> are different, you can check monmap/ ceph.conf mon entry and then
>> verify
>> >> all nodes are successfully ping.
>> >>
>> >>
>> >> -AmitG
>> >>
>> >>
>> >> On Tue, 11 May 2021 at 12:12 AM, Boris Behrens <bb(a)kervyn.de> wrote:
>> >>
>> >>> Hi guys,
>> >>>
>> >>> does someone got any idea?
>> >>>
>> >>> Am Mi., 5. Mai 2021 um 16:16 Uhr schrieb Boris Behrens <bb(a)kervyn.de
>> >:
>> >>>
>> >>> > Hi,
>> >>> > since a couple of days we experience a strange slowness on some
>> >>> > radosgw-admin operations.
>> >>> > What is the best way to debug this?
>> >>> >
>> >>> > For example creating a user takes over 20s.
>> >>> > [root@s3db1 ~]# time radosgw-admin user create --uid test-bb-user
>> >>> > --display-name=test-bb-user
>> >>> > 2021-05-05 14:08:14.297 7f6942286840 1 robust_notify: If at first
>> you
>> >>> > don't succeed: (110) Connection timed out
>> >>> > 2021-05-05 14:08:14.297 7f6942286840 0 ERROR: failed to distribute
>> >>> cache
>> >>> > for eu-central-1.rgw.users.uid:test-bb-user
>> >>> > 2021-05-05 14:08:24.335 7f6942286840 1 robust_notify: If at first
>> you
>> >>> > don't succeed: (110) Connection timed out
>> >>> > 2021-05-05 14:08:24.335 7f6942286840 0 ERROR: failed to distribute
>> >>> cache
>> >>> > for eu-central-1.rgw.users.keys:****
>> >>> > {
>> >>> > "user_id": "test-bb-user",
>> >>> > "display_name": "test-bb-user",
>> >>> > ....
>> >>> > }
>> >>> > real 0m20.557s
>> >>> > user 0m0.087s
>> >>> > sys 0m0.030s
>> >>> >
>> >>> > First I thought that rados operations might be slow, but adding and
>> >>> > deleting objects in rados are fast as usual (at least from my
>> >>> perspective).
>> >>> > Also uploading to buckets is fine.
>> >>> >
>> >>> > We changed some things and I think it might have to do with this:
>> >>> > * We have a HAProxy that distributes via leastconn between the 3
>> >>> radosgw's
>> >>> > (this did not change)
>> >>> > * We had three times a daemon with the name "eu-central-1" running
>> (on
>> >>> the
>> >>> > 3 radosgw's)
>> >>> > * Because this might have led to our data duplication problem, we
>> have
>> >>> > split that up so now the daemons are named per host
>> (eu-central-1-s3db1,
>> >>> > eu-central-1-s3db2, eu-central-1-s3db3)
>> >>> > * We also added dedicated rgw daemons for garbage collection,
>> because
>> >>> the
>> >>> > current one were not able to keep up.
>> >>> > * So basically ceph status went from "rgw: 1 daemon active
>> >>> (eu-central-1)"
>> >>> > to "rgw: 14 daemons active (eu-central-1-s3db1, eu-central-1-s3db2,
>> >>> > eu-central-1-s3db3, gc-s3db12, gc-s3db13...)
>> >>> >
>> >>> >
>> >>> > Cheers
>> >>> > Boris
>> >>> >
>> >>>
>> >>>
>> >>> --
>> >>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend
>> im
>> >>> groüen Saal.
>> >>> _______________________________________________
>> >>> ceph-users mailing list -- ceph-users(a)ceph.io
>> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> >>>
>> >>
>> >
>>
>> --
>> Thomas Schneider
>> IT.SERVICES
>> Wissenschaftliche Informationsversorgung Ruhr-Universität Bochum | 44780
>> Bochum
>> Telefon: +49 234 32 23939
>> http://www.it-services.rub.de/
>>
>
--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.