May 2021 - ceph-users - lists.ceph.io

Write Ops on CephFS Increasing exponentially

by Kyle Dean

Hi, hoping someone could help me get to the bottom of this particular issue I'm having. I have ceph octopus installed using ceph-ansible. Currently, I have 3 MDS servers running, and one client connected to the active MDS. I'm currently storing a very large encrypted container on the CephFS file system, 8TB worth, and I'm writing data into it from the client host. recently I have noticed a severe impact on performance, and the time take to do processing on file within the container has increased from 1 minute to 11 minutes. in the ceph dashboard, when I take a look at the performance tab on the file system page, the Write Ops are increasing exponentially over time. At the end of April around the 22nd I had 49 write Ops on the performance page for the MDS deamons. This is now at 266467 Write Ops and increasing. Also the client requests have gone from 14 to 67 to 117 and is now at 283 would someone be able to help me make sense of why the performance has decreased and what is going on with the client requests and write operations. Kind regards, kyle

2 years, 11 months

2
2
0 0

Re: Ceph Month June 2021 Event

by Mike Perez

Hi everyone, Today is the last day to get your proposal in for the Ceph June Month event! The types of talks include: * Lightning talk - 5 minutes * Presentation - 20 minutes with q/a * Unconference (Bof) - 40 minutes We will be confirming with speakers for the date/time by May 16th. https://ceph.io/events/ceph-month-june-2021/cfp On Wed, Apr 21, 2021 at 6:30 AM Mike Perez <thingee(a)redhat.com> wrote: > > Hi everyone, > > We're looking for presentations, lightning talks, and BoFs to schedule > for Ceph Month in June 2021. Please submit your proposals before May > 12th: > > https://ceph.io/events/ceph-month-june-2021/cfp > > On Wed, Apr 14, 2021 at 12:35 PM Mike Perez <thingee(a)redhat.com> wrote: > > > > Hi everyone, > > > > In June 2021, we're hosting a month of Ceph presentations, lightning > > talks, and unconference sessions such as BOFs. There is no > > registration or cost to attend this event. > > > > The CFP is now open until May 12th. > > > > https://ceph.io/events/ceph-month-june-2021/cfp > > > > Speakers will receive confirmation that their presentation is accepted > > and further instructions for scheduling by May 16th. > > > > The schedule will be available on May 19th. > > > > Join the Ceph community as we discuss how Ceph, the massively > > scalable, open-source, software-defined storage system, can radically > > improve the economics and management of data storage for your > > enterprise. > > > > -- > > Mike Perez

2 years, 11 months

1
0
0 0

Ceph stretch mode enabling

by Felix O

Hello, I'm trying to deploy my test ceph cluster and enable stretch mode ( https://docs.ceph.com/en/latest/rados/operations/stretch-mode/). My problem is enabling the stretch mode. ---------------------------------------------------- $ ceph mon enable_stretch_mode ceph-node-05 stretch_rule datacenter Error EINVAL: Could not find location entry for datacenter on monitor ceph-node-05 ---------------------------------------------------- ceph-node-5 is the tiebreaker monitor I tried to create the third datacenter and put the tiebreaker there but got the following error: ---------------------------------------------------- root@ceph-node-01:/home/clouduser# ceph mon enable_stretch_mode ceph-node-05 stretch_rule datacenter Error EINVAL: there are 3datacenter's in the cluster but stretch mode currently only works with 2! ---------------------------------------------------- An additional info: ---------------------------------------------------- Setup method: cephadm (https://docs.ceph.com/en/latest/cephadm/install/) # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.03998 root default -11 0.01999 datacenter site1 -5 0.00999 host ceph-node-01 0 hdd 0.00999 osd.0 up 1.00000 1.00000 -3 0.00999 host ceph-node-02 1 hdd 0.00999 osd.1 up 1.00000 1.00000 -12 0.01999 datacenter site2 -9 0.00999 host ceph-node-03 3 hdd 0.00999 osd.3 up 1.00000 1.00000 -7 0.00999 host ceph-node-04 2 hdd 0.00999 osd.2 up 1.00000 1.00000 stretch_rule is added to the crush # ceph mon set_location ceph-node-01 datacenter=site1 # ceph mon set_location ceph-node-02 datacenter=site1 # ceph mon set_location ceph-node-03 datacenter=site2 # ceph mon set_location ceph-node-04 datacenter=site2 # ceph versions { "mon": { "ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10) pacific (stable)": 5 }, "mgr": { "ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10) pacific (stable)": 2 }, "osd": { "ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10) pacific (stable)": 4 }, "mds": {}, "overall": { "ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10) pacific (stable)": 11 } } Thank you for your support. -- Best regards,

2 years, 11 months

2
1
0 0

cephfs mds issues

by Mazzystr

I did a simple os update and reboot. Now mds is stuck in replay. I'm running octapus debug mds = 20 shows some pretty lame logs # tail -f ceph-mds.bridge.log 2021-05-11T18:24:04.859-0700 7f41314a1700 20 mds.0.cache upkeep thread waiting interval 1s 2021-05-11T18:24:05.860-0700 7f41314a1700 10 mds.0.cache cache not ready for trimming 2021-05-11T18:24:05.860-0700 7f41314a1700 20 mds.0.cache upkeep thread waiting interval 1s 2021-05-11T18:24:06.859-0700 7f4133ca6700 20 mds.0.2898629 get_task_status 2021-05-11T18:24:06.859-0700 7f4133ca6700 20 mds.0.2898629 send_task_status: updating 1 status keys 2021-05-11T18:24:06.859-0700 7f4133ca6700 20 mds.0.2898629 schedule_update_timer_task 2021-05-11T18:24:06.859-0700 7f41314a1700 10 mds.0.cache cache not ready for trimming 2021-05-11T18:24:06.859-0700 7f41314a1700 20 mds.0.cache upkeep thread waiting interval 1s 2021-05-11T18:24:07.859-0700 7f41314a1700 10 mds.0.cache cache not ready for trimming 2021-05-11T18:24:07.859-0700 7f41314a1700 20 mds.0.cache upkeep thread waiting interval 1s # cephfs-journal-tool event recover_dentries summary gets stuck on an object and stays stuck. I tried to run rados -p cephfs_metadata_pool rmomapkey per https://tracker.ceph.com/issues/38452 but the cmd ran for hours and never completes. # cephfs-journal-tool --rank cephfs:0 journal reset 2021-05-11T18:31:26.860-0700 7f2e9c2a9700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol 2021-05-11T18:31:26.860-0700 7f2f2989ba80 4 waiting for MDS map... 2021-05-11T18:31:26.860-0700 7f2f2989ba80 4 Got MDS map 2898629 2021-05-11T18:31:26.861-0700 7f2f2989ba80 10 main: JournalTool::main 2021-05-11T18:31:26.861-0700 7f2f2989ba80 4 main: JournalTool: connecting to RADOS... 2021-05-11T18:31:26.863-0700 7f2f2989ba80 4 main: JournalTool: resolving pool 1 2021-05-11T18:31:26.863-0700 7f2f2989ba80 4 main: JournalTool: creating IoCtx.. 2021-05-11T18:31:26.863-0700 7f2f2989ba80 4 main: Executing for rank 0 2021-05-11T18:31:26.864-0700 7f2edc2aa700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol 2021-05-11T18:31:26.864-0700 7f2f2989ba80 4 waiting for MDS map... 2021-05-11T18:31:26.865-0700 7f2f2989ba80 4 Got MDS map 2898629 2021-05-11T18:31:26.865-0700 7f2f2989ba80 4 client.2024650.journalpointer Reading journal pointer '400.00000000' 2021-05-11T18:31:26.865-0700 7f2f2989ba80 1 client.2024650.journaler.resetter(ro) recover start 2021-05-11T18:31:26.865-0700 7f2f2989ba80 1 client.2024650.journaler.resetter(ro) read_head 2021-05-11T18:31:26.865-0700 7f291c293700 1 client.2024650.journaler.resetter(ro) _finish_read_head loghead(trim 14172553216, expire 14174788378, write 14400838791, stream_format 1). probing for end of log (from 14400838791)... 2021-05-11T18:31:26.865-0700 7f291c293700 1 client.2024650.journaler.resetter(ro) probing for end of the log I've been stuck here for hours # strace -f -p 10357 [pid 10360] <... sendmsg resumed>) = 9 [pid 10361] read(14, <unfinished ...> [pid 10360] epoll_wait(7, <unfinished ...> [pid 10361] <... read resumed>0x55e95d982000, 4096) = -1 EAGAIN (Resource temporarily unavailable) [pid 10360] <... epoll_wait resumed>[{EPOLLIN, {u32=16, u64=16}}, {EPOLLIN, {u32=18, u64=18}}], 5000, 30000) = 2 [pid 10361] epoll_wait(10, <unfinished ...> [pid 10360] read(16, "\23\1\10\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\354^\340;"..., 4096) = 57 [pid 10360] read(16, 0x55e95d9a8000, 4096) = -1 EAGAIN (Resource temporarily unavailable) [pid 10360] read(18, "\17\264R\233`\327\275\222+", 4096) = 9 [pid 10360] read(18, 0x55e95d9f4000, 4096) = -1 EAGAIN (Resource temporarily unavailable) [pid 10360] epoll_wait(7, ^X <unfinished ...> [pid 10370] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out) [pid 10381] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out) [pid 10370] clock_gettime(CLOCK_REALTIME, <unfinished ...> [pid 10389] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out) [pid 10381] clock_gettime(CLOCK_REALTIME, <unfinished ...> [pid 10370] <... clock_gettime resumed>{tv_sec=1620791989, tv_nsec=731038214}) = 0 [pid 10389] clock_gettime(CLOCK_REALTIME, <unfinished ...> [pid 10381] <... clock_gettime resumed>{tv_sec=1620791989, tv_nsec=731105584}) = 0 [pid 10389] <... clock_gettime resumed>{tv_sec=1620791989, tv_nsec=731125991}) = 0 [pid 10370] clock_gettime(CLOCK_REALTIME, <unfinished ...> [pid 10381] clock_gettime(CLOCK_REALTIME, <unfinished ...> [pid 10389] clock_gettime(CLOCK_REALTIME, <unfinished ...> [pid 10370] <... clock_gettime resumed>{tv_sec=1620791989, tv_nsec=731162065}) = 0 [pid 10389] <... clock_gettime resumed>{tv_sec=1620791989, tv_nsec=731184311}) = 0 [pid 10381] <... clock_gettime resumed>{tv_sec=1620791989, tv_nsec=731174345}) = 0 [pid 10370] futex(0x55e95d97c2d8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 10381] futex(0x55e95d8a5320, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 10370] <... futex resumed>) = 0 [pid 10389] futex(0x55e95d97fad8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 10381] <... futex resumed>) = 0 [pid 10370] futex(0x55e95d97c31c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 17805, {tv_sec=1620791990, tv_nsec=731161399}, 0xffffffff <unfinished ...> [pid 10389] <... futex resumed>) = 0 [pid 10381] futex(0x55e95d8a5364, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 17805, {tv_sec=1620791990, tv_nsec=731173986}, 0xffffffff <unfinished ...> [pid 10389] futex(0x55e95d97fb1c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 17805, {tv_sec=1620791990, tv_nsec=731183618}, 0xffffffff^Cstrace: Process 10357 detached Any help would be great. Thanks, /C

2 years, 11 months

1
1
0 0

DocuBetter Meeting -- 12 May 2021 1730 UTC

by John Zachary Dover

There will be a DocuBetter Meeting held on 12 May 2021 at 1730 UTC. This is the monthly DocuBetter Meeting that is more convenient for European and North American Ceph contributors than the other meeting, which is convenient for people in Australia and Asia (and which is very rarely attended). I plan to discuss at this meeting the continuing cleaning of the cephadm documentation, and to discuss an ambitious plan of virtually Alexandrian hubris to create a 10ish-page Ceph Overview document (a long-term, tedious project that will involve a dozen people, so don't get too excited about it). Bring your docs complaints and requests to this meeting. Meeting: https://bluejeans.com/908675367 Etherpad: https://pad.ceph.com/p/Ceph_Documentation

2 years, 11 months

1
0
0 0

Re: radosgw-admin user create takes a long time (with failed to distribute cache message)

by Boris Behrens

Good call. I just restarted the whole cluster, but the problem still persists. I don't think it is a problem with the rados, but with the radosgw. But I still struggle to pin the issue. Am Di., 11. Mai 2021 um 10:45 Uhr schrieb Thomas Schneider < Thomas.Schneider-q2p(a)ruhr-uni-bochum.de>: > Hey all, > > we had slow RGW access when some OSDs were slow due to an (to us) unknown > OSD bug that made PG access either slow or impossible. (It showed itself > through slowness of the mgr as well, but nothing other than that). > We restarted all OSDs that held RGW data and the problem was gone. > I have no good way to debug the problem since it never occured again after > we restarted the OSDs. > > Kind regards, > Thomas > > > Am 11. Mai 2021 08:47:06 MESZ schrieb Boris Behrens <bb(a)kervyn.de>: > >Hi Amit, > > > >I just pinged the mons from every system and they are all available. > > > >Am Mo., 10. Mai 2021 um 21:18 Uhr schrieb Amit Ghadge < > amitg.b14(a)gmail.com>: > > > >> We seen slowness due to unreachable one of them mgr service, maybe here > >> are different, you can check monmap/ ceph.conf mon entry and then verify > >> all nodes are successfully ping. > >> > >> > >> -AmitG > >> > >> > >> On Tue, 11 May 2021 at 12:12 AM, Boris Behrens <bb(a)kervyn.de> wrote: > >> > >>> Hi guys, > >>> > >>> does someone got any idea? > >>> > >>> Am Mi., 5. Mai 2021 um 16:16 Uhr schrieb Boris Behrens <bb(a)kervyn.de>: > >>> > >>> > Hi, > >>> > since a couple of days we experience a strange slowness on some > >>> > radosgw-admin operations. > >>> > What is the best way to debug this? > >>> > > >>> > For example creating a user takes over 20s. > >>> > [root@s3db1 ~]# time radosgw-admin user create --uid test-bb-user > >>> > --display-name=test-bb-user > >>> > 2021-05-05 14:08:14.297 7f6942286840 1 robust_notify: If at first > you > >>> > don't succeed: (110) Connection timed out > >>> > 2021-05-05 14:08:14.297 7f6942286840 0 ERROR: failed to distribute > >>> cache > >>> > for eu-central-1.rgw.users.uid:test-bb-user > >>> > 2021-05-05 14:08:24.335 7f6942286840 1 robust_notify: If at first > you > >>> > don't succeed: (110) Connection timed out > >>> > 2021-05-05 14:08:24.335 7f6942286840 0 ERROR: failed to distribute > >>> cache > >>> > for eu-central-1.rgw.users.keys:**** > >>> > { > >>> > "user_id": "test-bb-user", > >>> > "display_name": "test-bb-user", > >>> > .... > >>> > } > >>> > real 0m20.557s > >>> > user 0m0.087s > >>> > sys 0m0.030s > >>> > > >>> > First I thought that rados operations might be slow, but adding and > >>> > deleting objects in rados are fast as usual (at least from my > >>> perspective). > >>> > Also uploading to buckets is fine. > >>> > > >>> > We changed some things and I think it might have to do with this: > >>> > * We have a HAProxy that distributes via leastconn between the 3 > >>> radosgw's > >>> > (this did not change) > >>> > * We had three times a daemon with the name "eu-central-1" running > (on > >>> the > >>> > 3 radosgw's) > >>> > * Because this might have led to our data duplication problem, we > have > >>> > split that up so now the daemons are named per host > (eu-central-1-s3db1, > >>> > eu-central-1-s3db2, eu-central-1-s3db3) > >>> > * We also added dedicated rgw daemons for garbage collection, because > >>> the > >>> > current one were not able to keep up. > >>> > * So basically ceph status went from "rgw: 1 daemon active > >>> (eu-central-1)" > >>> > to "rgw: 14 daemons active (eu-central-1-s3db1, eu-central-1-s3db2, > >>> > eu-central-1-s3db3, gc-s3db12, gc-s3db13...) > >>> > > >>> > > >>> > Cheers > >>> > Boris > >>> > > >>> > >>> > >>> -- > >>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend > im > >>> groÃƒ¼en Saal. > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users(a)ceph.io > >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > >>> > >> > > > > -- > Thomas Schneider > IT.SERVICES > Wissenschaftliche Informationsversorgung Ruhr-Universität Bochum | 44780 > Bochum > Telefon: +49 234 32 23939 > http://www.it-services.rub.de/ > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groÃƒ¼en Saal.

2 years, 11 months

1
2
0 0

MonSession vs TCP connection

by Jan Pekař - Imatic

Hi all, I would like to "pair" MonSession with TCP connection to get real process, which is using that session. I need it to identify processes with old ceph features. MonSession looks like MonSession(client.84324148 [..IP...]:0/3096235764 is open allow *, features 0x27018fb86aa42ada (jewel)) What does client.NUMBER and 0/3096235764 means? How can I resolve client.NUMBER or that /NUMBER with certain TCP session. I have many processes on that server (on that IP) with different features. Thank you -- ============ Ing. Jan Pekař jan.pekar(a)imatic.cz ---- Imatic | Jagellonská 14 | Praha 3 | 130 00 http://www.imatic.cz | +420326555326 ============ --

2 years, 11 months

1
0
0 0

Re: cephfs mount problems with 5.11 kernel - not a ipv6 problem

by Ilya Dryomov

On Mon, May 3, 2021 at 12:24 PM Magnus Harlander <magnus(a)harlan.de> wrote: > > Am 03.05.21 um 11:22 schrieb Ilya Dryomov: > > There is a 6th osd directory on both machines, but it's empty > > [root@s0 osd]# ll > total 0 > drwxrwxrwt. 2 ceph ceph 200 2. Mai 16:31 ceph-1 > drwxrwxrwt. 2 ceph ceph 200 2. Mai 16:31 ceph-3 > drwxrwxrwt. 2 ceph ceph 200 2. Mai 16:31 ceph-4 > drwxrwxrwt. 2 ceph ceph 200 2. Mai 16:31 ceph-5 > drwxr-xr-x. 2 ceph ceph 6 3. Apr 19:50 ceph-8 <=== > drwxrwxrwt. 2 ceph ceph 200 2. Mai 16:31 ceph-9 > [root@s0 osd]# pwd > /var/lib/ceph/osd > > [root@s1 osd]# ll > total 0 > drwxrwxrwt 2 ceph ceph 200 May 2 15:39 ceph-0 > drwxr-xr-x. 2 ceph ceph 6 Mar 13 17:54 ceph-1 <=== > drwxrwxrwt 2 ceph ceph 200 May 2 15:39 ceph-2 > drwxrwxrwt 2 ceph ceph 200 May 2 15:39 ceph-6 > drwxrwxrwt 2 ceph ceph 200 May 2 15:39 ceph-7 > drwxrwxrwt 2 ceph ceph 200 May 2 15:39 ceph-8 > [root@s1 osd]# pwd > /var/lib/ceph/osd > > The bogus directories are empty and they are > used on the other machine for a real osd! > > How is that? > > Should I remove them and restart ceph.target? I don't think empty directories matter at this point. You may not have had 12 OSDs at any point in time, but the max_osd value appears to have gotten bumped when you were replacing those disks. Note that max_osd being greater than the number of OSDs is not a big problem by itself. The osdmap is going to be larger and require more memory but that's it. You can test by setting it back to 12 and trying to mount -- it should work. The issue is specific to how to those OSDs were replaced -- something went wrong and the osdmap somehow ended up with rather bogus addrvec entries. Not sure if it's ceph-deploy's fault, something weird in ceph.conf (back then) or a an actual ceph bug. Thanks, Ilya

2 years, 11 months

2
3
0 0

Which EC-code for 6 servers?

by Szabo, Istvan (Agoda)

Hi, Thinking to have 2:2 so I can tolerate 2 hosts loss, but if I just want to tolerate 1 host loss, which one better, 3:2 or 4:1? Istvan Szabo Senior Infrastructure Engineer --------------------------------------------------- Agoda Services Co., Ltd. e: istvan.szabo(a)agoda.com<mailto:istvan.szabo@agoda.com> --------------------------------------------------- ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

2 years, 11 months

2
2
0 0

Re: radosgw-admin user create takes a long time (with failed to distribute cache message)

by Boris Behrens

Hi Amit, it is the same physical interface but different VLANs. I checked all IP adresses from all systems and everything is direct connected, without any gateway hops. Am Di., 11. Mai 2021 um 10:59 Uhr schrieb Amit Ghadge <amitg.b14(a)gmail.com>: > I hope you are using a single network interface for the public and cluster? > > On Tue, May 11, 2021 at 2:15 PM Thomas Schneider < > Thomas.Schneider-q2p(a)ruhr-uni-bochum.de> wrote: > >> Hey all, >> >> we had slow RGW access when some OSDs were slow due to an (to us) unknown >> OSD bug that made PG access either slow or impossible. (It showed itself >> through slowness of the mgr as well, but nothing other than that). >> We restarted all OSDs that held RGW data and the problem was gone. >> I have no good way to debug the problem since it never occured again >> after we restarted the OSDs. >> >> Kind regards, >> Thomas >> >> >> Am 11. Mai 2021 08:47:06 MESZ schrieb Boris Behrens <bb(a)kervyn.de>: >> >Hi Amit, >> > >> >I just pinged the mons from every system and they are all available. >> > >> >Am Mo., 10. Mai 2021 um 21:18 Uhr schrieb Amit Ghadge < >> amitg.b14(a)gmail.com>: >> > >> >> We seen slowness due to unreachable one of them mgr service, maybe here >> >> are different, you can check monmap/ ceph.conf mon entry and then >> verify >> >> all nodes are successfully ping. >> >> >> >> >> >> -AmitG >> >> >> >> >> >> On Tue, 11 May 2021 at 12:12 AM, Boris Behrens <bb(a)kervyn.de> wrote: >> >> >> >>> Hi guys, >> >>> >> >>> does someone got any idea? >> >>> >> >>> Am Mi., 5. Mai 2021 um 16:16 Uhr schrieb Boris Behrens <bb(a)kervyn.de >> >: >> >>> >> >>> > Hi, >> >>> > since a couple of days we experience a strange slowness on some >> >>> > radosgw-admin operations. >> >>> > What is the best way to debug this? >> >>> > >> >>> > For example creating a user takes over 20s. >> >>> > [root@s3db1 ~]# time radosgw-admin user create --uid test-bb-user >> >>> > --display-name=test-bb-user >> >>> > 2021-05-05 14:08:14.297 7f6942286840 1 robust_notify: If at first >> you >> >>> > don't succeed: (110) Connection timed out >> >>> > 2021-05-05 14:08:14.297 7f6942286840 0 ERROR: failed to distribute >> >>> cache >> >>> > for eu-central-1.rgw.users.uid:test-bb-user >> >>> > 2021-05-05 14:08:24.335 7f6942286840 1 robust_notify: If at first >> you >> >>> > don't succeed: (110) Connection timed out >> >>> > 2021-05-05 14:08:24.335 7f6942286840 0 ERROR: failed to distribute >> >>> cache >> >>> > for eu-central-1.rgw.users.keys:**** >> >>> > { >> >>> > "user_id": "test-bb-user", >> >>> > "display_name": "test-bb-user", >> >>> > .... >> >>> > } >> >>> > real 0m20.557s >> >>> > user 0m0.087s >> >>> > sys 0m0.030s >> >>> > >> >>> > First I thought that rados operations might be slow, but adding and >> >>> > deleting objects in rados are fast as usual (at least from my >> >>> perspective). >> >>> > Also uploading to buckets is fine. >> >>> > >> >>> > We changed some things and I think it might have to do with this: >> >>> > * We have a HAProxy that distributes via leastconn between the 3 >> >>> radosgw's >> >>> > (this did not change) >> >>> > * We had three times a daemon with the name "eu-central-1" running >> (on >> >>> the >> >>> > 3 radosgw's) >> >>> > * Because this might have led to our data duplication problem, we >> have >> >>> > split that up so now the daemons are named per host >> (eu-central-1-s3db1, >> >>> > eu-central-1-s3db2, eu-central-1-s3db3) >> >>> > * We also added dedicated rgw daemons for garbage collection, >> because >> >>> the >> >>> > current one were not able to keep up. >> >>> > * So basically ceph status went from "rgw: 1 daemon active >> >>> (eu-central-1)" >> >>> > to "rgw: 14 daemons active (eu-central-1-s3db1, eu-central-1-s3db2, >> >>> > eu-central-1-s3db3, gc-s3db12, gc-s3db13...) >> >>> > >> >>> > >> >>> > Cheers >> >>> > Boris >> >>> > >> >>> >> >>> >> >>> -- >> >>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend >> im >> >>> groÃƒ¼en Saal. >> >>> _______________________________________________ >> >>> ceph-users mailing list -- ceph-users(a)ceph.io >> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >>> >> >> >> > >> >> -- >> Thomas Schneider >> IT.SERVICES >> Wissenschaftliche Informationsversorgung Ruhr-Universität Bochum | 44780 >> Bochum >> Telefon: +49 234 32 23939 >> http://www.it-services.rub.de/ >> > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groÃƒ¼en Saal.

2 years, 12 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users May 2021