Hi,
2 years after my issue [ https://tracker.ceph.com/issues/22928 | https://tracker.ceph.com/issues/22928 ] the next one fires back.
The Problem:
Old Buckets have their index and data in rgw.buckets:
root@cephrgw01:~# radosgw-admin metadata get bucket:testtesttesty
{
"key": "bucket:testtesttesty",
"ver": {
"tag": "_E_OHNhD28Zu1DeuvyGq8Q8b",
"ver": 1
},
"mtime": "2013-11-11 09:25:56.000000Z",
"data": {
"bucket": {
"name": "testtesttesty",
"marker": "default.2542971.19",
"bucket_id": "default.2542971.19",
"tenant": "",
"explicit_placement": {
"data_pool": "rgw.buckets",
"data_extra_pool": "",
"index_pool": "rgw.buckets"
}
},
"owner": "123",
"creation_time": "2013-11-11 09:25:56.000000Z",
"linked": "true",
"has_bucket_info": "false"
}
}
After upgrade from luminous to nautilus i get 400(InvalidArgument) and "NOTICE: invalid dest placement" in the radosgw-log on access to the buckets
My zone defines:
root@cephrgw01:~# radosgw-admin zone get
{
"id": "default",
"name": "default",
"domain_root": ".rgw",
"control_pool": ".rgw.control",
"gc_pool": ".rgw.gc",
"lc_pool": ".log:lc",
"log_pool": ".log",
"intent_log_pool": ".intent-log",
"usage_log_pool": ".usage",
"reshard_pool": ".log:reshard",
"user_keys_pool": ".users",
"user_email_pool": ".users.email",
"user_swift_pool": ".users.swift",
"user_uid_pool": ".users.uid",
"otp_pool": "default.rgw.otp",
"system_key": {
"access_key": "",
"secret_key": ""
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "rgw.buckets.index",
"storage_classes": {
"STANDARD": {
"data_pool": "rgw.buckets"
}
},
"data_extra_pool": "rgw.buckets.non-ec",
"index_type": 0
}
}
],
"metadata_heap": ".rgw.meta",
"realm_id": "*********************c"
}
Now i am a little bit lost. I added a new placement to my zone and zonegroup
radosgw-admin zonegroup placement add --rgw-zonegroup default --placement-id pre-jewel
radosgw-admin zone placement add --rgw-zonegroup default --placement-id pre-jewel --data-pool rgw.buckets --index-pool rgw.buckets data-extra-pool ""
radosgw-admin period update --commit
root@cephrgw01:~# radosgw-admin zone get
{
"id": "default",
"name": "default",
"domain_root": ".rgw",
"control_pool": ".rgw.control",
"gc_pool": ".rgw.gc",
"lc_pool": ".log:lc",
"log_pool": ".log",
"intent_log_pool": ".intent-log",
"usage_log_pool": ".usage",
"reshard_pool": ".log:reshard",
"user_keys_pool": ".users",
"user_email_pool": ".users.email",
"user_swift_pool": ".users.swift",
"user_uid_pool": ".users.uid",
"otp_pool": "default.rgw.otp",
"system_key": {
"access_key": "",
"secret_key": ""
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "rgw.buckets.index",
"storage_classes": {
"STANDARD": {
"data_pool": "rgw.buckets"
}
},
"data_extra_pool": "rgw.buckets.non-ec",
"index_type": 0
}
},
{
"key": "pre-jewel",
"val": {
"index_pool": "rgw.buckets",
"storage_classes": {
"STANDARD": {
"data_pool": "rgw.buckets"
}
},
"data_extra_pool": "",
"index_type": 0
}
}
],
"metadata_heap": ".rgw.meta",
"realm_id": "****************c"
}
Nevertheless, only the luminous gateways may list my old buckets. As far as I can see, I may only change the placement_rule for new buckets. Is there any chance to make radosgw find the old indices and complete the upgrade to nautilus?
Many thanks,
Ingo
--
Ingo Reimann
[ https://www.dunkel.de/ ]
Dunkel GmbH
Philipp-Reis-Straße 2
65795 Hattersheim
Fon: +49 6190 889-100
Fax: +49 6190 889-399
eMail: support(a)dunkel.de
http://www.Dunkel.de/ Amtsgericht Frankfurt/Main
HRB: 37971
Geschäftsführer: Axel Dunkel
Ust-ID: DE 811622001
Hi Bryan and Dan,
Had some similar observations, and wanted some data points from you as well
if possible.
1. When you say down osd , is it down and in or down and out.
2. I see the osd accumulating map ranges, when continuous recovery is going
on, due to osd flaps( that is some portion of the total of pgs in cluster
are not in active+clean) state.
3.Another observation is that, although it may be one of the pools that see
a osd churn, the map ranges seem to increase across all the OSd across all
the pools.
4.I use the asok socket to dump the status, on osd Daemon which shows the
map ranges.
5.Even I tried restarting Mon, one after another, and that seemed to reduce
the space, but in a sporadic fashion.
6. Once there was some breathing room, in terms of OSd not flapping, and
all PG being in active + clean state, the size trimmed down to almost
optimal spaces.
7. I found some pointers at :-
https://docs.ceph.com/docs/master/dev/mon-osdmap-prune/
8. Th behaviour I am specifying us in 12.X Luminous release.
9. Would you care to share what commands you use to dump maps usage info
from space perspective.
Thanks
Romit
On Mon, 9 Dec 2019, 22:56 , <ceph-users-request(a)ceph.io> wrote:
> Send ceph-users mailing list submissions to
> ceph-users(a)ceph.io
>
> To subscribe or unsubscribe via email, send a message with subject or
> body 'help' to
> ceph-users-request(a)ceph.io
>
> You can reach the person managing the list at
> ceph-users-owner(a)ceph.io
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ceph-users digest..."
>
> Today's Topics:
>
> 1. RGW listing millions of objects takes too much time (Arash Shams)
> 2. Re: RGW listing millions of objects takes too much time
> (Robert LeBlanc)
> 3. ceph mgr daemon multiple ip addresses (Frank R)
> 4. Re: osdmaps not trimmed until ceph-mon's restarted (if cluster has a
> down osd)
> (Bryan Stillwell)
>
>
> ----------------------------------------------------------------------
>
> Date: Mon, 9 Dec 2019 15:46:04 +0000
> From: Arash Shams <ara4sh(a)hotmail.com>
> Subject: [ceph-users] RGW listing millions of objects takes too much
> time
> To: "ceph-users(a)ceph.io" <ceph-users(a)ceph.io>
> Message-ID: <CWXP265MB128679DE9E6B2A6CA289B16492580(a)CWXP265MB1286.GBR
> P265.PROD.OUTLOOK.COM>
> Content-Type: multipart/alternative; boundary="_000_CWXP265MB128679
> DE9E6B2A6CA289B16492580CWXP265MB1286GBRP_"
>
> --_000_CWXP265MB128679DE9E6B2A6CA289B16492580CWXP265MB1286GBRP_
> Content-Type: text/plain; charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
>
> Dear All,
>
> I have almost 30 million objects and I want to list them and index them
> som=
> ewhere else,
> Im using boto3 with continuation Marker but it takes almost 9 hours
>
> can I run it in multiple threads to make it faster? what solution do you
> su=
> ggest to speedup this process,
>
>
> Thanks
>
>
> --_000_CWXP265MB128679DE9E6B2A6CA289B16492580CWXP265MB1286GBRP_
> Content-Type: text/html; charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
>
> <html>
> <head>
> <meta http-equiv=3D"Content-Type" content=3D"text/html;
> charset=3Diso-8859-=
> 1">
> <style type=3D"text/css" style=3D"display:none;"> P
> {margin-top:0;margin-bo=
> ttom:0;} </style>
> </head>
> <body dir=3D"ltr">
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> Dear All, </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> <br>
> </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> I have almost 30 million objects and I want to list them and index them
> som=
> ewhere else, </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> Im using boto3 with continuation Marker but it takes almost 9 hours</div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> <br>
> </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> can I run it in multiple threads to make it faster? what solution do you
> su=
> ggest to speedup this process, </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> <br>
> </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> <br>
> </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> Thanks </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> <br>
> </div>
> </body>
> </html>
>
> --_000_CWXP265MB128679DE9E6B2A6CA289B16492580CWXP265MB1286GBRP_--
>
> ------------------------------
>
> Date: Mon, 9 Dec 2019 08:23:55 -0800
> From: Robert LeBlanc <robert(a)leblancnet.us>
> Subject: [ceph-users] Re: RGW listing millions of objects takes too
> much time
> To: Arash Shams <ara4sh(a)hotmail.com>
> Cc: "ceph-users(a)ceph.io" <ceph-users(a)ceph.io>
> Message-ID:
> <CAANLjFoZu8LE0eH2vB9QArAOZJf4Ofm8V=
> jAW-+ZPOFS-f9Oag(a)mail.gmail.com>
> Content-Type: multipart/alternative;
> boundary="0000000000000e664e059947d3b2"
>
> --0000000000000e664e059947d3b2
> Content-Type: text/plain; charset="UTF-8"
>
> On Mon, Dec 9, 2019 at 7:47 AM Arash Shams <ara4sh(a)hotmail.com> wrote:
>
> > Dear All,
> >
> > I have almost 30 million objects and I want to list them and index them
> > somewhere else,
> > Im using boto3 with continuation Marker but it takes almost 9 hours
> >
> > can I run it in multiple threads to make it faster? what solution do you
> > suggest to speedup this process,
> >
> >
> > Thanks
> >
>
> I've thought about indexing objects elsewhere as well. One thought I had
> was hooking into the HTTP flow where a PUT or DEL would update the objects
> in some kind of database (async of course). We could also gather stats with
> GET and POST. Initially, my thoughts were to hook into haproxy since we
> already use it, but possibly RGW if that is an option. That way it would
> always be up to date and not have to do big scans on the buckets (our
> buckets would not perform well with this). I haven't actually gotten to the
> implementation phase of this idea.
>
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>
> --0000000000000e664e059947d3b2
> Content-Type: text/html; charset="UTF-8"
> Content-Transfer-Encoding: quoted-printable
>
> <div dir=3D"ltr"><div dir=3D"ltr">On Mon, Dec 9, 2019 at 7:47 AM Arash
> Sham=
> s <<a href=3D"mailto:ara4sh@hotmail.com">ara4sh(a)hotmail.com</a>>
> wrot=
> e:<br></div><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote"
> st=
> yle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid
> rgb(204,204,204);padd=
> ing-left:1ex">
>
>
>
>
> <div dir=3D"ltr">
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> Dear All,=C2=A0</div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> <br>
> </div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> I have almost 30 million objects and I want to list them and index them
> som=
> ewhere else,=C2=A0</div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> Im using boto3 with continuation Marker but it takes almost 9 hours</div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> <br>
> </div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> can I run it in multiple threads to make it faster? what solution do you
> su=
> ggest to speedup this process,=C2=A0</div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> <br>
> </div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> <br>
> </div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> Thanks=C2=A0</div></div></blockquote><div
> class=3D"gmail_quote"><br></div><=
> div class=3D"gmail_quote">I've thought about indexing objects
> elsewhere=
> as well. One thought I had was hooking into the HTTP flow where a PUT or
> D=
> EL would update the objects in some kind of database (async of course). We
> =
> could also gather stats with GET and POST. Initially, my thoughts were to
> h=
> ook into haproxy since we already use it, but possibly RGW if that is an
> op=
> tion. That way it would always be up to date and not have to do big scans
> o=
> n the buckets (our buckets would not perform well with this). I
> haven't=
> actually gotten to the implementation phase of this idea.</div><br clear=
> =3D"all"><div><div dir=3D"ltr"
> class=3D"gmail_signature">----------------<b=
> r>Robert LeBlanc<br></div></div><div>PGP Fingerprint 79A2 9CA4 6CC4 45DD
> A9=
> 04 =C2=A0C70E E654 3BB2 FA62 B9F1=C2=A0</div></div></div>
>
> --0000000000000e664e059947d3b2--
>
> ------------------------------
>
> Date: Mon, 9 Dec 2019 11:54:10 -0500
> From: Frank R <frankaritchie(a)gmail.com>
> Subject: [ceph-users] ceph mgr daemon multiple ip addresses
> To: ceph-users <ceph-users(a)ceph.com>
> Message-ID:
> <CAMuVLDOhax+8RwPEK_52UfzmOc8dv5=
> YBH7Etsm5VmiO_WFEPg(a)mail.gmail.com>
> Content-Type: multipart/alternative;
> boundary="0000000000004ef36d0599483fd9"
>
> --0000000000004ef36d0599483fd9
> Content-Type: text/plain; charset="UTF-8"
>
> Hi all,
>
> Does anyone know what possible issues can arise if the ceph mgr daemon is
> running on a mon node that has 2 ips in the public net range (1 is a
> loopback address).
>
> As I understand the it. mgr will bind to all ips
>
> FYI - I am not sure why the loopback is there, I am trying to find out.
>
> thx
> Frank
>
>
>
>
> mlovell - ceph anycast
>
> --0000000000004ef36d0599483fd9
> Content-Type: text/html; charset="UTF-8"
> Content-Transfer-Encoding: quoted-printable
>
> <div dir=3D"ltr"><div>Hi all,</div><div><br>Does anyone know what possible
> =
> issues can arise if the ceph mgr daemon is running on a mon node that has
> 2=
> ips in the public net range (1 is a loopback
> address).<br><br></div><div>A=
> s I understand the it. mgr will bind to all ips<br><br></div><div>FYI - I
> a=
> m not sure why the loopback is there, I am trying to find
> out.</div><div><b=
> r></div><div>thx</div><div>Frank<br><br><br><br><br>mlovell - ceph
> anycast<=
> br></div></div>
>
> --0000000000004ef36d0599483fd9--
>
> ------------------------------
>
> Date: Mon, 9 Dec 2019 17:24:27 +0000
> From: Bryan Stillwell <bstillwell(a)godaddy.com>
> Subject: [ceph-users] Re: osdmaps not trimmed until ceph-mon's
> restarted (if cluster has a down osd)
> To: Dan van der Ster <dan(a)vanderster.com>
> Cc: Joao Eduardo Luis <joao(a)suse.de>, "dev(a)ceph.io" <dev(a)ceph.io>,
> ceph-users <ceph-users(a)ceph.io>
> Message-ID: <9B145B17-6665-4254-8D6A-04A9B37389C3(a)godaddy.com>
> Content-Type: text/plain; charset="us-ascii"
>
> On Nov 18, 2019, at 8:12 AM, Dan van der Ster <dan(a)vanderster.com> wrote:
> >
> > On Fri, Nov 15, 2019 at 4:45 PM Joao Eduardo Luis <joao(a)suse.de> wrote:
> >>
> >> On 19/11/14 11:04AM, Gregory Farnum wrote:
> >>> On Thu, Nov 14, 2019 at 8:14 AM Dan van der Ster <dan(a)vanderster.com>
> wrote:
> >>>>
> >>>> Hi Joao,
> >>>>
> >>>> I might have found the reason why several of our clusters (and maybe
> >>>> Bryan's too) are getting stuck not trimming osdmaps.
> >>>> It seems that when an osd fails, the min_last_epoch_clean gets stuck
> >>>> forever (even long after HEALTH_OK), until the ceph-mons are
> >>>> restarted.
> >>>>
> >>>> I've updated the ticket: https://tracker.ceph.com/issues/41154
> >>>
> >>> Wrong ticket, I think you meant
> https://tracker.ceph.com/issues/37875#note-7
> >>
> >> I've seen this behavior a long, long time ago, but stopped being able to
> >> reproduce it consistently enough to ensure the patch was working
> properly.
> >>
> >> I think I have a patch here:
> >>
> >> https://github.com/ceph/ceph/pull/19076/commits
> >>
> >> If you are feeling adventurous, and want to give it a try, let me know.
> I'll
> >> be happy to forward port it to whatever you are running.
> >
> > Thanks Joao, this patch is what I had in mind.
> >
> > I'm trying to evaluate how adventurous this would be -- Is there any
> > risk that if a huge number of osds are down all at once (but
> > transiently), it would trigger the mon to trim too many maps?
> > I would expect that the remaining up OSDs will have a safe, low,
> osd_epoch ?
> >
> > And anyway I guess that your proposed get_min_last_epoch_clean patch
> > is equivalent to what we have today if we restart the ceph-mon leader
> > while an osd is down.
>
> Joao,
>
> I ran into this again today and found over 100,000 osdmaps on all 1,000
> OSDs (~50 TiB of disk space used just by osdmaps). There were down OSDs
> (pretty regular occurrence with ~1,000 OSDs) so that matches up with what
> Dan found. Then when I restarted all the mon nodes twice the osdmaps
> started cleaning up.
>
> I believe the steps to reproduce would look like this:
>
> 1. Start with a cluster with at least 1 down osd
> 2. Expand the cluster (the bigger the expansion, the more osdmaps that
> pile up)
> 3. Notice that after the expansion completes and the cluster is healthy
> that the old osdmaps aren't cleaned up
>
> I would be willing to test the fix on our test cluster after 14.2.5 comes
> out. Could you make a build based on that release?
>
> Thanks,
> Bryan
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
>
>
> ------------------------------
>
> End of ceph-users Digest, Vol 83, Issue 35
> ******************************************
>
--
*-----------------------------------------------------------------------------------------*
*This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the
system manager. This message contains confidential information and is
intended only for the individual named. If you are not the named addressee,
you should not disseminate, distribute or copy this email. Please notify
the sender immediately by email if you have received this email by mistake
and delete this email from your system. If you are not the intended
recipient, you are notified that disclosing, copying, distributing or
taking any action in reliance on the contents of this information is
strictly prohibited.*****
****
*Any views or opinions presented in this
email are solely those of the author and do not necessarily represent those
of the organization. Any information on shares, debentures or similar
instruments, recommended product pricing, valuations and the like are for
information purposes only. It is not meant to be an instruction or
recommendation, as the case may be, to buy or to sell securities, products,
services nor an offer to buy or sell securities, products or services
unless specifically stated to be so on behalf of the Flipkart group.
Employees of the Flipkart group of companies are expressly required not to
make defamatory statements and not to infringe or authorise any
infringement of copyright or any other legal right by email communications.
Any such communication is contrary to organizational policy and outside the
scope of the employment of the individual concerned. The organization will
not accept any liability in respect of such communication, and the employee
responsible will be personally liable for any damages or other liability
arising.*****
****
*Our organization accepts no liability for the
content of this email, or for the consequences of any actions taken on the
basis of the information *provided,* unless that information is
subsequently confirmed in writing. If you are not the intended recipient,
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.*
_-----------------------------------------------------------------------------------------_
Hi Robert and Arash,
A couple of pointers and asks that might help.
1. Can you point to the code you are using for listing the buckets.
2. Which release is the cluster running on..?
3. What is the number of shards that have been configured in the bucket
index for the said mentioned bucket...?
4.Have you tried timing the listing process, as to where does the slow down
began, in other words after 1 million entries, 2 million entri s, or the
listing is slow at per key level.
5. You could profile the timing for every 10000 entries or 100000
6.While using a additional meta, like redis in memory meta store could be a
solution, but moving many moving parts altogether becomes a headache.
7. Let's look at the tunebales first, to see where the bottleneck is.
8.Another good practice , of possible in your deployment would be, if you
can restrict objects per bucket, to say .5 million or 1 million, and have a
client side hash/placement algorithm for writing into the buckets.
9. If these 30 million entries are static, iff , then prefetching the
markers , and running are multithreaded flow makes sense. But again if
these are static why would you not store them in a blob altogether, so
contraindicating my own point.
10.Also do look ceph lists and mailers for ordered Vs unordered
listing(remember seeing some query around it)
Thanks
Romit Misra
On Mon, 9 Dec 2019, 22:56 , <ceph-users-request(a)ceph.io> wrote:
> Send ceph-users mailing list submissions to
> ceph-users(a)ceph.io
>
> To subscribe or unsubscribe via email, send a message with subject or
> body 'help' to
> ceph-users-request(a)ceph.io
>
> You can reach the person managing the list at
> ceph-users-owner(a)ceph.io
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ceph-users digest..."
>
> Today's Topics:
>
> 1. RGW listing millions of objects takes too much time (Arash Shams)
> 2. Re: RGW listing millions of objects takes too much time
> (Robert LeBlanc)
> 3. ceph mgr daemon multiple ip addresses (Frank R)
> 4. Re: osdmaps not trimmed until ceph-mon's restarted (if cluster has a
> down osd)
> (Bryan Stillwell)
>
>
> ----------------------------------------------------------------------
>
> Date: Mon, 9 Dec 2019 15:46:04 +0000
> From: Arash Shams <ara4sh(a)hotmail.com>
> Subject: [ceph-users] RGW listing millions of objects takes too much
> time
> To: "ceph-users(a)ceph.io" <ceph-users(a)ceph.io>
> Message-ID: <CWXP265MB128679DE9E6B2A6CA289B16492580(a)CWXP265MB1286.GBR
> P265.PROD.OUTLOOK.COM>
> Content-Type: multipart/alternative; boundary="_000_CWXP265MB128679
> DE9E6B2A6CA289B16492580CWXP265MB1286GBRP_"
>
> --_000_CWXP265MB128679DE9E6B2A6CA289B16492580CWXP265MB1286GBRP_
> Content-Type: text/plain; charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
>
> Dear All,
>
> I have almost 30 million objects and I want to list them and index them
> som=
> ewhere else,
> Im using boto3 with continuation Marker but it takes almost 9 hours
>
> can I run it in multiple threads to make it faster? what solution do you
> su=
> ggest to speedup this process,
>
>
> Thanks
>
>
> --_000_CWXP265MB128679DE9E6B2A6CA289B16492580CWXP265MB1286GBRP_
> Content-Type: text/html; charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
>
> <html>
> <head>
> <meta http-equiv=3D"Content-Type" content=3D"text/html;
> charset=3Diso-8859-=
> 1">
> <style type=3D"text/css" style=3D"display:none;"> P
> {margin-top:0;margin-bo=
> ttom:0;} </style>
> </head>
> <body dir=3D"ltr">
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> Dear All, </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> <br>
> </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> I have almost 30 million objects and I want to list them and index them
> som=
> ewhere else, </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> Im using boto3 with continuation Marker but it takes almost 9 hours</div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> <br>
> </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> can I run it in multiple threads to make it faster? what solution do you
> su=
> ggest to speedup this process, </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> <br>
> </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> <br>
> </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> Thanks </div>
> <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
> 12pt;=
> color: rgb(0, 0, 0);">
> <br>
> </div>
> </body>
> </html>
>
> --_000_CWXP265MB128679DE9E6B2A6CA289B16492580CWXP265MB1286GBRP_--
>
> ------------------------------
>
> Date: Mon, 9 Dec 2019 08:23:55 -0800
> From: Robert LeBlanc <robert(a)leblancnet.us>
> Subject: [ceph-users] Re: RGW listing millions of objects takes too
> much time
> To: Arash Shams <ara4sh(a)hotmail.com>
> Cc: "ceph-users(a)ceph.io" <ceph-users(a)ceph.io>
> Message-ID:
> <CAANLjFoZu8LE0eH2vB9QArAOZJf4Ofm8V=
> jAW-+ZPOFS-f9Oag(a)mail.gmail.com>
> Content-Type: multipart/alternative;
> boundary="0000000000000e664e059947d3b2"
>
> --0000000000000e664e059947d3b2
> Content-Type: text/plain; charset="UTF-8"
>
> On Mon, Dec 9, 2019 at 7:47 AM Arash Shams <ara4sh(a)hotmail.com> wrote:
>
> > Dear All,
> >
> > I have almost 30 million objects and I want to list them and index them
> > somewhere else,
> > Im using boto3 with continuation Marker but it takes almost 9 hours
> >
> > can I run it in multiple threads to make it faster? what solution do you
> > suggest to speedup this process,
> >
> >
> > Thanks
> >
>
> I've thought about indexing objects elsewhere as well. One thought I had
> was hooking into the HTTP flow where a PUT or DEL would update the objects
> in some kind of database (async of course). We could also gather stats with
> GET and POST. Initially, my thoughts were to hook into haproxy since we
> already use it, but possibly RGW if that is an option. That way it would
> always be up to date and not have to do big scans on the buckets (our
> buckets would not perform well with this). I haven't actually gotten to the
> implementation phase of this idea.
>
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>
> --0000000000000e664e059947d3b2
> Content-Type: text/html; charset="UTF-8"
> Content-Transfer-Encoding: quoted-printable
>
> <div dir=3D"ltr"><div dir=3D"ltr">On Mon, Dec 9, 2019 at 7:47 AM Arash
> Sham=
> s <<a href=3D"mailto:ara4sh@hotmail.com">ara4sh(a)hotmail.com</a>>
> wrot=
> e:<br></div><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote"
> st=
> yle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid
> rgb(204,204,204);padd=
> ing-left:1ex">
>
>
>
>
> <div dir=3D"ltr">
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> Dear All,=C2=A0</div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> <br>
> </div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> I have almost 30 million objects and I want to list them and index them
> som=
> ewhere else,=C2=A0</div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> Im using boto3 with continuation Marker but it takes almost 9 hours</div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> <br>
> </div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> can I run it in multiple threads to make it faster? what solution do you
> su=
> ggest to speedup this process,=C2=A0</div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> <br>
> </div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> <br>
> </div>
> <div
> style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
> :rgb(0,0,0)">
> Thanks=C2=A0</div></div></blockquote><div
> class=3D"gmail_quote"><br></div><=
> div class=3D"gmail_quote">I've thought about indexing objects
> elsewhere=
> as well. One thought I had was hooking into the HTTP flow where a PUT or
> D=
> EL would update the objects in some kind of database (async of course). We
> =
> could also gather stats with GET and POST. Initially, my thoughts were to
> h=
> ook into haproxy since we already use it, but possibly RGW if that is an
> op=
> tion. That way it would always be up to date and not have to do big scans
> o=
> n the buckets (our buckets would not perform well with this). I
> haven't=
> actually gotten to the implementation phase of this idea.</div><br clear=
> =3D"all"><div><div dir=3D"ltr"
> class=3D"gmail_signature">----------------<b=
> r>Robert LeBlanc<br></div></div><div>PGP Fingerprint 79A2 9CA4 6CC4 45DD
> A9=
> 04 =C2=A0C70E E654 3BB2 FA62 B9F1=C2=A0</div></div></div>
>
> --0000000000000e664e059947d3b2--
>
> ------------------------------
>
> Date: Mon, 9 Dec 2019 11:54:10 -0500
> From: Frank R <frankaritchie(a)gmail.com>
> Subject: [ceph-users] ceph mgr daemon multiple ip addresses
> To: ceph-users <ceph-users(a)ceph.com>
> Message-ID:
> <CAMuVLDOhax+8RwPEK_52UfzmOc8dv5=
> YBH7Etsm5VmiO_WFEPg(a)mail.gmail.com>
> Content-Type: multipart/alternative;
> boundary="0000000000004ef36d0599483fd9"
>
> --0000000000004ef36d0599483fd9
> Content-Type: text/plain; charset="UTF-8"
>
> Hi all,
>
> Does anyone know what possible issues can arise if the ceph mgr daemon is
> running on a mon node that has 2 ips in the public net range (1 is a
> loopback address).
>
> As I understand the it. mgr will bind to all ips
>
> FYI - I am not sure why the loopback is there, I am trying to find out.
>
> thx
> Frank
>
>
>
>
> mlovell - ceph anycast
>
> --0000000000004ef36d0599483fd9
> Content-Type: text/html; charset="UTF-8"
> Content-Transfer-Encoding: quoted-printable
>
> <div dir=3D"ltr"><div>Hi all,</div><div><br>Does anyone know what possible
> =
> issues can arise if the ceph mgr daemon is running on a mon node that has
> 2=
> ips in the public net range (1 is a loopback
> address).<br><br></div><div>A=
> s I understand the it. mgr will bind to all ips<br><br></div><div>FYI - I
> a=
> m not sure why the loopback is there, I am trying to find
> out.</div><div><b=
> r></div><div>thx</div><div>Frank<br><br><br><br><br>mlovell - ceph
> anycast<=
> br></div></div>
>
> --0000000000004ef36d0599483fd9--
>
> ------------------------------
>
> Date: Mon, 9 Dec 2019 17:24:27 +0000
> From: Bryan Stillwell <bstillwell(a)godaddy.com>
> Subject: [ceph-users] Re: osdmaps not trimmed until ceph-mon's
> restarted (if cluster has a down osd)
> To: Dan van der Ster <dan(a)vanderster.com>
> Cc: Joao Eduardo Luis <joao(a)suse.de>, "dev(a)ceph.io" <dev(a)ceph.io>,
> ceph-users <ceph-users(a)ceph.io>
> Message-ID: <9B145B17-6665-4254-8D6A-04A9B37389C3(a)godaddy.com>
> Content-Type: text/plain; charset="us-ascii"
>
> On Nov 18, 2019, at 8:12 AM, Dan van der Ster <dan(a)vanderster.com> wrote:
> >
> > On Fri, Nov 15, 2019 at 4:45 PM Joao Eduardo Luis <joao(a)suse.de> wrote:
> >>
> >> On 19/11/14 11:04AM, Gregory Farnum wrote:
> >>> On Thu, Nov 14, 2019 at 8:14 AM Dan van der Ster <dan(a)vanderster.com>
> wrote:
> >>>>
> >>>> Hi Joao,
> >>>>
> >>>> I might have found the reason why several of our clusters (and maybe
> >>>> Bryan's too) are getting stuck not trimming osdmaps.
> >>>> It seems that when an osd fails, the min_last_epoch_clean gets stuck
> >>>> forever (even long after HEALTH_OK), until the ceph-mons are
> >>>> restarted.
> >>>>
> >>>> I've updated the ticket: https://tracker.ceph.com/issues/41154
> >>>
> >>> Wrong ticket, I think you meant
> https://tracker.ceph.com/issues/37875#note-7
> >>
> >> I've seen this behavior a long, long time ago, but stopped being able to
> >> reproduce it consistently enough to ensure the patch was working
> properly.
> >>
> >> I think I have a patch here:
> >>
> >> https://github.com/ceph/ceph/pull/19076/commits
> >>
> >> If you are feeling adventurous, and want to give it a try, let me know.
> I'll
> >> be happy to forward port it to whatever you are running.
> >
> > Thanks Joao, this patch is what I had in mind.
> >
> > I'm trying to evaluate how adventurous this would be -- Is there any
> > risk that if a huge number of osds are down all at once (but
> > transiently), it would trigger the mon to trim too many maps?
> > I would expect that the remaining up OSDs will have a safe, low,
> osd_epoch ?
> >
> > And anyway I guess that your proposed get_min_last_epoch_clean patch
> > is equivalent to what we have today if we restart the ceph-mon leader
> > while an osd is down.
>
> Joao,
>
> I ran into this again today and found over 100,000 osdmaps on all 1,000
> OSDs (~50 TiB of disk space used just by osdmaps). There were down OSDs
> (pretty regular occurrence with ~1,000 OSDs) so that matches up with what
> Dan found. Then when I restarted all the mon nodes twice the osdmaps
> started cleaning up.
>
> I believe the steps to reproduce would look like this:
>
> 1. Start with a cluster with at least 1 down osd
> 2. Expand the cluster (the bigger the expansion, the more osdmaps that
> pile up)
> 3. Notice that after the expansion completes and the cluster is healthy
> that the old osdmaps aren't cleaned up
>
> I would be willing to test the fix on our test cluster after 14.2.5 comes
> out. Could you make a build based on that release?
>
> Thanks,
> Bryan
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
>
>
> ------------------------------
>
> End of ceph-users Digest, Vol 83, Issue 35
> ******************************************
>
--
*-----------------------------------------------------------------------------------------*
*This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the
system manager. This message contains confidential information and is
intended only for the individual named. If you are not the named addressee,
you should not disseminate, distribute or copy this email. Please notify
the sender immediately by email if you have received this email by mistake
and delete this email from your system. If you are not the intended
recipient, you are notified that disclosing, copying, distributing or
taking any action in reliance on the contents of this information is
strictly prohibited.*****
****
*Any views or opinions presented in this
email are solely those of the author and do not necessarily represent those
of the organization. Any information on shares, debentures or similar
instruments, recommended product pricing, valuations and the like are for
information purposes only. It is not meant to be an instruction or
recommendation, as the case may be, to buy or to sell securities, products,
services nor an offer to buy or sell securities, products or services
unless specifically stated to be so on behalf of the Flipkart group.
Employees of the Flipkart group of companies are expressly required not to
make defamatory statements and not to infringe or authorise any
infringement of copyright or any other legal right by email communications.
Any such communication is contrary to organizational policy and outside the
scope of the employment of the individual concerned. The organization will
not accept any liability in respect of such communication, and the employee
responsible will be personally liable for any damages or other liability
arising.*****
****
*Our organization accepts no liability for the
content of this email, or for the consequences of any actions taken on the
basis of the information *provided,* unless that information is
subsequently confirmed in writing. If you are not the intended recipient,
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.*
_-----------------------------------------------------------------------------------------_
Hi,
our Ceph 14.2.3 cluster so far runs smooth with replicated and EC pools, but since a couple of days one of the dedicated replication nodes consumes up to 99% swap and stays at that level. The other two replicated nodes use +- 50 - 60% of swap.
All the 24 NVMe OSDs per node are BlueStore with default settings, 128GB RAM. The vm.swappiness is set to 10.
Do you have any suggestions how to handle/reduce the swap usage?
Thanks for feedback and regards . Götz
Hi all,
Does anyone know what possible issues can arise if the ceph mgr daemon is
running on a mon node that has 2 ips in the public net range (1 is a
loopback address).
As I understand the it. mgr will bind to all ips
FYI - I am not sure why the loopback is there, I am trying to find out.
thx
Frank
mlovell - ceph anycast
Hi Joao,
I might have found the reason why several of our clusters (and maybe
Bryan's too) are getting stuck not trimming osdmaps.
It seems that when an osd fails, the min_last_epoch_clean gets stuck
forever (even long after HEALTH_OK), until the ceph-mons are
restarted.
I've updated the ticket: https://tracker.ceph.com/issues/41154
Cheers, Dan
On Mon, Dec 9, 2019 at 7:47 AM Arash Shams <ara4sh(a)hotmail.com> wrote:
> Dear All,
>
> I have almost 30 million objects and I want to list them and index them
> somewhere else,
> Im using boto3 with continuation Marker but it takes almost 9 hours
>
> can I run it in multiple threads to make it faster? what solution do you
> suggest to speedup this process,
>
>
> Thanks
>
I've thought about indexing objects elsewhere as well. One thought I had
was hooking into the HTTP flow where a PUT or DEL would update the objects
in some kind of database (async of course). We could also gather stats with
GET and POST. Initially, my thoughts were to hook into haproxy since we
already use it, but possibly RGW if that is an option. That way it would
always be up to date and not have to do big scans on the buckets (our
buckets would not perform well with this). I haven't actually gotten to the
implementation phase of this idea.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
Dear All,
I have almost 30 million objects and I want to list them and index them somewhere else,
Im using boto3 with continuation Marker but it takes almost 9 hours
can I run it in multiple threads to make it faster? what solution do you suggest to speedup this process,
Thanks
Hi,
I had a failure on 2 of 7 OSD nodes.
This caused a server reboot and unfortunately the cluster network failed
to come up.
This resulted in many OSD down situation.
I decided to stop all services (OSD, MGR, MON) and to start them
sequentially.
Now I have multiple OSD marked as down although the service is running.
None of these down OSDS is connected to the 2 nodes with failure.
In the OSD logs I can see multiple entries like this:
2019-12-09 11:13:10.378 7f9a372fb700 1 osd.374 pg_epoch: 493189
pg[11.1992( v 457986'92619 (303558'88266,457986'92619]
local-lis/les=466724/466725 n=4107 ec=8346/8346 lis/c 466724/466724
les/c/f 466725/466725/176266 468956/493184/468423) [203,412] r=-1
lpr=493184 pi=[466724,493184)/1 crt=457986'92619 lcod 0'0 unknown NOTIFY
mbc={}] state<Start>: transitioning to Stray
I tried to restart the impacted OSD w/o success, means the relevant OSD
is still marked as down.
Is there a procedure to overcome this issue, means getting all OSD up?
THX
Hello,
I'm trying to wrap my head around how having a multi-site (two zones in
one zonegroup) with multiple placement
targets but only wanting to replicate some placement targets would work.
Can you setup a zonegroup with two zones and it would only replicate the
placement targets that the other side has?
For example if we have three placement targets in the zonegroup, one
zone got all three placement targets and the
other zone only has two of them, would it only replicate the two?
Best regards