Re: : RGW listing millions of objects takes too much time - ceph-users

10 Dec 2019

Hi Robert and Arash,
A couple of pointers and asks that might help.

1. Can you point to the code you are using for listing the buckets.
2. Which release is the cluster running on..?
3. What is the number of shards that have been configured in the bucket
index for the said mentioned bucket...?
4.Have you tried timing the listing process, as to where does the slow down
began, in other words after 1 million entries, 2 million entri s, or the
listing is slow at per key level.
5. You could profile the timing for every 10000 entries or 100000
6.While using a additional meta, like redis in memory meta store could be a
solution, but moving many moving parts altogether becomes a headache.
7. Let's look at the tunebales first, to see where the bottleneck is.
8.Another good practice , of possible in your deployment would be, if you
can restrict objects per bucket, to say .5 million or 1 million, and have a
client side hash/placement algorithm for writing into the buckets.
9. If these 30 million entries are static, iff , then prefetching the
markers , and running are multithreaded flow makes sense. But again if
these are static why would you not store them in a blob altogether, so
contraindicating my own point.
10.Also do look ceph lists and mailers for ordered Vs unordered
listing(remember seeing some query around it)

Thanks
Romit Misra

On Mon, 9 Dec 2019, 22:56 , &lt;ceph-users-request(a)ceph.io&gt; wrote:

...
  Send ceph-users mailing list submissions to
         ceph-users(a)ceph.io

 To subscribe or unsubscribe via email, send a message with subject or
 body 'help' to
         ceph-users-request(a)ceph.io

 You can reach the person managing the list at
         ceph-users-owner(a)ceph.io

 When replying, please edit your Subject line so it is more specific
 than "Re: Contents of ceph-users digest..."

 Today's Topics:

    1. RGW listing millions of objects takes too much time (Arash Shams)
    2. Re: RGW listing millions of objects takes too much time
       (Robert LeBlanc)
    3. ceph mgr daemon multiple ip addresses (Frank R)
    4. Re: osdmaps not trimmed until ceph-mon's restarted (if cluster has a
 down osd)
       (Bryan Stillwell)

 ----------------------------------------------------------------------

 Date: Mon, 9 Dec 2019 15:46:04 +0000
 From: Arash Shams &lt;ara4sh(a)hotmail.com&gt;
 Subject: [ceph-users] RGW listing millions of objects takes too much
         time
 To: &quot;ceph-users(a)ceph.io&quot; &lt;ceph-users(a)ceph.io&gt;
 Message-ID:  &lt;CWXP265MB128679DE9E6B2A6CA289B16492580(a)CWXP265MB1286.GBR
         P265.PROD.OUTLOOK.COM>
 Content-Type: multipart/alternative;    boundary="_000_CWXP265MB128679
         DE9E6B2A6CA289B16492580CWXP265MB1286GBRP_"

 --_000_CWXP265MB128679DE9E6B2A6CA289B16492580CWXP265MB1286GBRP_
 Content-Type: text/plain; charset="iso-8859-1"
 Content-Transfer-Encoding: quoted-printable

 Dear All,

 I have almost 30 million objects and I want to list them and index them
 som=
 ewhere else,
 Im using boto3 with continuation Marker but it takes almost 9 hours

 can I run it in multiple threads to make it faster? what solution do you
 su=
 ggest to speedup this process,

 Thanks

 --_000_CWXP265MB128679DE9E6B2A6CA289B16492580CWXP265MB1286GBRP_
 Content-Type: text/html; charset="iso-8859-1"
 Content-Transfer-Encoding: quoted-printable

 <html>
 <head>
 <meta http-equiv=3D"Content-Type" content=3D"text/html;
 charset=3Diso-8859-=
 1">
 <style type=3D"text/css" style=3D"display:none;"> P
 {margin-top:0;margin-bo=
 ttom:0;} </style>
 </head>
 <body dir=3D"ltr">
 <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
 12pt;=
  color: rgb(0, 0, 0);">
 Dear All,&nbsp;</div>
 <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
 12pt;=
  color: rgb(0, 0, 0);">
 <br>
 </div>
 <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
 12pt;=
  color: rgb(0, 0, 0);">
 I have almost 30 million objects and I want to list them and index them
 som=
 ewhere else,&nbsp;</div>
 <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
 12pt;=
  color: rgb(0, 0, 0);">
 Im using boto3 with continuation Marker but it takes almost 9 hours</div>
 <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
 12pt;=
  color: rgb(0, 0, 0);">
 <br>
 </div>
 <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
 12pt;=
  color: rgb(0, 0, 0);">
 can I run it in multiple threads to make it faster? what solution do you
 su=
 ggest to speedup this process,&nbsp;</div>
 <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
 12pt;=
  color: rgb(0, 0, 0);">
 <br>
 </div>
 <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
 12pt;=
  color: rgb(0, 0, 0);">
 <br>
 </div>
 <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
 12pt;=
  color: rgb(0, 0, 0);">
 Thanks&nbsp;</div>
 <div style=3D"font-family: Calibri, Helvetica, sans-serif; font-size:
 12pt;=
  color: rgb(0, 0, 0);">
 <br>
 </div>
 </body>
 </html>

 --_000_CWXP265MB128679DE9E6B2A6CA289B16492580CWXP265MB1286GBRP_--

 ------------------------------

 Date: Mon, 9 Dec 2019 08:23:55 -0800
 From: Robert LeBlanc &lt;robert(a)leblancnet.us&gt;
 Subject: [ceph-users] Re: RGW listing millions of objects takes too
         much time
 To: Arash Shams &lt;ara4sh(a)hotmail.com&gt;
 Cc: &quot;ceph-users(a)ceph.io&quot; &lt;ceph-users(a)ceph.io&gt;
 Message-ID:
         <CAANLjFoZu8LE0eH2vB9QArAOZJf4Ofm8V=
 jAW-+ZPOFS-f9Oag(a)mail.gmail.com&gt;
 Content-Type: multipart/alternative;
         boundary="0000000000000e664e059947d3b2"

 --0000000000000e664e059947d3b2
 Content-Type: text/plain; charset="UTF-8"

 On Mon, Dec 9, 2019 at 7:47 AM Arash Shams &lt;ara4sh(a)hotmail.com&gt; wrote:

  Dear All,

 I have almost 30 million objects and I want to list them and index them
 somewhere else,
 Im using boto3 with continuation Marker but it takes almost 9 hours

 can I run it in multiple threads to make it faster? what solution do you
 suggest to speedup this process,

 Thanks

 I've thought about indexing objects elsewhere as well. One thought I had
 was hooking into the HTTP flow where a PUT or DEL would update the objects
 in some kind of database (async of course). We could also gather stats with
 GET and POST. Initially, my thoughts were to hook into haproxy since we
 already use it, but possibly RGW if that is an option. That way it would
 always be up to date and not have to do big scans on the buckets (our
 buckets would not perform well with this). I haven't actually gotten to the
 implementation phase of this idea.

 ----------------
 Robert LeBlanc
 PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

 --0000000000000e664e059947d3b2
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 <div dir=3D"ltr"><div dir=3D"ltr">On Mon, Dec 9, 2019 at
7:47 AM Arash
 Sham=
 s &lt;<a
href=3D"mailto:ara4sh@hotmail.com">ara4sh@hotmail.com</a>&gt;
 wrot=
 e:<br></div><div class=3D"gmail_quote"><blockquote
class=3D"gmail_quote"
 st=
 yle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid
 rgb(204,204,204);padd=
 ing-left:1ex">

 <div dir=3D"ltr">
 <div
 style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
 :rgb(0,0,0)">
 Dear All,=C2=A0</div>
 <div
 style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
 :rgb(0,0,0)">
 <br>
 </div>
 <div
 style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
 :rgb(0,0,0)">
 I have almost 30 million objects and I want to list them and index them
 som=
 ewhere else,=C2=A0</div>
 <div
 style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
 :rgb(0,0,0)">
 Im using boto3 with continuation Marker but it takes almost 9 hours</div>
 <div
 style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
 :rgb(0,0,0)">
 <br>
 </div>
 <div
 style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
 :rgb(0,0,0)">
 can I run it in multiple threads to make it faster? what solution do you
 su=
 ggest to speedup this process,=C2=A0</div>
 <div
 style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
 :rgb(0,0,0)">
 <br>
 </div>
 <div
 style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
 :rgb(0,0,0)">
 <br>
 </div>
 <div
 style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
 :rgb(0,0,0)">
 Thanks=C2=A0</div></div></blockquote><div
 class=3D"gmail_quote"><br></div><=
 div class=3D"gmail_quote">I&#39;ve thought about indexing objects
 elsewhere=
  as well. One thought I had was hooking into the HTTP flow where a PUT or
 D=
 EL would update the objects in some kind of database (async of course). We
 =
 could also gather stats with GET and POST. Initially, my thoughts were to
 h=
 ook into haproxy since we already use it, but possibly RGW if that is an
 op=
 tion. That way it would always be up to date and not have to do big scans
 o=
 n the buckets (our buckets would not perform well with this). I
 haven&#39;t=
  actually gotten to the implementation phase of this idea.</div><br clear=
 =3D"all"><div><div dir=3D"ltr"
 class=3D"gmail_signature">----------------<b=
 r>Robert LeBlanc<br></div></div><div>PGP Fingerprint 79A2 9CA4
6CC4 45DD
 A9=
 04 =C2=A0C70E E654 3BB2 FA62 B9F1=C2=A0</div></div></div>

 --0000000000000e664e059947d3b2--

 ------------------------------

 Date: Mon, 9 Dec 2019 11:54:10 -0500
 From: Frank R &lt;frankaritchie(a)gmail.com&gt;
 Subject: [ceph-users] ceph mgr daemon multiple ip addresses
 To: ceph-users &lt;ceph-users(a)ceph.com&gt;
 Message-ID:
         <CAMuVLDOhax+8RwPEK_52UfzmOc8dv5=
 YBH7Etsm5VmiO_WFEPg(a)mail.gmail.com&gt;
 Content-Type: multipart/alternative;
         boundary="0000000000004ef36d0599483fd9"

 --0000000000004ef36d0599483fd9
 Content-Type: text/plain; charset="UTF-8"

 Hi all,

 Does anyone know what possible issues can arise if the ceph mgr daemon is
 running on a mon node that has 2 ips in the public net range (1 is a
 loopback address).

 As I understand the it. mgr will bind to all ips

 FYI - I am not sure why the loopback is there, I am trying to find out.

 thx
 Frank

 mlovell - ceph anycast

 --0000000000004ef36d0599483fd9
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 <div dir=3D"ltr"><div>Hi all,</div><div><br>Does
anyone know what possible
 =
 issues can arise if the ceph mgr daemon is running on a mon node that has
 2=
  ips in the public net range (1 is a loopback
 address).<br><br></div><div>A=
 s I understand the it. mgr will bind to all
ips<br><br></div><div>FYI - I
 a=
 m not sure why the loopback is there, I am trying to find
 out.</div><div><b=

r></div><div>thx</div><div>Frank<br><br><br><br><br>mlovell
- ceph
 anycast<=
 br></div></div>

 --0000000000004ef36d0599483fd9--

 ------------------------------

 Date: Mon, 9 Dec 2019 17:24:27 +0000
 From: Bryan Stillwell &lt;bstillwell(a)godaddy.com&gt;
 Subject: [ceph-users] Re: osdmaps not trimmed until ceph-mon's
         restarted (if cluster has a down osd)
 To: Dan van der Ster &lt;dan(a)vanderster.com&gt;
 Cc: Joao Eduardo Luis &lt;joao(a)suse.de&gt;de>, &quot;dev(a)ceph.io&quot; &lt;dev(a)ceph.io&gt;io>,
         ceph-users &lt;ceph-users(a)ceph.io&gt;
 Message-ID: &lt;9B145B17-6665-4254-8D6A-04A9B37389C3(a)godaddy.com&gt;
 Content-Type: text/plain; charset="us-ascii"

 On Nov 18, 2019, at 8:12 AM, Dan van der Ster &lt;dan(a)vanderster.com&gt; wrote:

 On Fri, Nov 15, 2019 at 4:45 PM Joao Eduardo Luis &lt;joao(a)suse.de&gt; wrote:
>
> On 19/11/14 11:04AM, Gregory Farnum wrote:
>> On Thu, Nov 14, 2019 at 8:14 AM Dan van der Ster &lt;dan(a)vanderster.com&gt;
 wrote:
 >>>
>>> Hi Joao,
>>>
>>> I might have found the reason why several of our clusters (and maybe
>>> Bryan's too) are getting stuck not trimming osdmaps.
>>> It seems that when an osd fails, the min_last_epoch_clean gets stuck
>>> forever (even long after HEALTH_OK), until the ceph-mons are
>>> restarted.
>>>
>>> I've updated the ticket: https://tracker.ceph.com/issues/41154
>>
>> Wrong ticket, I think you meant 
https://tracker.ceph.com/issues/37875#note-7
 >
> I've seen this behavior a long, long time ago, but stopped being able to
> reproduce it consistently enough to ensure the patch was working  properly.
 >
> I think I have a patch here:
>
>  https://github.com/ceph/ceph/pull/19076/commits
>
> If you are feeling adventurous, and want to give it a try, let me know. 
I'll
   be happy
to forward port it to whatever you are running. 
 Thanks Joao, this patch is what I had in mind.

 I'm trying to evaluate how adventurous this would be -- Is there any
 risk that if a huge number of osds are down all at once (but
 transiently), it would trigger the mon to trim too many maps?
 I would expect that the remaining up OSDs will have a safe, low,  osd_epoch ?

 And anyway I guess that your proposed get_min_last_epoch_clean patch
 is equivalent to what we have today if we restart the ceph-mon leader
 while an osd is down. 
 Joao,

 I ran into this again today and found over 100,000 osdmaps on all 1,000
 OSDs (~50 TiB of disk space used just by osdmaps).  There were down OSDs
 (pretty regular occurrence with ~1,000 OSDs) so that matches up with what
 Dan found.  Then when I restarted all the mon nodes twice the osdmaps
 started cleaning up.

 I believe the steps to reproduce would look like this:

 1. Start with a cluster with at least 1 down osd
 2. Expand the cluster (the bigger the expansion, the more osdmaps that
 pile up)
 3. Notice that after the expansion completes and the cluster is healthy
 that the old osdmaps aren't cleaned up

 I would be willing to test the fix on our test cluster after 14.2.5 comes
 out.  Could you make a build based on that release?

 Thanks,
 Bryan

 ------------------------------

 Subject: Digest Footer

 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io
 %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

 ------------------------------

 End of ceph-users Digest, Vol 83, Issue 35
 ******************************************

-- 

*-----------------------------------------------------------------------------------------*

*This email and any files transmitted with it are confidential and 
intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error, please notify the 
system manager. This message contains confidential information and is 
intended only for the individual named. If you are not the named addressee, 
you should not disseminate, distribute or copy this email. Please notify 
the sender immediately by email if you have received this email by mistake 
and delete this email from your system. If you are not the intended 
recipient, you are notified that disclosing, copying, distributing or 
taking any action in reliance on the contents of this information is 
strictly prohibited.*****

 ****

*Any views or opinions presented in this 
email are solely those of the author and do not necessarily represent those 
of the organization. Any information on shares, debentures or similar 
instruments, recommended product pricing, valuations and the like are for 
information purposes only. It is not meant to be an instruction or 
recommendation, as the case may be, to buy or to sell securities, products, 
services nor an offer to buy or sell securities, products or services 
unless specifically stated to be so on behalf of the Flipkart group. 
Employees of the Flipkart group of companies are expressly required not to 
make defamatory statements and not to infringe or authorise any 
infringement of copyright or any other legal right by email communications. 
Any such communication is contrary to organizational policy and outside the 
scope of the employment of the individual concerned. The organization will 
not accept any liability in respect of such communication, and the employee 
responsible will be personally liable for any damages or other liability 
arising.*****

 ****

*Our organization accepts no liability for the 
content of this email, or for the consequences of any actions taken on the 
basis of the information *provided,* unless that information is 
subsequently confirmed in writing. If you are not the intended recipient, 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited.*

_-----------------------------------------------------------------------------------------_