Hello,
We have a 3 geo locational freshly installed multisite setup with an upgraded octopus from
15.2.5 to 15.2.7.
We have 6 osd nodes, 3 mon/mgr/rgw in each dc, full SSD, 3 ssd is using 1 nvme for
journaling. Each zone backed with 3 RGW, one on each mon/mgr node.
The goal is to replicate 2 (currently) big buckets in the zonegroup but it only works if I
disable and reenable the bucket sync.
Big buckets means, one bucket is presharded for 9000 shards (9 billions objects), the 2nd
bucket that I'm detailing in this ticket 24000 (24 billions objects) shards.
Once picked up the objects (not all, only the ones that was on the source site at that
given time when it was enabled) it will slows down a lot from 100.000 objects / 15 minutes
in and 10GB/15 minutes to 50 objects/4 hours.
Once it synchronized after enabled/disabled, it maxing out the osd nodes with NVME/SSD
drives with some operation which I don't know what is it. Let me show you the symptoms
below.
Let me summarize as much as I can.
We have 1 realm, in this realm we have 1 zonegroup (please help me to check if the sync
policies are ok) and in this zonegroup we have 1 cluster in US, 1 in Hong Kong (master)
and 1 in Singapore.
Here is the realm, zonegroup and zones definition:
https://pastebin.com/raw/pu66tqcf
Let me show you one enable/disable operation when I've disabled on the HKG master site
the pix-bucket and enabled it.
In this screenshot:
https://i.ibb.co/WNC0gNQ/6nodes6day.png
the highlighted area is when the data sync is running after disable enable. You can see
almost no operation. You can see also when sync is not running, the green and yellow is
the NVME drive rocksdb+wal drives. The screenshot represents the 6 Singapore nodes
SSD/NVME disk utilizations. The first node you can see in the last hours no green and
yellow, it's because I've reinstalled in that nodes all the osds to not use NVME.
In the following 1st screenshot you can see the HKG object usage where the user is
uploading the objects. 2nd screenshot the SGP one where you can see the highlighted area
is the disable/enable operation.
HKG where user upload:
https://i.ibb.co/vj2VFYP/pixhkg6d.png
SGP where sync happened:
https://i.ibb.co/w41rmQT/pixsgp6d.png
Let me show you some troubleshooting logs regarding bucket sync status, cluster sync
status, reshard list (which might be because of previous testing), sync error list
https://pastebin.com/raw/TdwiZFC1
The issue might be very similar to this issue:
https://tracker.ceph.com/issues/21591
Where I should move forward or how can I help you to provide more logs to help me please?
Thank you in advance
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may
also be privileged or otherwise protected by copyright or other legal rules. If you have
received it by mistake please let us know by reply email and delete it from your system.
It is prohibited to copy this message or disclose its content to anyone. Any
confidentiality or privilege is not waived or lost by any mistaken delivery or
unauthorized disclosure of the message. All messages sent to and from Agoda may be
monitored to ensure compliance with company policies, to protect the company's
interests and to remove potential malware. Electronic messages may be intercepted,
amended, lost or deleted, or contain viruses.