MON sync time depends on outage duration

List overview All Threads
Download

newer

older

Reef release candidate - v18.1.3

Multiple object instances with...

Eugen Block

6 Jul 2023 6 Jul '23

9:47 p.m.

Hi *, I'm investigating an interesting issue on two customer clusters (used for mirroring) I've not solved yet, but today we finally made some progress. Maybe someone has an idea where to look next, I'd appreciate any hints or comments. These are two (latest) Octopus clusters, main usage currently is RBD mirroring with snapshot mode (around 500 RBD images are synced every 30 minutes). They noticed very long startup times of MON daemons after reboot, times between 10 and 30 minutes (reboot time already subtracted). These delays are present on both sites. Today we got a maintenance window and started to check in more detail by just restarting the MON service (joins quorum within seconds), then stopping the MON service and wait a few minutes (still joins quorum within seconds). And then we stopped the service and waited for more than 5 minutes, simulating a reboot, and then we were able to reproduce it. The sync then takes around 15 minutes, we verified with other MONs as well. The MON store is around 2 GB of size (on HDD), I understand that the sync itself can take some time, but what is the threshold here? I tried to find a hint in the MON config, searching for timeouts with 300 seconds, there were only a few matches (mon_session_timeout is one of them), but I'm not sure if they can explain this behavior. Investigating the MON store (ceph-monstore-tool dump-keys) I noticed that there were more than 42 Million osd_snap keys, which is quite a lot and would explain the size of the MON store. But I'm also not sure if it's related to the long syncing process. Does that sound familiar to anyone? Thanks, Eugen

Show replies by date

Dan van der Ster

6 Jul 6 Jul

10:50 p.m.

Hi Eugen! Yes that sounds familiar from the luminous and mimic days. Check this old thread: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF5… (that thread is truncated but I can tell you that it worked for Frank). Also the even older referenced thread: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2… The workaround for zillions of snapshot keys at that time was to use: ceph config set mon mon_sync_max_payload_size 4096 That said, that sync issue was supposed to be fixed by way of adding the new option mon_sync_max_payload_keys, which has been around since nautilus. So it could be in your case that the sync payload is just too small to efficiently move 42 million osd_snap keys? Using debug_paxos and debug_mon you should be able to understand what is taking so long, and tune mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. Good luck! Dan ______________________________________________________ Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock(a)nde.ag> wrote:

...

Eugen Block

7 Jul 7 Jul

8:40 a.m.

Thanks, Dan!

...

Yes that sounds familiar from the luminous and mimic days. The workaround for zillions of snapshot keys at that time was to use: ceph config set mon mon_sync_max_payload_size 4096

...

So it could be in your case that the sync payload is just too small to efficiently move 42 million osd_snap keys? Using debug_paxos and debug_mon you should be able to understand what is taking so long, and tune mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.

I'm confused, if the payload size is too small, why would decreasing it help? Or am I misunderstanding something? But it probably won't hurt to try it with 4096 and see if anything changes. If not we can still turn on debug logs and take a closer look.

...

And additional to Dan suggestion, the HDD is not a good choices for RocksDB, which is most likely the reason for this thread, I think that from the 3rd time the database just goes into compaction maintenance

Believe me, I know... but there's not much they can currently do about it, quite a long story... But I have been telling them that for months now. Anyway, I will make some suggestions and report back if it worked in this case as well. Thanks! Eugen Zitat von Dan van der Ster <dan.vanderster(a)clyso.com>om>: > Hi Eugen! > > Yes that sounds familiar from the luminous and mimic days. > > Check this old thread: > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF5… > (that thread is truncated but I can tell you that it worked for Frank). > Also the even older referenced thread: > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2… > > The workaround for zillions of snapshot keys at that time was to use: > ceph config set mon mon_sync_max_payload_size 4096 > > That said, that sync issue was supposed to be fixed by way of adding the > new option mon_sync_max_payload_keys, which has been around since nautilus. >

...

> > Good luck! > > Dan > > ______________________________________________________ > Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > > > > On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock(a)nde.ag> wrote: > >> Hi *, >> >> I'm investigating an interesting issue on two customer clusters (used >> for mirroring) I've not solved yet, but today we finally made some >> progress. Maybe someone has an idea where to look next, I'd appreciate >> any hints or comments. >> These are two (latest) Octopus clusters, main usage currently is RBD >> mirroring with snapshot mode (around 500 RBD images are synced every >> 30 minutes). They noticed very long startup times of MON daemons after >> reboot, times between 10 and 30 minutes (reboot time already >> subtracted). These delays are present on both sites. Today we got a >> maintenance window and started to check in more detail by just >> restarting the MON service (joins quorum within seconds), then >> stopping the MON service and wait a few minutes (still joins quorum >> within seconds). And then we stopped the service and waited for more >> than 5 minutes, simulating a reboot, and then we were able to >> reproduce it. The sync then takes around 15 minutes, we verified with >> other MONs as well. The MON store is around 2 GB of size (on HDD), I >> understand that the sync itself can take some time, but what is the >> threshold here? I tried to find a hint in the MON config, searching >> for timeouts with 300 seconds, there were only a few matches >> (mon_session_timeout is one of them), but I'm not sure if they can >> explain this behavior. >> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed >> that there were more than 42 Million osd_snap keys, which is quite a >> lot and would explain the size of the MON store. But I'm also not sure >> if it's related to the long syncing process. >> Does that sound familiar to anyone? >> >> Thanks, >> Eugen >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>

Eugen Block

8:52 a.m.

I forgot to add one question. @Konstantin, you wrote:

...

I think that from the 3rd time the database just goes into compaction maintenance

Can you share some more details what exactly you mean? Do you mean that if I restart a MON three times it goes into compaction maintenance and that it's not related to a timing? We tried the same on a different MON and only did two tests: - stopping a MON for less than 5 minutes, starting it again, sync happens immediately - stopping a MON for more than 5 minutes, starting it again, sync takes 15 minutes This doesn't feel related to the payload size or keys option, but a timing option. Zitat von Eugen Block <eblock(a)nde.ag>ag>:

...

Thanks, Dan!

Yes that sounds familiar from the luminous and mimic days. The workaround for zillions of snapshot keys at that time was to use: ceph config set mon mon_sync_max_payload_size 4096

Konstantin Shalygin

9:08 a.m.

...

On 7 Jul 2023, at 10:54, Eugen Block <eblock(a)nde.ag> wrote: Can you share some more details what exactly you mean? Do you mean that if I restart a MON three times it goes into compaction maintenance and that it's not related to a timing? We tried the same on a different MON and only did two tests: - stopping a MON for less than 5 minutes, starting it again, sync happens immediately - stopping a MON for more than 5 minutes, starting it again, sync takes 15 minutes This doesn't feel related to the payload size or keys option, but a timing option.

Eugen Block

11:21 a.m.

We did look at the iostats of the disk, it was not saturated, but I don't have any specific numbers right now as I don't have direct access. But I'm open for more theories why waiting for 5 minutes lets the MON sync immediately but waiting for more takes so much more time. If necessary we'll need to investigate the debug logs during startup and compare between two such events. But maybe someone already did that which would save me quite some time. ;-) Zitat von Konstantin Shalygin <k0ste(a)k0ste.ru>ru>:

...

This is a guess, the databases is like to swell. Especially the Level DB's, can grow x2 and reduce tens of percent of total size. This may be just another SST file creation, 1GB by default, Ii I remember it right Do you was looks to Grafana, about this HDD's utilization, IOPS? k Sent from my iPhone > On 7 Jul 2023, at 10:54, Eugen Block <eblock(a)nde.ag> wrote: > > Can you share some more details what exactly you mean? Do you mean > that if I restart a MON three times it goes into compaction > maintenance and that it's not related to a timing? We tried the > same on a different MON and only did two tests: > - stopping a MON for less than 5 minutes, starting it again, sync > happens immediately > - stopping a MON for more than 5 minutes, starting it again, sync > takes 15 minutes > > This doesn't feel related to the payload size or keys option, but a > timing option.

Eugen Block

10 Jul 10 Jul

2:09 p.m.

Hi, I got a customer response with payload size 4096, that made things even worse. The mon startup time was now around 40 minutes. My doubts wrt decreasing the payload size seem confirmed. Then I read Dan's response again which also mentions that the default payload size could be too small. So I asked them to double the default (2M instead of 1M) and am now waiting for a new result. I'm still wondering why this only happens when the mon is down for more than 5 minutes. Does anyone have an explanation for that time factor? Another thing they're going to do is to remove lots of snapshot tombstones (rbd mirroring snapshots in the trash namespace), maybe that will reduce the osd_snap keys in the mon db, which then would increase the startup time. We'll see... Zitat von Eugen Block <eblock(a)nde.ag>ag>:

...

Thanks, Dan!

Yes that sounds familiar from the luminous and mimic days. The workaround for zillions of snapshot keys at that time was to use: ceph config set mon mon_sync_max_payload_size 4096

Dan van der Ster

6:32 p.m.

...

Thanks, Dan!

Yes that sounds familiar from the luminous and mimic days. The workaround for zillions of snapshot keys at that time was to use: ceph config set mon mon_sync_max_payload_size 4096

I actually did search for mon_sync_max_payload_keys, not bytes so I missed your thread, it seems. Thanks for pointing that out. So the defaults seem to be these in Octopus: "mon_sync_max_payload_keys": "2000", "mon_sync_max_payload_size": "1048576", > So it could be in your case that the sync payload is just too small to > efficiently move 42 million osd_snap keys? Using debug_paxos and

debug_mon

you should be able to understand what is taking so long, and tune mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF5…

> (that thread is truncated but I can tell you that it worked for Frank). > Also the even older referenced thread: >

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2…

> > The workaround for zillions of snapshot keys at that time was to use: > ceph config set mon mon_sync_max_payload_size 4096 > > That said, that sync issue was supposed to be fixed by way of adding the > new option mon_sync_max_payload_keys, which has been around since

nautilus.

> > So it could be in your case that the sync payload is just too small to > efficiently move 42 million osd_snap keys? Using debug_paxos and

debug_mon

> you should be able to understand what is taking so long, and tune > mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. > > Good luck! > > Dan > > ______________________________________________________ > Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > > > > On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock(a)nde.ag> wrote: > >> Hi *, >> >> I'm investigating an interesting issue on two customer clusters (used >> for mirroring) I've not solved yet, but today we finally made some >> progress. Maybe someone has an idea where to look next, I'd appreciate >> any hints or comments. >> These are two (latest) Octopus clusters, main usage currently is RBD >> mirroring with snapshot mode (around 500 RBD images are synced every >> 30 minutes). They noticed very long startup times of MON daemons after >> reboot, times between 10 and 30 minutes (reboot time already >> subtracted). These delays are present on both sites. Today we got a >> maintenance window and started to check in more detail by just >> restarting the MON service (joins quorum within seconds), then >> stopping the MON service and wait a few minutes (still joins quorum >> within seconds). And then we stopped the service and waited for more >> than 5 minutes, simulating a reboot, and then we were able to >> reproduce it. The sync then takes around 15 minutes, we verified with >> other MONs as well. The MON store is around 2 GB of size (on HDD), I >> understand that the sync itself can take some time, but what is the >> threshold here? I tried to find a hint in the MON config, searching >> for timeouts with 300 seconds, there were only a few matches >> (mon_session_timeout is one of them), but I'm not sure if they can >> explain this behavior. >> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed >> that there were more than 42 Million osd_snap keys, which is quite a >> lot and would explain the size of the MON store. But I'm also not sure >> if it's related to the long syncing process. >> Does that sound familiar to anyone? >> >> Thanks, >> Eugen >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>

Eugen Block

11 Jul 11 Jul

12:10 p.m.

I'm not so sure anymore if that could really help here. The dump-keys output from the mon contains 42 million osd_snap prefix entries, 39 million of them are "purged_snap" keys. I also compared to other clusters as well, those aren't tombstones but expected "history" of purged snapshots. So I don't think removing a couple of hundred trash snapshots will actually reduce the number of osd_snap keys. At least doubling the payload_size seems to have a positive impact. The compaction during the sync has a negative impact, of course, same as not having the mon store on SSDs. I'm currently playing with a test cluster, removing all "purged_snap" entries from the mon db (not finished yet) to see what that will do with the mon and if it will even start correctly. But has anyone done that, removing keys from the mon store? Not sure what to expect yet... Zitat von Dan van der Ster <dan.vanderster(a)clyso.com>om>:

...

Oh yes, sounds like purging the rbd trash will be the real fix here! Good luck! ______________________________________________________ Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com On Mon, Jul 10, 2023 at 6:10 AM Eugen Block <eblock(a)nde.ag> wrote: > Hi, > I got a customer response with payload size 4096, that made things > even worse. The mon startup time was now around 40 minutes. My doubts > wrt decreasing the payload size seem confirmed. Then I read Dan's > response again which also mentions that the default payload size could > be too small. So I asked them to double the default (2M instead of 1M) > and am now waiting for a new result. I'm still wondering why this only > happens when the mon is down for more than 5 minutes. Does anyone have > an explanation for that time factor? > Another thing they're going to do is to remove lots of snapshot > tombstones (rbd mirroring snapshots in the trash namespace), maybe > that will reduce the osd_snap keys in the mon db, which then would > increase the startup time. We'll see... > > Zitat von Eugen Block <eblock(a)nde.ag>ag>: > > > Thanks, Dan! > > > >> Yes that sounds familiar from the luminous and mimic days. > >> The workaround for zillions of snapshot keys at that time was to use: > >> ceph config set mon mon_sync_max_payload_size 4096 > > > > I actually did search for mon_sync_max_payload_keys, not bytes so I > > missed your thread, it seems. Thanks for pointing that out. So the > > defaults seem to be these in Octopus: > > > > "mon_sync_max_payload_keys": "2000", > > "mon_sync_max_payload_size": "1048576", > > > >> So it could be in your case that the sync payload is just too small to > >> efficiently move 42 million osd_snap keys? Using debug_paxos and > debug_mon > >> you should be able to understand what is taking so long, and tune > >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. > > > > I'm confused, if the payload size is too small, why would decreasing > > it help? Or am I misunderstanding something? But it probably won't > > hurt to try it with 4096 and see if anything changes. If not we can > > still turn on debug logs and take a closer look. > > > >> And additional to Dan suggestion, the HDD is not a good choices for > >> RocksDB, which is most likely the reason for this thread, I think > >> that from the 3rd time the database just goes into compaction > >> maintenance > > > > Believe me, I know... but there's not much they can currently do > > about it, quite a long story... But I have been telling them that > > for months now. Anyway, I will make some suggestions and report back > > if it worked in this case as well. > > > > Thanks! > > Eugen > > > > Zitat von Dan van der Ster <dan.vanderster(a)clyso.com>om>: > > > >> Hi Eugen! > >> > >> Yes that sounds familiar from the luminous and mimic days. > >> > >> Check this old thread: > >> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF5… > >> (that thread is truncated but I can tell you that it worked for Frank). > >> Also the even older referenced thread: > >> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2… > >> > >> The workaround for zillions of snapshot keys at that time was to use: > >> ceph config set mon mon_sync_max_payload_size 4096 > >> > >> That said, that sync issue was supposed to be fixed by way of adding the > >> new option mon_sync_max_payload_keys, which has been around since > nautilus. > >> > >> So it could be in your case that the sync payload is just too small to > >> efficiently move 42 million osd_snap keys? Using debug_paxos and > debug_mon > >> you should be able to understand what is taking so long, and tune > >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. > >> > >> Good luck! > >> > >> Dan > >> > >> ______________________________________________________ > >> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > >> > >> > >> > >> On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock(a)nde.ag> wrote: > >> > >>> Hi *, > >>> > >>> I'm investigating an interesting issue on two customer clusters (used > >>> for mirroring) I've not solved yet, but today we finally made some > >>> progress. Maybe someone has an idea where to look next, I'd appreciate > >>> any hints or comments. > >>> These are two (latest) Octopus clusters, main usage currently is RBD > >>> mirroring with snapshot mode (around 500 RBD images are synced every > >>> 30 minutes). They noticed very long startup times of MON daemons after > >>> reboot, times between 10 and 30 minutes (reboot time already > >>> subtracted). These delays are present on both sites. Today we got a > >>> maintenance window and started to check in more detail by just > >>> restarting the MON service (joins quorum within seconds), then > >>> stopping the MON service and wait a few minutes (still joins quorum > >>> within seconds). And then we stopped the service and waited for more > >>> than 5 minutes, simulating a reboot, and then we were able to > >>> reproduce it. The sync then takes around 15 minutes, we verified with > >>> other MONs as well. The MON store is around 2 GB of size (on HDD), I > >>> understand that the sync itself can take some time, but what is the > >>> threshold here? I tried to find a hint in the MON config, searching > >>> for timeouts with 300 seconds, there were only a few matches > >>> (mon_session_timeout is one of them), but I'm not sure if they can > >>> explain this behavior. > >>> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed > >>> that there were more than 42 Million osd_snap keys, which is quite a > >>> lot and would explain the size of the MON store. But I'm also not sure > >>> if it's related to the long syncing process. > >>> Does that sound familiar to anyone? > >>> > >>> Thanks, > >>> Eugen > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users(a)ceph.io > >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > >>> > > > >

Josh Baergen

3:08 p.m.

Out of curiosity, what is your require_osd_release set to? (ceph osd dump | grep require_osd_release) Josh On Tue, Jul 11, 2023 at 5:11 AM Eugen Block <eblock(a)nde.ag> wrote: > > I'm not so sure anymore if that could really help here. The dump-keys > output from the mon contains 42 million osd_snap prefix entries, 39 > million of them are "purged_snap" keys. I also compared to other > clusters as well, those aren't tombstones but expected "history" of > purged snapshots. So I don't think removing a couple of hundred trash > snapshots will actually reduce the number of osd_snap keys. At least > doubling the payload_size seems to have a positive impact. The > compaction during the sync has a negative impact, of course, same as > not having the mon store on SSDs. > I'm currently playing with a test cluster, removing all "purged_snap" > entries from the mon db (not finished yet) to see what that will do > with the mon and if it will even start correctly. But has anyone done > that, removing keys from the mon store? Not sure what to expect yet... > > Zitat von Dan van der Ster <dan.vanderster(a)clyso.com>om>: > > > Oh yes, sounds like purging the rbd trash will be the real fix here! > > Good luck! > > > > ______________________________________________________ > > Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > > > > > > > > > > On Mon, Jul 10, 2023 at 6:10 AM Eugen Block <eblock(a)nde.ag> wrote: > > > >> Hi, > >> I got a customer response with payload size 4096, that made things > >> even worse. The mon startup time was now around 40 minutes. My doubts > >> wrt decreasing the payload size seem confirmed. Then I read Dan's > >> response again which also mentions that the default payload size could > >> be too small. So I asked them to double the default (2M instead of 1M) > >> and am now waiting for a new result. I'm still wondering why this only > >> happens when the mon is down for more than 5 minutes. Does anyone have > >> an explanation for that time factor? > >> Another thing they're going to do is to remove lots of snapshot > >> tombstones (rbd mirroring snapshots in the trash namespace), maybe > >> that will reduce the osd_snap keys in the mon db, which then would > >> increase the startup time. We'll see... > >> > >> Zitat von Eugen Block <eblock(a)nde.ag>ag>: > >> > >> > Thanks, Dan! > >> > > >> >> Yes that sounds familiar from the luminous and mimic days. > >> >> The workaround for zillions of snapshot keys at that time was to use: > >> >> ceph config set mon mon_sync_max_payload_size 4096 > >> > > >> > I actually did search for mon_sync_max_payload_keys, not bytes so I > >> > missed your thread, it seems. Thanks for pointing that out. So the > >> > defaults seem to be these in Octopus: > >> > > >> > "mon_sync_max_payload_keys": "2000", > >> > "mon_sync_max_payload_size": "1048576", > >> > > >> >> So it could be in your case that the sync payload is just too small to > >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and > >> debug_mon > >> >> you should be able to understand what is taking so long, and tune > >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. > >> > > >> > I'm confused, if the payload size is too small, why would decreasing > >> > it help? Or am I misunderstanding something? But it probably won't > >> > hurt to try it with 4096 and see if anything changes. If not we can > >> > still turn on debug logs and take a closer look. > >> > > >> >> And additional to Dan suggestion, the HDD is not a good choices for > >> >> RocksDB, which is most likely the reason for this thread, I think > >> >> that from the 3rd time the database just goes into compaction > >> >> maintenance > >> > > >> > Believe me, I know... but there's not much they can currently do > >> > about it, quite a long story... But I have been telling them that > >> > for months now. Anyway, I will make some suggestions and report back > >> > if it worked in this case as well. > >> > > >> > Thanks! > >> > Eugen > >> > > >> > Zitat von Dan van der Ster <dan.vanderster(a)clyso.com>om>: > >> > > >> >> Hi Eugen! > >> >> > >> >> Yes that sounds familiar from the luminous and mimic days. > >> >> > >> >> Check this old thread: > >> >> > >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF5… > >> >> (that thread is truncated but I can tell you that it worked for Frank). > >> >> Also the even older referenced thread: > >> >> > >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2… > >> >> > >> >> The workaround for zillions of snapshot keys at that time was to use: > >> >> ceph config set mon mon_sync_max_payload_size 4096 > >> >> > >> >> That said, that sync issue was supposed to be fixed by way of adding the > >> >> new option mon_sync_max_payload_keys, which has been around since > >> nautilus. > >> >> > >> >> So it could be in your case that the sync payload is just too small to > >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and > >> debug_mon > >> >> you should be able to understand what is taking so long, and tune > >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. > >> >> > >> >> Good luck! > >> >> > >> >> Dan > >> >> > >> >> ______________________________________________________ > >> >> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > >> >> > >> >> > >> >> > >> >> On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock(a)nde.ag> wrote: > >> >> > >> >>> Hi *, > >> >>> > >> >>> I'm investigating an interesting issue on two customer clusters (used > >> >>> for mirroring) I've not solved yet, but today we finally made some > >> >>> progress. Maybe someone has an idea where to look next, I'd appreciate > >> >>> any hints or comments. > >> >>> These are two (latest) Octopus clusters, main usage currently is RBD > >> >>> mirroring with snapshot mode (around 500 RBD images are synced every > >> >>> 30 minutes). They noticed very long startup times of MON daemons after > >> >>> reboot, times between 10 and 30 minutes (reboot time already > >> >>> subtracted). These delays are present on both sites. Today we got a > >> >>> maintenance window and started to check in more detail by just > >> >>> restarting the MON service (joins quorum within seconds), then > >> >>> stopping the MON service and wait a few minutes (still joins quorum > >> >>> within seconds). And then we stopped the service and waited for more > >> >>> than 5 minutes, simulating a reboot, and then we were able to > >> >>> reproduce it. The sync then takes around 15 minutes, we verified with > >> >>> other MONs as well. The MON store is around 2 GB of size (on HDD), I > >> >>> understand that the sync itself can take some time, but what is the > >> >>> threshold here? I tried to find a hint in the MON config, searching > >> >>> for timeouts with 300 seconds, there were only a few matches > >> >>> (mon_session_timeout is one of them), but I'm not sure if they can > >> >>> explain this behavior. > >> >>> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed > >> >>> that there were more than 42 Million osd_snap keys, which is quite a > >> >>> lot and would explain the size of the MON store. But I'm also not sure > >> >>> if it's related to the long syncing process. > >> >>> Does that sound familiar to anyone? > >> >>> > >> >>> Thanks, > >> >>> Eugen > >> >>> _______________________________________________ > >> >>> ceph-users mailing list -- ceph-users(a)ceph.io > >> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > >> >>> > >> > >> > >> > >> > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Eugen Block

3:41 p.m.

It was installed with Octopus and hasn't been upgraded yet: "require_osd_release": "octopus", Zitat von Josh Baergen <jbaergen(a)digitalocean.com>om>:

...

Out of curiosity, what is your require_osd_release set to? (ceph osd dump | grep require_osd_release) Josh On Tue, Jul 11, 2023 at 5:11 AM Eugen Block <eblock(a)nde.ag> wrote: > > I'm not so sure anymore if that could really help here. The dump-keys > output from the mon contains 42 million osd_snap prefix entries, 39 > million of them are "purged_snap" keys. I also compared to other > clusters as well, those aren't tombstones but expected "history" of > purged snapshots. So I don't think removing a couple of hundred trash > snapshots will actually reduce the number of osd_snap keys. At least > doubling the payload_size seems to have a positive impact. The > compaction during the sync has a negative impact, of course, same as > not having the mon store on SSDs. > I'm currently playing with a test cluster, removing all "purged_snap" > entries from the mon db (not finished yet) to see what that will do > with the mon and if it will even start correctly. But has anyone done > that, removing keys from the mon store? Not sure what to expect yet... > > Zitat von Dan van der Ster <dan.vanderster(a)clyso.com>om>: > > > Oh yes, sounds like purging the rbd trash will be the real fix here! > > Good luck! > > > > ______________________________________________________ > > Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > > > > > > > > > > On Mon, Jul 10, 2023 at 6:10 AM Eugen Block <eblock(a)nde.ag> wrote: > > > >> Hi, > >> I got a customer response with payload size 4096, that made things > >> even worse. The mon startup time was now around 40 minutes. My doubts > >> wrt decreasing the payload size seem confirmed. Then I read Dan's > >> response again which also mentions that the default payload size could > >> be too small. So I asked them to double the default (2M instead of 1M) > >> and am now waiting for a new result. I'm still wondering why this only > >> happens when the mon is down for more than 5 minutes. Does anyone have > >> an explanation for that time factor? > >> Another thing they're going to do is to remove lots of snapshot > >> tombstones (rbd mirroring snapshots in the trash namespace), maybe > >> that will reduce the osd_snap keys in the mon db, which then would > >> increase the startup time. We'll see... > >> > >> Zitat von Eugen Block <eblock(a)nde.ag>ag>: > >> > >> > Thanks, Dan! > >> > > >> >> Yes that sounds familiar from the luminous and mimic days. > >> >> The workaround for zillions of snapshot keys at that time was to use: > >> >> ceph config set mon mon_sync_max_payload_size 4096 > >> > > >> > I actually did search for mon_sync_max_payload_keys, not bytes so I > >> > missed your thread, it seems. Thanks for pointing that out. So the > >> > defaults seem to be these in Octopus: > >> > > >> > "mon_sync_max_payload_keys": "2000", > >> > "mon_sync_max_payload_size": "1048576", > >> > > >> >> So it could be in your case that the sync payload is just too small to > >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and > >> debug_mon > >> >> you should be able to understand what is taking so long, and tune > >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. > >> > > >> > I'm confused, if the payload size is too small, why would decreasing > >> > it help? Or am I misunderstanding something? But it probably won't > >> > hurt to try it with 4096 and see if anything changes. If not we can > >> > still turn on debug logs and take a closer look. > >> > > >> >> And additional to Dan suggestion, the HDD is not a good choices for > >> >> RocksDB, which is most likely the reason for this thread, I think > >> >> that from the 3rd time the database just goes into compaction > >> >> maintenance > >> > > >> > Believe me, I know... but there's not much they can currently do > >> > about it, quite a long story... But I have been telling them that > >> > for months now. Anyway, I will make some suggestions and report back > >> > if it worked in this case as well. > >> > > >> > Thanks! > >> > Eugen > >> > > >> > Zitat von Dan van der Ster <dan.vanderster(a)clyso.com>om>: > >> > > >> >> Hi Eugen! > >> >> > >> >> Yes that sounds familiar from the luminous and mimic days. > >> >> > >> >> Check this old thread: > >> >> > >> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF5… > >> >> (that thread is truncated but I can tell you that it worked > for Frank). > >> >> Also the even older referenced thread: > >> >> > >> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2… > >> >> > >> >> The workaround for zillions of snapshot keys at that time was to use: > >> >> ceph config set mon mon_sync_max_payload_size 4096 > >> >> > >> >> That said, that sync issue was supposed to be fixed by way of > adding the > >> >> new option mon_sync_max_payload_keys, which has been around since > >> nautilus. > >> >> > >> >> So it could be in your case that the sync payload is just too small to > >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and > >> debug_mon > >> >> you should be able to understand what is taking so long, and tune > >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. > >> >> > >> >> Good luck! > >> >> > >> >> Dan > >> >> > >> >> ______________________________________________________ > >> >> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > >> >> > >> >> > >> >> > >> >> On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock(a)nde.ag> wrote: > >> >> > >> >>> Hi *, > >> >>> > >> >>> I'm investigating an interesting issue on two customer clusters (used > >> >>> for mirroring) I've not solved yet, but today we finally made some > >> >>> progress. Maybe someone has an idea where to look next, I'd > appreciate > >> >>> any hints or comments. > >> >>> These are two (latest) Octopus clusters, main usage currently is RBD > >> >>> mirroring with snapshot mode (around 500 RBD images are synced every > >> >>> 30 minutes). They noticed very long startup times of MON > daemons after > >> >>> reboot, times between 10 and 30 minutes (reboot time already > >> >>> subtracted). These delays are present on both sites. Today we got a > >> >>> maintenance window and started to check in more detail by just > >> >>> restarting the MON service (joins quorum within seconds), then > >> >>> stopping the MON service and wait a few minutes (still joins quorum > >> >>> within seconds). And then we stopped the service and waited for more > >> >>> than 5 minutes, simulating a reboot, and then we were able to > >> >>> reproduce it. The sync then takes around 15 minutes, we verified with > >> >>> other MONs as well. The MON store is around 2 GB of size (on HDD), I > >> >>> understand that the sync itself can take some time, but what is the > >> >>> threshold here? I tried to find a hint in the MON config, searching > >> >>> for timeouts with 300 seconds, there were only a few matches > >> >>> (mon_session_timeout is one of them), but I'm not sure if they can > >> >>> explain this behavior. > >> >>> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed > >> >>> that there were more than 42 Million osd_snap keys, which is quite a > >> >>> lot and would explain the size of the MON store. But I'm > also not sure > >> >>> if it's related to the long syncing process. > >> >>> Does that sound familiar to anyone? > >> >>> > >> >>> Thanks, > >> >>> Eugen > >> >>> _______________________________________________ > >> >>> ceph-users mailing list -- ceph-users(a)ceph.io > >> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > >> >>> > >> > >> > >> > >> > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Eugen Block

12 Jul 12 Jul

7:58 a.m.

My test with a single-host-cluster (virtual machine) finished after around 20 hours. I removed all purged_snap keys from the mon and it actually started again (wasn't sure if I could have expected that). Is that a valid approach in order to reduce the mon store size? Or can it be dangerous? How would that work in a real cluster with multiple MONs? If I stop the first, clean up the mon db, then start it again, wouldn't it sync the keys from its peers? Not sure how that would work... Zitat von Eugen Block <eblock(a)nde.ag>ag>:

...

It was installed with Octopus and hasn't been upgraded yet: "require_osd_release": "octopus", Zitat von Josh Baergen <jbaergen(a)digitalocean.com>om>: > Out of curiosity, what is your require_osd_release set to? (ceph osd > dump | grep require_osd_release) > > Josh > > On Tue, Jul 11, 2023 at 5:11 AM Eugen Block <eblock(a)nde.ag> wrote: >> >> I'm not so sure anymore if that could really help here. The dump-keys >> output from the mon contains 42 million osd_snap prefix entries, 39 >> million of them are "purged_snap" keys. I also compared to other >> clusters as well, those aren't tombstones but expected "history" of >> purged snapshots. So I don't think removing a couple of hundred trash >> snapshots will actually reduce the number of osd_snap keys. At least >> doubling the payload_size seems to have a positive impact. The >> compaction during the sync has a negative impact, of course, same as >> not having the mon store on SSDs. >> I'm currently playing with a test cluster, removing all "purged_snap" >> entries from the mon db (not finished yet) to see what that will do >> with the mon and if it will even start correctly. But has anyone done >> that, removing keys from the mon store? Not sure what to expect yet... >> >> Zitat von Dan van der Ster <dan.vanderster(a)clyso.com>om>: >> >>> Oh yes, sounds like purging the rbd trash will be the real fix here! >>> Good luck! >>> >>> ______________________________________________________ >>> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com >>> >>> >>> >>> >>> On Mon, Jul 10, 2023 at 6:10 AM Eugen Block <eblock(a)nde.ag> wrote: >>> >>>> Hi, >>>> I got a customer response with payload size 4096, that made things >>>> even worse. The mon startup time was now around 40 minutes. My doubts >>>> wrt decreasing the payload size seem confirmed. Then I read Dan's >>>> response again which also mentions that the default payload size could >>>> be too small. So I asked them to double the default (2M instead of 1M) >>>> and am now waiting for a new result. I'm still wondering why this only >>>> happens when the mon is down for more than 5 minutes. Does anyone have >>>> an explanation for that time factor? >>>> Another thing they're going to do is to remove lots of snapshot >>>> tombstones (rbd mirroring snapshots in the trash namespace), maybe >>>> that will reduce the osd_snap keys in the mon db, which then would >>>> increase the startup time. We'll see... >>>> >>>> Zitat von Eugen Block <eblock(a)nde.ag>ag>: >>>> >>>> > Thanks, Dan! >>>> > >>>> >> Yes that sounds familiar from the luminous and mimic days. >>>> >> The workaround for zillions of snapshot keys at that time was to use: >>>> >> ceph config set mon mon_sync_max_payload_size 4096 >>>> > >>>> > I actually did search for mon_sync_max_payload_keys, not bytes so I >>>> > missed your thread, it seems. Thanks for pointing that out. So the >>>> > defaults seem to be these in Octopus: >>>> > >>>> > "mon_sync_max_payload_keys": "2000", >>>> > "mon_sync_max_payload_size": "1048576", >>>> > >>>> >> So it could be in your case that the sync payload is just too small to >>>> >> efficiently move 42 million osd_snap keys? Using debug_paxos and >>>> debug_mon >>>> >> you should be able to understand what is taking so long, and tune >>>> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. >>>> > >>>> > I'm confused, if the payload size is too small, why would decreasing >>>> > it help? Or am I misunderstanding something? But it probably won't >>>> > hurt to try it with 4096 and see if anything changes. If not we can >>>> > still turn on debug logs and take a closer look. >>>> > >>>> >> And additional to Dan suggestion, the HDD is not a good choices for >>>> >> RocksDB, which is most likely the reason for this thread, I think >>>> >> that from the 3rd time the database just goes into compaction >>>> >> maintenance >>>> > >>>> > Believe me, I know... but there's not much they can currently do >>>> > about it, quite a long story... But I have been telling them that >>>> > for months now. Anyway, I will make some suggestions and report back >>>> > if it worked in this case as well. >>>> > >>>> > Thanks! >>>> > Eugen >>>> > >>>> > Zitat von Dan van der Ster <dan.vanderster(a)clyso.com>om>: >>>> > >>>> >> Hi Eugen! >>>> >> >>>> >> Yes that sounds familiar from the luminous and mimic days. >>>> >> >>>> >> Check this old thread: >>>> >> >>>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF5… >>>> >> (that thread is truncated but I can tell you that it worked >>>> for Frank). >>>> >> Also the even older referenced thread: >>>> >> >>>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2… >>>> >> >>>> >> The workaround for zillions of snapshot keys at that time was to use: >>>> >> ceph config set mon mon_sync_max_payload_size 4096 >>>> >> >>>> >> That said, that sync issue was supposed to be fixed by way of >>>> adding the >>>> >> new option mon_sync_max_payload_keys, which has been around since >>>> nautilus. >>>> >> >>>> >> So it could be in your case that the sync payload is just too small to >>>> >> efficiently move 42 million osd_snap keys? Using debug_paxos and >>>> debug_mon >>>> >> you should be able to understand what is taking so long, and tune >>>> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. >>>> >> >>>> >> Good luck! >>>> >> >>>> >> Dan >>>> >> >>>> >> ______________________________________________________ >>>> >> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com >>>> >> >>>> >> >>>> >> >>>> >> On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock(a)nde.ag> wrote: >>>> >> >>>> >>> Hi *, >>>> >>> >>>> >>> I'm investigating an interesting issue on two customer clusters (used >>>> >>> for mirroring) I've not solved yet, but today we finally made some >>>> >>> progress. Maybe someone has an idea where to look next, I'd >>>> appreciate >>>> >>> any hints or comments. >>>> >>> These are two (latest) Octopus clusters, main usage currently is RBD >>>> >>> mirroring with snapshot mode (around 500 RBD images are synced every >>>> >>> 30 minutes). They noticed very long startup times of MON >>>> daemons after >>>> >>> reboot, times between 10 and 30 minutes (reboot time already >>>> >>> subtracted). These delays are present on both sites. Today we got a >>>> >>> maintenance window and started to check in more detail by just >>>> >>> restarting the MON service (joins quorum within seconds), then >>>> >>> stopping the MON service and wait a few minutes (still joins quorum >>>> >>> within seconds). And then we stopped the service and waited for more >>>> >>> than 5 minutes, simulating a reboot, and then we were able to >>>> >>> reproduce it. The sync then takes around 15 minutes, we verified with >>>> >>> other MONs as well. The MON store is around 2 GB of size (on HDD), I >>>> >>> understand that the sync itself can take some time, but what is the >>>> >>> threshold here? I tried to find a hint in the MON config, searching >>>> >>> for timeouts with 300 seconds, there were only a few matches >>>> >>> (mon_session_timeout is one of them), but I'm not sure if they can >>>> >>> explain this behavior. >>>> >>> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed >>>> >>> that there were more than 42 Million osd_snap keys, which is quite a >>>> >>> lot and would explain the size of the MON store. But I'm >>>> also not sure >>>> >>> if it's related to the long syncing process. >>>> >>> Does that sound familiar to anyone? >>>> >>> >>>> >>> Thanks, >>>> >>> Eugen >>>> >>> _______________________________________________ >>>> >>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> >>> >>>> >>>> >>>> >>>> >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io

Eugen Block

28 Jul 28 Jul

11:36 a.m.

Hi, I think we found an explanation for the behaviour, we still need to verify it though. Just wanted to write it up for posterity. We already knew that the large number of "purged_snap" keys in the mon store is responsible for the long synchronization. Removing them didn't seem to have a negative impact in my test cluster, but don't want to try that in production. They also tried a couple of variations with mon_sync_payload_size but it didn't have a significant impact (it impacted a few other keys, but not the osd_snap keys). We seemed to hit the payload_keys limit (default 2000), we'll suggest to increase it and hopefully find a suitable value. But it still didn't explain the variations in the sync duration. So we looked deeper (also dived into the code) and finally got some debug logs we could analyse. The paxos versions determine if a "full sync" is required or a "recent sync" is sufficient: if (paxos->get_version() < m->paxos_first_version && m->paxos_first_version > 1) { dout(10) << " peer paxos first versions [" << m->paxos_first_version << "," << m->paxos_last_version << "]" << " vs my version " << paxos->get_version() << " (too far ahead)" << dendl; ... So if the current version of the to-be-synced mon is lower than the first available version of the peer it starts a full sync, otherwise a recent sync is started. In one of the tests (simulating a mon reboot) the difference between paxos versions was 628. I checked the available mon config options and found "paxos_min" (default 500). This will be the next suggestion, increase paxos_min to 1000 so the cluster doesn't require a full sync after a regular reboot and only do a full sync in case it's down for a longer period of time. Not sure what other impact it could have except for some more storage consumption, but we'll let them test it. But this still doesn't explain the variations in the startup times. My current theory is that the duration depends on the timing of the reboot/daemon shutdown: The rbd-mirror is currently configured with a 30 minute schedule. This means that every full and every half hour new snapshots are created and synced, older snapshots are deleted which impacts the osdmap. So if a MON goes down during this time it's very likely that its paxos version will be lower than the first available on the peer(s). So if a reboot is scheduled after the snapshot schedule the mon synchronisation time probably will decrease. This also needs some varification, still waiting for the results. From my perspective, those two config options (mon_sync_payload_keys, paxos_min) and rebooting a MON server at the right time are the most promising approaches for now. Having the mon store on SSDs would help as well, of course, but unfortunately that's currently not an option. I'll update this thread when we have more results, maybe my theory garbage, but I'm confident. :-) If you have comments or objections regarding those config options, I'd appreciate your comments. Thanks, Eugen Zitat von Josh Baergen <jbaergen(a)digitalocean.com>om>:

...

Out of curiosity, what is your require_osd_release set to? (ceph osd dump | grep require_osd_release) Josh On Tue, Jul 11, 2023 at 5:11 AM Eugen Block <eblock(a)nde.ag> wrote: > > I'm not so sure anymore if that could really help here. The dump-keys > output from the mon contains 42 million osd_snap prefix entries, 39 > million of them are "purged_snap" keys. I also compared to other > clusters as well, those aren't tombstones but expected "history" of > purged snapshots. So I don't think removing a couple of hundred trash > snapshots will actually reduce the number of osd_snap keys. At least > doubling the payload_size seems to have a positive impact. The > compaction during the sync has a negative impact, of course, same as > not having the mon store on SSDs. > I'm currently playing with a test cluster, removing all "purged_snap" > entries from the mon db (not finished yet) to see what that will do > with the mon and if it will even start correctly. But has anyone done > that, removing keys from the mon store? Not sure what to expect yet... > > Zitat von Dan van der Ster <dan.vanderster(a)clyso.com>om>: > > > Oh yes, sounds like purging the rbd trash will be the real fix here! > > Good luck! > > > > ______________________________________________________ > > Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > > > > > > > > > > On Mon, Jul 10, 2023 at 6:10 AM Eugen Block <eblock(a)nde.ag> wrote: > > > >> Hi, > >> I got a customer response with payload size 4096, that made things > >> even worse. The mon startup time was now around 40 minutes. My doubts > >> wrt decreasing the payload size seem confirmed. Then I read Dan's > >> response again which also mentions that the default payload size could > >> be too small. So I asked them to double the default (2M instead of 1M) > >> and am now waiting for a new result. I'm still wondering why this only > >> happens when the mon is down for more than 5 minutes. Does anyone have > >> an explanation for that time factor? > >> Another thing they're going to do is to remove lots of snapshot > >> tombstones (rbd mirroring snapshots in the trash namespace), maybe > >> that will reduce the osd_snap keys in the mon db, which then would > >> increase the startup time. We'll see... > >> > >> Zitat von Eugen Block <eblock(a)nde.ag>ag>: > >> > >> > Thanks, Dan! > >> > > >> >> Yes that sounds familiar from the luminous and mimic days. > >> >> The workaround for zillions of snapshot keys at that time was to use: > >> >> ceph config set mon mon_sync_max_payload_size 4096 > >> > > >> > I actually did search for mon_sync_max_payload_keys, not bytes so I > >> > missed your thread, it seems. Thanks for pointing that out. So the > >> > defaults seem to be these in Octopus: > >> > > >> > "mon_sync_max_payload_keys": "2000", > >> > "mon_sync_max_payload_size": "1048576", > >> > > >> >> So it could be in your case that the sync payload is just too small to > >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and > >> debug_mon > >> >> you should be able to understand what is taking so long, and tune > >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. > >> > > >> > I'm confused, if the payload size is too small, why would decreasing > >> > it help? Or am I misunderstanding something? But it probably won't > >> > hurt to try it with 4096 and see if anything changes. If not we can > >> > still turn on debug logs and take a closer look. > >> > > >> >> And additional to Dan suggestion, the HDD is not a good choices for > >> >> RocksDB, which is most likely the reason for this thread, I think > >> >> that from the 3rd time the database just goes into compaction > >> >> maintenance > >> > > >> > Believe me, I know... but there's not much they can currently do > >> > about it, quite a long story... But I have been telling them that > >> > for months now. Anyway, I will make some suggestions and report back > >> > if it worked in this case as well. > >> > > >> > Thanks! > >> > Eugen > >> > > >> > Zitat von Dan van der Ster <dan.vanderster(a)clyso.com>om>: > >> > > >> >> Hi Eugen! > >> >> > >> >> Yes that sounds familiar from the luminous and mimic days. > >> >> > >> >> Check this old thread: > >> >> > >> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF5… > >> >> (that thread is truncated but I can tell you that it worked > for Frank). > >> >> Also the even older referenced thread: > >> >> > >> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2… > >> >> > >> >> The workaround for zillions of snapshot keys at that time was to use: > >> >> ceph config set mon mon_sync_max_payload_size 4096 > >> >> > >> >> That said, that sync issue was supposed to be fixed by way of > adding the > >> >> new option mon_sync_max_payload_keys, which has been around since > >> nautilus. > >> >> > >> >> So it could be in your case that the sync payload is just too small to > >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and > >> debug_mon > >> >> you should be able to understand what is taking so long, and tune > >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. > >> >> > >> >> Good luck! > >> >> > >> >> Dan > >> >> > >> >> ______________________________________________________ > >> >> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > >> >> > >> >> > >> >> > >> >> On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock(a)nde.ag> wrote: > >> >> > >> >>> Hi *, > >> >>> > >> >>> I'm investigating an interesting issue on two customer clusters (used > >> >>> for mirroring) I've not solved yet, but today we finally made some > >> >>> progress. Maybe someone has an idea where to look next, I'd > appreciate > >> >>> any hints or comments. > >> >>> These are two (latest) Octopus clusters, main usage currently is RBD > >> >>> mirroring with snapshot mode (around 500 RBD images are synced every > >> >>> 30 minutes). They noticed very long startup times of MON > daemons after > >> >>> reboot, times between 10 and 30 minutes (reboot time already > >> >>> subtracted). These delays are present on both sites. Today we got a > >> >>> maintenance window and started to check in more detail by just > >> >>> restarting the MON service (joins quorum within seconds), then > >> >>> stopping the MON service and wait a few minutes (still joins quorum > >> >>> within seconds). And then we stopped the service and waited for more > >> >>> than 5 minutes, simulating a reboot, and then we were able to > >> >>> reproduce it. The sync then takes around 15 minutes, we verified with > >> >>> other MONs as well. The MON store is around 2 GB of size (on HDD), I > >> >>> understand that the sync itself can take some time, but what is the > >> >>> threshold here? I tried to find a hint in the MON config, searching > >> >>> for timeouts with 300 seconds, there were only a few matches > >> >>> (mon_session_timeout is one of them), but I'm not sure if they can > >> >>> explain this behavior. > >> >>> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed > >> >>> that there were more than 42 Million osd_snap keys, which is quite a > >> >>> lot and would explain the size of the MON store. But I'm > also not sure > >> >>> if it's related to the long syncing process. > >> >>> Does that sound familiar to anyone? > >> >>> > >> >>> Thanks, > >> >>> Eugen > >> >>> _______________________________________________ > >> >>> ceph-users mailing list -- ceph-users(a)ceph.io > >> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > >> >>> > >> > >> > >> > >> > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Konstantin Shalygin

7 Jul 7 Jul

7:56 a.m.

Hi, And additional to Dan suggestion, the HDD is not a good choices for RocksDB, which is most likely the reason for this thread, I think that from the 3rd time the database just goes into compaction maintenance k Sent from my iPhone

...

On 6 Jul 2023, at 23:48, Eugen Block <eblock(a)nde.ag> wrote: The MON store is around 2 GB of size (on HDD)

297

days inactive

319

days old

ceph-users@ceph.io

Manage subscription

13 comments

4 participants

tags (0)

participants (4)

Dan van der Ster
Eugen Block
Josh Baergen
Konstantin Shalygin