osd_pglog memory hoarding - another case - ceph-users

List overview All Threads
Download

newer

osd_pglog memory hoarding - another case

older

PGs down

kvm vm cephfs mount hangs on osd...

Kalle Happonen

17 Nov 2020 17 Nov '20

11:34 a.m.

Hello all, wrt: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… Yesterday we hit a problem with osd_pglog memory, similar to the thread above. We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. We run 8+3 EC for the data pool (metadata is on replicated nvme pool). The cluster has been running fine, and (as relevant to the post) the memory usage has been stable at 100 GB / node. We've had the default pg_log of 3000. The user traffic doesn't seem to have been exceptional lately. Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory usage on OSD nodes started to grow. On each node it grew steadily about 30 GB/day, until the servers started OOM killing OSD processes. After a lot of debugging we found that the pg_logs were huge. Each OSD process pg_log had grown to ~22GB, which we naturally didn't have memory for, and then the cluster was in an unstable situation. This is significantly more than the 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the size. We've reduced the pg_log to 500, and started offline trimming it where we can, and also just waited. The pg_log size dropped to ~1,2 GB on at least some nodes, but we're still recovering, and have a lot of ODSs down and out still. We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered this (or something unrelated we don't see). This mail is mostly to figure out if there are good guesses why the pg_log size per OSD process exploded? Any technical (and moral) support is appreciated. Also, currently we're not sure if 14.2.13 triggered this, so this is also to put a data point out there for other debuggers. Cheers, Kalle Happonen

Show replies by thread

Dan van der Ster

17 Nov 17 Nov

11:55 a.m.

Hi Kalle, Strangely and luckily, in our case the memory explosion didn't reoccur after that incident. So I can mostly only offer moral support. But if this bug indeed appeared between 14.2.8 and 14.2.13, then I think this is suspicious: b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk https://github.com/ceph/ceph/commit/b670715eb4 Given that it adds a case where the pg_log is not trimmed, I wonder if there could be an unforeseen condition where `last_update_ondisk` isn't being updated correctly, and therefore the osd stops trimming the pg_log altogether. Xie or Samuel: does that sound possible? Cheers, Dan On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: > > Hello all, > wrt: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… > > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. > > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. We run 8+3 EC for the data pool (metadata is on replicated nvme pool). > > The cluster has been running fine, and (as relevant to the post) the memory usage has been stable at 100 GB / node. We've had the default pg_log of 3000. The user traffic doesn't seem to have been exceptional lately. > > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory usage on OSD nodes started to grow. On each node it grew steadily about 30 GB/day, until the servers started OOM killing OSD processes. > > After a lot of debugging we found that the pg_logs were huge. Each OSD process pg_log had grown to ~22GB, which we naturally didn't have memory for, and then the cluster was in an unstable situation. This is significantly more than the 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the size. > > We've reduced the pg_log to 500, and started offline trimming it where we can, and also just waited. The pg_log size dropped to ~1,2 GB on at least some nodes, but we're still recovering, and have a lot of ODSs down and out still. > > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered this (or something unrelated we don't see). > > This mail is mostly to figure out if there are good guesses why the pg_log size per OSD process exploded? Any technical (and moral) support is appreciated. Also, currently we're not sure if 14.2.13 triggered this, so this is also to put a data point out there for other debuggers. > > Cheers, > Kalle Happonen > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

xie.xingguo＠zte.com.cn

1:14 p.m.

Dan van der Ster

1:19 p.m.

Hi Xie, On Tue, Nov 17, 2020 at 11:14 AM <xie.xingguo(a)zte.com.cn> wrote:

...

Hi Dan，

Given that it adds a case where the pg_log is not trimmed, I wonder if there could be an unforeseen condition where `last_update_ondisk` isn't being updated correctly, and therefore the osd stops trimming the pg_log altogether.

Xie or Samuel: does that sound possible?

"b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk" sounds like the culprit to me if the cluster pgs never go active and recover under min_size.

Thanks for the reply. In our case the cluster was HEALTH_OK -- all PGs active and running for two weeks after upgrading to v14.2.11 (from 12.2.12). It took two weeks for us to notice that the pg logs were growing without bound. -- dan > > > > 原始邮件 > 发件人：DanvanderSter > 收件人：Kalle Happonen; > 抄送人：Ceph Users;谢型果10072465;Samuel Just; > 日期：2020年11月17日 16:56 > 主题：Re: [ceph-users] osd_pglog memory hoarding - another case > Hi Kalle, > > Strangely and luckily, in our case the memory explosion didn't reoccur > after that incident. So I can mostly only offer moral support. > > But if this bug indeed appeared between 14.2.8 and 14.2.13, then I > think this is suspicious: > > b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk > > https://github.com/ceph/ceph/commit/b670715eb4 > > Given that it adds a case where the pg_log is not trimmed, I wonder if > there could be an unforeseen condition where `last_update_ondisk` > isn't being updated correctly, and therefore the osd stops trimming > the pg_log altogether. > > Xie or Samuel: does that sound possible? > > Cheers, Dan > > On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: > > > > Hello all, > > wrt: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… > > > > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. > > > > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. We run 8+3 EC for the data pool (metadata is on replicated nvme pool).. > > > > The cluster has been running fine, and (as relevant to the post) the memory usage has been stable at 100 GB / node. We've had the default pg_log of 3000. The user traffic doesn't seem to have been exceptional lately. > > > > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory usage on OSD nodes started to grow. On each node it grew steadily about 30 GB/day, until the servers started OOM killing OSD processes. > > > > After a lot of debugging we found that the pg_logs were huge. Each OSD process pg_log had grown to ~22GB, which we naturally didn't have memory for, and then the cluster was in an unstable situation. This is significantly more than the 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the size. > > > > We've reduced the pg_log to 500, and started offline trimming it where we can, and also just waited. The pg_log size dropped to ~1,2 GB on at least some nodes, but we're still recovering, and have a lot of ODSs down and out still. > > > > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered this (or something unrelated we don't see). > > > > This mail is mostly to figure out if there are good guesses why the pg_log size per OSD process exploded? Any technical (and moral) support is appreciated. Also, currently we're not sure if 14.2.13 triggered this, so this is also to put a data point out there for other debuggers. > > > > Cheers, > > Kalle Happonen > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > >

Dan van der Ster

1:22 p.m.

Hi Kalle, Do you have active PGs now with huge pglogs? You can do something like this to find them: ceph pg dump -f json | jq '.pg_map.pg_stats[] | select(.ondisk_log_size > 3000)' If you find some, could you increase to debug_osd = 10 then share the osd log. I am interested in the debug lines from calc_trim_to_aggressively (or calc_trim_to if you didn't enable pglog_hardlimit), but the whole log might show other issues. Cheers, dan On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan(a)vanderster.com> wrote: > > Hi Kalle, > > Strangely and luckily, in our case the memory explosion didn't reoccur > after that incident. So I can mostly only offer moral support. > > But if this bug indeed appeared between 14.2.8 and 14.2.13, then I > think this is suspicious: > > b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk > > https://github.com/ceph/ceph/commit/b670715eb4 > > Given that it adds a case where the pg_log is not trimmed, I wonder if > there could be an unforeseen condition where `last_update_ondisk` > isn't being updated correctly, and therefore the osd stops trimming > the pg_log altogether. > > Xie or Samuel: does that sound possible? > > Cheers, Dan > > On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: > > > > Hello all, > > wrt: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… > > > > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. > > > > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. We run 8+3 EC for the data pool (metadata is on replicated nvme pool). > > > > The cluster has been running fine, and (as relevant to the post) the memory usage has been stable at 100 GB / node. We've had the default pg_log of 3000. The user traffic doesn't seem to have been exceptional lately. > > > > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory usage on OSD nodes started to grow. On each node it grew steadily about 30 GB/day, until the servers started OOM killing OSD processes. > > > > After a lot of debugging we found that the pg_logs were huge. Each OSD process pg_log had grown to ~22GB, which we naturally didn't have memory for, and then the cluster was in an unstable situation. This is significantly more than the 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the size. > > > > We've reduced the pg_log to 500, and started offline trimming it where we can, and also just waited. The pg_log size dropped to ~1,2 GB on at least some nodes, but we're still recovering, and have a lot of ODSs down and out still. > > > > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered this (or something unrelated we don't see). > > > > This mail is mostly to figure out if there are good guesses why the pg_log size per OSD process exploded? Any technical (and moral) support is appreciated. Also, currently we're not sure if 14.2.13 triggered this, so this is also to put a data point out there for other debuggers. > > > > Cheers, > > Kalle Happonen > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Kalle Happonen

1:45 p.m.

Hi Dan @ co., Thanks for the support (moral and technical). That sounds like a good guess, but it seems like there is nothing alarming here. In all our pools, some pgs are a bit over 3100, but not at any exceptional values. cat pgdumpfull.txt | jq '.pg_map.pg_stats[] | select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size" "pgid": "37.2b9", "ondisk_log_size": 3103, "pgid": "33.e", "ondisk_log_size": 3229, "pgid": "7.2", "ondisk_log_size": 3111, "pgid": "26.4", "ondisk_log_size": 3185, "pgid": "33.4", "ondisk_log_size": 3311, "pgid": "33.8", "ondisk_log_size": 3278, I also have no idea what the average size of a pg log entry should be, in our case it seems it's around 8 MB (22GB/3000 entires). Cheers, Kalle ----- Original Message -----

...

From: "Dan van der Ster" <dan(a)vanderster.com> To: "Kalle Happonen" <kalle.happonen(a)csc.fi> Cc: "ceph-users" <ceph-users(a)ceph.io>io>, "xie xingguo" <xie.xingguo(a)zte.com.cn>cn>, "Samuel Just" <sjust(a)redhat.com> Sent: Tuesday, 17 November, 2020 12:22:28 Subject: Re: [ceph-users] osd_pglog memory hoarding - another case

> Hi Kalle, > > Do you have active PGs now with huge pglogs? > You can do something like this to find them: > > ceph pg dump -f json | jq '.pg_map.pg_stats[] | > select(.ondisk_log_size > 3000)' > > If you find some, could you increase to debug_osd = 10 then share the osd log. > I am interested in the debug lines from calc_trim_to_aggressively (or > calc_trim_to if you didn't enable pglog_hardlimit), but the whole log > might show other issues. > > Cheers, dan > > > On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan(a)vanderster.com> wrote: >> >> Hi Kalle, >> >> Strangely and luckily, in our case the memory explosion didn't reoccur >> after that incident. So I can mostly only offer moral support. >> >> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I >> think this is suspicious: >> >> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk >> >> https://github.com/ceph/ceph/commit/b670715eb4 >> >> Given that it adds a case where the pg_log is not trimmed, I wonder if >> there could be an unforeseen condition where `last_update_ondisk` >> isn't being updated correctly, and therefore the osd stops trimming >> the pg_log altogether. >> >> Xie or Samuel: does that sound possible? >> >> Cheers, Dan >> >> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >> > >> > Hello all, >> > wrt: >> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… >> > >> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. >> > >> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. >> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool). >> > >> > The cluster has been running fine, and (as relevant to the post) the memory >> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000. >> > The user traffic doesn't seem to have been exceptional lately. >> > >> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory >> > usage on OSD nodes started to grow. On each node it grew steadily about 30 >> > GB/day, until the servers started OOM killing OSD processes. >> > >> > After a lot of debugging we found that the pg_logs were huge. Each OSD process >> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then >> > the cluster was in an unstable situation. This is significantly more than the >> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the >> > size. >> > >> > We've reduced the pg_log to 500, and started offline trimming it where we can, >> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some >> > nodes, but we're still recovering, and have a lot of ODSs down and out still. >> > >> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered >> > this (or something unrelated we don't see). >> > >> > This mail is mostly to figure out if there are good guesses why the pg_log size >> > per OSD process exploded? Any technical (and moral) support is appreciated. >> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to >> > put a data point out there for other debuggers. >> > >> > Cheers, >> > Kalle Happonen >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Dan van der Ster

1:57 p.m.

On Tue, Nov 17, 2020 at 11:45 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote:

...

I also have no idea how large the average PG log entry *should* be. (BTW I think you forgot a factor which is the number of PGs on each OSD.). Here's a sample from one of our S3 4+2 OSDs: 71 PGs, "osd_pglog": { "items": 249530, "bytes": 33925360 }, So that's ~32MB for roughly 500*71 entries == around 1kB each. Anyway you raised a good point -- this isn't necessarily a "pg log not trimming" bug, but rather it might be a "pg log entries are huge" bug. -- dan > > Cheers, > Kalle > > ----- Original Message ----- > > From: "Dan van der Ster" <dan(a)vanderster.com> > > To: "Kalle Happonen" <kalle.happonen(a)csc.fi> > > Cc: "ceph-users" <ceph-users(a)ceph.io>io>, "xie xingguo" <xie.xingguo(a)zte.com.cn>cn>, "Samuel Just" <sjust(a)redhat.com> > > Sent: Tuesday, 17 November, 2020 12:22:28 > > Subject: Re: [ceph-users] osd_pglog memory hoarding - another case > > > Hi Kalle, > > > > Do you have active PGs now with huge pglogs? > > You can do something like this to find them: > > > > ceph pg dump -f json | jq '.pg_map.pg_stats[] | > > select(.ondisk_log_size > 3000)' > > > > If you find some, could you increase to debug_osd = 10 then share the osd log. > > I am interested in the debug lines from calc_trim_to_aggressively (or > > calc_trim_to if you didn't enable pglog_hardlimit), but the whole log > > might show other issues. > > > > Cheers, dan > > > > > > On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan(a)vanderster.com> wrote: > >> > >> Hi Kalle, > >> > >> Strangely and luckily, in our case the memory explosion didn't reoccur > >> after that incident. So I can mostly only offer moral support. > >> > >> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I > >> think this is suspicious: > >> > >> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk > >> > >> https://github.com/ceph/ceph/commit/b670715eb4 > >> > >> Given that it adds a case where the pg_log is not trimmed, I wonder if > >> there could be an unforeseen condition where `last_update_ondisk` > >> isn't being updated correctly, and therefore the osd stops trimming > >> the pg_log altogether. > >> > >> Xie or Samuel: does that sound possible? > >> > >> Cheers, Dan > >> > >> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: > >> > > >> > Hello all, > >> > wrt: > >> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… > >> > > >> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. > >> > > >> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. > >> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool). > >> > > >> > The cluster has been running fine, and (as relevant to the post) the memory > >> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000. > >> > The user traffic doesn't seem to have been exceptional lately. > >> > > >> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory > >> > usage on OSD nodes started to grow. On each node it grew steadily about 30 > >> > GB/day, until the servers started OOM killing OSD processes. > >> > > >> > After a lot of debugging we found that the pg_logs were huge. Each OSD process > >> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then > >> > the cluster was in an unstable situation. This is significantly more than the > >> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the > >> > size. > >> > > >> > We've reduced the pg_log to 500, and started offline trimming it where we can, > >> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some > >> > nodes, but we're still recovering, and have a lot of ODSs down and out still. > >> > > >> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered > >> > this (or something unrelated we don't see). > >> > > >> > This mail is mostly to figure out if there are good guesses why the pg_log size > >> > per OSD process exploded? Any technical (and moral) support is appreciated. > >> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to > >> > put a data point out there for other debuggers. > >> > > >> > Cheers, > >> > Kalle Happonen > >> > _______________________________________________ > >> > ceph-users mailing list -- ceph-users(a)ceph.io > > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Kalle Happonen

1:58 p.m.

...

From: "Kalle Happonen" <kalle.happonen(a)csc.fi> To: "Dan van der Ster" <dan(a)vanderster.com> Cc: "ceph-users" <ceph-users(a)ceph.io> Sent: Tuesday, 17 November, 2020 12:45:25 Subject: [ceph-users] Re: osd_pglog memory hoarding - another case

> Hi Dan @ co., > Thanks for the support (moral and technical). > > That sounds like a good guess, but it seems like there is nothing alarming here. > In all our pools, some pgs are a bit over 3100, but not at any exceptional > values. > > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] | > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size" > "pgid": "37.2b9", > "ondisk_log_size": 3103, > "pgid": "33.e", > "ondisk_log_size": 3229, > "pgid": "7.2", > "ondisk_log_size": 3111, > "pgid": "26.4", > "ondisk_log_size": 3185, > "pgid": "33.4", > "ondisk_log_size": 3311, > "pgid": "33.8", > "ondisk_log_size": 3278, > > I also have no idea what the average size of a pg log entry should be, in our > case it seems it's around 8 MB (22GB/3000 entires). > > Cheers, > Kalle > > ----- Original Message ----- >> From: "Dan van der Ster" <dan(a)vanderster.com> >> To: "Kalle Happonen" <kalle.happonen(a)csc.fi> >> Cc: "ceph-users" <ceph-users(a)ceph.io>io>, "xie xingguo" <xie.xingguo(a)zte.com.cn>cn>, >> "Samuel Just" <sjust(a)redhat.com> >> Sent: Tuesday, 17 November, 2020 12:22:28 >> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case > >> Hi Kalle, >> >> Do you have active PGs now with huge pglogs? >> You can do something like this to find them: >> >> ceph pg dump -f json | jq '.pg_map.pg_stats[] | >> select(.ondisk_log_size > 3000)' >> >> If you find some, could you increase to debug_osd = 10 then share the osd log. >> I am interested in the debug lines from calc_trim_to_aggressively (or >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log >> might show other issues. >> >> Cheers, dan >> >> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan(a)vanderster.com> wrote: >>> >>> Hi Kalle, >>> >>> Strangely and luckily, in our case the memory explosion didn't reoccur >>> after that incident. So I can mostly only offer moral support. >>> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I >>> think this is suspicious: >>> >>> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk >>> >>> https://github.com/ceph/ceph/commit/b670715eb4 >>> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if >>> there could be an unforeseen condition where `last_update_ondisk` >>> isn't being updated correctly, and therefore the osd stops trimming >>> the pg_log altogether. >>> >>> Xie or Samuel: does that sound possible? >>> >>> Cheers, Dan >>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >>> > >>> > Hello all, >>> > wrt: >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… >>> > >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. >>> > >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool). >>> > >>> > The cluster has been running fine, and (as relevant to the post) the memory >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000. >>> > The user traffic doesn't seem to have been exceptional lately. >>> > >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory >>> > usage on OSD nodes started to grow. On each node it grew steadily about 30 >>> > GB/day, until the servers started OOM killing OSD processes. >>> > >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then >>> > the cluster was in an unstable situation. This is significantly more than the >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the >>> > size. >>> > >>> > We've reduced the pg_log to 500, and started offline trimming it where we can, >>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some >>> > nodes, but we're still recovering, and have a lot of ODSs down and out still. >>> > >>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered >>> > this (or something unrelated we don't see). >>> > >>> > This mail is mostly to figure out if there are good guesses why the pg_log size >>> > per OSD process exploded? Any technical (and moral) support is appreciated. >>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to >>> > put a data point out there for other debuggers. >>> > >>> > Cheers, >>> > Kalle Happonen >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users(a)ceph.io >> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Dan van der Ster

2:13 p.m.

I don't think the default osd_min_pg_log_entries has changed recently. In https://tracker.ceph.com/issues/47775 I proposed that we limit the pg log length by memory -- if it is indeed possible for log entries to get into several MB, then this would be necessary IMHO. But you said you were trimming PG logs with the offline tool? How long were those logs that needed to be trimmed? -- dan On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: > > Another idea, which I don't know if has any merit. > > If 8 MB is a realistic log size (or has this grown for some reason?), did the enforcement (or default) of the minimum value change lately (osd_min_pg_log_entries)? > > If the minimum amount would be set to 1000, at 8 MB per log, we would have issues with memory. > > Cheers, > Kalle > > > > ----- Original Message ----- > > From: "Kalle Happonen" <kalle.happonen(a)csc.fi> > > To: "Dan van der Ster" <dan(a)vanderster.com> > > Cc: "ceph-users" <ceph-users(a)ceph.io> > > Sent: Tuesday, 17 November, 2020 12:45:25 > > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case > > > Hi Dan @ co., > > Thanks for the support (moral and technical). > > > > That sounds like a good guess, but it seems like there is nothing alarming here. > > In all our pools, some pgs are a bit over 3100, but not at any exceptional > > values. > > > > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] | > > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size" > > "pgid": "37.2b9", > > "ondisk_log_size": 3103, > > "pgid": "33.e", > > "ondisk_log_size": 3229, > > "pgid": "7.2", > > "ondisk_log_size": 3111, > > "pgid": "26.4", > > "ondisk_log_size": 3185, > > "pgid": "33.4", > > "ondisk_log_size": 3311, > > "pgid": "33.8", > > "ondisk_log_size": 3278, > > > > I also have no idea what the average size of a pg log entry should be, in our > > case it seems it's around 8 MB (22GB/3000 entires). > > > > Cheers, > > Kalle > > > > ----- Original Message ----- > >> From: "Dan van der Ster" <dan(a)vanderster.com> > >> To: "Kalle Happonen" <kalle.happonen(a)csc.fi> > >> Cc: "ceph-users" <ceph-users(a)ceph.io>io>, "xie xingguo" <xie.xingguo(a)zte.com.cn>cn>, > >> "Samuel Just" <sjust(a)redhat.com> > >> Sent: Tuesday, 17 November, 2020 12:22:28 > >> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case > > > >> Hi Kalle, > >> > >> Do you have active PGs now with huge pglogs? > >> You can do something like this to find them: > >> > >> ceph pg dump -f json | jq '.pg_map.pg_stats[] | > >> select(.ondisk_log_size > 3000)' > >> > >> If you find some, could you increase to debug_osd = 10 then share the osd log. > >> I am interested in the debug lines from calc_trim_to_aggressively (or > >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log > >> might show other issues. > >> > >> Cheers, dan > >> > >> > >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan(a)vanderster.com> wrote: > >>> > >>> Hi Kalle, > >>> > >>> Strangely and luckily, in our case the memory explosion didn't reoccur > >>> after that incident. So I can mostly only offer moral support. > >>> > >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I > >>> think this is suspicious: > >>> > >>> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk > >>> > >>> https://github.com/ceph/ceph/commit/b670715eb4 > >>> > >>> Given that it adds a case where the pg_log is not trimmed, I wonder if > >>> there could be an unforeseen condition where `last_update_ondisk` > >>> isn't being updated correctly, and therefore the osd stops trimming > >>> the pg_log altogether. > >>> > >>> Xie or Samuel: does that sound possible? > >>> > >>> Cheers, Dan > >>> > >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: > >>> > > >>> > Hello all, > >>> > wrt: > >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… > >>> > > >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. > >>> > > >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. > >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool). > >>> > > >>> > The cluster has been running fine, and (as relevant to the post) the memory > >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000. > >>> > The user traffic doesn't seem to have been exceptional lately. > >>> > > >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory > >>> > usage on OSD nodes started to grow. On each node it grew steadily about 30 > >>> > GB/day, until the servers started OOM killing OSD processes. > >>> > > >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process > >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then > >>> > the cluster was in an unstable situation. This is significantly more than the > >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the > >>> > size. > >>> > > >>> > We've reduced the pg_log to 500, and started offline trimming it where we can, > >>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some > >>> > nodes, but we're still recovering, and have a lot of ODSs down and out still. > >>> > > >>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered > >>> > this (or something unrelated we don't see). > >>> > > >>> > This mail is mostly to figure out if there are good guesses why the pg_log size > >>> > per OSD process exploded? Any technical (and moral) support is appreciated. > >>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to > >>> > put a data point out there for other debuggers. > >>> > > >>> > Cheers, > >>> > Kalle Happonen > >>> > _______________________________________________ > >>> > ceph-users mailing list -- ceph-users(a)ceph.io > >> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Mark Nelson

3:49 p.m.

Hi Dan, I 100% agree with your proposal. One of the goals I had in mind with the prioritycache framework is that pglog could end up becoming another prioritycache target that is balanced against the other caches. The idea would be that we try to keep some amount of pglog data in memory at high priority but ultimately the longer the log gets the less priority it gets relative to onode cache and other things (with some minimums/maximums in place as well). Just yesterday Josh and I were also talking about the possibility of keeping a longer running log on disk than what's represented in memory as well. This could have implications for peering performance, but frankly I don't see how we keep using log based recovery in a world where we are putting OSDs on devices capable of hundreds of thousands of write IOPS. Mark On 11/17/20 5:13 AM, Dan van der Ster wrote:

...

Another idea, which I don't know if has any merit. If 8 MB is a realistic log size (or has this grown for some reason?), did the enforcement (or default) of the minimum value change lately (osd_min_pg_log_entries)? If the minimum amount would be set to 1000, at 8 MB per log, we would have issues with memory. Cheers, Kalle ----- Original Message ----- > From: "Kalle Happonen" <kalle.happonen(a)csc.fi> > To: "Dan van der Ster" <dan(a)vanderster.com> > Cc: "ceph-users" <ceph-users(a)ceph.io> > Sent: Tuesday, 17 November, 2020 12:45:25 > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case > Hi Dan @ co., > Thanks for the support (moral and technical). > > That sounds like a good guess, but it seems like there is nothing alarming here. > In all our pools, some pgs are a bit over 3100, but not at any exceptional > values. > > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] | > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size" > "pgid": "37.2b9", > "ondisk_log_size": 3103, > "pgid": "33.e", > "ondisk_log_size": 3229, > "pgid": "7.2", > "ondisk_log_size": 3111, > "pgid": "26.4", > "ondisk_log_size": 3185, > "pgid": "33.4", > "ondisk_log_size": 3311, > "pgid": "33.8", > "ondisk_log_size": 3278, > > I also have no idea what the average size of a pg log entry should be, in our > case it seems it's around 8 MB (22GB/3000 entires). > > Cheers, > Kalle > > ----- Original Message ----- >> From: "Dan van der Ster" <dan(a)vanderster.com> >> To: "Kalle Happonen" <kalle.happonen(a)csc.fi> >> Cc: "ceph-users" <ceph-users(a)ceph.io>io>, "xie xingguo" <xie.xingguo(a)zte.com.cn>cn>, >> "Samuel Just" <sjust(a)redhat.com> >> Sent: Tuesday, 17 November, 2020 12:22:28 >> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case >> Hi Kalle, >> >> Do you have active PGs now with huge pglogs? >> You can do something like this to find them: >> >> ceph pg dump -f json | jq '.pg_map.pg_stats[] | >> select(.ondisk_log_size > 3000)' >> >> If you find some, could you increase to debug_osd = 10 then share the osd log. >> I am interested in the debug lines from calc_trim_to_aggressively (or >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log >> might show other issues. >> >> Cheers, dan >> >> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan(a)vanderster.com> wrote: >>> Hi Kalle, >>> >>> Strangely and luckily, in our case the memory explosion didn't reoccur >>> after that incident. So I can mostly only offer moral support. >>> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I >>> think this is suspicious: >>> >>> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk >>> >>> https://github.com/ceph/ceph/commit/b670715eb4 >>> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if >>> there could be an unforeseen condition where `last_update_ondisk` >>> isn't being updated correctly, and therefore the osd stops trimming >>> the pg_log altogether. >>> >>> Xie or Samuel: does that sound possible? >>> >>> Cheers, Dan >>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >>>> Hello all, >>>> wrt: >>>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… >>>> >>>> Yesterday we hit a problem with osd_pglog memory, similar to the thread above. >>>> >>>> We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. >>>> We run 8+3 EC for the data pool (metadata is on replicated nvme pool). >>>> >>>> The cluster has been running fine, and (as relevant to the post) the memory >>>> usage has been stable at 100 GB / node. We've had the default pg_log of 3000. >>>> The user traffic doesn't seem to have been exceptional lately. >>>> >>>> Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory >>>> usage on OSD nodes started to grow. On each node it grew steadily about 30 >>>> GB/day, until the servers started OOM killing OSD processes. >>>> >>>> After a lot of debugging we found that the pg_logs were huge. Each OSD process >>>> pg_log had grown to ~22GB, which we naturally didn't have memory for, and then >>>> the cluster was in an unstable situation. This is significantly more than the >>>> 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the >>>> size. >>>> >>>> We've reduced the pg_log to 500, and started offline trimming it where we can, >>>> and also just waited. The pg_log size dropped to ~1,2 GB on at least some >>>> nodes, but we're still recovering, and have a lot of ODSs down and out still. >>>> >>>> We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered >>>> this (or something unrelated we don't see). >>>> >>>> This mail is mostly to figure out if there are good guesses why the pg_log size >>>> per OSD process exploded? Any technical (and moral) support is appreciated. >>>> Also, currently we're not sure if 14.2.13 triggered this, so this is also to >>>> put a data point out there for other debuggers. >>>> >>>> Cheers, >>>> Kalle Happonen >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Kalle Happonen

5:07 p.m.

Hi,

...

I've had a surprising crash course on pg_log in the last 36 hours. But for the size of each entry, you're right. I counted pg log * ODS, and did not take into factor pg log * OSDs * pgs on the OSD. Still the total memory use that an OSD uses for pg_log was ~22GB / OSD process.

...

But you said you were trimming PG logs with the offline tool? How long were those logs that needed to be trimmed?

The logs we are trimming were ~3000, we trimmed them to the new size of 500. After restarting the OSDs, it dropped the pg_log memory usage from ~22GB, to what we guess is 2-3GB but with the cluster at this state, it's hard to be specific. Cheers, Kalle > -- dan > > > On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >> >> Another idea, which I don't know if has any merit. >> >> If 8 MB is a realistic log size (or has this grown for some reason?), did the >> enforcement (or default) of the minimum value change lately >> (osd_min_pg_log_entries)? >> >> If the minimum amount would be set to 1000, at 8 MB per log, we would have >> issues with memory. >> >> Cheers, >> Kalle >> >> >> >> ----- Original Message ----- >> > From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >> > To: "Dan van der Ster" <dan(a)vanderster.com> >> > Cc: "ceph-users" <ceph-users(a)ceph.io> >> > Sent: Tuesday, 17 November, 2020 12:45:25 >> > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case >> >> > Hi Dan @ co., >> > Thanks for the support (moral and technical). >> > >> > That sounds like a good guess, but it seems like there is nothing alarming here. >> > In all our pools, some pgs are a bit over 3100, but not at any exceptional >> > values. >> > >> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] | >> > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size" >> > "pgid": "37.2b9", >> > "ondisk_log_size": 3103, >> > "pgid": "33.e", >> > "ondisk_log_size": 3229, >> > "pgid": "7.2", >> > "ondisk_log_size": 3111, >> > "pgid": "26.4", >> > "ondisk_log_size": 3185, >> > "pgid": "33.4", >> > "ondisk_log_size": 3311, >> > "pgid": "33.8", >> > "ondisk_log_size": 3278, >> > >> > I also have no idea what the average size of a pg log entry should be, in our >> > case it seems it's around 8 MB (22GB/3000 entires). >> > >> > Cheers, >> > Kalle >> > >> > ----- Original Message ----- >> >> From: "Dan van der Ster" <dan(a)vanderster.com> >> >> To: "Kalle Happonen" <kalle.happonen(a)csc.fi> >> >> Cc: "ceph-users" <ceph-users(a)ceph.io>io>, "xie xingguo" <xie.xingguo(a)zte.com.cn>cn>, >> >> "Samuel Just" <sjust(a)redhat.com> >> >> Sent: Tuesday, 17 November, 2020 12:22:28 >> >> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case >> > >> >> Hi Kalle, >> >> >> >> Do you have active PGs now with huge pglogs? >> >> You can do something like this to find them: >> >> >> >> ceph pg dump -f json | jq '.pg_map.pg_stats[] | >> >> select(.ondisk_log_size > 3000)' >> >> >> >> If you find some, could you increase to debug_osd = 10 then share the osd log. >> >> I am interested in the debug lines from calc_trim_to_aggressively (or >> >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log >> >> might show other issues. >> >> >> >> Cheers, dan >> >> >> >> >> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan(a)vanderster.com> wrote: >> >>> >> >>> Hi Kalle, >> >>> >> >>> Strangely and luckily, in our case the memory explosion didn't reoccur >> >>> after that incident. So I can mostly only offer moral support. >> >>> >> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I >> >>> think this is suspicious: >> >>> >> >>> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk >> >>> >> >>> https://github.com/ceph/ceph/commit/b670715eb4 >> >>> >> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if >> >>> there could be an unforeseen condition where `last_update_ondisk` >> >>> isn't being updated correctly, and therefore the osd stops trimming >> >>> the pg_log altogether. >> >>> >> >>> Xie or Samuel: does that sound possible? >> >>> >> >>> Cheers, Dan >> >>> >> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >> >>> > >> >>> > Hello all, >> >>> > wrt: >> >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… >> >>> > >> >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. >> >>> > >> >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. >> >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool). >> >>> > >> >>> > The cluster has been running fine, and (as relevant to the post) the memory >> >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000. >> >>> > The user traffic doesn't seem to have been exceptional lately. >> >>> > >> >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory >> >>> > usage on OSD nodes started to grow. On each node it grew steadily about 30 >> >>> > GB/day, until the servers started OOM killing OSD processes. >> >>> > >> >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process >> >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then >> >>> > the cluster was in an unstable situation. This is significantly more than the >> >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the >> >>> > size. >> >>> > >> >>> > We've reduced the pg_log to 500, and started offline trimming it where we can, >> >>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some >> >>> > nodes, but we're still recovering, and have a lot of ODSs down and out still. >> >>> > >> >>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered >> >>> > this (or something unrelated we don't see). >> >>> > >> >>> > This mail is mostly to figure out if there are good guesses why the pg_log size >> >>> > per OSD process exploded? Any technical (and moral) support is appreciated. >> >>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to >> >>> > put a data point out there for other debuggers. >> >>> > >> >>> > Cheers, >> >>> > Kalle Happonen >> >>> > _______________________________________________ >> >>> > ceph-users mailing list -- ceph-users(a)ceph.io >> >> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Kalle Happonen

19 Nov 19 Nov

2:56 p.m.

Hello, I thought I'd post an update. Setting the pg_log size to 500, and running the offline trim operation sequentially on all OSDs seems to help. With our current setup, it takes about 12-48h per node, depending on the pgs per osd. The PG amounts per OSD we have are ~180-750, with a majority around 200, and some nodes consistently have 500 per OSD. The limiting factor of the recovery time seems to be our nvme, which we use for rocksdb for the OSDs. We haven't fully recovered yet, we're working on it. Almost all our PGs are back up, we still have ~40/18000 PGs down, but I think we'll get there. Currently ~40 OSDs/1200 down. It seems like the previous mention of 32kB / pg_log entry seems in the correct magnitude for us too. If we count 32kB * 200 pgs * 3000 log entries, we're close to the 20 GB / OSD process. For the nodes that have been trimmed, we're hovering around 100 GB/node of memory use, or ~4 GB per OSD, and so far seems stable, but we don't have longer term data on that, and we don't know exactly how it behaves when load is applied. However if we're currently at the pg_log limit of 500, adding load should hopefully not increase pg_log memory consumption. Cheers, Kalle ----- Original Message -----

...

From: "Kalle Happonen" <kalle.happonen(a)csc.fi> To: "Dan van der Ster" <dan(a)vanderster.com> Cc: "ceph-users" <ceph-users(a)ceph.io> Sent: Tuesday, 17 November, 2020 16:07:03 Subject: [ceph-users] Re: osd_pglog memory hoarding - another case

> Hi, > >> I don't think the default osd_min_pg_log_entries has changed recently. >> In https://tracker.ceph.com/issues/47775 I proposed that we limit the >> pg log length by memory -- if it is indeed possible for log entries to >> get into several MB, then this would be necessary IMHO. > > I've had a surprising crash course on pg_log in the last 36 hours. But for the > size of each entry, you're right. I counted pg log * ODS, and did not take into > factor pg log * OSDs * pgs on the OSD. Still the total memory use that an OSD > uses for pg_log was ~22GB / OSD process. > > >> But you said you were trimming PG logs with the offline tool? How long >> were those logs that needed to be trimmed? > > The logs we are trimming were ~3000, we trimmed them to the new size of 500. > After restarting the OSDs, it dropped the pg_log memory usage from ~22GB, to > what we guess is 2-3GB but with the cluster at this state, it's hard to be > specific. > > Cheers, > Kalle > > > >> -- dan >> >> >> On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >>> >>> Another idea, which I don't know if has any merit. >>> >>> If 8 MB is a realistic log size (or has this grown for some reason?), did the >>> enforcement (or default) of the minimum value change lately >>> (osd_min_pg_log_entries)? >>> >>> If the minimum amount would be set to 1000, at 8 MB per log, we would have >>> issues with memory. >>> >>> Cheers, >>> Kalle >>> >>> >>> >>> ----- Original Message ----- >>> > From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >>> > To: "Dan van der Ster" <dan(a)vanderster.com> >>> > Cc: "ceph-users" <ceph-users(a)ceph.io> >>> > Sent: Tuesday, 17 November, 2020 12:45:25 >>> > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case >>> >>> > Hi Dan @ co., >>> > Thanks for the support (moral and technical). >>> > >>> > That sounds like a good guess, but it seems like there is nothing alarming here. >>> > In all our pools, some pgs are a bit over 3100, but not at any exceptional >>> > values. >>> > >>> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] | >>> > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size" >>> > "pgid": "37.2b9", >>> > "ondisk_log_size": 3103, >>> > "pgid": "33.e", >>> > "ondisk_log_size": 3229, >>> > "pgid": "7.2", >>> > "ondisk_log_size": 3111, >>> > "pgid": "26.4", >>> > "ondisk_log_size": 3185, >>> > "pgid": "33.4", >>> > "ondisk_log_size": 3311, >>> > "pgid": "33.8", >>> > "ondisk_log_size": 3278, >>> > >>> > I also have no idea what the average size of a pg log entry should be, in our >>> > case it seems it's around 8 MB (22GB/3000 entires). >>> > >>> > Cheers, >>> > Kalle >>> > >>> > ----- Original Message ----- >>> >> From: "Dan van der Ster" <dan(a)vanderster.com> >>> >> To: "Kalle Happonen" <kalle.happonen(a)csc.fi> >>> >> Cc: "ceph-users" <ceph-users(a)ceph.io>io>, "xie xingguo" <xie.xingguo(a)zte.com.cn>cn>, >>> >> "Samuel Just" <sjust(a)redhat.com> >>> >> Sent: Tuesday, 17 November, 2020 12:22:28 >>> >> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case >>> > >>> >> Hi Kalle, >>> >> >>> >> Do you have active PGs now with huge pglogs? >>> >> You can do something like this to find them: >>> >> >>> >> ceph pg dump -f json | jq '.pg_map.pg_stats[] | >>> >> select(.ondisk_log_size > 3000)' >>> >> >>> >> If you find some, could you increase to debug_osd = 10 then share the osd log. >>> >> I am interested in the debug lines from calc_trim_to_aggressively (or >>> >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log >>> >> might show other issues. >>> >> >>> >> Cheers, dan >>> >> >>> >> >>> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan(a)vanderster.com> wrote: >>> >>> >>> >>> Hi Kalle, >>> >>> >>> >>> Strangely and luckily, in our case the memory explosion didn't reoccur >>> >>> after that incident. So I can mostly only offer moral support. >>> >>> >>> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I >>> >>> think this is suspicious: >>> >>> >>> >>> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk >>> >>> >>> >>> https://github.com/ceph/ceph/commit/b670715eb4 >>> >>> >>> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if >>> >>> there could be an unforeseen condition where `last_update_ondisk` >>> >>> isn't being updated correctly, and therefore the osd stops trimming >>> >>> the pg_log altogether. >>> >>> >>> >>> Xie or Samuel: does that sound possible? >>> >>> >>> >>> Cheers, Dan >>> >>> >>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >>> >>> > >>> >>> > Hello all, >>> >>> > wrt: >>> >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… >>> >>> > >>> >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. >>> >>> > >>> >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. >>> >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool). >>> >>> > >>> >>> > The cluster has been running fine, and (as relevant to the post) the memory >>> >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000. >>> >>> > The user traffic doesn't seem to have been exceptional lately. >>> >>> > >>> >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory >>> >>> > usage on OSD nodes started to grow. On each node it grew steadily about 30 >>> >>> > GB/day, until the servers started OOM killing OSD processes. >>> >>> > >>> >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process >>> >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then >>> >>> > the cluster was in an unstable situation. This is significantly more than the >>> >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the >>> >>> > size. >>> >>> > >>> >>> > We've reduced the pg_log to 500, and started offline trimming it where we can, >>> >>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some >>> >>> > nodes, but we're still recovering, and have a lot of ODSs down and out still. >>> >>> > >>> >>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered >>> >>> > this (or something unrelated we don't see). >>> >>> > >>> >>> > This mail is mostly to figure out if there are good guesses why the pg_log size >>> >>> > per OSD process exploded? Any technical (and moral) support is appreciated. >>> >>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to >>> >>> > put a data point out there for other debuggers. >>> >>> > >>> >>> > Cheers, >>> >>> > Kalle Happonen >>> >>> > _______________________________________________ >>> >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>> >> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users(a)ceph.io >> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Kalle Happonen

1 Dec 1 Dec

4:09 p.m.

Hi All, back to this. Dan, it seems we're following exactly in your footsteps. We recovered from our large pg_log, and got the cluster running. A week after our cluster was ok, we started seeing big memory increases again. I don't know if we had buffer_anon issues before or if our big pg_logs were masking it. But we started seeing bluefs spillover and buffer_anon growth. This led to whole other series of problems with OOM killing, which probably resulted in mon node db growth which filled the disk, which resulted in all mons going down, and a bigger mess of bringing everything up. However. We're back. But I think we can confirm the buffer_anon growth, and bluefs spillover. We now have a job that constatly writes 10k objects in a buckets and deletes them. This may curb the memory growth, but I don't think it stops the problem. We're just testing restarting OSDs and while it takes a while, it seems it may help. Of course this is not the greatest fix in production. Has anybody gleaned any new information on this issue? Things to tweaks? Fixes in the horizon? Other mitigations? Cheers, Kalle ----- Original Message -----

...

From: "Kalle Happonen" <kalle.happonen(a)csc.fi> To: "Dan van der Ster" <dan(a)vanderster.com> Cc: "ceph-users" <ceph-users(a)ceph.io> Sent: Thursday, 19 November, 2020 13:56:37 Subject: [ceph-users] Re: osd_pglog memory hoarding - another case

> Hello, > I thought I'd post an update. > > Setting the pg_log size to 500, and running the offline trim operation > sequentially on all OSDs seems to help. With our current setup, it takes about > 12-48h per node, depending on the pgs per osd. The PG amounts per OSD we have > are ~180-750, with a majority around 200, and some nodes consistently have 500 > per OSD. The limiting factor of the recovery time seems to be our nvme, which > we use for rocksdb for the OSDs. > > We haven't fully recovered yet, we're working on it. Almost all our PGs are back > up, we still have ~40/18000 PGs down, but I think we'll get there. Currently > ~40 OSDs/1200 down. > > It seems like the previous mention of 32kB / pg_log entry seems in the correct > magnitude for us too. If we count 32kB * 200 pgs * 3000 log entries, we're > close to the 20 GB / OSD process. > > For the nodes that have been trimmed, we're hovering around 100 GB/node of > memory use, or ~4 GB per OSD, and so far seems stable, but we don't have longer > term data on that, and we don't know exactly how it behaves when load is > applied. However if we're currently at the pg_log limit of 500, adding load > should hopefully not increase pg_log memory consumption. > > Cheers, > Kalle > > ----- Original Message ----- >> From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >> To: "Dan van der Ster" <dan(a)vanderster.com> >> Cc: "ceph-users" <ceph-users(a)ceph.io> >> Sent: Tuesday, 17 November, 2020 16:07:03 >> Subject: [ceph-users] Re: osd_pglog memory hoarding - another case > >> Hi, >> >>> I don't think the default osd_min_pg_log_entries has changed recently. >>> In https://tracker.ceph.com/issues/47775 I proposed that we limit the >>> pg log length by memory -- if it is indeed possible for log entries to >>> get into several MB, then this would be necessary IMHO. >> >> I've had a surprising crash course on pg_log in the last 36 hours. But for the >> size of each entry, you're right. I counted pg log * ODS, and did not take into >> factor pg log * OSDs * pgs on the OSD. Still the total memory use that an OSD >> uses for pg_log was ~22GB / OSD process. >> >> >>> But you said you were trimming PG logs with the offline tool? How long >>> were those logs that needed to be trimmed? >> >> The logs we are trimming were ~3000, we trimmed them to the new size of 500. >> After restarting the OSDs, it dropped the pg_log memory usage from ~22GB, to >> what we guess is 2-3GB but with the cluster at this state, it's hard to be >> specific. >> >> Cheers, >> Kalle >> >> >> >>> -- dan >>> >>> >>> On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >>>> >>>> Another idea, which I don't know if has any merit. >>>> >>>> If 8 MB is a realistic log size (or has this grown for some reason?), did the >>>> enforcement (or default) of the minimum value change lately >>>> (osd_min_pg_log_entries)? >>>> >>>> If the minimum amount would be set to 1000, at 8 MB per log, we would have >>>> issues with memory. >>>> >>>> Cheers, >>>> Kalle >>>> >>>> >>>> >>>> ----- Original Message ----- >>>> > From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >>>> > To: "Dan van der Ster" <dan(a)vanderster.com> >>>> > Cc: "ceph-users" <ceph-users(a)ceph.io> >>>> > Sent: Tuesday, 17 November, 2020 12:45:25 >>>> > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case >>>> >>>> > Hi Dan @ co., >>>> > Thanks for the support (moral and technical). >>>> > >>>> > That sounds like a good guess, but it seems like there is nothing alarming here. >>>> > In all our pools, some pgs are a bit over 3100, but not at any exceptional >>>> > values. >>>> > >>>> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] | >>>> > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size" >>>> > "pgid": "37.2b9", >>>> > "ondisk_log_size": 3103, >>>> > "pgid": "33.e", >>>> > "ondisk_log_size": 3229, >>>> > "pgid": "7.2", >>>> > "ondisk_log_size": 3111, >>>> > "pgid": "26.4", >>>> > "ondisk_log_size": 3185, >>>> > "pgid": "33.4", >>>> > "ondisk_log_size": 3311, >>>> > "pgid": "33.8", >>>> > "ondisk_log_size": 3278, >>>> > >>>> > I also have no idea what the average size of a pg log entry should be, in our >>>> > case it seems it's around 8 MB (22GB/3000 entires). >>>> > >>>> > Cheers, >>>> > Kalle >>>> > >>>> > ----- Original Message ----- >>>> >> From: "Dan van der Ster" <dan(a)vanderster.com> >>>> >> To: "Kalle Happonen" <kalle.happonen(a)csc.fi> >>>> >> Cc: "ceph-users" <ceph-users(a)ceph.io>io>, "xie xingguo" <xie.xingguo(a)zte.com.cn>cn>, >>>> >> "Samuel Just" <sjust(a)redhat.com> >>>> >> Sent: Tuesday, 17 November, 2020 12:22:28 >>>> >> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case >>>> > >>>> >> Hi Kalle, >>>> >> >>>> >> Do you have active PGs now with huge pglogs? >>>> >> You can do something like this to find them: >>>> >> >>>> >> ceph pg dump -f json | jq '.pg_map.pg_stats[] | >>>> >> select(.ondisk_log_size > 3000)' >>>> >> >>>> >> If you find some, could you increase to debug_osd = 10 then share the osd log. >>>> >> I am interested in the debug lines from calc_trim_to_aggressively (or >>>> >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log >>>> >> might show other issues. >>>> >> >>>> >> Cheers, dan >>>> >> >>>> >> >>>> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan(a)vanderster.com> wrote: >>>> >>> >>>> >>> Hi Kalle, >>>> >>> >>>> >>> Strangely and luckily, in our case the memory explosion didn't reoccur >>>> >>> after that incident. So I can mostly only offer moral support. >>>> >>> >>>> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I >>>> >>> think this is suspicious: >>>> >>> >>>> >>> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk >>>> >>> >>>> >>> https://github.com/ceph/ceph/commit/b670715eb4 >>>> >>> >>>> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if >>>> >>> there could be an unforeseen condition where `last_update_ondisk` >>>> >>> isn't being updated correctly, and therefore the osd stops trimming >>>> >>> the pg_log altogether. >>>> >>> >>>> >>> Xie or Samuel: does that sound possible? >>>> >>> >>>> >>> Cheers, Dan >>>> >>> >>>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >>>> >>> > >>>> >>> > Hello all, >>>> >>> > wrt: >>>> >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… >>>> >>> > >>>> >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. >>>> >>> > >>>> >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. >>>> >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool). >>>> >>> > >>>> >>> > The cluster has been running fine, and (as relevant to the post) the memory >>>> >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000. >>>> >>> > The user traffic doesn't seem to have been exceptional lately. >>>> >>> > >>>> >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory >>>> >>> > usage on OSD nodes started to grow. On each node it grew steadily about 30 >>>> >>> > GB/day, until the servers started OOM killing OSD processes. >>>> >>> > >>>> >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process >>>> >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then >>>> >>> > the cluster was in an unstable situation. This is significantly more than the >>>> >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the >>>> >>> > size. >>>> >>> > >>>> >>> > We've reduced the pg_log to 500, and started offline trimming it where we can, >>>> >>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some >>>> >>> > nodes, but we're still recovering, and have a lot of ODSs down and out still. >>>> >>> > >>>> >>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered >>>> >>> > this (or something unrelated we don't see). >>>> >>> > >>>> >>> > This mail is mostly to figure out if there are good guesses why the pg_log size >>>> >>> > per OSD process exploded? Any technical (and moral) support is appreciated. >>>> >>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to >>>> >>> > put a data point out there for other debuggers. >>>> >>> > >>>> >>> > Cheers, >>>> >>> > Kalle Happonen >>>> >>> > _______________________________________________ >>>> >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>>> >> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>> > _______________________________________________ >>>> > ceph-users mailing list -- ceph-users(a)ceph.io >>> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Kalle Happonen

5:49 p.m.

Quick update, restarting OSDs is not enough for us to compact the db. So we stop the osd ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd compact start the osd It seems to fix the spillover. Until it grows again. Cheers, Kalle ----- Original Message -----

...

From: "Kalle Happonen" <kalle.happonen(a)csc.fi> To: "Dan van der Ster" <dan(a)vanderster.com> Cc: "ceph-users" <ceph-users(a)ceph.io> Sent: Tuesday, 1 December, 2020 15:09:37 Subject: [ceph-users] Re: osd_pglog memory hoarding - another case

> Hi All, > back to this. Dan, it seems we're following exactly in your footsteps. > > We recovered from our large pg_log, and got the cluster running. A week after > our cluster was ok, we started seeing big memory increases again. I don't know > if we had buffer_anon issues before or if our big pg_logs were masking it. But > we started seeing bluefs spillover and buffer_anon growth. > > This led to whole other series of problems with OOM killing, which probably > resulted in mon node db growth which filled the disk, which resulted in all > mons going down, and a bigger mess of bringing everything up. > > However. We're back. But I think we can confirm the buffer_anon growth, and > bluefs spillover. > > We now have a job that constatly writes 10k objects in a buckets and deletes > them. > > This may curb the memory growth, but I don't think it stops the problem. We're > just testing restarting OSDs and while it takes a while, it seems it may help. > Of course this is not the greatest fix in production. > > Has anybody gleaned any new information on this issue? Things to tweaks? Fixes > in the horizon? Other mitigations? > > Cheers, > Kalle > > > ----- Original Message ----- >> From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >> To: "Dan van der Ster" <dan(a)vanderster.com> >> Cc: "ceph-users" <ceph-users(a)ceph.io> >> Sent: Thursday, 19 November, 2020 13:56:37 >> Subject: [ceph-users] Re: osd_pglog memory hoarding - another case > >> Hello, >> I thought I'd post an update. >> >> Setting the pg_log size to 500, and running the offline trim operation >> sequentially on all OSDs seems to help. With our current setup, it takes about >> 12-48h per node, depending on the pgs per osd. The PG amounts per OSD we have >> are ~180-750, with a majority around 200, and some nodes consistently have 500 >> per OSD. The limiting factor of the recovery time seems to be our nvme, which >> we use for rocksdb for the OSDs. >> >> We haven't fully recovered yet, we're working on it. Almost all our PGs are back >> up, we still have ~40/18000 PGs down, but I think we'll get there. Currently >> ~40 OSDs/1200 down. >> >> It seems like the previous mention of 32kB / pg_log entry seems in the correct >> magnitude for us too. If we count 32kB * 200 pgs * 3000 log entries, we're >> close to the 20 GB / OSD process. >> >> For the nodes that have been trimmed, we're hovering around 100 GB/node of >> memory use, or ~4 GB per OSD, and so far seems stable, but we don't have longer >> term data on that, and we don't know exactly how it behaves when load is >> applied. However if we're currently at the pg_log limit of 500, adding load >> should hopefully not increase pg_log memory consumption. >> >> Cheers, >> Kalle >> >> ----- Original Message ----- >>> From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >>> To: "Dan van der Ster" <dan(a)vanderster.com> >>> Cc: "ceph-users" <ceph-users(a)ceph.io> >>> Sent: Tuesday, 17 November, 2020 16:07:03 >>> Subject: [ceph-users] Re: osd_pglog memory hoarding - another case >> >>> Hi, >>> >>>> I don't think the default osd_min_pg_log_entries has changed recently. >>>> In https://tracker.ceph.com/issues/47775 I proposed that we limit the >>>> pg log length by memory -- if it is indeed possible for log entries to >>>> get into several MB, then this would be necessary IMHO. >>> >>> I've had a surprising crash course on pg_log in the last 36 hours. But for the >>> size of each entry, you're right. I counted pg log * ODS, and did not take into >>> factor pg log * OSDs * pgs on the OSD. Still the total memory use that an OSD >>> uses for pg_log was ~22GB / OSD process. >>> >>> >>>> But you said you were trimming PG logs with the offline tool? How long >>>> were those logs that needed to be trimmed? >>> >>> The logs we are trimming were ~3000, we trimmed them to the new size of 500. >>> After restarting the OSDs, it dropped the pg_log memory usage from ~22GB, to >>> what we guess is 2-3GB but with the cluster at this state, it's hard to be >>> specific. >>> >>> Cheers, >>> Kalle >>> >>> >>> >>>> -- dan >>>> >>>> >>>> On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >>>>> >>>>> Another idea, which I don't know if has any merit. >>>>> >>>>> If 8 MB is a realistic log size (or has this grown for some reason?), did the >>>>> enforcement (or default) of the minimum value change lately >>>>> (osd_min_pg_log_entries)? >>>>> >>>>> If the minimum amount would be set to 1000, at 8 MB per log, we would have >>>>> issues with memory. >>>>> >>>>> Cheers, >>>>> Kalle >>>>> >>>>> >>>>> >>>>> ----- Original Message ----- >>>>> > From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >>>>> > To: "Dan van der Ster" <dan(a)vanderster.com> >>>>> > Cc: "ceph-users" <ceph-users(a)ceph.io> >>>>> > Sent: Tuesday, 17 November, 2020 12:45:25 >>>>> > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case >>>>> >>>>> > Hi Dan @ co., >>>>> > Thanks for the support (moral and technical). >>>>> > >>>>> > That sounds like a good guess, but it seems like there is nothing alarming here. >>>>> > In all our pools, some pgs are a bit over 3100, but not at any exceptional >>>>> > values. >>>>> > >>>>> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] | >>>>> > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size" >>>>> > "pgid": "37.2b9", >>>>> > "ondisk_log_size": 3103, >>>>> > "pgid": "33.e", >>>>> > "ondisk_log_size": 3229, >>>>> > "pgid": "7.2", >>>>> > "ondisk_log_size": 3111, >>>>> > "pgid": "26.4", >>>>> > "ondisk_log_size": 3185, >>>>> > "pgid": "33.4", >>>>> > "ondisk_log_size": 3311, >>>>> > "pgid": "33.8", >>>>> > "ondisk_log_size": 3278, >>>>> > >>>>> > I also have no idea what the average size of a pg log entry should be, in our >>>>> > case it seems it's around 8 MB (22GB/3000 entires). >>>>> > >>>>> > Cheers, >>>>> > Kalle >>>>> > >>>>> > ----- Original Message ----- >>>>> >> From: "Dan van der Ster" <dan(a)vanderster.com> >>>>> >> To: "Kalle Happonen" <kalle.happonen(a)csc.fi> >>>>> >> Cc: "ceph-users" <ceph-users(a)ceph.io>io>, "xie xingguo" <xie.xingguo(a)zte.com.cn>cn>, >>>>> >> "Samuel Just" <sjust(a)redhat.com> >>>>> >> Sent: Tuesday, 17 November, 2020 12:22:28 >>>>> >> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case >>>>> > >>>>> >> Hi Kalle, >>>>> >> >>>>> >> Do you have active PGs now with huge pglogs? >>>>> >> You can do something like this to find them: >>>>> >> >>>>> >> ceph pg dump -f json | jq '.pg_map.pg_stats[] | >>>>> >> select(.ondisk_log_size > 3000)' >>>>> >> >>>>> >> If you find some, could you increase to debug_osd = 10 then share the osd log. >>>>> >> I am interested in the debug lines from calc_trim_to_aggressively (or >>>>> >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log >>>>> >> might show other issues. >>>>> >> >>>>> >> Cheers, dan >>>>> >> >>>>> >> >>>>> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan(a)vanderster.com> wrote: >>>>> >>> >>>>> >>> Hi Kalle, >>>>> >>> >>>>> >>> Strangely and luckily, in our case the memory explosion didn't reoccur >>>>> >>> after that incident. So I can mostly only offer moral support. >>>>> >>> >>>>> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I >>>>> >>> think this is suspicious: >>>>> >>> >>>>> >>> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk >>>>> >>> >>>>> >>> https://github.com/ceph/ceph/commit/b670715eb4 >>>>> >>> >>>>> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if >>>>> >>> there could be an unforeseen condition where `last_update_ondisk` >>>>> >>> isn't being updated correctly, and therefore the osd stops trimming >>>>> >>> the pg_log altogether. >>>>> >>> >>>>> >>> Xie or Samuel: does that sound possible? >>>>> >>> >>>>> >>> Cheers, Dan >>>>> >>> >>>>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >>>>> >>> > >>>>> >>> > Hello all, >>>>> >>> > wrt: >>>>> >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… >>>>> >>> > >>>>> >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. >>>>> >>> > >>>>> >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. >>>>> >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool). >>>>> >>> > >>>>> >>> > The cluster has been running fine, and (as relevant to the post) the memory >>>>> >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000. >>>>> >>> > The user traffic doesn't seem to have been exceptional lately. >>>>> >>> > >>>>> >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory >>>>> >>> > usage on OSD nodes started to grow. On each node it grew steadily about 30 >>>>> >>> > GB/day, until the servers started OOM killing OSD processes. >>>>> >>> > >>>>> >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process >>>>> >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then >>>>> >>> > the cluster was in an unstable situation. This is significantly more than the >>>>> >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the >>>>> >>> > size. >>>>> >>> > >>>>> >>> > We've reduced the pg_log to 500, and started offline trimming it where we can, >>>>> >>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some >>>>> >>> > nodes, but we're still recovering, and have a lot of ODSs down and out still. >>>>> >>> > >>>>> >>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered >>>>> >>> > this (or something unrelated we don't see). >>>>> >>> > >>>>> >>> > This mail is mostly to figure out if there are good guesses why the pg_log size >>>>> >>> > per OSD process exploded? Any technical (and moral) support is appreciated. >>>>> >>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to >>>>> >>> > put a data point out there for other debuggers. >>>>> >>> > >>>>> >>> > Cheers, >>>>> >>> > Kalle Happonen >>>>> >>> > _______________________________________________ >>>>> >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>>>> >> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>> > _______________________________________________ >>>>> > ceph-users mailing list -- ceph-users(a)ceph.io >>>> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Dan van der Ster

5:53 p.m.

Hi Kalle, Thanks for the update. Unfortunately I haven't made any progress on understanding the root cause of this issue. (We are still tracking our mempools closely in grafana and in our case they are no longer exploding like in the incident.) Cheers, Dan On Tue, Dec 1, 2020 at 3:49 PM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: > > Quick update, restarting OSDs is not enough for us to compact the db. So we > stop the osd > ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd compact > start the osd > > It seems to fix the spillover. Until it grows again. > > Cheers, > Kalle > > ----- Original Message ----- > > From: "Kalle Happonen" <kalle.happonen(a)csc.fi> > > To: "Dan van der Ster" <dan(a)vanderster.com> > > Cc: "ceph-users" <ceph-users(a)ceph.io> > > Sent: Tuesday, 1 December, 2020 15:09:37 > > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case > > > Hi All, > > back to this. Dan, it seems we're following exactly in your footsteps. > > > > We recovered from our large pg_log, and got the cluster running. A week after > > our cluster was ok, we started seeing big memory increases again. I don't know > > if we had buffer_anon issues before or if our big pg_logs were masking it. But > > we started seeing bluefs spillover and buffer_anon growth. > > > > This led to whole other series of problems with OOM killing, which probably > > resulted in mon node db growth which filled the disk, which resulted in all > > mons going down, and a bigger mess of bringing everything up. > > > > However. We're back. But I think we can confirm the buffer_anon growth, and > > bluefs spillover. > > > > We now have a job that constatly writes 10k objects in a buckets and deletes > > them. > > > > This may curb the memory growth, but I don't think it stops the problem. We're > > just testing restarting OSDs and while it takes a while, it seems it may help. > > Of course this is not the greatest fix in production. > > > > Has anybody gleaned any new information on this issue? Things to tweaks? Fixes > > in the horizon? Other mitigations? > > > > Cheers, > > Kalle > > > > > > ----- Original Message ----- > >> From: "Kalle Happonen" <kalle.happonen(a)csc.fi> > >> To: "Dan van der Ster" <dan(a)vanderster.com> > >> Cc: "ceph-users" <ceph-users(a)ceph.io> > >> Sent: Thursday, 19 November, 2020 13:56:37 > >> Subject: [ceph-users] Re: osd_pglog memory hoarding - another case > > > >> Hello, > >> I thought I'd post an update. > >> > >> Setting the pg_log size to 500, and running the offline trim operation > >> sequentially on all OSDs seems to help. With our current setup, it takes about > >> 12-48h per node, depending on the pgs per osd. The PG amounts per OSD we have > >> are ~180-750, with a majority around 200, and some nodes consistently have 500 > >> per OSD. The limiting factor of the recovery time seems to be our nvme, which > >> we use for rocksdb for the OSDs. > >> > >> We haven't fully recovered yet, we're working on it. Almost all our PGs are back > >> up, we still have ~40/18000 PGs down, but I think we'll get there. Currently > >> ~40 OSDs/1200 down. > >> > >> It seems like the previous mention of 32kB / pg_log entry seems in the correct > >> magnitude for us too. If we count 32kB * 200 pgs * 3000 log entries, we're > >> close to the 20 GB / OSD process. > >> > >> For the nodes that have been trimmed, we're hovering around 100 GB/node of > >> memory use, or ~4 GB per OSD, and so far seems stable, but we don't have longer > >> term data on that, and we don't know exactly how it behaves when load is > >> applied. However if we're currently at the pg_log limit of 500, adding load > >> should hopefully not increase pg_log memory consumption. > >> > >> Cheers, > >> Kalle > >> > >> ----- Original Message ----- > >>> From: "Kalle Happonen" <kalle.happonen(a)csc.fi> > >>> To: "Dan van der Ster" <dan(a)vanderster.com> > >>> Cc: "ceph-users" <ceph-users(a)ceph.io> > >>> Sent: Tuesday, 17 November, 2020 16:07:03 > >>> Subject: [ceph-users] Re: osd_pglog memory hoarding - another case > >> > >>> Hi, > >>> > >>>> I don't think the default osd_min_pg_log_entries has changed recently. > >>>> In https://tracker.ceph.com/issues/47775 I proposed that we limit the > >>>> pg log length by memory -- if it is indeed possible for log entries to > >>>> get into several MB, then this would be necessary IMHO. > >>> > >>> I've had a surprising crash course on pg_log in the last 36 hours. But for the > >>> size of each entry, you're right. I counted pg log * ODS, and did not take into > >>> factor pg log * OSDs * pgs on the OSD. Still the total memory use that an OSD > >>> uses for pg_log was ~22GB / OSD process. > >>> > >>> > >>>> But you said you were trimming PG logs with the offline tool? How long > >>>> were those logs that needed to be trimmed? > >>> > >>> The logs we are trimming were ~3000, we trimmed them to the new size of 500. > >>> After restarting the OSDs, it dropped the pg_log memory usage from ~22GB, to > >>> what we guess is 2-3GB but with the cluster at this state, it's hard to be > >>> specific. > >>> > >>> Cheers, > >>> Kalle > >>> > >>> > >>> > >>>> -- dan > >>>> > >>>> > >>>> On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: > >>>>> > >>>>> Another idea, which I don't know if has any merit. > >>>>> > >>>>> If 8 MB is a realistic log size (or has this grown for some reason?), did the > >>>>> enforcement (or default) of the minimum value change lately > >>>>> (osd_min_pg_log_entries)? > >>>>> > >>>>> If the minimum amount would be set to 1000, at 8 MB per log, we would have > >>>>> issues with memory. > >>>>> > >>>>> Cheers, > >>>>> Kalle > >>>>> > >>>>> > >>>>> > >>>>> ----- Original Message ----- > >>>>> > From: "Kalle Happonen" <kalle.happonen(a)csc.fi> > >>>>> > To: "Dan van der Ster" <dan(a)vanderster.com> > >>>>> > Cc: "ceph-users" <ceph-users(a)ceph.io> > >>>>> > Sent: Tuesday, 17 November, 2020 12:45:25 > >>>>> > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case > >>>>> > >>>>> > Hi Dan @ co., > >>>>> > Thanks for the support (moral and technical). > >>>>> > > >>>>> > That sounds like a good guess, but it seems like there is nothing alarming here. > >>>>> > In all our pools, some pgs are a bit over 3100, but not at any exceptional > >>>>> > values. > >>>>> > > >>>>> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] | > >>>>> > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size" > >>>>> > "pgid": "37.2b9", > >>>>> > "ondisk_log_size": 3103, > >>>>> > "pgid": "33.e", > >>>>> > "ondisk_log_size": 3229, > >>>>> > "pgid": "7.2", > >>>>> > "ondisk_log_size": 3111, > >>>>> > "pgid": "26.4", > >>>>> > "ondisk_log_size": 3185, > >>>>> > "pgid": "33.4", > >>>>> > "ondisk_log_size": 3311, > >>>>> > "pgid": "33.8", > >>>>> > "ondisk_log_size": 3278, > >>>>> > > >>>>> > I also have no idea what the average size of a pg log entry should be, in our > >>>>> > case it seems it's around 8 MB (22GB/3000 entires). > >>>>> > > >>>>> > Cheers, > >>>>> > Kalle > >>>>> > > >>>>> > ----- Original Message ----- > >>>>> >> From: "Dan van der Ster" <dan(a)vanderster.com> > >>>>> >> To: "Kalle Happonen" <kalle.happonen(a)csc.fi> > >>>>> >> Cc: "ceph-users" <ceph-users(a)ceph.io>io>, "xie xingguo" <xie.xingguo(a)zte.com.cn>cn>, > >>>>> >> "Samuel Just" <sjust(a)redhat.com> > >>>>> >> Sent: Tuesday, 17 November, 2020 12:22:28 > >>>>> >> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case > >>>>> > > >>>>> >> Hi Kalle, > >>>>> >> > >>>>> >> Do you have active PGs now with huge pglogs? > >>>>> >> You can do something like this to find them: > >>>>> >> > >>>>> >> ceph pg dump -f json | jq '.pg_map.pg_stats[] | > >>>>> >> select(.ondisk_log_size > 3000)' > >>>>> >> > >>>>> >> If you find some, could you increase to debug_osd = 10 then share the osd log. > >>>>> >> I am interested in the debug lines from calc_trim_to_aggressively (or > >>>>> >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log > >>>>> >> might show other issues. > >>>>> >> > >>>>> >> Cheers, dan > >>>>> >> > >>>>> >> > >>>>> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan(a)vanderster.com> wrote: > >>>>> >>> > >>>>> >>> Hi Kalle, > >>>>> >>> > >>>>> >>> Strangely and luckily, in our case the memory explosion didn't reoccur > >>>>> >>> after that incident. So I can mostly only offer moral support. > >>>>> >>> > >>>>> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I > >>>>> >>> think this is suspicious: > >>>>> >>> > >>>>> >>> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk > >>>>> >>> > >>>>> >>> https://github.com/ceph/ceph/commit/b670715eb4 > >>>>> >>> > >>>>> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if > >>>>> >>> there could be an unforeseen condition where `last_update_ondisk` > >>>>> >>> isn't being updated correctly, and therefore the osd stops trimming > >>>>> >>> the pg_log altogether. > >>>>> >>> > >>>>> >>> Xie or Samuel: does that sound possible? > >>>>> >>> > >>>>> >>> Cheers, Dan > >>>>> >>> > >>>>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: > >>>>> >>> > > >>>>> >>> > Hello all, > >>>>> >>> > wrt: > >>>>> >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… > >>>>> >>> > > >>>>> >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. > >>>>> >>> > > >>>>> >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. > >>>>> >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool). > >>>>> >>> > > >>>>> >>> > The cluster has been running fine, and (as relevant to the post) the memory > >>>>> >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000. > >>>>> >>> > The user traffic doesn't seem to have been exceptional lately. > >>>>> >>> > > >>>>> >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory > >>>>> >>> > usage on OSD nodes started to grow. On each node it grew steadily about 30 > >>>>> >>> > GB/day, until the servers started OOM killing OSD processes. > >>>>> >>> > > >>>>> >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process > >>>>> >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then > >>>>> >>> > the cluster was in an unstable situation. This is significantly more than the > >>>>> >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the > >>>>> >>> > size. > >>>>> >>> > > >>>>> >>> > We've reduced the pg_log to 500, and started offline trimming it where we can, > >>>>> >>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some > >>>>> >>> > nodes, but we're still recovering, and have a lot of ODSs down and out still. > >>>>> >>> > > >>>>> >>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered > >>>>> >>> > this (or something unrelated we don't see). > >>>>> >>> > > >>>>> >>> > This mail is mostly to figure out if there are good guesses why the pg_log size > >>>>> >>> > per OSD process exploded? Any technical (and moral) support is appreciated. > >>>>> >>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to > >>>>> >>> > put a data point out there for other debuggers. > >>>>> >>> > > >>>>> >>> > Cheers, > >>>>> >>> > Kalle Happonen > >>>>> >>> > _______________________________________________ > >>>>> >>> > ceph-users mailing list -- ceph-users(a)ceph.io > >>>>> >> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > >>>>> > _______________________________________________ > >>>>> > ceph-users mailing list -- ceph-users(a)ceph.io > >>>> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users(a)ceph.io > >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users(a)ceph.io > >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Kalle Happonen

14 Dec 14 Dec

11:25 a.m.

Hi all, Ok, so I have some updates on this. We noticed that we had a bucket with tons of RGW garbage collection pending. It was growing faster than we could clean it up. We suspect this was because users tried to do "s3cmd sync" operations on SWIFT uploaded large files. This could logically cause issues as s3 and SWIFT calculate md5sums differently on large objects. The following command showed the pending gc, and also shows which buckets are affected. radosgw-admin gc list |grep oid >garbagecollectionlist.txt Our total RGW GC backlog was up to ~40 M. We stopped the main s3sync workflow which was affecting the GC growth. Then we started running more aggressive radosgw garbage collection. This really helped with the memory use. It dropped a lot, and for now *knock on wood* when the GC has been cleaned up, the memory has stayed at a more stable lower level. So we hope we found the (or a) trigger for the problem. Hopefully reveals another thread to pull for others debugging the same issue (and for us when we hit it again). Cheers, Kalle ----- Original Message -----

...

From: "Dan van der Ster" <dan(a)vanderster.com> To: "Kalle Happonen" <kalle.happonen(a)csc.fi> Cc: "ceph-users" <ceph-users(a)ceph.io> Sent: Tuesday, 1 December, 2020 16:53:50 Subject: Re: [ceph-users] Re: osd_pglog memory hoarding - another case

> Hi Kalle, > > Thanks for the update. Unfortunately I haven't made any progress on > understanding the root cause of this issue. > (We are still tracking our mempools closely in grafana and in our case > they are no longer exploding like in the incident.) > > Cheers, Dan > > On Tue, Dec 1, 2020 at 3:49 PM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >> >> Quick update, restarting OSDs is not enough for us to compact the db. So we >> stop the osd >> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd compact >> start the osd >> >> It seems to fix the spillover. Until it grows again. >> >> Cheers, >> Kalle >> >> ----- Original Message ----- >> > From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >> > To: "Dan van der Ster" <dan(a)vanderster.com> >> > Cc: "ceph-users" <ceph-users(a)ceph.io> >> > Sent: Tuesday, 1 December, 2020 15:09:37 >> > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case >> >> > Hi All, >> > back to this. Dan, it seems we're following exactly in your footsteps. >> > >> > We recovered from our large pg_log, and got the cluster running. A week after >> > our cluster was ok, we started seeing big memory increases again. I don't know >> > if we had buffer_anon issues before or if our big pg_logs were masking it. But >> > we started seeing bluefs spillover and buffer_anon growth. >> > >> > This led to whole other series of problems with OOM killing, which probably >> > resulted in mon node db growth which filled the disk, which resulted in all >> > mons going down, and a bigger mess of bringing everything up. >> > >> > However. We're back. But I think we can confirm the buffer_anon growth, and >> > bluefs spillover. >> > >> > We now have a job that constatly writes 10k objects in a buckets and deletes >> > them. >> > >> > This may curb the memory growth, but I don't think it stops the problem. We're >> > just testing restarting OSDs and while it takes a while, it seems it may help. >> > Of course this is not the greatest fix in production. >> > >> > Has anybody gleaned any new information on this issue? Things to tweaks? Fixes >> > in the horizon? Other mitigations? >> > >> > Cheers, >> > Kalle >> > >> > >> > ----- Original Message ----- >> >> From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >> >> To: "Dan van der Ster" <dan(a)vanderster.com> >> >> Cc: "ceph-users" <ceph-users(a)ceph.io> >> >> Sent: Thursday, 19 November, 2020 13:56:37 >> >> Subject: [ceph-users] Re: osd_pglog memory hoarding - another case >> > >> >> Hello, >> >> I thought I'd post an update. >> >> >> >> Setting the pg_log size to 500, and running the offline trim operation >> >> sequentially on all OSDs seems to help. With our current setup, it takes about >> >> 12-48h per node, depending on the pgs per osd. The PG amounts per OSD we have >> >> are ~180-750, with a majority around 200, and some nodes consistently have 500 >> >> per OSD. The limiting factor of the recovery time seems to be our nvme, which >> >> we use for rocksdb for the OSDs. >> >> >> >> We haven't fully recovered yet, we're working on it. Almost all our PGs are back >> >> up, we still have ~40/18000 PGs down, but I think we'll get there. Currently >> >> ~40 OSDs/1200 down. >> >> >> >> It seems like the previous mention of 32kB / pg_log entry seems in the correct >> >> magnitude for us too. If we count 32kB * 200 pgs * 3000 log entries, we're >> >> close to the 20 GB / OSD process. >> >> >> >> For the nodes that have been trimmed, we're hovering around 100 GB/node of >> >> memory use, or ~4 GB per OSD, and so far seems stable, but we don't have longer >> >> term data on that, and we don't know exactly how it behaves when load is >> >> applied. However if we're currently at the pg_log limit of 500, adding load >> >> should hopefully not increase pg_log memory consumption. >> >> >> >> Cheers, >> >> Kalle >> >> >> >> ----- Original Message ----- >> >>> From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >> >>> To: "Dan van der Ster" <dan(a)vanderster.com> >> >>> Cc: "ceph-users" <ceph-users(a)ceph.io> >> >>> Sent: Tuesday, 17 November, 2020 16:07:03 >> >>> Subject: [ceph-users] Re: osd_pglog memory hoarding - another case >> >> >> >>> Hi, >> >>> >> >>>> I don't think the default osd_min_pg_log_entries has changed recently. >> >>>> In https://tracker.ceph.com/issues/47775 I proposed that we limit the >> >>>> pg log length by memory -- if it is indeed possible for log entries to >> >>>> get into several MB, then this would be necessary IMHO. >> >>> >> >>> I've had a surprising crash course on pg_log in the last 36 hours. But for the >> >>> size of each entry, you're right. I counted pg log * ODS, and did not take into >> >>> factor pg log * OSDs * pgs on the OSD. Still the total memory use that an OSD >> >>> uses for pg_log was ~22GB / OSD process. >> >>> >> >>> >> >>>> But you said you were trimming PG logs with the offline tool? How long >> >>>> were those logs that needed to be trimmed? >> >>> >> >>> The logs we are trimming were ~3000, we trimmed them to the new size of 500. >> >>> After restarting the OSDs, it dropped the pg_log memory usage from ~22GB, to >> >>> what we guess is 2-3GB but with the cluster at this state, it's hard to be >> >>> specific. >> >>> >> >>> Cheers, >> >>> Kalle >> >>> >> >>> >> >>> >> >>>> -- dan >> >>>> >> >>>> >> >>>> On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >> >>>>> >> >>>>> Another idea, which I don't know if has any merit. >> >>>>> >> >>>>> If 8 MB is a realistic log size (or has this grown for some reason?), did the >> >>>>> enforcement (or default) of the minimum value change lately >> >>>>> (osd_min_pg_log_entries)? >> >>>>> >> >>>>> If the minimum amount would be set to 1000, at 8 MB per log, we would have >> >>>>> issues with memory. >> >>>>> >> >>>>> Cheers, >> >>>>> Kalle >> >>>>> >> >>>>> >> >>>>> >> >>>>> ----- Original Message ----- >> >>>>> > From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >> >>>>> > To: "Dan van der Ster" <dan(a)vanderster.com> >> >>>>> > Cc: "ceph-users" <ceph-users(a)ceph.io> >> >>>>> > Sent: Tuesday, 17 November, 2020 12:45:25 >> >>>>> > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case >> >>>>> >> >>>>> > Hi Dan @ co., >> >>>>> > Thanks for the support (moral and technical). >> >>>>> > >> >>>>> > That sounds like a good guess, but it seems like there is nothing alarming here. >> >>>>> > In all our pools, some pgs are a bit over 3100, but not at any exceptional >> >>>>> > values. >> >>>>> > >> >>>>> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] | >> >>>>> > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size" >> >>>>> > "pgid": "37.2b9", >> >>>>> > "ondisk_log_size": 3103, >> >>>>> > "pgid": "33.e", >> >>>>> > "ondisk_log_size": 3229, >> >>>>> > "pgid": "7.2", >> >>>>> > "ondisk_log_size": 3111, >> >>>>> > "pgid": "26.4", >> >>>>> > "ondisk_log_size": 3185, >> >>>>> > "pgid": "33.4", >> >>>>> > "ondisk_log_size": 3311, >> >>>>> > "pgid": "33.8", >> >>>>> > "ondisk_log_size": 3278, >> >>>>> > >> >>>>> > I also have no idea what the average size of a pg log entry should be, in our >> >>>>> > case it seems it's around 8 MB (22GB/3000 entires). >> >>>>> > >> >>>>> > Cheers, >> >>>>> > Kalle >> >>>>> > >> >>>>> > ----- Original Message ----- >> >>>>> >> From: "Dan van der Ster" <dan(a)vanderster.com> >> >>>>> >> To: "Kalle Happonen" <kalle.happonen(a)csc.fi> >> >>>>> >> Cc: "ceph-users" <ceph-users(a)ceph.io>io>, "xie xingguo" <xie.xingguo(a)zte.com.cn>cn>, >> >>>>> >> "Samuel Just" <sjust(a)redhat.com> >> >>>>> >> Sent: Tuesday, 17 November, 2020 12:22:28 >> >>>>> >> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case >> >>>>> > >> >>>>> >> Hi Kalle, >> >>>>> >> >> >>>>> >> Do you have active PGs now with huge pglogs? >> >>>>> >> You can do something like this to find them: >> >>>>> >> >> >>>>> >> ceph pg dump -f json | jq '.pg_map.pg_stats[] | >> >>>>> >> select(.ondisk_log_size > 3000)' >> >>>>> >> >> >>>>> >> If you find some, could you increase to debug_osd = 10 then share the osd log. >> >>>>> >> I am interested in the debug lines from calc_trim_to_aggressively (or >> >>>>> >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log >> >>>>> >> might show other issues. >> >>>>> >> >> >>>>> >> Cheers, dan >> >>>>> >> >> >>>>> >> >> >>>>> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan(a)vanderster.com> wrote: >> >>>>> >>> >> >>>>> >>> Hi Kalle, >> >>>>> >>> >> >>>>> >>> Strangely and luckily, in our case the memory explosion didn't reoccur >> >>>>> >>> after that incident. So I can mostly only offer moral support. >> >>>>> >>> >> >>>>> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I >> >>>>> >>> think this is suspicious: >> >>>>> >>> >> >>>>> >>> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk >> >>>>> >>> >> >>>>> >>> https://github.com/ceph/ceph/commit/b670715eb4 >> >>>>> >>> >> >>>>> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if >> >>>>> >>> there could be an unforeseen condition where `last_update_ondisk` >> >>>>> >>> isn't being updated correctly, and therefore the osd stops trimming >> >>>>> >>> the pg_log altogether. >> >>>>> >>> >> >>>>> >>> Xie or Samuel: does that sound possible? >> >>>>> >>> >> >>>>> >>> Cheers, Dan >> >>>>> >>> >> >>>>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >> >>>>> >>> > >> >>>>> >>> > Hello all, >> >>>>> >>> > wrt: >> >>>>> >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… >> >>>>> >>> > >> >>>>> >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. >> >>>>> >>> > >> >>>>> >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. >> >>>>> >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool). >> >>>>> >>> > >> >>>>> >>> > The cluster has been running fine, and (as relevant to the post) the memory >> >>>>> >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000. >> >>>>> >>> > The user traffic doesn't seem to have been exceptional lately. >> >>>>> >>> > >> >>>>> >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory >> >>>>> >>> > usage on OSD nodes started to grow. On each node it grew steadily about 30 >> >>>>> >>> > GB/day, until the servers started OOM killing OSD processes. >> >>>>> >>> > >> >>>>> >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process >> >>>>> >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then >> >>>>> >>> > the cluster was in an unstable situation. This is significantly more than the >> >>>>> >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the >> >>>>> >>> > size. >> >>>>> >>> > >> >>>>> >>> > We've reduced the pg_log to 500, and started offline trimming it where we can, >> >>>>> >>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some >> >>>>> >>> > nodes, but we're still recovering, and have a lot of ODSs down and out still. >> >>>>> >>> > >> >>>>> >>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered >> >>>>> >>> > this (or something unrelated we don't see). >> >>>>> >>> > >> >>>>> >>> > This mail is mostly to figure out if there are good guesses why the pg_log size >> >>>>> >>> > per OSD process exploded? Any technical (and moral) support is appreciated. >> >>>>> >>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to >> >>>>> >>> > put a data point out there for other debuggers. >> >>>>> >>> > >> >>>>> >>> > Cheers, >> >>>>> >>> > Kalle Happonen >> >>>>> >>> > _______________________________________________ >> >>>>> >>> > ceph-users mailing list -- ceph-users(a)ceph.io >> >>>>> >> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >>>>> > _______________________________________________ >> >>>>> > ceph-users mailing list -- ceph-users(a)ceph.io >> >>>> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >>> _______________________________________________ >> >>> ceph-users mailing list -- ceph-users(a)ceph.io >> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> _______________________________________________ >> >> ceph-users mailing list -- ceph-users(a)ceph.io >> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Kalle Happonen

22 Dec 22 Dec

3:56 p.m.

For anybody facing similar issues, we wrote a blog post about everything we faced, and how we worked through it. https://cloud.blog.csc.fi/2020/12/allas-november-2020-incident-details.html Cheers, Kalle ----- Original Message -----

...

From: "Kalle Happonen" <kalle.happonen(a)csc.fi> To: "Dan van der Ster" <dan(a)vanderster.com>om>, "ceph-users" <ceph-users(a)ceph.io> Sent: Monday, 14 December, 2020 10:25:32 Subject: [ceph-users] Re: osd_pglog memory hoarding - another case

> Hi all, > Ok, so I have some updates on this. > > We noticed that we had a bucket with tons of RGW garbage collection pending. It > was growing faster than we could clean it up. > > We suspect this was because users tried to do "s3cmd sync" operations on SWIFT > uploaded large files. This could logically cause issues as s3 and SWIFT > calculate md5sums differently on large objects. > > The following command showed the pending gc, and also shows which buckets are > affected. > > radosgw-admin gc list |grep oid >garbagecollectionlist.txt > > Our total RGW GC backlog was up to ~40 M. > > We stopped the main s3sync workflow which was affecting the GC growth. Then we > started running more aggressive radosgw garbage collection. > > This really helped with the memory use. It dropped a lot, and for now *knock on > wood* when the GC has been cleaned up, the memory has stayed at a more stable > lower level. > > So we hope we found the (or a) trigger for the problem. > > Hopefully reveals another thread to pull for others debugging the same issue > (and for us when we hit it again). > > Cheers, > Kalle > > ----- Original Message ----- >> From: "Dan van der Ster" <dan(a)vanderster.com> >> To: "Kalle Happonen" <kalle.happonen(a)csc.fi> >> Cc: "ceph-users" <ceph-users(a)ceph.io> >> Sent: Tuesday, 1 December, 2020 16:53:50 >> Subject: Re: [ceph-users] Re: osd_pglog memory hoarding - another case > >> Hi Kalle, >> >> Thanks for the update. Unfortunately I haven't made any progress on >> understanding the root cause of this issue. >> (We are still tracking our mempools closely in grafana and in our case >> they are no longer exploding like in the incident.) >> >> Cheers, Dan >> >> On Tue, Dec 1, 2020 at 3:49 PM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >>> >>> Quick update, restarting OSDs is not enough for us to compact the db. So we >>> stop the osd >>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd compact >>> start the osd >>> >>> It seems to fix the spillover. Until it grows again. >>> >>> Cheers, >>> Kalle >>> >>> ----- Original Message ----- >>> > From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >>> > To: "Dan van der Ster" <dan(a)vanderster.com> >>> > Cc: "ceph-users" <ceph-users(a)ceph.io> >>> > Sent: Tuesday, 1 December, 2020 15:09:37 >>> > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case >>> >>> > Hi All, >>> > back to this. Dan, it seems we're following exactly in your footsteps. >>> > >>> > We recovered from our large pg_log, and got the cluster running. A week after >>> > our cluster was ok, we started seeing big memory increases again. I don't know >>> > if we had buffer_anon issues before or if our big pg_logs were masking it. But >>> > we started seeing bluefs spillover and buffer_anon growth. >>> > >>> > This led to whole other series of problems with OOM killing, which probably >>> > resulted in mon node db growth which filled the disk, which resulted in all >>> > mons going down, and a bigger mess of bringing everything up. >>> > >>> > However. We're back. But I think we can confirm the buffer_anon growth, and >>> > bluefs spillover. >>> > >>> > We now have a job that constatly writes 10k objects in a buckets and deletes >>> > them. >>> > >>> > This may curb the memory growth, but I don't think it stops the problem. We're >>> > just testing restarting OSDs and while it takes a while, it seems it may help. >>> > Of course this is not the greatest fix in production. >>> > >>> > Has anybody gleaned any new information on this issue? Things to tweaks? Fixes >>> > in the horizon? Other mitigations? >>> > >>> > Cheers, >>> > Kalle >>> > >>> > >>> > ----- Original Message ----- >>> >> From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >>> >> To: "Dan van der Ster" <dan(a)vanderster.com> >>> >> Cc: "ceph-users" <ceph-users(a)ceph.io> >>> >> Sent: Thursday, 19 November, 2020 13:56:37 >>> >> Subject: [ceph-users] Re: osd_pglog memory hoarding - another case >>> > >>> >> Hello, >>> >> I thought I'd post an update. >>> >> >>> >> Setting the pg_log size to 500, and running the offline trim operation >>> >> sequentially on all OSDs seems to help. With our current setup, it takes about >>> >> 12-48h per node, depending on the pgs per osd. The PG amounts per OSD we have >>> >> are ~180-750, with a majority around 200, and some nodes consistently have 500 >>> >> per OSD. The limiting factor of the recovery time seems to be our nvme, which >>> >> we use for rocksdb for the OSDs. >>> >> >>> >> We haven't fully recovered yet, we're working on it. Almost all our PGs are back >>> >> up, we still have ~40/18000 PGs down, but I think we'll get there. Currently >>> >> ~40 OSDs/1200 down. >>> >> >>> >> It seems like the previous mention of 32kB / pg_log entry seems in the correct >>> >> magnitude for us too. If we count 32kB * 200 pgs * 3000 log entries, we're >>> >> close to the 20 GB / OSD process. >>> >> >>> >> For the nodes that have been trimmed, we're hovering around 100 GB/node of >>> >> memory use, or ~4 GB per OSD, and so far seems stable, but we don't have longer >>> >> term data on that, and we don't know exactly how it behaves when load is >>> >> applied. However if we're currently at the pg_log limit of 500, adding load >>> >> should hopefully not increase pg_log memory consumption. >>> >> >>> >> Cheers, >>> >> Kalle >>> >> >>> >> ----- Original Message ----- >>> >>> From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >>> >>> To: "Dan van der Ster" <dan(a)vanderster.com> >>> >>> Cc: "ceph-users" <ceph-users(a)ceph.io> >>> >>> Sent: Tuesday, 17 November, 2020 16:07:03 >>> >>> Subject: [ceph-users] Re: osd_pglog memory hoarding - another case >>> >> >>> >>> Hi, >>> >>> >>> >>>> I don't think the default osd_min_pg_log_entries has changed recently. >>> >>>> In https://tracker.ceph.com/issues/47775 I proposed that we limit the >>> >>>> pg log length by memory -- if it is indeed possible for log entries to >>> >>>> get into several MB, then this would be necessary IMHO. >>> >>> >>> >>> I've had a surprising crash course on pg_log in the last 36 hours. But for the >>> >>> size of each entry, you're right. I counted pg log * ODS, and did not take into >>> >>> factor pg log * OSDs * pgs on the OSD. Still the total memory use that an OSD >>> >>> uses for pg_log was ~22GB / OSD process. >>> >>> >>> >>> >>> >>>> But you said you were trimming PG logs with the offline tool? How long >>> >>>> were those logs that needed to be trimmed? >>> >>> >>> >>> The logs we are trimming were ~3000, we trimmed them to the new size of 500. >>> >>> After restarting the OSDs, it dropped the pg_log memory usage from ~22GB, to >>> >>> what we guess is 2-3GB but with the cluster at this state, it's hard to be >>> >>> specific. >>> >>> >>> >>> Cheers, >>> >>> Kalle >>> >>> >>> >>> >>> >>> >>> >>>> -- dan >>> >>>> >>> >>>> >>> >>>> On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >>> >>>>> >>> >>>>> Another idea, which I don't know if has any merit. >>> >>>>> >>> >>>>> If 8 MB is a realistic log size (or has this grown for some reason?), did the >>> >>>>> enforcement (or default) of the minimum value change lately >>> >>>>> (osd_min_pg_log_entries)? >>> >>>>> >>> >>>>> If the minimum amount would be set to 1000, at 8 MB per log, we would have >>> >>>>> issues with memory. >>> >>>>> >>> >>>>> Cheers, >>> >>>>> Kalle >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> ----- Original Message ----- >>> >>>>> > From: "Kalle Happonen" <kalle.happonen(a)csc.fi> >>> >>>>> > To: "Dan van der Ster" <dan(a)vanderster.com> >>> >>>>> > Cc: "ceph-users" <ceph-users(a)ceph.io> >>> >>>>> > Sent: Tuesday, 17 November, 2020 12:45:25 >>> >>>>> > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case >>> >>>>> >>> >>>>> > Hi Dan @ co., >>> >>>>> > Thanks for the support (moral and technical). >>> >>>>> > >>> >>>>> > That sounds like a good guess, but it seems like there is nothing alarming here. >>> >>>>> > In all our pools, some pgs are a bit over 3100, but not at any exceptional >>> >>>>> > values. >>> >>>>> > >>> >>>>> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] | >>> >>>>> > select(.ondisk_log_size > 3100)' | egrep "pgid|ondisk_log_size" >>> >>>>> > "pgid": "37.2b9", >>> >>>>> > "ondisk_log_size": 3103, >>> >>>>> > "pgid": "33.e", >>> >>>>> > "ondisk_log_size": 3229, >>> >>>>> > "pgid": "7.2", >>> >>>>> > "ondisk_log_size": 3111, >>> >>>>> > "pgid": "26.4", >>> >>>>> > "ondisk_log_size": 3185, >>> >>>>> > "pgid": "33.4", >>> >>>>> > "ondisk_log_size": 3311, >>> >>>>> > "pgid": "33.8", >>> >>>>> > "ondisk_log_size": 3278, >>> >>>>> > >>> >>>>> > I also have no idea what the average size of a pg log entry should be, in our >>> >>>>> > case it seems it's around 8 MB (22GB/3000 entires). >>> >>>>> > >>> >>>>> > Cheers, >>> >>>>> > Kalle >>> >>>>> > >>> >>>>> > ----- Original Message ----- >>> >>>>> >> From: "Dan van der Ster" <dan(a)vanderster.com> >>> >>>>> >> To: "Kalle Happonen" <kalle.happonen(a)csc.fi> >>> >>>>> >> Cc: "ceph-users" <ceph-users(a)ceph.io>io>, "xie xingguo" <xie.xingguo(a)zte.com.cn>cn>, >>> >>>>> >> "Samuel Just" <sjust(a)redhat.com> >>> >>>>> >> Sent: Tuesday, 17 November, 2020 12:22:28 >>> >>>>> >> Subject: Re: [ceph-users] osd_pglog memory hoarding - another case >>> >>>>> > >>> >>>>> >> Hi Kalle, >>> >>>>> >> >>> >>>>> >> Do you have active PGs now with huge pglogs? >>> >>>>> >> You can do something like this to find them: >>> >>>>> >> >>> >>>>> >> ceph pg dump -f json | jq '.pg_map.pg_stats[] | >>> >>>>> >> select(.ondisk_log_size > 3000)' >>> >>>>> >> >>> >>>>> >> If you find some, could you increase to debug_osd = 10 then share the osd log. >>> >>>>> >> I am interested in the debug lines from calc_trim_to_aggressively (or >>> >>>>> >> calc_trim_to if you didn't enable pglog_hardlimit), but the whole log >>> >>>>> >> might show other issues. >>> >>>>> >> >>> >>>>> >> Cheers, dan >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der Ster <dan(a)vanderster.com> wrote: >>> >>>>> >>> >>> >>>>> >>> Hi Kalle, >>> >>>>> >>> >>> >>>>> >>> Strangely and luckily, in our case the memory explosion didn't reoccur >>> >>>>> >>> after that incident. So I can mostly only offer moral support. >>> >>>>> >>> >>> >>>>> >>> But if this bug indeed appeared between 14.2.8 and 14.2.13, then I >>> >>>>> >>> think this is suspicious: >>> >>>>> >>> >>> >>>>> >>> b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk >>> >>>>> >>> >>> >>>>> >>> https://github.com/ceph/ceph/commit/b670715eb4 >>> >>>>> >>> >>> >>>>> >>> Given that it adds a case where the pg_log is not trimmed, I wonder if >>> >>>>> >>> there could be an unforeseen condition where `last_update_ondisk` >>> >>>>> >>> isn't being updated correctly, and therefore the osd stops trimming >>> >>>>> >>> the pg_log altogether. >>> >>>>> >>> >>> >>>>> >>> Xie or Samuel: does that sound possible? >>> >>>>> >>> >>> >>>>> >>> Cheers, Dan >>> >>>>> >>> >>> >>>>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happonen(a)csc.fi> wrote: >>> >>>>> >>> > >>> >>>>> >>> > Hello all, >>> >>>>> >>> > wrt: >>> >>>>> >>> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN… >>> >>>>> >>> > >>> >>>>> >>> > Yesterday we hit a problem with osd_pglog memory, similar to the thread above. >>> >>>>> >>> > >>> >>>>> >>> > We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. >>> >>>>> >>> > We run 8+3 EC for the data pool (metadata is on replicated nvme pool). >>> >>>>> >>> > >>> >>>>> >>> > The cluster has been running fine, and (as relevant to the post) the memory >>> >>>>> >>> > usage has been stable at 100 GB / node. We've had the default pg_log of 3000. >>> >>>>> >>> > The user traffic doesn't seem to have been exceptional lately. >>> >>>>> >>> > >>> >>>>> >>> > Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory >>> >>>>> >>> > usage on OSD nodes started to grow. On each node it grew steadily about 30 >>> >>>>> >>> > GB/day, until the servers started OOM killing OSD processes. >>> >>>>> >>> > >>> >>>>> >>> > After a lot of debugging we found that the pg_logs were huge. Each OSD process >>> >>>>> >>> > pg_log had grown to ~22GB, which we naturally didn't have memory for, and then >>> >>>>> >>> > the cluster was in an unstable situation. This is significantly more than the >>> >>>>> >>> > 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the >>> >>>>> >>> > size. >>> >>>>> >>> > >>> >>>>> >>> > We've reduced the pg_log to 500, and started offline trimming it where we can, >>> >>>>> >>> > and also just waited. The pg_log size dropped to ~1,2 GB on at least some >>> >>>>> >>> > nodes, but we're still recovering, and have a lot of ODSs down and out still. >>> >>>>> >>> > >>> >>>>> >>> > We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered >>> >>>>> >>> > this (or something unrelated we don't see). >>> >>>>> >>> > >>> >>>>> >>> > This mail is mostly to figure out if there are good guesses why the pg_log size >>> >>>>> >>> > per OSD process exploded? Any technical (and moral) support is appreciated. >>> >>>>> >>> > Also, currently we're not sure if 14.2.13 triggered this, so this is also to >>> >>>>> >>> > put a data point out there for other debuggers. >>> >>>>> >>> > >>> >>>>> >>> > Cheers, >>> >>>>> >>> > Kalle Happonen >>> >>>>> >>> > _______________________________________________ >>> >>>>> >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>> >>>>> >> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> >>>>> > _______________________________________________ >>> >>>>> > ceph-users mailing list -- ceph-users(a)ceph.io >>> >>>> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> >>> _______________________________________________ >>> >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> >> _______________________________________________ >>> >> ceph-users mailing list -- ceph-users(a)ceph.io >>> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users(a)ceph.io >> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

1229

days inactive

1264

days old

ceph-users@ceph.io

Manage subscription

16 comments

4 participants

tags (0)

participants (4)

Dan van der Ster
Kalle Happonen
Mark Nelson
xie.xingguo＠zte.com.cn