For anybody facing similar issues, we wrote a blog post about everything we faced, and how
we worked through it.
https://cloud.blog.csc.fi/2020/12/allas-november-2020-incident-details.html
Cheers,
Kalle
----- Original Message -----
From: "Kalle Happonen"
<kalle.happonen(a)csc.fi>
To: "Dan van der Ster" <dan(a)vanderster.com>om>, "ceph-users"
<ceph-users(a)ceph.io>
Sent: Monday, 14 December, 2020 10:25:32
Subject: [ceph-users] Re: osd_pglog memory hoarding - another case
> Hi all,
> Ok, so I have some updates on this.
>
> We noticed that we had a bucket with tons of RGW garbage collection pending. It
> was growing faster than we could clean it up.
>
> We suspect this was because users tried to do "s3cmd sync" operations on
SWIFT
> uploaded large files. This could logically cause issues as s3 and SWIFT
> calculate md5sums differently on large objects.
>
> The following command showed the pending gc, and also shows which buckets are
> affected.
>
> radosgw-admin gc list |grep oid >garbagecollectionlist.txt
>
> Our total RGW GC backlog was up to ~40 M.
>
> We stopped the main s3sync workflow which was affecting the GC growth. Then we
> started running more aggressive radosgw garbage collection.
>
> This really helped with the memory use. It dropped a lot, and for now *knock on
> wood* when the GC has been cleaned up, the memory has stayed at a more stable
> lower level.
>
> So we hope we found the (or a) trigger for the problem.
>
> Hopefully reveals another thread to pull for others debugging the same issue
> (and for us when we hit it again).
>
> Cheers,
> Kalle
>
> ----- Original Message -----
>> From: "Dan van der Ster" <dan(a)vanderster.com>
>> To: "Kalle Happonen" <kalle.happonen(a)csc.fi>
>> Cc: "ceph-users" <ceph-users(a)ceph.io>
>> Sent: Tuesday, 1 December, 2020 16:53:50
>> Subject: Re: [ceph-users] Re: osd_pglog memory hoarding - another case
>
>> Hi Kalle,
>>
>> Thanks for the update. Unfortunately I haven't made any progress on
>> understanding the root cause of this issue.
>> (We are still tracking our mempools closely in grafana and in our case
>> they are no longer exploding like in the incident.)
>>
>> Cheers, Dan
>>
>> On Tue, Dec 1, 2020 at 3:49 PM Kalle Happonen <kalle.happonen(a)csc.fi>
wrote:
>>>
>>> Quick update, restarting OSDs is not enough for us to compact the db. So we
>>> stop the osd
>>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd compact
>>> start the osd
>>>
>>> It seems to fix the spillover. Until it grows again.
>>>
>>> Cheers,
>>> Kalle
>>>
>>> ----- Original Message -----
>>> > From: "Kalle Happonen" <kalle.happonen(a)csc.fi>
>>> > To: "Dan van der Ster" <dan(a)vanderster.com>
>>> > Cc: "ceph-users" <ceph-users(a)ceph.io>
>>> > Sent: Tuesday, 1 December, 2020 15:09:37
>>> > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case
>>>
>>> > Hi All,
>>> > back to this. Dan, it seems we're following exactly in your
footsteps.
>>> >
>>> > We recovered from our large pg_log, and got the cluster running. A week
after
>>> > our cluster was ok, we started seeing big memory increases again. I
don't know
>>> > if we had buffer_anon issues before or if our big pg_logs were masking
it. But
>>> > we started seeing bluefs spillover and buffer_anon growth.
>>> >
>>> > This led to whole other series of problems with OOM killing, which
probably
>>> > resulted in mon node db growth which filled the disk, which resulted in
all
>>> > mons going down, and a bigger mess of bringing everything up.
>>> >
>>> > However. We're back. But I think we can confirm the buffer_anon
growth, and
>>> > bluefs spillover.
>>> >
>>> > We now have a job that constatly writes 10k objects in a buckets and
deletes
>>> > them.
>>> >
>>> > This may curb the memory growth, but I don't think it stops the
problem. We're
>>> > just testing restarting OSDs and while it takes a while, it seems it may
help.
>>> > Of course this is not the greatest fix in production.
>>> >
>>> > Has anybody gleaned any new information on this issue? Things to tweaks?
Fixes
>>> > in the horizon? Other mitigations?
>>> >
>>> > Cheers,
>>> > Kalle
>>> >
>>> >
>>> > ----- Original Message -----
>>> >> From: "Kalle Happonen" <kalle.happonen(a)csc.fi>
>>> >> To: "Dan van der Ster" <dan(a)vanderster.com>
>>> >> Cc: "ceph-users" <ceph-users(a)ceph.io>
>>> >> Sent: Thursday, 19 November, 2020 13:56:37
>>> >> Subject: [ceph-users] Re: osd_pglog memory hoarding - another case
>>> >
>>> >> Hello,
>>> >> I thought I'd post an update.
>>> >>
>>> >> Setting the pg_log size to 500, and running the offline trim
operation
>>> >> sequentially on all OSDs seems to help. With our current setup, it
takes about
>>> >> 12-48h per node, depending on the pgs per osd. The PG amounts per
OSD we have
>>> >> are ~180-750, with a majority around 200, and some nodes
consistently have 500
>>> >> per OSD. The limiting factor of the recovery time seems to be our
nvme, which
>>> >> we use for rocksdb for the OSDs.
>>> >>
>>> >> We haven't fully recovered yet, we're working on it. Almost
all our PGs are back
>>> >> up, we still have ~40/18000 PGs down, but I think we'll get
there. Currently
>>> >> ~40 OSDs/1200 down.
>>> >>
>>> >> It seems like the previous mention of 32kB / pg_log entry seems in
the correct
>>> >> magnitude for us too. If we count 32kB * 200 pgs * 3000 log entries,
we're
>>> >> close to the 20 GB / OSD process.
>>> >>
>>> >> For the nodes that have been trimmed, we're hovering around 100
GB/node of
>>> >> memory use, or ~4 GB per OSD, and so far seems stable, but we
don't have longer
>>> >> term data on that, and we don't know exactly how it behaves when
load is
>>> >> applied. However if we're currently at the pg_log limit of 500,
adding load
>>> >> should hopefully not increase pg_log memory consumption.
>>> >>
>>> >> Cheers,
>>> >> Kalle
>>> >>
>>> >> ----- Original Message -----
>>> >>> From: "Kalle Happonen" <kalle.happonen(a)csc.fi>
>>> >>> To: "Dan van der Ster" <dan(a)vanderster.com>
>>> >>> Cc: "ceph-users" <ceph-users(a)ceph.io>
>>> >>> Sent: Tuesday, 17 November, 2020 16:07:03
>>> >>> Subject: [ceph-users] Re: osd_pglog memory hoarding - another
case
>>> >>
>>> >>> Hi,
>>> >>>
>>> >>>> I don't think the default osd_min_pg_log_entries has
changed recently.
>>> >>>> In
https://tracker.ceph.com/issues/47775 I proposed that we
limit the
>>> >>>> pg log length by memory -- if it is indeed possible for log
entries to
>>> >>>> get into several MB, then this would be necessary IMHO.
>>> >>>
>>> >>> I've had a surprising crash course on pg_log in the last 36
hours. But for the
>>> >>> size of each entry, you're right. I counted pg log * ODS,
and did not take into
>>> >>> factor pg log * OSDs * pgs on the OSD. Still the total memory
use that an OSD
>>> >>> uses for pg_log was ~22GB / OSD process.
>>> >>>
>>> >>>
>>> >>>> But you said you were trimming PG logs with the offline
tool? How long
>>> >>>> were those logs that needed to be trimmed?
>>> >>>
>>> >>> The logs we are trimming were ~3000, we trimmed them to the new
size of 500.
>>> >>> After restarting the OSDs, it dropped the pg_log memory usage
from ~22GB, to
>>> >>> what we guess is 2-3GB but with the cluster at this state,
it's hard to be
>>> >>> specific.
>>> >>>
>>> >>> Cheers,
>>> >>> Kalle
>>> >>>
>>> >>>
>>> >>>
>>> >>>> -- dan
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Nov 17, 2020 at 11:58 AM Kalle Happonen
<kalle.happonen(a)csc.fi> wrote:
>>> >>>>>
>>> >>>>> Another idea, which I don't know if has any merit.
>>> >>>>>
>>> >>>>> If 8 MB is a realistic log size (or has this grown for
some reason?), did the
>>> >>>>> enforcement (or default) of the minimum value change
lately
>>> >>>>> (osd_min_pg_log_entries)?
>>> >>>>>
>>> >>>>> If the minimum amount would be set to 1000, at 8 MB per
log, we would have
>>> >>>>> issues with memory.
>>> >>>>>
>>> >>>>> Cheers,
>>> >>>>> Kalle
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> ----- Original Message -----
>>> >>>>> > From: "Kalle Happonen"
<kalle.happonen(a)csc.fi>
>>> >>>>> > To: "Dan van der Ster"
<dan(a)vanderster.com>
>>> >>>>> > Cc: "ceph-users"
<ceph-users(a)ceph.io>
>>> >>>>> > Sent: Tuesday, 17 November, 2020 12:45:25
>>> >>>>> > Subject: [ceph-users] Re: osd_pglog memory hoarding
- another case
>>> >>>>>
>>> >>>>> > Hi Dan @ co.,
>>> >>>>> > Thanks for the support (moral and technical).
>>> >>>>> >
>>> >>>>> > That sounds like a good guess, but it seems like
there is nothing alarming here.
>>> >>>>> > In all our pools, some pgs are a bit over 3100, but
not at any exceptional
>>> >>>>> > values.
>>> >>>>> >
>>> >>>>> > cat pgdumpfull.txt | jq '.pg_map.pg_stats[] |
>>> >>>>> > select(.ondisk_log_size > 3100)' | egrep
"pgid|ondisk_log_size"
>>> >>>>> > "pgid": "37.2b9",
>>> >>>>> > "ondisk_log_size": 3103,
>>> >>>>> > "pgid": "33.e",
>>> >>>>> > "ondisk_log_size": 3229,
>>> >>>>> > "pgid": "7.2",
>>> >>>>> > "ondisk_log_size": 3111,
>>> >>>>> > "pgid": "26.4",
>>> >>>>> > "ondisk_log_size": 3185,
>>> >>>>> > "pgid": "33.4",
>>> >>>>> > "ondisk_log_size": 3311,
>>> >>>>> > "pgid": "33.8",
>>> >>>>> > "ondisk_log_size": 3278,
>>> >>>>> >
>>> >>>>> > I also have no idea what the average size of a pg
log entry should be, in our
>>> >>>>> > case it seems it's around 8 MB (22GB/3000
entires).
>>> >>>>> >
>>> >>>>> > Cheers,
>>> >>>>> > Kalle
>>> >>>>> >
>>> >>>>> > ----- Original Message -----
>>> >>>>> >> From: "Dan van der Ster"
<dan(a)vanderster.com>
>>> >>>>> >> To: "Kalle Happonen"
<kalle.happonen(a)csc.fi>
>>> >>>>> >> Cc: "ceph-users"
<ceph-users(a)ceph.io>io>, "xie xingguo" <xie.xingguo(a)zte.com.cn>cn>,
>>> >>>>> >> "Samuel Just"
<sjust(a)redhat.com>
>>> >>>>> >> Sent: Tuesday, 17 November, 2020 12:22:28
>>> >>>>> >> Subject: Re: [ceph-users] osd_pglog memory
hoarding - another case
>>> >>>>> >
>>> >>>>> >> Hi Kalle,
>>> >>>>> >>
>>> >>>>> >> Do you have active PGs now with huge pglogs?
>>> >>>>> >> You can do something like this to find them:
>>> >>>>> >>
>>> >>>>> >> ceph pg dump -f json | jq
'.pg_map.pg_stats[] |
>>> >>>>> >> select(.ondisk_log_size > 3000)'
>>> >>>>> >>
>>> >>>>> >> If you find some, could you increase to
debug_osd = 10 then share the osd log.
>>> >>>>> >> I am interested in the debug lines from
calc_trim_to_aggressively (or
>>> >>>>> >> calc_trim_to if you didn't enable
pglog_hardlimit), but the whole log
>>> >>>>> >> might show other issues.
>>> >>>>> >>
>>> >>>>> >> Cheers, dan
>>> >>>>> >>
>>> >>>>> >>
>>> >>>>> >> On Tue, Nov 17, 2020 at 9:55 AM Dan van der
Ster <dan(a)vanderster.com> wrote:
>>> >>>>> >>>
>>> >>>>> >>> Hi Kalle,
>>> >>>>> >>>
>>> >>>>> >>> Strangely and luckily, in our case the
memory explosion didn't reoccur
>>> >>>>> >>> after that incident. So I can mostly only
offer moral support.
>>> >>>>> >>>
>>> >>>>> >>> But if this bug indeed appeared between
14.2.8 and 14.2.13, then I
>>> >>>>> >>> think this is suspicious:
>>> >>>>> >>>
>>> >>>>> >>> b670715eb4 osd/PeeringState: do not trim
pg log past last_update_ondisk
>>> >>>>> >>>
>>> >>>>> >>>
https://github.com/ceph/ceph/commit/b670715eb4
>>> >>>>> >>>
>>> >>>>> >>> Given that it adds a case where the pg_log
is not trimmed, I wonder if
>>> >>>>> >>> there could be an unforeseen condition
where `last_update_ondisk`
>>> >>>>> >>> isn't being updated correctly, and
therefore the osd stops trimming
>>> >>>>> >>> the pg_log altogether.
>>> >>>>> >>>
>>> >>>>> >>> Xie or Samuel: does that sound possible?
>>> >>>>> >>>
>>> >>>>> >>> Cheers, Dan
>>> >>>>> >>>
>>> >>>>> >>> On Tue, Nov 17, 2020 at 9:35 AM Kalle
Happonen <kalle.happonen(a)csc.fi> wrote:
>>> >>>>> >>> >
>>> >>>>> >>> > Hello all,
>>> >>>>> >>> > wrt:
>>> >>>>> >>> >
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN…
>>> >>>>> >>> >
>>> >>>>> >>> > Yesterday we hit a problem with
osd_pglog memory, similar to the thread above.
>>> >>>>> >>> >
>>> >>>>> >>> > We have a 56 node object storage
(S3+SWIFT) cluster with 25 OSD disk per node.
>>> >>>>> >>> > We run 8+3 EC for the data pool
(metadata is on replicated nvme pool).
>>> >>>>> >>> >
>>> >>>>> >>> > The cluster has been running fine, and
(as relevant to the post) the memory
>>> >>>>> >>> > usage has been stable at 100 GB /
node. We've had the default pg_log of 3000.
>>> >>>>> >>> > The user traffic doesn't seem to
have been exceptional lately.
>>> >>>>> >>> >
>>> >>>>> >>> > Last Thursday we updated the OSDs from
14.2.8 -> 14.2.13. On Friday the memory
>>> >>>>> >>> > usage on OSD nodes started to grow. On
each node it grew steadily about 30
>>> >>>>> >>> > GB/day, until the servers started OOM
killing OSD processes.
>>> >>>>> >>> >
>>> >>>>> >>> > After a lot of debugging we found that
the pg_logs were huge. Each OSD process
>>> >>>>> >>> > pg_log had grown to ~22GB, which we
naturally didn't have memory for, and then
>>> >>>>> >>> > the cluster was in an unstable
situation. This is significantly more than the
>>> >>>>> >>> > 1,5 GB in the post above. We do have
~20k pgs, which may directly affect the
>>> >>>>> >>> > size.
>>> >>>>> >>> >
>>> >>>>> >>> > We've reduced the pg_log to 500,
and started offline trimming it where we can,
>>> >>>>> >>> > and also just waited. The pg_log size
dropped to ~1,2 GB on at least some
>>> >>>>> >>> > nodes, but we're still
recovering, and have a lot of ODSs down and out still.
>>> >>>>> >>> >
>>> >>>>> >>> > We're unsure if version 14.2.13
triggered this, or if the osd restarts triggered
>>> >>>>> >>> > this (or something unrelated we
don't see).
>>> >>>>> >>> >
>>> >>>>> >>> > This mail is mostly to figure out if
there are good guesses why the pg_log size
>>> >>>>> >>> > per OSD process exploded? Any
technical (and moral) support is appreciated.
>>> >>>>> >>> > Also, currently we're not sure if
14.2.13 triggered this, so this is also to
>>> >>>>> >>> > put a data point out there for other
debuggers.
>>> >>>>> >>> >
>>> >>>>> >>> > Cheers,
>>> >>>>> >>> > Kalle Happonen
>>> >>>>> >>> >
_______________________________________________
>>> >>>>> >>> > ceph-users mailing list --
ceph-users(a)ceph.io
>>> >>>>> >> > > To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>> >>>>> > _______________________________________________
>>> >>>>> > ceph-users mailing list -- ceph-users(a)ceph.io
>>> >>>> > > To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>> >>> _______________________________________________
>>> >>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>> >> _______________________________________________
>>> >> ceph-users mailing list -- ceph-users(a)ceph.io
>>> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users(a)ceph.io
>> > > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io