Also did you solve your log spam issue here?
https://tracker.ceph.com/issues/49161
Surely these things are related?
You might need to share more full logs from cluster, mon, osd, mds,
mgr so that we can help get to the bottom of this.
-- dan
On Thu, Feb 25, 2021 at 10:04 PM Janek Bevendorff
<janek.bevendorff(a)uni-weimar.de> wrote:
>
> Thanks, I’ll try that tomorrow.
>
>
> > On 25. Feb 2021, at 21:59, Dan van der Ster <dan(a)vanderster.com> wrote:
> >
> > Maybe the debugging steps in that insights tracker can be helpful
> > anyway:
https://tracker.ceph.com/issues/39955
> >
> > -- dan
> >
> > On Thu, Feb 25, 2021 at 9:27 PM Janek Bevendorff
> > <janek.bevendorff(a)uni-weimar.de> wrote:
> >>
> >> Thanks for the tip, but I do not have degraded PGs and the module is already
disabled.
> >>
> >>
> >> On 25. Feb 2021, at 21:17, Seena Fallah <seenafallah(a)gmail.com>
wrote:
> >>
> >> I had the same problem in my cluster and it was because of insights mgr
module that was storing lots of data to the RocksDB because mu cluster was degraded.
> >> If you have degraded pgs try to disable insights module.
> >>
> >> On Thu, Feb 25, 2021 at 11:40 PM Dan van der Ster <dan(a)vanderster.com>
wrote:
> >>>
> >>>> "source": "osd.104...
> >>>
> >>> What's happening on that osd? Is it something new which corresponds
to when
> >>> your mon started growing? Are other OSDs also flooding the mons with
logs?
> >>>
> >>> I'm mobile so can't check... Are those logging configs the
defaults? If not
> >>> .... revert to default...
> >>>
> >>> BTW do your mons have stable quorum or are they flapping with this
load?
> >>>
> >>> .. dan
> >>>
> >>>
> >>>
> >>> On Thu, Feb 25, 2021, 8:58 PM Janek Bevendorff <
> >>> janek.bevendorff(a)uni-weimar.de> wrote:
> >>>
> >>>> Thanks, Dan.
> >>>>
> >>>> On the first MON, the command doesn’t even return, but I was able to
get a
> >>>> dump from the one I restarted most recently. The oldest ops look
like this:
> >>>>
> >>>> {
> >>>> "description": "log(1000 entries from seq
17876238 at
> >>>> 2021-02-25T15:13:20.306487+0100)",
> >>>> "initiated_at":
"2021-02-25T20:40:34.698932+0100",
> >>>> "age": 183.762551121,
> >>>> "duration": 183.762599201,
> >>>> "type_data": {
> >>>> "events": [
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.698932+0100",
> >>>> "event": "initiated"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.698636+0100",
> >>>> "event": "throttled"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.698932+0100",
> >>>> "event": "header_read"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.701407+0100",
> >>>> "event": "all_read"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.701455+0100",
> >>>> "event": "dispatched"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.701458+0100",
> >>>> "event":
"mon:_ms_dispatch"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.701459+0100",
> >>>> "event":
"mon:dispatch_op"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.701459+0100",
> >>>> "event": "psvc:dispatch"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.701490+0100",
> >>>> "event":
"logm:wait_for_readable"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.701491+0100",
> >>>> "event":
"logm:wait_for_readable/paxos"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.701496+0100",
> >>>> "event":
"paxos:wait_for_readable"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.989198+0100",
> >>>> "event": "callback
finished"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.989199+0100",
> >>>> "event": "psvc:dispatch"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.989208+0100",
> >>>> "event":
"logm:preprocess_query"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.989208+0100",
> >>>> "event":
"logm:preprocess_log"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.989278+0100",
> >>>> "event":
"forward_request_leader"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:40:34.989344+0100",
> >>>> "event": "forwarded"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:41:58.658022+0100",
> >>>> "event": "resend forwarded
message to leader"
> >>>> },
> >>>> {
> >>>> "time":
"2021-02-25T20:42:27.735449+0100",
> >>>> "event": "resend forwarded
message to leader"
> >>>> }
> >>>> ],
> >>>> "info": {
> >>>> "seq": 41550,
> >>>> "src_is_mon": false,
> >>>> "source": "osd.104
v2:XXX:6864/16579",
> >>>> "forwarded_to_leader": true
> >>>> }
> >>>>
> >>>>
> >>>> Any idea what that might be about? Almost looks like this:
> >>>>
https://tracker.ceph.com/issues/24180
> >>>> I set debug_mon to 0, but I keep getting a lot of log spill in
journals.
> >>>> It’s about 1-2 messages per second, mostly RocksDB stuff, but
nothing that
> >>>> actually looks serious or even log-worthy. I noticed that before
that
> >>>> despite logging being set to warning level, the cluster log keeps
being
> >>>> written to the MON log. But it shouldn’t cause such massive
stability
> >>>> issues, should it? The date on the log op is also weird. 15:13+0100
was
> >>>> hours ago.
> >>>>
> >>>> Here’s my log config:
> >>>>
> >>>> global advanced clog_to_syslog_level
> >>>> warning
> >>>> global basic err_to_syslog
> >>>> true
> >>>> global basic log_to_file
> >>>> false
> >>>> global basic log_to_stderr
> >>>> false
> >>>> global basic log_to_syslog
> >>>> true
> >>>> global advanced
mon_cluster_log_file_level
> >>>> error
> >>>> global advanced mon_cluster_log_to_file
> >>>> false
> >>>> global advanced
mon_cluster_log_to_stderr
> >>>> false
> >>>> global advanced
mon_cluster_log_to_syslog
> >>>> false
> >>>> global advanced
> >>>> mon_cluster_log_to_syslog_level warning
> >>>>
> >>>>
> >>>>
> >>>> Ceph version is 15.2.8.
> >>>>
> >>>> Janek
> >>>>
> >>>>
> >>>> On 25. Feb 2021, at 20:33, Dan van der Ster
<dan(a)vanderster.com> wrote:
> >>>>
> >>>> ceph daemon mon.`hostname -s` ops
> >>>>
> >>>> That should show you the accumulating ops.
> >>>>
> >>>> .. dan
> >>>>
> >>>>
> >>>> On Thu, Feb 25, 2021, 8:23 PM Janek Bevendorff <
> >>>> janek.bevendorff(a)uni-weimar.de> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> All of a sudden, we are experiencing very concerning MON
behaviour. We
> >>>>> have five MONs and all of them have thousands up to tens of
thousands of
> >>>>> slow ops, the oldest one blocking basically indefinitely (at
least the
> >>>>> timer keeps creeping up). Additionally, the MON stores keep
inflating
> >>>>> heavily. Under normal circumstances we have about 450-550MB
there. Right
> >>>>> now its 27GB and growing (rapidly).
> >>>>>
> >>>>> I tried restarting all MONs, I disabled auto-scaling (just in
case) and
> >>>>> checked the system load and hardware. I also restarted the MGR
and MDS
> >>>>> daemons, but to no avail.
> >>>>>
> >>>>> Is there any way I can debug this properly? I can’t seem to find
how I
> >>>>> can actually view what ops are causing this and what client (if
any) may be
> >>>>> responsible for it.
> >>>>>
> >>>>> Thanks
> >>>>> Janek
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users(a)ceph.io
> >>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >>>>>
> >>>>
> >>>>
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users(a)ceph.io
> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >>
> >>
>