[ceph-users] Re: MON slow ops and growing MON store

25 Feb 2021

Thanks, I’ll try that tomorrow.

...
  On 25. Feb 2021, at 21:59, Dan van der Ster
&lt;dan(a)vanderster.com&gt; wrote:

 Maybe the debugging steps in that insights tracker can be helpful
 anyway: https://tracker.ceph.com/issues/39955

 -- dan

 On Thu, Feb 25, 2021 at 9:27 PM Janek Bevendorff
 &lt;janek.bevendorff(a)uni-weimar.de&gt; wrote:
> 
> Thanks for the tip, but I do not have degraded PGs and the module is already
disabled.
> 
> 
> On 25. Feb 2021, at 21:17, Seena Fallah &lt;seenafallah(a)gmail.com&gt; wrote:
> 
> I had the same problem in my cluster and it was because of insights mgr module that
was storing lots of data to the RocksDB because mu cluster was degraded.
> If you have degraded pgs try to disable insights module.
> 
> On Thu, Feb 25, 2021 at 11:40 PM Dan van der Ster &lt;dan(a)vanderster.com&gt; wrote:
>> 
>>> "source": "osd.104...
>> 
>> What's happening on that osd? Is it something new which corresponds to when
>> your mon started growing? Are other OSDs also flooding the mons with logs?
>> 
>> I'm mobile so can't check... Are those logging configs the defaults? If
not
>> .... revert to default...
>> 
>> BTW do your mons have stable quorum or are they flapping with this load?
>> 
>> .. dan
>> 
>> 
>> 
>> On Thu, Feb 25, 2021, 8:58 PM Janek Bevendorff <
>> janek.bevendorff(a)uni-weimar.de&gt; wrote:
>> 
>>> Thanks, Dan.
>>> 
>>> On the first MON, the command doesn’t even return, but I was able to get a
>>> dump from the one I restarted most recently. The oldest ops look like this:
>>> 
>>>        {
>>>            "description": "log(1000 entries from seq 17876238
at
>>> 2021-02-25T15:13:20.306487+0100)",
>>>            "initiated_at":
"2021-02-25T20:40:34.698932+0100",
>>>            "age": 183.762551121,
>>>            "duration": 183.762599201,
>>>            "type_data": {
>>>                "events": [
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.698932+0100",
>>>                        "event": "initiated"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.698636+0100",
>>>                        "event": "throttled"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.698932+0100",
>>>                        "event": "header_read"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.701407+0100",
>>>                        "event": "all_read"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.701455+0100",
>>>                        "event": "dispatched"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.701458+0100",
>>>                        "event": "mon:_ms_dispatch"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.701459+0100",
>>>                        "event": "mon:dispatch_op"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.701459+0100",
>>>                        "event": "psvc:dispatch"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.701490+0100",
>>>                        "event": "logm:wait_for_readable"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.701491+0100",
>>>                        "event":
"logm:wait_for_readable/paxos"
>>>                    },
>>>                                        {
>>>                        "time":
"2021-02-25T20:40:34.701496+0100",
>>>                        "event":
"paxos:wait_for_readable"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.989198+0100",
>>>                        "event": "callback finished"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.989199+0100",
>>>                        "event": "psvc:dispatch"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.989208+0100",
>>>                        "event": "logm:preprocess_query"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.989208+0100",
>>>                        "event": "logm:preprocess_log"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.989278+0100",
>>>                        "event": "forward_request_leader"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:40:34.989344+0100",
>>>                        "event": "forwarded"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:41:58.658022+0100",
>>>                        "event": "resend forwarded message to
leader"
>>>                    },
>>>                    {
>>>                        "time":
"2021-02-25T20:42:27.735449+0100",
>>>                        "event": "resend forwarded message to
leader"
>>>                    }
>>>                ],
>>>                "info": {
>>>                    "seq": 41550,
>>>                    "src_is_mon": false,
>>>                    "source": "osd.104
v2:XXX:6864/16579",
>>>                    "forwarded_to_leader": true
>>>                }
>>> 
>>> 
>>> Any idea what that might be about? Almost looks like this:
>>> https://tracker.ceph.com/issues/24180
>>> I set debug_mon to 0, but I keep getting a lot of log spill in journals.
>>> It’s about 1-2 messages per second, mostly RocksDB stuff, but nothing that
>>> actually looks serious or even log-worthy. I noticed that before that
>>> despite logging being set to warning level, the cluster log keeps being
>>> written to the MON log. But it shouldn’t cause such massive stability
>>> issues, should it? The date on the log op is also weird. 15:13+0100 was
>>> hours ago.
>>> 
>>> Here’s my log config:
>>> 
>>> global                            advanced  clog_to_syslog_level
>>>                      warning
>>> global                            basic     err_to_syslog
>>>                       true
>>> global                            basic     log_to_file
>>>                       false
>>> global                            basic     log_to_stderr
>>>                       false
>>> global                            basic     log_to_syslog
>>>                       true
>>> global                            advanced  mon_cluster_log_file_level
>>>                      error
>>> global                            advanced  mon_cluster_log_to_file
>>>                       false
>>> global                            advanced  mon_cluster_log_to_stderr
>>>                       false
>>> global                            advanced  mon_cluster_log_to_syslog
>>>                       false
>>> global                            advanced
>>> mon_cluster_log_to_syslog_level                      warning
>>> 
>>> 
>>> 
>>> Ceph version is 15.2.8.
>>> 
>>> Janek
>>> 
>>> 
>>> On 25. Feb 2021, at 20:33, Dan van der Ster &lt;dan(a)vanderster.com&gt;
wrote:
>>> 
>>> ceph daemon mon.`hostname -s` ops
>>> 
>>> That should show you the accumulating ops.
>>> 
>>> .. dan
>>> 
>>> 
>>> On Thu, Feb 25, 2021, 8:23 PM Janek Bevendorff <
>>> janek.bevendorff(a)uni-weimar.de&gt; wrote:
>>> 
>>>> Hi,
>>>> 
>>>> All of a sudden, we are experiencing very concerning MON behaviour. We
>>>> have five MONs and all of them have thousands up to tens of thousands of
>>>> slow ops, the oldest one blocking basically indefinitely (at least the
>>>> timer keeps creeping up). Additionally, the MON stores keep inflating
>>>> heavily. Under normal circumstances we have about 450-550MB there. Right
>>>> now its 27GB and growing (rapidly).
>>>> 
>>>> I tried restarting all MONs, I disabled auto-scaling (just in case) and
>>>> checked the system load and hardware. I also restarted the MGR and MDS
>>>> daemons, but to no avail.
>>>> 
>>>> Is there any way I can debug this properly? I can’t seem to find how I
>>>> can actually view what ops are causing this and what client (if any) may
be
>>>> responsible for it.
>>>> 
>>>> Thanks
>>>> Janek
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>> 
>>> 
>>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> 
>  

2024

2023

2022

2021

2020

2019

[ceph-users] Re: MON slow ops and growing MON store