[ceph-users] Re: MON slow ops and growing MON store

26 Feb 2021

Thanks for the tip, but I do not have degraded PGs and the module is already disabled.

...
  On 25. Feb 2021, at 21:17, Seena Fallah
&lt;seenafallah(a)gmail.com&gt; wrote:

 I had the same problem in my cluster and it was because of insights mgr module that was
storing lots of data to the RocksDB because mu cluster was degraded. 
 If you have degraded pgs try to disable insights module.

 On Thu, Feb 25, 2021 at 11:40 PM Dan van der Ster &lt;dan(a)vanderster.com
<mailto:dan@vanderster.com>> wrote:
  "source": "osd.104...  
 What's happening on that osd? Is it something new which corresponds to when
 your mon started growing? Are other OSDs also flooding the mons with logs?

 I'm mobile so can't check... Are those logging configs the defaults? If not
 .... revert to default...

 BTW do your mons have stable quorum or are they flapping with this load?

 .. dan

 On Thu, Feb 25, 2021, 8:58 PM Janek Bevendorff <
 janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>> wrote:

  Thanks, Dan.

 On the first MON, the command doesn’t even return, but I was able to get a
 dump from the one I restarted most recently. The oldest ops look like this:

         {
             "description": "log(1000 entries from seq 17876238 at
 2021-02-25T15:13:20.306487+0100)",
             "initiated_at": "2021-02-25T20:40:34.698932+0100",
             "age": 183.762551121,
             "duration": 183.762599201,
             "type_data": {
                 "events": [
                     {
                         "time": "2021-02-25T20:40:34.698932+0100",
                         "event": "initiated"
                     },
                     {
                         "time": "2021-02-25T20:40:34.698636+0100",
                         "event": "throttled"
                     },
                     {
                         "time": "2021-02-25T20:40:34.698932+0100",
                         "event": "header_read"
                     },
                     {
                         "time": "2021-02-25T20:40:34.701407+0100",
                         "event": "all_read"
                     },
                     {
                         "time": "2021-02-25T20:40:34.701455+0100",
                         "event": "dispatched"
                     },
                     {
                         "time": "2021-02-25T20:40:34.701458+0100",
                         "event": "mon:_ms_dispatch"
                     },
                     {
                         "time": "2021-02-25T20:40:34.701459+0100",
                         "event": "mon:dispatch_op"
                     },
                     {
                         "time": "2021-02-25T20:40:34.701459+0100",
                         "event": "psvc:dispatch"
                     },
                     {
                         "time": "2021-02-25T20:40:34.701490+0100",
                         "event": "logm:wait_for_readable"
                     },
                     {
                         "time": "2021-02-25T20:40:34.701491+0100",
                         "event": "logm:wait_for_readable/paxos"
                     },
                                         {
                         "time": "2021-02-25T20:40:34.701496+0100",
                         "event": "paxos:wait_for_readable"
                     },
                     {
                         "time": "2021-02-25T20:40:34.989198+0100",
                         "event": "callback finished"
                     },
                     {
                         "time": "2021-02-25T20:40:34.989199+0100",
                         "event": "psvc:dispatch"
                     },
                     {
                         "time": "2021-02-25T20:40:34.989208+0100",
                         "event": "logm:preprocess_query"
                     },
                     {
                         "time": "2021-02-25T20:40:34.989208+0100",
                         "event": "logm:preprocess_log"
                     },
                     {
                         "time": "2021-02-25T20:40:34.989278+0100",
                         "event": "forward_request_leader"
                     },
                     {
                         "time": "2021-02-25T20:40:34.989344+0100",
                         "event": "forwarded"
                     },
                     {
                         "time": "2021-02-25T20:41:58.658022+0100",
                         "event": "resend forwarded message to
leader"
                     },
                     {
                         "time": "2021-02-25T20:42:27.735449+0100",
                         "event": "resend forwarded message to
leader"
                     }
                 ],
                 "info": {
                     "seq": 41550,
                     "src_is_mon": false,
                     "source": "osd.104 v2:XXX:6864/16579",
                     "forwarded_to_leader": true
                 }

 Any idea what that might be about? Almost looks like this:
 https://tracker.ceph.com/issues/24180 <https://tracker.ceph.com/issues/24180>
 I set debug_mon to 0, but I keep getting a lot of log spill in journals.
 It’s about 1-2 messages per second, mostly RocksDB stuff, but nothing that
 actually looks serious or even log-worthy. I noticed that before that
 despite logging being set to warning level, the cluster log keeps being
 written to the MON log. But it shouldn’t cause such massive stability
 issues, should it? The date on the log op is also weird. 15:13+0100 was
 hours ago.

 Here’s my log config:

 global                            advanced  clog_to_syslog_level
                       warning
 global                            basic     err_to_syslog
                        true
 global                            basic     log_to_file
                        false
 global                            basic     log_to_stderr
                        false
 global                            basic     log_to_syslog
                        true
 global                            advanced  mon_cluster_log_file_level
                       error
 global                            advanced  mon_cluster_log_to_file
                        false
 global                            advanced  mon_cluster_log_to_stderr
                        false
 global                            advanced  mon_cluster_log_to_syslog
                        false
 global                            advanced
  mon_cluster_log_to_syslog_level                      warning

 Ceph version is 15.2.8.

 Janek

 On 25. Feb 2021, at 20:33, Dan van der Ster &lt;dan(a)vanderster.com
<mailto:dan@vanderster.com>> wrote:

 ceph daemon mon.`hostname -s` ops

 That should show you the accumulating ops.

 .. dan

 On Thu, Feb 25, 2021, 8:23 PM Janek Bevendorff <
 janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>> wrote:

  Hi,

 All of a sudden, we are experiencing very concerning MON behaviour. We
 have five MONs and all of them have thousands up to tens of thousands of
 slow ops, the oldest one blocking basically indefinitely (at least the
 timer keeps creeping up). Additionally, the MON stores keep inflating
 heavily. Under normal circumstances we have about 450-550MB there. Right
 now its 27GB and growing (rapidly).

 I tried restarting all MONs, I disabled auto-scaling (just in case) and
 checked the system load and hardware. I also restarted the MGR and MDS
 daemons, but to no avail.

 Is there any way I can debug this properly? I can’t seem to find how I
 can actually view what ops are causing this and what client (if any) may be
 responsible for it.

 Thanks
 Janek
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>
 To unsubscribe send an email to ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io>

  _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>
 To unsubscribe send an email to ceph-users-leave(a)ceph.io
<mailto:ceph-users-leave@ceph.io> 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: MON slow ops and growing MON store