ceph balancer <argument> runs for minutes or hangs - ceph-users

8 Jan 2020

Hi,

happy new year to you!

I'm running a multinode cluster with 3 MGR nodes.

The issue I'm facing now is that ceph balancer <argument> runs for
minutes or, in worst case, hangs.

I have documented the runtime of the following executions:
root@ld3955:~# date && time ceph balancer status
Mon Dec 23 10:06:12 CET 2019
{
    "active": true,
    "plans": [],
    "mode": "upmap"
}

real    1m45,045s
user    0m0,315s
sys     0m0,026s

root@ld3955:~# date && time ceph balancer status
Tue Jan  7 08:11:24 CET 2020
^CInterrupted
Traceback (most recent call last):
  File "/usr/bin/ceph", line 1263, in <module>
    retval = main()
  File "/usr/bin/ceph", line 1194, in main
    verbose)
  File "/usr/bin/ceph", line 619, in new_style_command
    ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
sigdict, inbuf, verbose)
  File "/usr/bin/ceph", line 593, in do_command
    return ret, '', ''
UnboundLocalError: local variable 'ret' referenced before assignment

real    102m44,084s
user    0m2,404s
sys     0m1,065s

root@ld3955:~# date && time ceph balancer off
Tue Jan  7 09:57:36 CET 2020

real    1m45,371s
user    0m0,358s
sys     0m0,013s

root@ld3955:~# date && time ceph balancer on
Tue Jan  7 14:57:03 CET 2020

real    0m0,452s
user    0m0,284s
sys     0m0,020s

root@ld3955:~# date && time ceph balancer status
Tue Jan  7 14:57:11 CET 2020
{
    "active": true,
    "plans": [],
    "mode": "upmap"
}

real    1m52,902s
user    0m0,301s
sys     0m0,042s

root@ld3955:~# date && time ceph balancer off
Wed Jan  8 08:49:26 CET 2020
^CInterrupted
Traceback (most recent call last):
  File "/usr/bin/ceph", line 1263, in <module>
    retval = main()
  File "/usr/bin/ceph", line 1194, in main
    verbose)
  File "/usr/bin/ceph", line 619, in new_style_command
    ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
sigdict, inbuf, verbose)
  File "/usr/bin/ceph", line 593, in do_command
    return ret, '', ''
UnboundLocalError: local variable 'ret' referenced before assignment

real    14m29,097s
user    0m0,579s
sys     0m0,157s

In correlation with this finding I have identified that active MGR node
is using +100% CPU, to be pricise 108-120%.

To workaround this issue I must stop the active MRG node service and
wait until another node becomes active.

What's the issue with MGR service here?
Should I open a bug report?

Regards