Hi,
happy new year to you!
I'm running a multinode cluster with 3 MGR nodes.
The issue I'm facing now is that ceph balancer <argument> runs for
minutes or, in worst case, hangs.
I have documented the runtime of the following executions:
root@ld3955:~# date && time ceph balancer status
Mon Dec 23 10:06:12 CET 2019
{
"active": true,
"plans": [],
"mode": "upmap"
}
real 1m45,045s
user 0m0,315s
sys 0m0,026s
root@ld3955:~# date && time ceph balancer status
Tue Jan 7 08:11:24 CET 2020
^CInterrupted
Traceback (most recent call last):
File "/usr/bin/ceph", line 1263, in <module>
retval = main()
File "/usr/bin/ceph", line 1194, in main
verbose)
File "/usr/bin/ceph", line 619, in new_style_command
ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
sigdict, inbuf, verbose)
File "/usr/bin/ceph", line 593, in do_command
return ret, '', ''
UnboundLocalError: local variable 'ret' referenced before assignment
real 102m44,084s
user 0m2,404s
sys 0m1,065s
root@ld3955:~# date && time ceph balancer off
Tue Jan 7 09:57:36 CET 2020
real 1m45,371s
user 0m0,358s
sys 0m0,013s
root@ld3955:~# date && time ceph balancer on
Tue Jan 7 14:57:03 CET 2020
real 0m0,452s
user 0m0,284s
sys 0m0,020s
root@ld3955:~# date && time ceph balancer status
Tue Jan 7 14:57:11 CET 2020
{
"active": true,
"plans": [],
"mode": "upmap"
}
real 1m52,902s
user 0m0,301s
sys 0m0,042s
root@ld3955:~# date && time ceph balancer off
Wed Jan 8 08:49:26 CET 2020
^CInterrupted
Traceback (most recent call last):
File "/usr/bin/ceph", line 1263, in <module>
retval = main()
File "/usr/bin/ceph", line 1194, in main
verbose)
File "/usr/bin/ceph", line 619, in new_style_command
ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
sigdict, inbuf, verbose)
File "/usr/bin/ceph", line 593, in do_command
return ret, '', ''
UnboundLocalError: local variable 'ret' referenced before assignment
real 14m29,097s
user 0m0,579s
sys 0m0,157s
In correlation with this finding I have identified that active MGR node
is using +100% CPU, to be pricise 108-120%.
To workaround this issue I must stop the active MRG node service and
wait until another node becomes active.
What's the issue with MGR service here?
Should I open a bug report?
Regards