Thore,
Thank you for your reply.
Unless the issue was specifically with a ceph telemetry server or their
subnet we had no network issues at that time, at least none was reported by
monitoring or customers. It is very weird unless the telemetry module may
have a bug of some kind and hangs on its own. But that is the first time
it happened and the cluster has been up for a year.
On Fri, Feb 21, 2020 at 3:49 AM Thore Krüss <thore(a)scimeda.de> wrote:
On Fri, Feb 21, 2020 at 05:28:12AM -0000,
alexander.v.litvak(a)gmail.com
wrote:
This evening I was awakened by an error message
cluster:
id: 9b4468b7-5bf2-4964-8aec-4b2f4bee87ad
health: HEALTH_ERR
Module 'telemetry' has failed: ('Connection aborted.',
error(101, 'Network is unreachable'))
services:
I have not seen any other problems with anything else on the cluster. I
disabled
and enabled the telemetry module and health returned to OK
status. Any ideas on what could cause the issue? As far as I understand,
telemetry is a module that sends messages to an external ceph server
outside of the network.
Maybe an uplink issue? We had similar behaviour as we had some trouble
with a
core router.
You have been able to disable and enable the module? This failed for me
with the
reason that the module had failed (Nautilus). Restarting all mgrs did help.
Still - I'm not sure why this is considered to be a health_err.
Best regards
Thore
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io