On Thu, 13 Jun 2019, Neha Ojha wrote:
Hi everyone,
There has been some interest in a feature that helps users to mute
health warnings. There is a trello card[1] associated with it and
we've had some discussion[2] in the past in a CDM about it. In
general, we want to understand a few things:
1. what is the level of interest in this feature
2. for how long should we mute these warnings - should the period be
decided by us or the user
3. possible misuse of this feature and negative impacts of muting some warnings
Let us know what you think.
[1]
https://trello.com/c/vINMkfTf/358-mute-health-warnings
[2]
https://pad.ceph.com/p/cephalocon-usability-brainstorming
What if we start with something like:
- a 'mute' targets a specific warning code (e.g., OSD_DOWN)
e.g., 'ceph health mute OSD_DOWN'
- the mute matches the alert code and the short description (e.g., "2 osds
down")
- this could be more specific, like matching the detail items too
- or, it could be less specific, so that e.g., a OSD_DOWN going from 2
to 1 osd won't unmute
- or, individual detail items could be the things that get muted
-> we might need to make alerts include more structured fields (besides
a summary string and vector<string> of details) in order to make this
work perfectly... but we can start start simple (with just the
summary string match?).
- the mute goes away if
- the description changes
- the alert resolves
- the TTL/expiration time is reached
- the user unmutes (the specific mute 'ceph health unmute <code>' or all
mutes with 'ceph health umute')
- 'ceph -s' will say HEALTH_OK (if all alerts are muted), but *also* say
how many muted alerts there are, e.g.
cluster:
id: 28f7427e-5558-4ffd-ae1a-51ec3042759a
health: HEALTH_OK
2 muted alerts: OSD_DOWN, TOO_MANY_PGS
services:
...
- 'ceph health' will say HEALTH_OK (if all alerts are muted)
- 'ceph health detail' will say HEALTH_OK (if all alerts are muted), but
will *also* show all of the muted alerts in a separate section (along
with the mute TTL/expiration)
- the dashboard would show HEALTH_OK, plus some clear visual
indication that there are one or more mutes, with an easy UI to
mute/unmute
sage