sentry is back! - Dev

13 Jan 2021

Thanks to David Galloway, there's a fresh version of sentry setup at
https://sentry.ceph.com. This should help us track teuthology failures
and make it easier to fix issues and keep the tests passing.

Sentry is a general tool for tracking errors, and it's hooked up to
teuthology. Whenever a job fails, teuthology reports the stacktrace and
a bunch of metadata to sentry (the metadata is customizable, so we can
add more fields easily [0]).

Sentry groups these failures ('events' in sentry terms) together into
a single 'issue', which you can search and filter by
suite/branch/os/etc.

You can login with your github account, add yourself to the ceph org,
and look around the sepia project to see all the failures from the
sepia lab. Pulpito also links to the sentry event directly for each
failure.

Each issue has a timeline of occurrences so it's easy to see if it's
recently introduced or a long-present intermittent failure. There's
also a chart of the distribution of the metadata, so you can see if
there's a common element like OS, branch, etc. This should help us
detect when an issue was introduced, and whether it is new or not.

The way sentry groups failures is also customizable - the default is
based on the stacktrace from teuthology, which is too coarse in some
cases [1]. There are likely more tweaks here that will make it easier
to use. A few ideas:

* report ceph crashes rather than the teuthology traceback when
   there's a coredump

* parse test framework output and report the failed tests rather than
   the command that ran them

* differentiate between different kinds of timeouts

* some automation with tracker tickets (you can link to redmine from 
comments on sentry events, and vice versa today)

Check it out, and let us know if there are ways to improve it!

Josh

[0] 
https://github.com/ceph/teuthology/blob/master/teuthology/run_tasks.py#L107…
[1] https://github.com/ceph/teuthology/pull/1593