Shared Sentry monitoring
How to organise the monitoring work across different teams working in the same application.
Problem Description
One of the main BOS issues is the daily monitoring from the team point of view because a lot of teams contribute to the BOS application and we don’t have an easy way to assign the errors that appear and clean them.
Background
Several teams in Ebury contribute to the BOS application (this is great!) but have some consequences, one of those is that the monitoring is shared and is hard to organise it because when the errors are related with a new module it is easy to identify but what happens when it’s related to legacy code?
Solution
To organise the work we created a Monitoring squad to divide the work between teams and work in that way: BOS Release manager is the main person reviewing the Slack channel #bos-alerts (Where the new and recurrent BOS errors appears) The BOS release manager contacts the person assigned in each team to check the issue using this Confluence dashboard. There are the possible scenarios:
- False positive. Marked as “Ignored” but need to be analysed to avoid happen again creating a JIRA spike with "sentry_error" label (after this spike could be moved to the next scenarios).
- No ERROR type. Sometimes we receive errors that are part of a validation and don't make sense that the issue level is ERROR, in that case, we create a task to reduce the issue type to WARNING linked with the Sentry issue and with the JIRA label “sentry_error” to obtain metrics.
- ERROR type. Create a task to analyse or solve the issue. In that case, we create a task in one of the BOS teams, linked with the Sentry issue and with the JIRA label “sentry_error” to obtain metrics.
Apart from that, Sentry panel should be reviewed to check the most common scenarios (ex. TOP 10) because in Slack we only receive the new ones. Using this panel we need to create tasks to solve this issues and clean the Sentry panel.
Alternatives
At the moment, this plan is working if we push the tasks in every sprint to solve the issues asap and clean the issues that are appearing but other ideas/suggestions are more than welcome.
Caveats
The main limitation there is that we have a few people involved in the monitoring stuff instead of all the people in the teams but with this solution we’re trying to educate all the BOS team members. This is the first step but we want this idea to disappear when the time passes and everyone involved in BOS be aware of the monitoring.
Other possible issue is false positives or fix that generate other Sentry issues. In this [Confluence dashboard] we could find all the tasks related to Sentry fixes, so we can find related tasks it's necessary.
Operation
Basically, we’ll have a squad to organise the monitoring work and the people involved will be the release manager and one senior engineer per team. They’ll be aware of Sentry issues and the creating of those tickets after that, we need to push them in the team refinements to solve it in the next sprints.
Security Impact
NA
Performance Impact
NA
Developer Impact
This will affect our sprints because we’ll need to book some capacity to solve sentry issues, at least, until we clean the most common ones. In that way, the BOS Sentry issues will be cleaned and improve the work related to monitoring.
Data Consumer Impact
NA
Deployment
NA
Dependencies
NA
References
NA