Improvements to Error-Rate Alerting

In January 2017 Mux announced support for error-rate alerts in our analytics service. One of our goals has been to calculate error-rate alert thresholds automatically. Fixed alerting thresholds are notoriously difficult to select and maintain over time.

Thresholds for key metrics can be very difficult to set without historical data. Any thresholds set early in a microservice’s lifecycle run the risk of either being useless or triggering too many alerts. (Susan Fowler, “Production Ready Microservices”)

We used statistical methods to calculate thresholds representative of highly-unusual error-rates, tailored to each error-type across all Mux customers. Our adaptive thresholds change over time as error types become more or less frequent.

But here’s the rub: things that are statistically unusual aren’t necessarily worth being alerted on. Many of the alerts were triggered at very low thresholds that didn’t affect many users and were not actionable. Discussions with customers confirmed our suspicions that we needed to improve our alerting algorithm to reduce alert fatigue.

For timeseries with evolving patterns, thresholds should be calibrated to reflect their most recent state. Data-driven thresholds require periodic readjustment. This approach emphasizes the importance of monitoring as a process and yields high rates of accuracy. Having said that, it comes at an expense of added complexity. (Slawek Ligus, “Effective Monitoring and Alerting”)

We responded by using alert incident details from the last 7 months to train a binary classifier capable of identifying important alert conditions matching the characteristics of historical alerts that affected large numbers of viewers. All alerts created since mid-August 2017 make use of this feature. This has greatly reduced the volume of alerts and boosted the visibility of actionable & important alerts.

We encourage you to try the alerts feature in the Mux analytics service. Customer feedback is always appreciated, so please don’t hesitate to let us know how we might improve your experience!