CB
Christian Battaglia
Developer · Builder · Creator
HomeBlogMusicProdEngProjectsUsesAboutQuotes

RFA: Engineering Incidents

A proposal for structured incident management at Amplify — defining severity levels (SEV-1 through SEV-3), communication protocols, root cause analysis, and retrospective practices for when things go wrong in production.
Christian Battaglia

Christian Battaglia

December 20, 2019

2 min read

incident management
SRE
Amplify
severity levels
root cause analysis
retrospectives
DevOps

Current Incident Reporting at Amplify

A sample:

IncidentBotAPP 1:32 PM

Issue: Science assessments Audio option is not appearing

Latest Update: Error rates have declined after removing possibly faulty server instances. We do not know the root cause, but users should now be able to access audio in assessments.

Status: Monitoring
Severity: SEV-3 (what's this?)
Affected products: Amplify Science
Response channel: #serious-badger-incident
Fact sheet: :clipboard: Fact Sheet (previous updates, etc.)

How to define severity? (Incident Management)

Definition:

Incidents are any interruption in normal service to our customers or employees.

Our Incident Management process is how we organize our response and communications when an incident occurs.

Severity Levels

We use severity levels to indicate the scope of an incident, and to help guide the immediacy and scope of our response.

  • SEV-1
    • Total loss of availability (outage) for all customers, or PII data breach.
  • SEV-2
    • Disruption of normal classroom usage with no workaround; or loss of availability to a significant fraction of user base
  • SEV-3
    • Lessons can be completed and students have their work, but degraded functionality that prevents users from completing typical tasks.

Please note that severity levels are not a popularity contest – we don't need to call something a SEV-1 just because we care about a particular customer or product a lot. They're just a guideline to clarify what is happening and how we are responding.

About incident management

The first goal of the incident management process is to restore a normal service operation as quickly as possible and to minimize the impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained.

In addition to the primary goal, there are a series of lesser goals that would ideally be achieved:

  • On-going communication to interested parties that the service interruption is occurring and the current progress towards recovering the service.
  • Root cause analysis, so we understand why the incident occured.
  • Retrospective analysis after the incident, so we understand how to prevent future incidents and improve our handling of them.