Quality Engineering Basics: Incident Reports
Part 10 of Quality Engineering Basics. Let's talk about incident reports!
Incident reports1 allow us to examine production failures by reviewing what happened, identifying and analyzing the contributing factors, and determining process improvements to prevent or mitigate future events. These are meant to be blameless, to understand what went wrong and learn from the experience.
Additionally, incident reports can be used to update internal and external stakeholders, such as executive leadership or affected customers.
If done right, here’s what to expect from an effective incident report process:
Investing the time to write them (and follow through on the action items) now will result in fewer incidents in the future.
Less time spent putting out production fires. 🔥
Less time spent writing incident reports.
More time for other stuff.
Better quality!
Stakeholders feel informed about what’s happening and have the necessary information.
Less time spent digging up answers to questions like “How many users were impacted by that thing last month?” or “What’s this blip in our key metrics?”
Clear justification for working on process improvements or paying down technical debt.
If incident A caused X, Y, and Z and the cost of the work is less than the damage to X, Y, and Z, stakeholders are usually more willing to approve this type of work than without the data. Over time, they will often recognize the value of preemptively doing this work.
The Challenge
Incident reports often get a bad reputation. A few years ago, I heard numerous complaints from developers when I rolled out a standardized process2 requiring an incident report for any production bug at or above a specific priority. Some teams had significantly more than others, and they felt this was an undue burden.
The company or organization has to be willing to allocate the time, with the expectation and understanding that the incident reports will lead to more work (action items), which will lead to better quality and freed-up time later on.
Some common complaints were:
This is too much work; we should be coding new features instead.
No one ever reads these, so why bother?
Nothing ever comes out of these; it’s a waste of time.
The Solution
There is no universal solution for getting people to love writing incident reports. The most important thing is to demonstrate the reports' effects and remember that they are meant to be a blameless opportunity to seek ways to do better in the future.
When I faced all those complaints, I started by addressing the fallacy of “too much work; we should be coding new features instead.” I pointed out that by identifying ways to prevent future issues, we’d have more time for feature work in the future and better quality.
I also ensured everyone knew that I read them—every. single. one. And I encouraged the appropriate engineering managers and VPs to read them and provide comments and feedback.
As for lack of outcome? Previously, when these were done, the action items were in a list in a Confluence doc or email, never to be reviewed or prioritized. I addressed this by requiring each action item to be reported in Jira as part of writing the incident report. These were then later reviewed and prioritized by the appropriate scrum team. Additionally, I created reports showing how many incident report action items were completed.
That team that had to write 10x more than other teams? They realized how much time they were spending fixing production issues and that they could regain some of this by dedicating time to understanding how to prevent and mitigate future problems. By doing this, they significantly improved their testing and release processes. These changes dramatically reduced the number of high-priority production issues and the mean time to repair production issues.
Incident Report Process
For an effective process, here are the key bits you’ll need to determine:
Criteria for required incident reports.
Person(s) responsible for:
writing the incident report(s).
collaboration on report(s).
reviewing the report(s).
Action item tracking and reporting.
How will you track and report action items?
Incident Report Template
Example
Criteria: Incident reports are required for any P0 production issue, failed deployment, or issue requiring immediate intervention. Optional for P1 issues, depending on the visibility of the problem.
Person(s) responsible:
Writing: A cross-functional team from dev, qa, and ops.
Collaboration: team members as needed.
Reviewing: team manager, VP of Engineering.
Action Items: AIs will be tracked in the same system as other engineering (product development) work.
Incident Report Template: example template (feel free to make a copy and customize it to make it work for you.)
Conversation Starters:
Have you ever written an incident report or postmortem? What did you learn?
Incident Reports are often the purview of development teams or operations. Why should quality engineers be involved?
Up Next
This is the penultimate post for Quality Engineering Basics. Next, we’ll examine how these ten topics can be a foundation for quality engineering best practices and processes.
Like this article? Click the ♥️ button to let me know!
Or leave a comment 💬 and share your thoughts.
xo,
Brie
PS. If you’d like to support my writing and my work on QUALITY BOSS, you can show appreciation by leaving me a tip through Ko-fi.
Coaching/Mentoring: Want to work with me? Book a call here to discuss getting started.
My areas of expertise and interest are leadership development, conquering impostor syndrome, values exploration, goal setting, and creating habits & systems. And, of course, Quality Engineering. 🐞
These can also be called postmortems or even retrospectives. However, since both postmortems and retros are often done at the project or release level, I prefer the term incident report as a more explicit, unique name.
I borrowed heavily from Google’s postmortem process when I created my first standardized incident report process.