James Duffy

Designing incident response that doesn't break your team

The incident itself is rarely the problem. The technology almost always has a path to resolution. The hard part is the chaos that surrounds it.

Unclear ownership. Nobody knows who’s in charge. Five people have SSH’d into the same server. The product manager is asking for updates in one Slack channel, the CEO is asking in another, and your on-call engineer is trying to respond to both while also debugging.

That chaos is a design failure.

#Where this comes from

In high school I was part of the San Jose Fire Explorer Program — a program run through the fire department where we trained in structure search, hose work, and ladder operations. Real drills at a real burn building. We trained under ICS without me fully knowing what ICS was.

When I started responding to production incidents years later, ICS wasn’t a foreign concept I had to learn. It was the obvious answer to a problem I had already seen solved.

The Incident Command System was built in the early 1970s after a series of catastrophic California wildfires killed firefighters — not from the flames, but from coordination failures. Multiple agencies with incompatible communication systems. No unified command. Personnel duplicating effort in some areas while other areas had no coverage.

Does that sound familiar?

The U.S. government spent a decade building a response. The result is now the foundation of FEMA’s National Incident Management System, stress-tested across wildfire response, hurricane relief, and search and rescue. Its core principles are elegant in their simplicity.

#The four things that actually matter

If you take nothing else from this:

One Incident Commander. There is always exactly one person with decision-making authority. Not a committee. Not whoever speaks up first. One person, named and accountable.

Clear roles — so people know what they’re supposed to do and, crucially, what they’re not supposed to do. Scope creep in an incident is as dangerous as scope creep in a project.

Defined communication channels — one place for incident updates, one place for stakeholder communication. Not twelve parallel threads.

Explicit handoffs — so that when someone steps out, the knowledge they’re carrying doesn’t walk out with them.

#The IC’s job is to direct, not do

This is the hardest thing to internalize for engineers who are used to being the strongest technical person in the room. The instinct when something is broken is to fix it — get your hands on the keyboard, pull up the logs, start solving.

But when you are the IC, that instinct is wrong.

The moment you go heads-down on a technical problem, you’ve stopped being the IC. You’ve created a command vacuum. And the incident will fill that vacuum with chaos.

The IC’s job is to maintain the view from 30,000 feet. Assign who does what. Make decisions when the team is stuck. Manage communication. Set priorities. That’s it.

The doing is someone else’s job.

#The command structure scales with the incident

For small incidents, the IC handles everything. But as incidents scale — more responders, more stakeholders, longer duration — the IC delegates:

  • Communications Lead: owns all messaging. Internal updates, status page, customer-facing communication. They draft, the IC approves.
  • Operations Lead: hands-on technical mitigation. Your strongest technical responder goes here, executing on the plan.
  • Planning Lead: tracks the timeline, documents decisions and next steps, is thinking two moves ahead. If the Ops Lead is in the present tense, the Planning Lead is in the future tense.
  • Logistics Lead: makes sure everyone has what they need — access, tooling, staffing. If someone needs a database credential reset at 3am and can’t reach anyone, that’s a logistics failure.

#Severity is based on customer impact. Not technical complexity.

A control plane in a degraded state with no customer-facing impact is not a SEV1. A login page returning errors for 10% of users is a SEV1, even if the root cause is a two-line config change.

Severity is always about customer and business impact. And it can change as you learn more — you start where you are and you adjust. The specific thresholds matter less than having them defined before an incident starts. The worst time to debate severity is during an active incident.

#Sustainable on-call is a reliability problem

We’ve all heard “we should take care of our engineers.” While true, it doesn’t always move the needle in planning conversations. So let’s put it differently:

A fatigued engineer makes worse decisions. They miss things. They communicate poorly. They stay heads-down when they should be coordinating. They don’t hand off when they should. When they burn out entirely, you lose institutional knowledge that’s almost impossible to replace.

Responder health is not a people problem. It’s a reliability problem.

No one should serve as Incident Commander for more than two hours without a rotation plan. Two hours of sustained, high-stakes coordination is cognitively exhausting. For any incident running longer than two hours, the IC should be planning their handoff before the two-hour mark — not when they’re already burnt out.

#The handoff is the reliability practice

Most teams treat the role handoff as an interruption to the real work — something to get through quickly so the next person can start fixing things. That’s wrong.

A well-executed handoff keeps institutional knowledge in the incident as individuals cycle out. A bad handoff — or no handoff — is how critical context gets lost and incident duration doubles.

Every handoff must include:

  • Current status and severity
  • What’s been tried and what worked or didn’t
  • Active mitigation steps
  • Known risks and open questions
  • Outstanding action items

Document it. Announce it to the incident channel. Not passed privately in a DM.

#Treat every incident like an incident

The ICS structure only works if you use it every single time. SEV2 bug? Declare an IC, log it. SEV3? Same thing.

The process should feel boring on the small ones — because boring means it’s muscle memory. And when the SEV1 hits at 2am, nobody is reading a runbook. They’re already in position.

Tabletop exercises help. Fire drills help. But nothing builds the reflex like actual reps. The teams that handle major incidents well aren’t calmer under pressure — they’ve done the process so many times on small incidents that the roles activate automatically.

#Blameless postmortems, or you’re wasting the incident

Every incident that reaches SEV2 or higher should have a postmortem. No exceptions.

And it must be genuinely blameless — not “we won’t fire anyone over this,” but analysis focused on system failures and process gaps, not individual performance. The moment people think they’ll be blamed, they stop being honest. And you lose the ability to learn.

A good postmortem produces specific, actionable follow-up items. Not “we should improve monitoring” — but “we need an alert on X metric with a threshold of Y by this date, owned by this person.”

The cost of the incident has already been paid. The downtime happened. The engineers were paged. The customers were affected. That cost is sunk.

The postmortem is how you get the return on that cost. Organizations that take it seriously reduce incident frequency, reduce mean time to resolution, and get more resilient. The organizations that skip it just pay the same cost again next quarter.

An incident without a postmortem is a lesson you’ll have to learn again.