Skip to content
Intellira

Incident Response

Rollback or fix forward? An on-call decision playbook

An on-call decision playbook for choosing rollback vs fix-forward during an incident, with safe Kubernetes and ArgoCD rollback commands and verification steps.

By Intellira Engineering, Editorial team

The 30-second answer

Roll back when the failure is change-induced and the change is reversible — a bad image, a bad config, a regression in the latest deploy. Fix forward when rolling back is itself unsafe: a forward-only database migration ran, data has already been written in a new shape, or only part of a progressive rollout is affected. The decision rule is not "which is faster to type" — it is which path has the smaller, more predictable blast radius. And before either: mitigate. Google SRE's incident-management guidance is blunt about ordering — "Stop the bleeding, restore service, and preserve the evidence for root-causing." Restoring service is the job; the root cause can wait for the review.

Mitigate before you root-cause

The most expensive mistake on-call is debugging a live outage. You are not trying to understand the bug at 3 a.m.; you are trying to make the impact stop. Rollback and fix-forward are both mitigations — ways to restore service — and they sit ahead of root-cause analysis, not after it.

This is also why DORA counts the cost honestly. Change failure rate is "the ratio of deployments that require immediate intervention following a deployment" — explicitly including changes "likely resulting in a rollback of the changes or a 'hotfix' to quickly remediate any issues." A rollback is not a failure of the on-call engineer; it is the recovery mechanism the metric expects. The metric that rewards you for moving fast here is failed deployment recovery time — "the time it takes to recover from a deployment that fails and requires immediate intervention." Optimize for that number, not for being right about the cause.

So the first question on-call is never "what broke?" It is "what changed, and can I take it back?"

Scope the decision: four signals that point to rollback

Roll back when most of these hold:

  1. Change-induced. The failure started at or just after a deploy, config push, or flag flip. The error rate, latency, or crash loop tracks the rollout timeline.
  2. Reversible. The previous version is still deployable as-is. No forward-only migration ran; no data was written in a shape the old code can't read.
  3. Fast. The rollback path is a single, well-understood operation (kubectl rollout undo, an ArgoCD sync to a prior revision) that completes in seconds to a couple of minutes.
  4. Bounded. You can predict exactly what reverts. Code and the config that shipped with it go back together; nothing else moves.

When those hold, rollback is the lowest-blast-radius option and you should take it immediately. You can find the root cause from the reverted commit at leisure.

When to fix forward instead

Rollback is a deploy in reverse — it carries its own risk, and in several cases that risk is higher than shipping a fix. Fix forward when:

  • A forward-only schema migration ran. Application rollback does not undo a migration. If the new version added a NOT NULL column, dropped one, or rewrote data, the old code may crash against the new schema — or worse, the old code runs against migrated data and corrupts it. Rolling the app back without rolling the schema back is the classic way to turn one incident into two.
  • Data has already been written in the new shape. Even without a migration, if the new release persisted records in a new format, queue payloads in a new schema, or cache entries the old code can't parse, going back means the old version chokes on data it never produced.
  • Only part of a progressive rollout is affected. With a canary or partial rollout, you may not need a full revert — pausing the rollout or scaling the canary to zero can be a smaller, faster mitigation than rolling the whole fleet.
  • The previous version was also broken. If the bug predates the last deploy, rolling back lands you on a different bad state, not a good one. Verify the prior revision was actually healthy before you trust it.

Fix forward means shipping a small, targeted change — a one-line config correction, a guard around the failing code path, a feature-flag kill switch — through your normal (expedited) deploy path. The tradeoff: it is slower to author and ships new code under pressure, which is exactly when regressions sneak in. Prefer it only when rollback is genuinely unsafe, and keep the forward change as small as the mitigation requires.

How to roll back a Kubernetes Deployment safely

Kubernetes tracks every rollout at the ReplicaSet level — each unique Pod template (image, env, resources, probes) produces a new ReplicaSet — so a rollback is the controller scaling a prior ReplicaSet back up. A Deployment's rollout "is triggered if and only if the Deployment's Pod template (.spec.template) is changed", which is also why scaling alone never creates a new revision to roll back to.

First, see what you can roll back to:

kubectl rollout history deployment/web-api

Roll back to the immediately previous revision:

kubectl rollout undo deployment/web-api

Or target a specific known-good revision:

kubectl rollout undo deployment/web-api --to-revision=2

Then wait for it and confirm it converged — do not assume:

kubectl rollout status deployment/web-api

A successful run prints deployment "web-api" successfully rolled out.

One pitfall that bites teams mid-incident: you can only roll back to revisions Kubernetes still retains. .spec.revisionHistoryLimit controls how many old ReplicaSets are kept, and "by default, 10 old ReplicaSets will be kept." Set it to 0 and you keep none — a rollout "cannot be undone, since its revision history is cleaned up." Check this before an incident, not during one.

How to roll back with ArgoCD (GitOps)

Under GitOps the cluster reflects Git, so a rollback has two layers: the live cluster and the source of truth. There is a sharp gotcha here. Per the ArgoCD docs, "Rollback cannot be performed against an application with automated sync enabled." If self-heal is on, ArgoCD will fight you — when the live state deviates from Git it re-syncs "after self-heal timeout (5 seconds by default)", dragging the broken version right back.

So the safe sequence is: disable automated sync, roll back, verify, then decide about Git.

Disable auto-sync for the app:

argocd app set web-api --sync-policy none

List deployment history to find the target ID:

argocd app history web-api

Roll back to a previous deployed version by its history ID (omit the ID to go to the immediately previous version):

argocd app rollback web-api 12

The rollback restores the cluster to a prior revision, but Git still points at the bad commit. Until you reconcile that, anyone re-enabling sync — including the next person on-call — re-applies the failure. Two ways to close the loop:

  • Revert in Git (the durable fix): git revert the offending commit, merge, and let ArgoCD sync the cluster back to a known-good source of truth. This keeps Git as the real source of truth.
  • Re-enable sync only after Git is good. Once Git reflects the rolled-back state, restore automation: argocd app set web-api --sync-policy automated.

The tradeoff between the two layers is the whole point of GitOps: a live-only rollback is faster but leaves a landmine in your source of truth; a Git revert is slightly slower but self-consistent. In an incident, do the live rollback to stop the bleeding, then immediately revert in Git so the system is coherent before you re-arm self-heal.

Verify the mitigation actually held

A rollback that completes is not a rollback that worked. Confirm with a signal, not a feeling — the same standard the incident commander checklist applies to declaring resolved:

  • Workload is healthy. kubectl rollout status reports success and kubectl get pods shows the expected ready replicas with no crash loops.
  • The symptom is gone at the source. The error rate, latency percentile, or crash signal that triggered the incident is back to baseline — and has stayed there for a sustained window, not for one scrape interval.
  • No new symptom appeared. Rolling back can surface its own problems (the data-shape mismatch above is the common one). Watch for a different error spiking as the old version meets new data.
  • GitOps is coherent. If you used ArgoCD, confirm the app is Synced/Healthy against a Git state you trust, not against the bad commit with sync disabled.

Declare the mitigation successful only after the signal holds. Then the incident moves from "stop the bleeding" to root cause — on your schedule, from the reverted change.

Pitfalls that turn one incident into two

  • Rolling back code but not the schema. The single most damaging mistake. If a forward-only migration ran, an application-only rollback is not safe. Either fix forward, or roll back the migration deliberately and only if it is genuinely reversible — many are not.
  • Treating a config rollback like a code rollback. Config that lives outside your deploy artifact (a feature flag store, a ConfigMap edited out-of-band, a runtime parameter) does not revert when the image does. Identify whether the bad change was code or config, because they roll back through different mechanisms — and a kubectl rollout undo will not touch a flag flipped in a separate system.
  • Re-arming ArgoCD self-heal too early. Disable auto-sync, roll back, but leave Git pointing at the bad commit, and self-heal re-applies the failure within seconds of being switched back on. Reconcile Git first.
  • Rolling back to a revision that was also broken. Confirm the target revision was actually healthy before you bet the recovery on it.
  • Skipping rollout status. Firing undo and walking away assumes convergence. Watch it finish.
  • Discovering revisionHistoryLimit is too low mid-incident. If the known-good ReplicaSet was already garbage-collected, you have no fast rollback at all. Set a sane limit ahead of time.

The decision in one line

Change-induced and reversible → roll back now, verify, then revert in Git. Irreversible migration, migrated data, or a partial rollout → fix forward with the smallest safe change. Either way, mitigate first and find the cause second. The slowest possible recovery is the one where you tried to understand the bug before you stopped the impact.

Knowing which change to take back is the hard part at 3 a.m. — and it is exactly the "what changed?" question an on-call engineer burns minutes on. Intellira correlates the incident with the commit, build, and deploy behind it, so you start the rollback-or-fix-forward decision from an evidence-backed timeline instead of a blank channel.

Sources

By Intellira Engineering. AI-assisted draft, reviewed by the Intellira engineering team; claims cited inline; last verified 2026-06-02.

Related

Stop guessing at root cause

Connect your stack read-only and get an evidence-backed RCA on your next incident.