Building a Self-Healing Deployment Pipeline

Deployments break. Your site goes down at 2 AM. Users see generic errors while you scramble to rollback manually. Here's how I built a deployment system that heals itself automatically.

The Problem

Traditional deployments fail silently. Code deploys, breaks production hours later. Users see ugly Nginx errors. You manually SSH to fix it. There's a better way.

The Solution

1. Branded Error Pages

Dual-layer error handling: - Django templates when Django works - Static Nginx fallbacks when Django crashes - Always branded—never generic

2. Health Checks with Timeouts

Four checks after each deployment: - HTTP Response (30s max): Site responding? - Gunicorn Service (10s): App server running? - Django Check (20s): Configuration valid? - Static Files (30s): Assets loading?

Timeouts prevent hanging. Fail fast, rollback immediately.

3. Last-Known-Good Rollback

The clever part: we don't rollback to previous commit—we rollback to last working commit.

Every successful deployment saves the commit SHA. On failure, restore that exact version. Two broken commits in a row? Still safe.

Visual: The Deployment Flow

Impact

Before: 5-10 minute manual rollbacks. Users saw errors.

After: 30-second automatic rollback. Branded error pages during recovery.

The Code

# GitHub Actions workflow
- name: Health Check
  run: |
    curl --max-time 30 https://site.com
    timeout 10 systemctl is-active gunicorn
    timeout 20 python manage.py check

- name: Rollback
  if: failure()
  run: |
    GOOD=$(cat .last_good_commit)
    git reset --hard $GOOD
    systemctl restart gunicorn

Key Lessons

Fail fast: Timeout hung services immediately. Track working statepost: Not just HEAD~1, but last verified working commit. Brand errors: Users see your brand even during failures. Test rollbacks: Your rollback is production code too.

Built with Django, GitHub Actions, and defensive programming. Deployments should heal themselves.