Building a Self-Healing Deployment Pipeline
Deployments break. Your site goes down at 2 AM. Users see generic errors while you scramble to rollback manually. Here's how I built a deployment system that heals itself automatically.
The Problem
Traditional deployments fail silently. Code deploys, breaks production hours later. Users see ugly Nginx errors. You manually SSH to fix it. There's a better way.
The Solution
1. Branded Error Pages
Dual-layer error handling: - Django templates when Django works - Static Nginx fallbacks when Django crashes - Always branded—never generic
2. Health Checks with Timeouts
Four checks after each deployment: - HTTP Response (30s max): Site responding? - Gunicorn Service (10s): App server running? - Django Check (20s): Configuration valid? - Static Files (30s): Assets loading?
Timeouts prevent hanging. Fail fast, rollback immediately.
3. Last-Known-Good Rollback
The clever part: we don't rollback to previous commit—we rollback to last working commit.
Every successful deployment saves the commit SHA. On failure, restore that exact version. Two broken commits in a row? Still safe.
Visual: The Deployment Flow
Impact
Before: 5-10 minute manual rollbacks. Users saw errors.
After: 30-second automatic rollback. Branded error pages during recovery.
The Code
# GitHub Actions workflow
- name: Health Check
run: |
curl --max-time 30 https://site.com
timeout 10 systemctl is-active gunicorn
timeout 20 python manage.py check
- name: Rollback
if: failure()
run: |
GOOD=$(cat .last_good_commit)
git reset --hard $GOOD
systemctl restart gunicorn
Key Lessons
Fail fast: Timeout hung services immediately.
Track working statepost: Not just HEAD~1, but last verified working commit.
Brand errors: Users see your brand even during failures.
Test rollbacks: Your rollback is production code too.
Built with Django, GitHub Actions, and defensive programming. Deployments should heal themselves.