What to Do When AWS Goes Down: A Practical Guide

When AWS Goes Down, You Need a Plan

AWS powers roughly a third of the cloud infrastructure market. When it has an outage — and it does, every year — the ripple effects hit thousands of businesses simultaneously. The difference between a minor inconvenience and a full-blown crisis comes down to preparation.

Here's your practical playbook for the next AWS outage.

Step 1: Confirm It's Actually AWS

Before you blame AWS, verify the issue:

Check the AWS Status Page: health.aws.amazon.com (but know it's often slow to update)

Check ServiceAlert.ai: We monitor AWS and dozens of services that depend on it — view AWS status

Check your specific region: AWS outages are usually regional, not global

Test from outside your network: Use a VPN or external monitoring to rule out local issues

Important: The AWS status page has historically been slow to acknowledge issues. Social media (Twitter/X, Reddit) often surfaces reports 15-30 minutes before the official status page updates.

Step 2: Identify What's Affected

AWS has over 200 services. An outage rarely affects all of them. Quickly determine:

Which AWS services are impacted? (EC2, S3, RDS, Lambda, CloudFront, etc.)
Which region? (us-east-1 is the most common source of major incidents)
Which of YOUR services depend on the affected AWS services?

This is where dependency mapping pays off. If you haven't documented your AWS dependencies, start now — you'll thank yourself during the next incident.

Step 3: Communicate Proactively

Don't wait for customers to complain. Get ahead of it:

Internal: Alert your engineering, support, and leadership teams via your backup communication channel
External: Update your own status page within 15 minutes
Support team: Prepare templated responses for incoming tickets

Sample Status Page Update

"We're aware of an issue affecting [specific functionality]. This is related to an ongoing AWS incident in [region]. We're actively monitoring the situation and will provide updates every 30 minutes. Our team is evaluating mitigation options."

Step 4: Mitigate Where Possible

Depending on your architecture, you may have options:

Multi-region: Fail over to an unaffected region
Multi-cloud: Route traffic to your Azure/GCP backup
CDN: If CloudFront is down, switch DNS to Cloudflare or Fastly
Static fallback: Serve a static version of critical pages
Queue and retry: For non-real-time operations, queue requests for processing after recovery

Step 5: Monitor for Recovery

AWS outages can resolve in stages:

Set up ServiceAlert.ai recovery alerts to get notified the moment services come back

Don't rush to declare "all clear" — services often flap between degraded and operational

Verify YOUR services are working, not just AWS — cached errors, connection pool issues, and stale DNS can persist after AWS recovers

Step 6: Post-Incident Review

After the dust settles:

Timeline: Document when you detected the issue, what you did, and when you recovered

Impact: Quantify the business impact (revenue, users affected, SLA breach)

Gaps: What could you have detected faster? What mitigation was missing?

Action items: What will you build or change before the next outage?

Building Long-Term Resilience

Automate monitoring: Use ServiceAlert.ai to track all your cloud dependencies, not just AWS
Design for failure: Assume any AWS service can go down at any time
Test regularly: Run chaos engineering exercises or game days
Multi-region at minimum: Never run production in a single AZ or region
Cache aggressively: CDN and application-level caching can keep you running during upstream issues

The Reality Check

AWS has excellent overall uptime — typically 99.99%+ for most services. But with millions of customers, even brief outages make headlines. The question isn't whether AWS will have another outage, but whether you'll be ready when it happens.

Monitor AWS status | View all monitored services | Set up alerts