When Giants Fall

Analysis of the Cloudflare and AWS Outages That Stopped the Internet

FIRST PAGENETWORKS AND DATA INFRASTRUCTURESDIGITAL CULTURE AND PHILOSOPHY

Network Caffé

11/27/202513 min read

When Giants Fall:

Analysis of the Cloudflare and AWS Outages That Stopped the Internet

"The Internet never sleeps" — or so they say. Yet, within a few weeks between October and November 2025, we witnessed something that seemed impossible: huge chunks of that "always on" simply went dark. First Amazon Web Services, then Cloudflare. Two separate incidents, two different causes, but one single, uncomfortable truth: the infrastructure that sustains our digital lives is far more fragile than we want to admit.

We're not talking about minor sites or niche services. We're talking about ChatGPT going silent. About X (formerly Twitter) displaying error pages. About Snapchat, Fortnite, Spotify, banking services, crypto platforms — all falling like dominoes. And while millions of users stared at screens showing the infamous "Error 500," a question emerged spontaneously: how did we get to this point?

This article isn't a digital war bulletin. It's an autopsy. A meticulous reconstruction of what went wrong, why it went wrong, and most importantly what it tells us about the future of the Internet. Because behind every "Error 500" there's a story of code, architectures, and decisions made years ago that are now coming back to present the bill.

1. The Cloudflare Outage: November 18, 2025

1.1 The Brutal Awakening

On November 18, 2025, at 11:20 UTC, Cloudflare — one of the invisible pillars of the Internet — began returning HTTP 500 errors to millions of users worldwide. For those unfamiliar with technical jargon, a 500 error is the elegant way a server says: "I don't know what happened, but something went terribly wrong."

The paradox? Even Downdetector, the website that monitors internet outages, went down. It's like the ambulance getting a flat tire while racing to an emergency.

1.2 The Scope of the Impact

The list of affected services reads like a Who's Who of the modern web:

CategoryImpacted Services

Social MediaX (Twitter), Letterboxd

AI & ProductivityChatGPT/OpenAI, Canva

GamingLeague of Legends, various multiplayer games

StreamingSpotify

CryptoArbiscan, DefiLlama, BitMEX, Toncoin

FinancePayPal, Uber Eats (payments)

CommunicationHubSpot, Zoom (partial)

And let's not forget the supreme irony: Cloudflare's own status page went offline for a period, showing users an error message that read "We can't connect to the server for this app or website at this time."

1.3 Incident Timeline

Let's reconstruct the chronology of the event based on Cloudflare's official post-mortem:

Time (UTC)Event11:05A permission change is applied to the ClickHouse database11:20First 5xx errors detected on the Cloudflare network11:28Impact reaches customer environments11:31-11:32Automated tests detect the problem, investigation begins11:35War room created for incident management11:48Cloudflare publishes first official update13:05Bypass implemented for Workers KV and Access — impact reduced13:37Team focused on rolling back Bot Management configuration file14:24Creation and propagation of new configuration files stopped14:30Corrected file distributed globally — main traffic restored17:06All services fully operational

The total outage lasted approximately 6 hours, with peak disruption concentrated in the first 3 hours.

1.4 The Technical Cause: When a File Doubles in Size

And here we are at the heart of the problem. The cause wasn't a DDoS attack. It wasn't a hacker. It wasn't even a misconfiguration in the classic sense. It was something far more subtle and, in some ways, more disturbing.

The Bot Management System

Cloudflare uses a Bot Management system to distinguish human traffic from automated traffic. This system relies on a machine learning model that analyzes every request passing through the network. To function, the model needs a "feature file" — a configuration file containing the characteristics (features) used to make predictions.

This file is updated every few minutes and distributed across Cloudflare's entire global network. Speed is essential: malicious bots change tactics rapidly, and the system must adapt.

The Bug in ClickHouse Database

The problem started with a permission change to the ClickHouse database that generates this feature file. ClickHouse is a distributed analytical database, organized into "shards" (fragments). Queries are executed through distributed tables in the "default" database, which in turn query underlying tables in the "r0" database.

Before the change, users only saw metadata for tables in the "default" database. The change was intended to make access to tables in "r0" explicit — a sensible thing from a security and traceability perspective.

The problem? A query in the feature file generation system looked like this:

SELECT name, type FROM system.columns WHERE table = 'http_requests_features' ORDER BY name;

Notice what's missing? The database filter. The query didn't specify which database to query. Before the change, it only returned columns from the "default" database. After the change, it returned columns from both databases — "default" and "r0" — effectively doubling the rows in the result.

The Domino Effect

The feature file, normally with about 60 features, suddenly found itself with over 200 features (all duplicates). But the software reading this file had a hardcoded limit of 200 features — a limit set for performance and memory allocation reasons.

When the corrupted file was propagated to servers, the Rust code processing it did what good Rust code does when it encounters an impossible condition: panic. In non-technical terms, the software threw up its hands and refused to continue.

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

The Intermittent Behavior

One thing that made diagnosis particularly difficult: the error was intermittent. The file was regenerated every 5 minutes. But the permission change had been applied gradually to the ClickHouse cluster. So, depending on which cluster node executed the query, the file could be correct or corrupted.

For several minutes, the network oscillated between working and failing, confusing the response teams. Initially, they even suspected an Aisuru-type DDoS attack — a series of attacks that had hit various providers in the preceding weeks.

1.5 Impact on Cloudflare Services

ServiceType of ImpactCore CDN and SecurityHTTP 5xx errors for all trafficTurnstileUnable to load (Cloudflare's CAPTCHA)Workers KVHigh 5xx errorsDashboardLogin impossible (Turnstile dependency)AccessAuthentication failed for most usersEmail SecurityTemporary reduction in anti-spam accuracy

1.6 Response and Corrective Actions

Matthew Prince, CEO and co-founder of Cloudflare, published public apologies calling the incident "the worst outage since 2019". The company announced several corrective measures:

Configuration file hardening — Treat internally generated files with the same caution as user input
Global kill-switches — Implement emergency switches to quickly disable problematic features
Debug resource limits — Prevent core dumps and error reports from overloading the system
Error condition review — Complete audit of failure modes in all proxy modules

2. The AWS Outage: October 20, 2025

2.1 A Month Earlier, Same Script (But Different)

Less than a month before the Cloudflare incident, another infrastructure giant had fallen. Amazon Web Services — the world's largest cloud provider with about 30% of the market — had suffered a massive outage in its most critical region: US-EAST-1 (Northern Virginia).

To understand the severity: US-EAST-1 is not a region like the others. It's the first AWS region ever created, the default choice for many services, and is estimated to host 30-40% of all AWS workloads globally. The Northern Virginia corridor is so dense with data centers that approximately 70% of worldwide Internet traffic passes through it.

2.2 The Scope of the Impact

The AWS outage had an even broader echo than Cloudflare's:

Category Impacted Services

Social & Entertainment Snapchat, Reddit, Disney+, Hulu, Roblox, Fortnite

Finance & Payments Coinbase, Venmo, UK banks (Lloyds, Halifax)

AI & Productivity Perplexity AI, Signal

Smart Home Amazon Ring, Alexa, Eight Sleep

Transportation United Airlines app, Delta

Gaming Pokémon GO, various services

E-commerce Amazon.com itself, Prime Video

Government HMRC (UK tax agency)

Downdetector recorded over 17 million reports during the incident.

2.3 Incident Timeline

Time (PDT/UTC) Event

23:48 PDT (06:48 UTC) First DynamoDB API errors detected

00:26 PDT (07:26 UTC) DNS problem identified

01:15 PDT (08:15 UTC) First temporary mitigations applied

02:25 PDT (09:25 UTC) DynamoDB DNS restored

02:25 - 13:50 PDT Cascading effects on EC2, NLB, Lambda

05:30 - 14:09 PDT Network Load Balancer with health check failures

13:50 PDT EC2 fully recovered

14:20 PDT Most services restored

October 21 04:05 PDT Last Redshift clusters restored

The incident lasted a total of over 15 hours, with residual impacts extending into the following day.

2.4 The Technical Cause: A Race Condition in DNS

DynamoDB's DNS Architecture

DynamoDB, AWS's NoSQL database, uses an automated system to manage its DNS records. This system consists of two independent components (for redundancy reasons):

DNS Planner: Monitors load balancer status and creates "DNS plans" — instructions on where to direct traffic
DNS Enactor: Applies the plans by updating Amazon Route 53 (AWS's DNS service)

The DNS Enactor runs redundantly in three different Availability Zones to ensure high availability.

The Race Condition

A race condition is a bug that occurs when system behavior depends on the temporal order of events that are not adequately synchronized. It's like two people trying to enter through the same door simultaneously — normally it works, but occasionally they get stuck.

Here's what happened:

The DNS Planner generates a plan (let's call it PLAN_V02) with updated IPs for DynamoDB
Enactor A starts applying the plan but experiences an anomalous delay
Meanwhile, a newer plan is generated
Enactor B applies the newer plan and initiates cleanup of obsolete plans
At that precise moment, Enactor A (delayed) finally applies PLAN_V01 — a now-obsolete plan
The cleanup process, seeing a very old plan, immediately deletes it

The result? DynamoDB's regional endpoint found itself with an empty DNS record. No IP. Nothing. As if someone had erased the phone number from the directory.

To understand the impact: DNS is the Internet's phone book. When an application wants to connect to DynamoDB, it first asks DNS "what IP address do I find dynamodb.us-east-1.amazonaws.com at?" With an empty record, the answer was: no address found.

The Cascade Effect

DynamoDB isn't just a database for AWS customers. It's the database on which AWS itself builds its internal services. When DynamoDB became unreachable:

EC2 DropletWorkflow Manager (DWFM) — the system that manages physical servers — couldn't renew its "leases" with DynamoDB. New EC2 instances couldn't be launched.
Network Load Balancer (NLB) — dependent on EC2 — started seeing health check failures. Its automatic failover systems began removing "unhealthy" nodes that were actually perfectly functional.
Lambda, ECS, EKS, Fargate — all dependent on EC2 — couldn't start new functions or containers.
Even the AWS Support Center went haywire because one of its subsystems was providing incorrect responses about accounts, preventing legitimate users from accessing it.

The Recovery Problem

But the worst part? Restoring DNS didn't immediately fix everything.

When DynamoDB came back online at 02:25, DWFM tried to re-establish leases for the entire EC2 fleet all at once. The result was what AWS called "congestive collapse" — such an overload that retries kept accumulating faster than the system could process them.

Engineers had to intervene manually, applying throttling (traffic limiting) and restarting parts of DWFM. EC2 became fully operational only at 13:50 — over 11 hours after DynamoDB was restored.

2.5 AWS's Corrective Actions

AWS announced several measures:

Global disabling of automated DNS Planner and DNS Enactor
Race condition fix before re-enabling automation
Velocity control for NLB — limits on capacity that a single NLB can remove during failover
New test suite for DWFM — specific testing of recovery workflows
Improved EC2 throttling — rate limiting based on queue size

3. The Centralization Paradox: The Internet Is Less Decentralized Than You Think

3.1 The Original Promise

When the Internet was born in the 1960s and '70s, its architecture was inherently decentralized. The TCP/IP protocol was designed to survive a nuclear attack — if one node was destroyed, packets would automatically find another route. No central point of control. No single point of failure.

This promise of resilience was one of the philosophical and technical pillars on which we built our trust in the Internet.

3.2 The Reality of 2025

Let's look at the numbers:

ProviderCloud Market Share (2025)AWS~30%Microsoft Azure~24%Google Cloud~11%Others~35% (fragmented)

Three companies control nearly two-thirds of global cloud infrastructure. But the concentration doesn't end there:

Cloudflare manages proxy and CDN for millions of websites
Akamai serves a significant portion of internet content
AWS's US-EAST-1 region alone hosts 30-40% of global AWS workloads
Northern Virginia sees approximately 70% of global Internet traffic pass through it

3.3 The House Analogy

Imagine the Internet as a residential neighborhood. The original promise was: every house has its own power supply, its own plumbing, its own systems. If one house has a problem, the others continue functioning.

The reality? Almost all houses in the neighborhood are connected to the same electrical transformer (AWS US-EAST-1), the same central water system (few DNS providers), and share the same security guards at the entrance (Cloudflare, Akamai).

When the transformer blows, it's not one house that goes dark. It's the entire neighborhood.

3.4 Why Did This Happen?

Centralization didn't happen by chance or bad intent. It's the result of precise economic and technical forces:

Economies of Scale

AWS, Azure, and Google can offer services at prices that an internal corporate datacenter cannot match. They have purchasing power for hardware, energy, and can distribute R&D costs across millions of customers.

Operational Complexity

Managing infrastructure is difficult. It requires specialized skills, 24/7 operational teams, and continuous investment. For most companies, it makes more sense to delegate this task to those who do it professionally.

Network Effects

The more customers use a platform, the more services are built on it, the more convenient it becomes to use. AWS has over 200 services that integrate natively with each other. Building the same ecosystem elsewhere is nearly impossible.

The Geographic Redundancy Myth

Many companies believe they're protected because they use "multiple Availability Zones" within a region. But as the AWS outage demonstrates, AZs protect against single datacenter failures, not against regional or control plane failures.

It's like having different rooms in the same house. If one room floods, you move to another. But if the whole house floods, every room is underwater.

3.5 The Cost of Convenience

Centralization has brought undeniable benefits:

Reduced costs
Ease of deployment
Access to advanced technologies
Immediate scalability

But the hidden price is systemic fragility. When Fortune 500 companies, startups, banks, governments, and consumer apps all depend on the same few providers, a single bug can have consequences worth hundreds of billions of dollars.

Lloyd's of London estimated back in 2018 that a 3-6 day outage of a top-3 cloud provider could cause damages between $6.9 and $14.7 billion. And that was before cloud adoption accelerated further during the pandemic.

3.6 Is Multi-Cloud the Answer?

The theoretical answer to centralization is multi-cloud: distributing workloads across multiple providers (AWS, Azure, GCP) so that if one falls, the others hold.

Reality is more complicated:

Technical complexity: Each provider has its own APIs, its own services, its own peculiarities. Building truly portable applications requires enormous effort.
Costs: Managing infrastructure across multiple clouds means duplicating skills, tools, and often capacity.
Hidden dependencies: Even if your app runs on Azure, it might depend on a library that calls a service on AWS. Or your payment provider uses Cloudflare. Dependencies are often invisible until they break.
Internet as a dependency: Even with perfect multi-cloud, you remain dependent on global DNS, BGP routing, CDN. As the Cloud Security Alliance notes: "The Internet itself is a single point of failure."

3.7 Toward a New Architecture?

Experts suggest some directions:

StrategyProsCons

Multi-region (same provider)Relatively simple, good cost/benefit ratioDoesn't protect against control plane outages

Active multi-cloudMaximum theoretical resilienceHigh complexity and costs

Hybrid cloud (cloud + on-premise)Local control for critical workloadsRequires internal expertise and investment

Distributed edge computingReduces dependence on central datacentersStill immature for many use cases

The pragmatic recommendation for most organizations:

Start with multi-region within your primary provider
Identify truly critical workloads and evaluate backup on a second provider
Map dependencies — not just yours, but your vendors'
Implement chaos engineering — regularly test failure modes before they surprise you in production

4. Lessons Learned

4.1 For Engineers

Dormant Code Can Wake Up

Both incidents involve latent bugs — code defects that existed for a long time but only activated when specific conditions aligned.

In Cloudflare's case, the 200-feature limit was probably adequate when written. In AWS's case, the race condition might never have manifested until delays reached a certain threshold.

Takeaway: Code reviews aren't enough. Deep testing of failure modes, fuzzing, and chaos engineering are essential.

Complex Systems Fail in Complex Ways

Neither outage was caused by a single "stupid" mistake. Both were the result of unexpected interactions between multiple components:

Cloudflare: DB permission change → query returns duplicates → file exceeds limit → panic
AWS: race condition → empty DNS → DynamoDB down → cascade to EC2 → cascade to everything else

Takeaway: Distributed architecture introduces emergent complexity. Mental models that embrace failure as inevitable are needed.

Recovery Can Be Worse Than the Outage

AWS's experience with DWFM shows that restoring a service can trigger problems worse than the original failure. When millions of expired leases try to renew simultaneously, the system that was supposed to recover finds itself on its knees.

Takeaway: Test not only failovers but also recovery workflows. Implement rate limiting and circuit breakers even in restoration paths.

4.2 For Decision Makers

Convenience Has a Hidden Price

Using a single provider, a single region, completely relying on one CDN is convenient and cost-effective. But the cost of "when it breaks" can exceed accumulated savings by orders of magnitude.

Takeaway: In every project budget, include a line item for resilience. It's not a nice-to-have.

Insurance Isn't Enough

AWS offers a 30% credit to impacted customers who fall within SLA terms. It's minimal consolation for a company that lost a day of business.

Takeaway: Cloud SLAs guarantee refunds, not operational continuity. The responsibility to stay online is yours.

4.3 For All of Us

Internet Is Not a Given Right

We treat connectivity like electricity or running water — something that's "always there." But unlike traditional utilities, the Internet has no regulator mandating uptime standards, redundancy, or transparency.

Takeaway: As a society, we should start discussing whether major cloud providers and infrastructure services should be treated as critical infrastructure, with corresponding responsibilities.

Concentration Is a Systemic Risk

When a bug in a database can bring down ChatGPT, British banks, and online games simultaneously, we have a problem that goes beyond any single company or incident.

Takeaway: Internet infrastructure diversification isn't just a technical matter. It's a question of social and economic resilience.

5. Conclusions: Building on Shifting Sands

Let's return to the initial question: how did we get to this point?

The answer is simple and complex at the same time. We got here one step at a time, each step perfectly rational. Using the cloud is convenient. Relying on a CDN is practical. Standardizing on one provider reduces operational complexity.

But the sum of individually sensible decisions has produced a globally fragile system. An infrastructure where the Internet's original promise — resilience through distribution — has been silently eroded in the name of efficiency.

The Cloudflare and AWS outages weren't isolated incidents. They're symptoms of an architecture that privileges convenience over robustness, centralization over diversity, cost optimization over resilience.

This doesn't mean we should abandon the cloud or return to the corporate datacenters of the 1990s. It means we should design our systems with the awareness that everything can fail — and probably will at the least opportune moment.

It means investing in redundancy before it becomes necessary. Mapping dependencies before they become evident (in the worst possible way). Testing failure modes before they manifest in production.

And perhaps, at the policy and regulatory level, starting to treat digital infrastructure with the same seriousness with which we treat bridges, highways, and electrical grids.

Because the next time giants fall — and there will be a next time — the impact will depend on how wise we were in preparing.

Are you ready to verify your infrastructure's resilience? Start by mapping the dependencies of your critical services. You might discover you depend on more "giants" than you thought. And the next time one of them falls, at least you won't be caught off guard.

When Giants Fall

When Giants Fall:

Analysis of the Cloudflare and AWS Outages That Stopped the Internet

1. The Cloudflare Outage: November 18, 2025

1.1 The Brutal Awakening

1.2 The Scope of the Impact

1.3 Incident Timeline

1.4 The Technical Cause: When a File Doubles in Size

The Bot Management System

The Bug in ClickHouse Database

The Domino Effect

The Intermittent Behavior

1.5 Impact on Cloudflare Services

1.6 Response and Corrective Actions

2. The AWS Outage: October 20, 2025

2.1 A Month Earlier, Same Script (But Different)

2.2 The Scope of the Impact

2.3 Incident Timeline

2.4 The Technical Cause: A Race Condition in DNS

DynamoDB's DNS Architecture

The Race Condition

The Cascade Effect

The Recovery Problem

2.5 AWS's Corrective Actions

3. The Centralization Paradox: The Internet Is Less Decentralized Than You Think

3.1 The Original Promise

3.2 The Reality of 2025

3.3 The House Analogy

3.4 Why Did This Happen?

Economies of Scale

Operational Complexity

Network Effects

The Geographic Redundancy Myth

3.5 The Cost of Convenience

3.6 Is Multi-Cloud the Answer?

3.7 Toward a New Architecture?

4. Lessons Learned

4.1 For Engineers

Dormant Code Can Wake Up

Complex Systems Fail in Complex Ways

Recovery Can Be Worse Than the Outage

4.2 For Decision Makers

Convenience Has a Hidden Price

Insurance Isn't Enough

4.3 For All of Us

Internet Is Not a Given Right

Concentration Is a Systemic Risk

5. Conclusions: Building on Shifting Sands

Bibliografy:

Official Post-Mortems

Technical Analysis

Resources on Internet Centralization