Formerly known as Wikibon

Special Breaking Analysis: The Hidden Fault Domain in AWS — Understanding Control Planes and Availability Zones

The AWS outage on October 20–21, 2025 in the early morning ET hours, exposed a systemic vulnerability whereby even pristine multi-AZ designs don’t protect you from a shared control plane and DNS path failure. Our understanding is that the incident is a function of an architectural reliance to US-EAST-1 for identity, service discovery, and API orchestration. Specifically, while availability zones do exactly what they were designed to do – i.e. absorb intra-region physical failures – they don’t defend against logical, cross-AZ control-plane dependencies. Enterprises should not equate multi-AZ with business resilience, and must reframe their thinking around fault-domain isolation across regions (and, where justified, clouds), with clear strategies for DNS, identity, and service-discovery dependence.

In this Special Breaking Analysis we review what happened and put forth architectural best practices to avoid such outages. As well, we introduce a case study where approximately 300 Snowflake customers avoided the outage completely by leveraging Snowgrid, a cross-cloud technology layer within Snowflake’s platform that provides improved resilience leveraging a cross-region and cloud (supercloud) topology.

What happened  

A DNS subsystem malfunction in US-EAST-1 broke endpoint resolution for core services such as DynamoDB, Identity Access Management (IAM), and routing APIs. Because AWS’s global control plane is serviced out of US-EAST-1, dependencies across multiple regions couldn’t authenticate, discover, or call required APIs – despite compute and storage resources in other AZs and regions were healthy. As a result, downstream services outside US-EAST-1 experienced cascading errors due to the interdependency of shared control-plane and DNS services.

Key Concepts

  • AZs do not equate to immunity from platform control-plane failure. Reports indicate the outage was triggered by a DNS resolution failure tied to the DynamoDB API endpoint and other global control-plane dependencies in US-EAST-1, affecting healthy infrastructure elsewhere.
  • Interdependencies exist beyond hardware. Global services (e.g., IAM, DNS, DynamoDB control) can cause failure across regions when they centralize state or routing through a single point of control.
  • Full resilience comes from region-level (or cloud-level) isolation. Multi-region architectures – ideally active/active with control-plane redundancy may be required for business-level resilience during platform faults. However users should understand this brings additional cost and complexity that must be weighed against the probability of such outages and their severity on the business. Integrated capabilities such as Snowflake’s Snowgrid can minimize complexity at reasonable costs.
  • True resilience is layered. DNS, identity, service discovery, secrets, and tenancy boundaries are shown by this incident to now be first-class services, not just plumbing.

Why Availability Zones didn’t save you

Availability Zones are designed to isolate power, network, and hardware failures. This event was a logical/service-level control-plane fault that transcended AZ boundaries. Customers often equate well-architected, “multi-AZ” as a synonym for “business resilience.” The reality is that AZs share regional control systems, DNS, and global identity services. When those fail, all AZs can be negatively affected simultaneously.

The architectural takeaway

We believe the clear lesson is to treat control planes as a separate fault domain and design for independent recovery when operator control functions are disrupted. This implies:

  • Isolate by region (and sometimes by cloud) for control-plane-independent operations;
  • Decouple DNS/identity/service-discovery from a single provider path where business-critical;
  • Engineer data-plane-only modes that continue serving known traffic until shared control paths recover.

How to eliminate (or dramatically reduce) risk from this class of outage

A viable strategy is to make control planes a redundant service domain and engineer end-to-end independence across DNS, identity, service discovery, data, and operations so workloads degrade gracefully rather than fail. Or in the case of Snowflake Snowgrid (see below) are an integrated platform service that can mitigate disruptions.

Broadly, this may mean building multi-region active/active capability with pre-provisioned resources and region-local control proxies. This can be expensive so make sure the business case justifies this approach. For example, a tier-1 API might run active/active in two regions with redundant DNS services and alternate client discovery so it continues serving reads and asynchronously performs writes during a DNS/IAM incident. Customers may want to consider delaying or deferring high risk operations until primary control planes recover.

What this means for AWS and customers – Shared responsibility in full view

This incident underscores that AWS must reduce control-plane concentration risk and diversify global service dependencies to reduce the impact of such outages. That said, operator actions are not a substitute for customer resilience. The enterprise mandate is to treat provider control planes as an at-risk dependency and engineer autonomous regional operations with provable and tested failover/failback for DNS, identity, and discovery.

Case Study: How Snowflake Snowgrid and “Supercloud” Architecture Mitigated the Outage

Monday’s outage had another side to the story. While many teams in industries such as financial services faced a pre-market frenzy, scrambling to talk to AWS amid the DNS/control-plane chaos, approximately 300 Snowflake customers running Snowgrid largely treated the event as a non-event. In our view, the difference was architectural and procedural – i.e. cross-region (and cross-cloud) business continuity was pre-determined, with human-in-the-loop failsafes, and executed in around a minute for most cases.

Notably, Snowflake reports 300+ customers failed over that morning, went about their day on the secondary platform, and later failed back cleanly. Existing pipelines, dashboards, and engines continued with only a brief blip – exactly the behavior enterprises want as AI workloads assume more mission-critical business processes.

Snowgrid works because its three primitives operate in lockstep and abstract away underlying AWS dependencies. We cite three enabling factors, including: 1) Transactionally consistent replication keeps a warm, point-in-time-accurate copy in another region or cloud. It efficiently tracks what changed in the primary (not naïve CDC), so secondaries are inexpensive to maintain and can meet ~minute-level RPOs (often 1–15+ minutes); 2) A managed, explicit failover lets humans decide whether incident severity warrants action and cut over with one command. This avoids oscillation during partial recoveries and it enables symmetrical failback the original state once stable; and 3) Client redirect handles the last mile.

As an example, BI tools and applications follow the new endpoint automatically, so users experience a ~1-minute interruption and then resume normal operations. Because Snowgrid spans regions and clouds, it bypasses the control-plane/DNS blast radius that AZ redundancy can’t address. This is an example of what a “Supercloud” architecture can enable.

Economically, Snowgrid is consumption-based. Continuous replication consumes credits (with a replication multiplier) and, when crossing regions or clouds, incurs egress; failover/failback uses compute and storage; cross-cloud data sharing stores one copy on the provider while consumers pay their own for query compute.

In our opinion, customers should tune RPO/RTO by tier. In other words, pay for tight RPO windows where revenue or regulatory deadlines demand it, and relax elsewhere. Snowgrid customer adoption runs ~20–25% overall according to Snowflake and is likely higher in regulated sectors. Notably, Snowgrid originated in 2017 and was hardened in 2018 under pressure from regulated customers, which explains the maturity on display Monday.

Snowflake emphasizes the competitive gap vis a vis Databricks. It claims that by comparison, Databricks lacks equivalent native replication and managed failover primitives; and recovery typically involves standing up a parallel environment with scripts and manual file movement, with no automatic, transactional failover. As AI expands its footprint in enterprise workflows, we believe this issue will continue to be a focal point. In other words, multi-AZ is table stakes; multi-region control-plane independence is the new architectural design point.

Key Takeaways from Snowflake Customer Experiences Monday

  • 300+ Snowflake customers failed over within minutes, operated normally, and failed back later;
  • Behavior was largely transparent to pipelines, dashboards, and apps due to client redirect;
  • Minute-level, transactionally consistent replicas enabled safe, fast cutovers;
  • Costs are predictable and tunable (replication credits + egress; failover compute/storage);
  • Higher adoption and preparedness in FS/regulatory environments drove smoother outcomes.

In our view, Monday’s performance is a relevant proof point in this era of heightened business resilience. Snowgrid showed itself as more than marketing fluff and a sensible ingredient in BC/DR strategies. Snowgrid’s transactionally consistent replicas, human-gated failover, and client redirect combined to turn a platform-level incident into routine operations for hundreds of customers. The economics are adjustable and should be evaluated based on probability of incident, degree of impact and the associated expected loss. As AI takes on more core workflows, we believe Snowgrid’s region/cloud abstraction will become a reference model for continuity at scale.

Stepping back, the week underscored that, like the early confusion around the shared responsibility model, multi-AZ is good hygiene but only a piece of the business resilience story. The more complete picture is a multi-region (and sometimes multi-cloud) layer where DNS, identity, and service discovery are engineered for independence. Our belief is the enterprises that pre-decide BC posture, rehearse cutovers, and fund minute-level RPOs where revenue warrants, will survive future operator faults with much improved customer experiences. In our opinion, the new bar is treat provider control planes as at-risk dependencies, design more resilient data-plane models, and use region-level independence to protect customers, cash flow and reputational risk.

In this age of AI, the risks are simply too high to ignore this mandate.

Article Categories

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
"Your vote of support is important to us and it helps us keep the content FREE. One click below supports our mission to provide free, deep, and relevant content. "
John Furrier
Co-Founder of theCUBE Research's parent company, SiliconANGLE Media

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well”

Book A Briefing

Fill out the form , and our team will be in touch shortly.
Skip to content