Core dna Chronicles: How Core dna Stayed Online During the $650M AWS Outage
The real cost of infrastructure wasn’t just the $650 million in estimated losses, it was the trust that evaporated in minutes. The biggest Amazon Web Services (AWS) outage of the decade took the internet down with it.
One of AWS’s most repeated best practices for high availability has always been to distribute workloads across multiple availability zones within the same region. Last week proved how fragile that assumption really is.
The 14-hour failure in AWS’s US-EAST-1 region crippled thousands of websites and services worldwide: Shopify stores froze mid-transaction, Snapchat, Fortnite, Reddit, all saw their websites come to a halt.
Why and what happened? What began as a DNS configuration error in DynamoDB, one of AWS’s core database services, quickly spiraled into one of the most significant cloud outages in history. Over 150 interconnected AWS services were affected.
For most platforms, recovery took more than half a day. Core dna clients were back online in under 30 minutes thanks to a platform architecture designed and tested for moments exactly like this.
Key Takeaways
- Don't accept "multiple availability zones", require your platform to run production environments across different geographic regions.
- Ask your provider how long recovery takes during AWS outages: 30 minutes vs. 14 hours is the difference between minimal and catastrophic revenue loss.
- If your platform hasn't actually executed its disaster recovery procedures, you're at risk when real outages hit.
- Choose providers that detect issues before cloud providers acknowledge them; early detection means faster response.
- Your checkout, payments, and login must stay operational even when background systems go down; confirm your platform separates these concerns.
On this page:
The Cascading Failure That Broke the Internet
The AWS outage revealed a fundamental vulnerability in modern cloud infrastructure: internal dependencies. When DynamoDB's DNS configuration failed, it didn't just take down one service, it created a domino effect across AWS's ecosystem.
Simple Queue Service (SQS), Lambda functions, EC2 instance launches, and eventually over 150 additional services joined the casualty list. The problem?
Most companies had followed AWS's own best practices, architecting their systems across multiple availability zones within the same region. They believed this provided redundancy. The outage proved otherwise—when core regional services fail, availability zones become irrelevant.
How Core dna Detected and Responded in Real-Time
At Core dna, the story unfolded differently. Here's the timeline from our CTO, Dmitry:
3:55 AM ET: One of our clients has a big event running.
Early Morning: The client called reporting website issues, the team quickly realized this wasn't an isolated incident, it was systemic. AWS's status page initially reported only DynamoDB issues, but Core dna's logs revealed the full scope: SQS endpoints were completely down, returning internal server errors.
Decision Point: Unlike companies forced to wait for AWS to resolve the issue, Core dna had options. Our disaster recovery plan includes multiple production environments across different regions with continuous cross-regional backups.
The Critical Advantage: Because SQS is not latency-sensitive like database operations, we could switch our SQS endpoints from US-EAST-1 to Canada Central without moving entire infrastructures. This surgical approach restored full functionality within 30 minutes.
The Architecture That Made the Difference
Core dna's resilience during this outage wasn't accidental, it reflects years of architectural decisions prioritizing business continuity over convenience:
1. Multi-Region Production Environments
We maintain fully operational production environments in multiple AWS regions, not just availability zones. This means:
- Real infrastructure running 24/7, not cold backups waiting to spin up
- Continuous cross-regional data replication
- Ability to shift specific services without complete migration
2. Service-Level Failover Strategy
Instead of an all-or-nothing disaster recovery approach, we can selectively route specific services based on their latency requirements:
- Latency-sensitive services (databases): Must stay regional due to the compounding effect of network delays across hundreds of queries per page load
- Latency-tolerant services (message queues, search): Can route cross-region without performance impact
3. Proactive Monitoring Beyond AWS Status
Our systems monitor actual service health, not just what cloud providers report. During this outage, AWS took hours to update their status page with the full list of affected services. We had already identified the issue and implemented our solution.
4. Designed for Partial Failures
Core dna's architecture assumes that individual services will fail. When SQS went down, here's what happened:
- User registrations continued: Accounts were created, logins worked—users only saw errors in follow-up notifications
- eCommerce transactions processed: Orders completed, payments went through, inventory updated
- Admin operations persisted: While some audit logging and search indexing were delayed, critical business functions remained operational
The key insight: our orchestration layer failed gracefully. Events that couldn't be pushed to the queue didn't crash entire workflows.
What This Means for Your Business
The October 2025 AWS outage should fundamentally change how ecommerce leaders think about infrastructure:
For CMOs: Revenue Protection is Infrastructure Insurance
Most of the losses came from eCommerce operations. Core dna clients avoided extended downtime, not through luck, but through infrastructure investment that directly protects revenue.
Consider what 30 minutes of downtime costs your business during peak season. Now multiply that by the 12+ hours some companies experienced. The architecture investment pays for itself in a single incident.
For CTOs: Multi-Cloud ≠ Multi-Region
Many technical leaders believe they've solved for redundancy by using multiple AWS availability zones. The October outage proved this insufficient. True resilience requires:
- Geographic distribution beyond a single cloud region
- Service-level understanding of latency requirements
- Automated failover procedures that don't require manual intervention
- Regular testing of disaster recovery plans
For eCommerce Managers: Operational Continuity During Crisis
While competitors explained downtime to customers, Core dna clients maintained normal operations. The only intervention required was manual reprocessing of orchestration events for the 30-minute window before failover, a minor backend task invisible to customers.
This operational continuity matters beyond revenue. Your brand reputation, customer trust, and competitive position all hang in the balance when systems fail.
Building for the Next Outage (Because There Will Be One)
This wasn't the first major cloud outage, and it won't be the last. US-EAST-1 region has experienced three major incidents in five years. Microsoft Azure suffered a similar outage just days after AWS recovered. The pattern is clear: as we consolidate more infrastructure with fewer providers, the impact radius of each failure grows.
Forward-thinking ecommerce platforms are already adopting Core dna's architectural principles:
1. Assume Failure at Every Layer Design systems where any individual component can fail without cascading. This means:
- Graceful degradation paths for non-critical features
- Message queues that can redirect across regions
- Separation of synchronous (customer-facing) and asynchronous (background) operations
2. Invest in Operational Intelligence The ability to detect and respond before official acknowledgment provided Core dna crucial extra time. Real-time monitoring of actual service health—not just status pages—is essential.
3. Test Disaster Recovery Regularly Core dna's 30-minute response time came from having executed these procedures before. If your DR plan lives in a document that's never been tested, it's not a plan—it's wishful thinking.
4. Understand Service Latency Requirements Not everything needs to run in the same data center. Core dna's insight that SQS could route cross-region while databases couldn't represents sophisticated architectural thinking that most platforms lack.
The New Standard for Enterprise Ecommerce
The AWS outage didn't just disrupt services, it revealed which platforms were built for enterprise resilience and which were hoping problems would never happen.
Core dna's response during this crisis demonstrates what modern eCommerce infrastructure should look like:
- Multiple production environments, not backup dreams
- Service-specific failover strategies, not one-size-fits-all disaster recovery
- Proactive monitoring and response, not reactive scrambling
- Architecture that assumes failure, not hopes for perfection
For CMOs, CTOs, and ecommerce managers evaluating platforms, the question isn't whether your provider uses AWS, Google Cloud, or Azure. The question is: What happens when that cloud fails?
With Core dna, the answer is simple: your business keeps running.
About the Technical Response
Core dna's approach to this outage reflects years of architectural investment in distributed systems. Our disaster recovery plan includes:
- Cross-regional backups running continuously across US and Canadian regions
- Service-specific routing strategies based on latency sensitivity
- Automated monitoring that detects issues before cloud providers acknowledge them
- Manual fallback procedures for full regional migration (tested regularly but unnecessary in this case)
The October 2025 outage proved these investments weren't over-engineering, they were essential infrastructure for any platform handling critical ecommerce operations.
