TL;DR
- Root Trigger: AWS cooling failure shut down racks, triggering widespread Coinbase service outages throughout buying and selling and account actions.
- Compounding Points: Matching‑engine quorum loss and a silent Kafka management‑airplane defect considerably prolonged restoration time.
- Subsequent Steps: Coinbase is enhancing cross‑zone resiliency, increasing failover testing, and upgrading Kafka deployments to forestall related incidents.
The Could 7 service disruption left clients throughout retail and institutional platforms dealing with hours of halted exercise, and the corporate is now outlining how a localized AWS cooling failure escalated into a protracted, multi‑layer outage. Coinbase says the eight‑hour interruption, adopted by a twelve‑hour full‑system restoration window, fell wanting its requirements and warranted a detailed accounting of the technical breakdowns that compounded the incident.
AWS Cooling Failure Triggered Widespread Shutdowns
In line with the corporate, the chain of occasions started at 7:20 PM ET when a number of chiller items failed inside a single knowledge corridor in AWS’s us‑east‑1 area. The cooling loss pressured a thermal‑security shutdown of racks internet hosting EC2 situations and EBS volumes, impacting Coinbase techniques alongside different main web providers. By 7:48 PM ET, almost all buying and selling on the platform had halted, leaving retail customers unable to purchase, promote, ship, obtain, deposit, or withdraw.
Institutional shoppers on Prime additionally noticed order‑routing degradation as Coinbase Change markets went offline. Restoration progressed inconsistently. The matching engine returned in cancel‑solely mode at 2:25 AM ET on Could 8, full buying and selling resumed at 3:49 AM ET, and coinbase.com and the cell app regained full performance by 9:53 AM ET. Occasion‑streaming backlogs cleared by 2:00 PM ET. Coinbase additionally notified regulators inside required home windows and is finishing formal impression assessments.

Two Architectural Weaknesses Extended the Outage
The corporate recognized two core points that turned a single‑zone AWS occasion right into a multi‑hour platform disruption. First, the matching engine was pinned to a single constructing as a result of its Raft‑based mostly cluster design, which prioritizes low‑latency co‑location. When AWS terminated situations at 9:29 PM ET, three of 5 nodes went down, eliminating quorum. With no automated cross‑zone failover, engineers needed to ship an emergency code change, create a brand new node group, and restore a 3‑of‑5 quorum manually. Second, AWS’s managed Kafka service failed silently.
A defect within the MSK management airplane prevented computerized partition‑chief reelection, leaving producers unable to jot down and blocking downstream techniques, together with charges, quoting, ledger elements, funds, and knowledge pipelines. Guide partition reassignments started at 3:00 AM ET, with precedence subjects restored by 9:30 AM ET. Coinbase says it’s enhancing cross‑zone standby design for its matching engines, rising failover testing, collaborating with AWS on the Kafka defect, enhancing inside tooling, and migrating the remaining 2‑AZ Kafka clusters to three‑AZ deployments.

