What Outages Teach About Organizational Resilience and Leadership

The cost of oversight is enormous and can have long-term repercussions on a business. Major tech outages leave no room for excuses and push leaders to take responsibility.

Paromita Gupta April 02, 2026

Topics

Image Credit- Diksha Mishra/ MIT Sloan Management Review Middle East

When technology falters, even briefly, the consequences can cascade across industries. For example, the recent Iranian drone attacks on three Amazon Web Services facilities disrupted cloud services for an estimated 11 million users. However, it’s not always cyberattacks that reveal a company’s weaknesses; sometimes, a simple update or maintenance exposes hidden problems.

This became apparent during the 2024 Microsoft-CrowdStrike outage, which caused approximately 8.5 million Microsoft Windows systems to crash due to a faulty software update. The global financial hit? Over $10 billion. Meanwhile, AWS experienced one of the biggest and most widespread cloud outages of recent times in 2025, when a major disruption in its US-EAST-1 region sidelined services for over 4 million users and 1,000+ companies, including Slack, Atlassian, and Snapchat.

Today, outages are the ultimate test for organizations, often characterized by the decisions leaders make during a crisis.

“What defined us wasn’t that moment, it was everything that came next,” shared George Kurtz, founder and CEO of CrowdStrike, a year after the “Blue Screen of Death” global incident.

Major tech outages leave no room for excuses and force leaders to take responsibility for the results.

Dealing with Uncomfortable Truths

According to an Oxford Economics study, downtime—when a machine, especially a computer, is unavailable—costs organizations about $9,000 per minute, or $540,000 per hour. In short, the cost of oversight is enormous and can have long-term repercussions on a business.

“I have learned more about real leadership at midnight during outages than in any strategy offsite,” shares Sergio Gago Heurta, CTO, Cloudera. In enterprise software, the reality is harsh: technology always fails.

Heurta adds that leaders have resorted to an ‘update every five minutes’ default, not due to its problem-solving feature, but to have ammo for stakeholder spin. “While understandable, this instinct is poisonous.”

Recalling when an airline’s digital ticketing system went down, and executives kept demanding updates, David Noël, Vice President Middle East and Africa at Dynatrace, said, “The pressure wasn’t just coming from the technology side. The Chief Revenue Officer was calling because every minute offline meant abandoned bookings and lost fares. That tells you a lot about where the real pressure sat.” As a CTO, Heurta’s role is clear: be the glue and the shield. “We must act as glue for those with partial context and a shield for the engineers who need space to restore service, rather than dealing with noise.”

Noël admits the experience was tough, “The entire response was manual. And when everything is manual, pressure amplifies fast. Too many people, too many hours, too much stress.”

It is in moments like these that leaders rise to the challenge. Uptime Institute’s 2024 Outage Analysis reveals that 10-20 high-profile IT outages or data center events occur annually.

Being Familiar with the Realities of Cybersecurity

Attackers tend to exploit tech outages to launch or amplify cyberattacks, capitalizing on chaos, reduced monitoring, and user confusion to turn initial disruptions into broader breaches.

During the 2024 CrowdStrike outage, scammers used spear-phishing and fake fix websites to target confused users, but no network breaches were reported. Similarly, the global Microsoft Windows outage triggered phishing surges that impersonated patches and updates, which deployed malware such as the Remcos RAT and data wipers.

Samer El Kodsi, Regional Vice President of Sales–Gulf & North Africa–at Palo Alto Networks, says company leaders now have a better grasp of cybersecurity. “The level to which business leaders prioritize technology risks varies widely between organizations. However, discussions regarding technology risks have reached the boardroom and are becoming more commonplace,” he noted.

A 2025 PwC report found that only 24% of organizations spent significantly more on proactive measures (e.g., monitoring, assessments, testing, controls) than reactive measures (incident response, fines, recovery). Meanwhile, 67% spent roughly equal amounts on both categories.

“Cybersecurity specialists will have seen the full gamut of reactions from organizations that have suffered a breach, ranging from sober realism and honesty to denial and spin. But thankfully, the latter is increasingly rare,” Kodsi says, adding that every outage or breach represents a learning opportunity to both cybersecurity specialists and organizations.

Today, boards have taken an active interest in cybersecurity—partly due to the well-publicized AI-based threats—because they cause major disruption, data loss, the threat of blackmail, and the complete loss of customer trust. Kodsi suggests that cybersecurity should be treated as a team sport, with practices that share intelligence and open technical integrations as the best ways to stay ahead of sophisticated AI-driven threats.

According to an IBM 2025 report, the global average cost of a data breach is approximately $4.4- $4.45 million per incident.

The Talk Around Failure

The 2024 AT&T outage in the US, caused by a misconfigured network element during a routine update, left services for over 125 million devices nationwide dark for roughly 12 hours—including 92 million calls and 25,000 to 911 alone. Upon network restoration, CEO John Stankey addressed employees, acknowledging the failure and apologizing for the disruption. He framed the moment not as a crisis of competence but as a crisis of resilience. “This is not our first network outage, and it won’t be our last—unfortunately, it’s the reality of our business. What matters most is how we react, adapt, and improve to deliver the service our customers need and expect,” Stankey explains.

How leaders respond during a crisis shows their character. But their response only matters as much as the culture that precedes it. Experts are clear on this: the goal is not a culture with zero failure. A culture that punishes every mistake doesn’t stop errors; it just makes people hide them.

“Engineers must be explicitly encouraged to call out ‘limping stools’—the shaky assumptions and silent failure modes—without fear of being labeled as negative. If your culture punishes bad news, you do not eliminate risk; you simply delay the moment it finds you,” Heurta says.

“What works is making escalations routine, not an exception. Weekly or daily stability stand-ups, where teams across the organization review system health together, create a culture where raising a concern early is expected and valued. Some organizations bring in their key vendors. When speaking up is normalized, problems surface before they compound,” Noël adds.

When employees feel safe reporting mistakes without fear, trust grows. This leads to better communication and quicker fixes. “Psychological safety doesn’t lower standards. It raises them. It means issues surface sooner, when they’re smaller and easier to fix,” Noël notes.

According to the 2024 IBM Cost of a Data Breach Report, organizations that detected breaches internally, using their own security teams and tools, saved an average of nearly $1 million compared to those discovered by attackers.

What Post-Outage Scenario Looks Like?

The meeting after the outage is as revealing as the outage itself. It is one with deep, intense discussions on addressing the issue and what can be done to avoid it in the future, or one where a leader’s frustration lands on the first person in their line of sight?

The tech industry has a new mantra: the blameless postmortem. It’s now the gold standard.

“When outages inevitably happen, the post-incident review has one rule: blameless, or don’t bother. If the goal is finding a scapegoat, you will simply get quieter engineers and louder outages. We go forensic on systems, not people,” Huerta says.

According to him, one should ensure that the team, specifically those who build and run the code, remain focused on eliminating the phantom of technical debt and continuously improving the system.

A mature cybersecurity approach treats every outage as a system problem first. The focus is on identifying root causes within the system, not on blaming individuals.

“When organizations focus on human error as a final conclusion, they miss the underlying weaknesses that allowed that error to occur. True resilience is built through transparency. Follow-through must be a shared responsibility, but ultimate ownership resides with the executive leadership and the chief information security officer. Part of their role is to ensure that resources are available for technical remediations and that a culture of proactive risk management is integrated into the organization’s DNA,” says Kodsi.

Huerta agrees that follow-through is indispensable. “Mitigations are owned by the product team, linked to actual tickets, and tracked like real work, not treated as nice-to-have chores that disappear when the next shiny feature arrives.”

CrowdStrike didn’t just take immediate steps after the incident; it made additional lasting changes that drove proactive resilience. Through its “Resilient by Design” strategic framework, introduced post-Falcon disruption, it brought meaningful structural changes—self-healing sensors and scalable infrastructure to more flexible customer controls, enhanced testing, and a strengthened global operations center.

“What we’ve built drives greater resilience across our ecosystem and helps advance industry standards for cybersecurity,” says Kurtz.

Sometimes outages don’t just get fixed; they lead to new infrastructure capabilities. Microsoft and Azure implemented a similar system.

Regaining and Reinstilling Customer Trust

Culture matters. Leadership response matters. But none of it means anything if the people lose faith in you. After all, trust is earned in public, not asserted in private.

“Following any cybersecurity incident, trust is best measured through a combination of behavioral data, showing how customers have behaved or reacted following the incident, and sentiment data, which is more about what customers have said following the incident. Both are important in helping a company to properly formulate its ongoing response to rebuild trust with customers,” Kodsi suggests.

Heurta and Cloudera measure customer sentiments with transparent metrics (SLAs, SLIs, and SLOs) and connect incidents to what actually matters: customer impact and business KPIs. “When trust is lost due to system wobbles, we win it back the only way that counts: fast, safe iteration. This means utilizing tight CI/CD pipelines, canary deployments, and a definition of done that signifies value delivered, not just code merged and prayed,” he adds.

For Noël, the clearest signal is adoption. “When customers keep using digital channels after an incident, transacting more, returning consistently, that is trust in observable form. When those numbers stall or reverse, sentiment data won’t make up for it. Download an app, find it broken, delete it. That decision takes seconds. Recovering that customer takes months.”

Long-term Approach

In a crisis, leaders steady their teams by acting decisively, communicating transparently, and leading with empathy. By adopting an “all-hands” approach, they need to set clear priorities, move with agility, and stay calm under pressure.

As digital infrastructure becomes inseparable from daily life—powering hospitals, grids, governments, and emergency lines—the question is no longer whether disruptions will happen. They will. The question is, how do we keep it to minimal disruption by developing a system that catches the next one before a user feels it?

Digital acceleration often means deeper dependence on centralized cloud infrastructure. Noël notes that boardroom conversations that often get skipped are not about awareness of that risk—“Most leaders are aware”—but about ownership. “Does the organization actually have visibility and control, or is it hoping nothing breaks? Mature organizations treat digital risk the same way they treat credit or regulatory risk, with real governance behind it. And as multi-cloud environments grow more complex, that governance has to include real-time visibility across every layer, not just periodic reporting.”

Topics

About the Author

Tags:

Topics

Share