What the CrowdStrike global IT meltdown teaches us about technology risks, resilience and complexity

We are living with an interconnected digital nervous system. Nobody is even attempting to think about risk and resilience in our highly-connected society. You would think everyone, from regulators to media to technology buyers, would become well-versed in understanding the underlying technologies. Nope. Faux outrage and jingoism rules the roost.

21 July 2024 (Los Angeles, California) — When I make these long trips back to the U.S. (this one will be 2 months) I always have a packed agenda. But (of course) “unexpected events” happen and you are called upon to reflect – inspired by my adoring public. Or inspired by my pompous navel gazing. Today’s was a little bit of both.

But I had to cringe by this tweet from FTC Chair Lina Khan who tried to make the current CrowdStrike/Microsoft global IT melt-down situation a crisis because of “big tech” and “consolidation”. Noah Crisp, my cybersecurity reporter, covered this nicely earlier this week but I wanted to weigh in with some thoughts, too. Do read Noah’s piece. He has some spot-on quotes from an event he was covering.

Pardon my French, Lina, but how in fuck is it that you don’t understand systems, or their inherent complexity? Well, if you give someone a hammer, everything looks like a nail.

All of my regular readers know I’m a little skeptical (ok, COMPLETELY skeptical and cynical) of government regulators (including the DOJ, the FTC, the SEC, etc., etc.) being able to control and rein in Big Tech, and more importantly, bring about change that is timely, impactful and meaningful in the long run. This lack of understanding of the complexity of our modern technology-reliant, digital-first world is why all of these regulators need to rethink regulation and regulatory frameworks. That needs a more nuanced ✍️ which I will save for another day. But herein a few points on last week’s IT melt-down.

CrowdStrike doesn’t quite fit the definition of a classic monopoly, but its impact, as shown by the widespread system failure, suggests otherwise. It isn’t the only service that helps guide the world in an invisible fashion. We need to rethink not only how we regulate but also what we regulate. Khan’s tweet, like many others I’ve seen in the wake of this week’s CrowdStrike/Microsoft “crisis”, only reinforces my hypothesis that with the abstraction of core technologies into buzzwords, most people involved in the technology ecosystem don’t quite understand the complexity of digital systems.

I see it every day in the ediscovery / information governance industry. In the last 25 years, the adtech industrial complex has built a vast inverted pyramid of complexity, obscurity, rent-seeking, arbitrage (and occasional fraud), a structure never designed for information governance or data privacy. Information services now sit within complex media ecologies, and networked platforms and infrastructures create complex interdependencies and path dependencies. The power dynamic has changed. Because data has become the crucial part of our infrastructure, enabling all commercial and social interactions. We live in a massively intermediated, platform-based data environment, with endless network effects, commercial layers, inference data points, and new paths to analysis. It cannot be regulated, it cannot be governed. That industry lives in “The Matrix”, but to their credit their job is to sell product, not tell the truth. There is no money in the truth.

And I see it every day in the cybersecurity industry. The biggest issue right now is that the increasing complexity of cloud, multi-cloud, and hybrid network environments has rapidly evolved into an almost-ungovernable system, showing the Achilles heel of traditional network cybersecurity defenses. There is a maxim in the cybersecurity industry, best articulated by Elio Grieco, one of my team’s brilliant, creative “must follow” chaps on Linkedin. Elio has superb computer skills from programming to usage, and a deep knowledge of cybersecurity issues. As Elio has noted: “We’ll do anything to fix cybersecurity and network flaws – except build software correctly“.

I can certainly understand that a layperson doesn’t comprehend the complexity of technologies that touch their daily lives, but this lack of comprehension is widespread and deep among those who are the most important part of the technology ecosystem. Technical ineptitude afflicts everyone — from politicians and bureaucrats to company leaders and media (including the media covering technology).

Five years ago, in his essay about the need for systems thinking, Adam Bly pointed out that complexity is “an impossibly large network of interacting components, without central control, whose emergent behavior is much more elaborate than the sum of the behaviors of its individual parts”. He argued that we are “nearly at, the point where every world problem is intractable in isolation”.

With technology, you could see it coming. A decade and a half ago, many bemoaned that Facebook Login was becoming a single point of failure for many internet services. Nobody was really being a Cassandra. Instead they were just highlighting that the ever-increasing risks in the systems would lead to disaster. In the Facebook example, the Facebook login failures have disrupted the internet on multiple occasions.

With increasing frequency, we are reminded of this interconnected complexity and the growing magnitude of risk and failure. Fat fingers or a single line of errant code can bring down a telecom network. A misconfigured address on routers can bring the entire internet to a halt. When Amazon’s cloud goes offline, almost half the world’s internet offerings go on the blink. It seems that the timeless “six degrees of separation” adage applies to machines and software as well.

I have more I’d like to write but I need to catch a flight back to D.C. so just a few more things.

To maximally insure against meltdowns like the CrowdStrike mess would require paying a price in economic efficiency. But we got fat and lazy. Trade-offs? We were not willing to make any.

And, yes, averting this particular disaster might not have cost a ton of money, to be clear. But the problem is CrowdStrike is only one of approximately one zillion points of possible failure in our thoroughly networked and globalized economy, as I noted above.

Over the past 50 years, the market’s relentless drive for efficiency and reach has made such mass failure nodes more numerous, more potentially catastrophic and harder to see before they fail — while also giving us instant access to all the world’s culture and most of its information, plus more, cheaper and better goods and services, and a global economy that every year lifts tens of millions more people out of poverty.

I came of age in the 1970s, when this transformation was but a glimmer and network complexity was slowly being understood. Very early days. But I was there just as “e-commerce” was beginning to take shape. The internet was really just beginning, showing its possibilities, and global trade was shifting manufacturing from the United States to countries that had cost advantages (such as China and Mexico) and countries that had special technical expertise (such as Germany and Japan).

And, at the same time, Americans were importing foreign techniques such as “just-in-time” manufacturing, which replaced inefficient local suppliers and mountains of spare parts with far-flung but streamlined supply chains that delivered inputs precisely when they were needed. Yes, this was often hard on people who worked for the local industries and it would lead to the political and social incohesion we are living through today. But it was a boon to workers in poor countries and to American consumers.

Compare any consumer product today with its 1970 or 1980 or 1990 equivalent. You’ll often find that either the quality has risen dramatically (cars) or the price has fallen precipitously (clothes). In the case of televisions and many other products, it’s both. The internet intensified this trend, because its network effects and economies of scale often drive markets toward a handful of players.

I’m a media/digital media guy and I have seen it in full fury. We moved from thousands of local newspapers, each competing for readership within a geographically limited market. Now a small number of newspapers are vying to be among a handful of daily news publications that serve the entire country, maybe even the entire English-speaking world. Enlarging markets, and putting more eggs in fewer baskets has real benefits: a few big news publications can cover more beats, more deeply, than a lot of small ones can.

But as Megan McCardle, lead opinion write for the Washington Post, recently said “it also has real downsides, such as a higher risk that those few players will all miss some stories — notably, local news — or get one of them badly wrong”.

Similarly, as became obvious during the pandemic, cost-efficient just-in-time global supply chains will leave companies and their customers vulnerable when borders slam shut and governments hoard critical resources for their own citizens. In the case of software companies, it’s quite efficient for one firm to serve a large number of important customers, as CrowdStrike does — or even practically all the customers, as is the case with online search. In some ways, these concentrated players might provide greater reliability, because they develop a lot of expertise by serving many users, and they can invest more in R&D and security than Bob’s Friendly Local Software Co. can.

But when outages happen, they happen to seemingly everyone, everywhere, all at once, leaving users no alternatives. How best to try to manage the trade-off between efficiency and redundancy is a hard question for another day. My short answer? It is now impossible. The time to think about trade-offs was when the “Great Efficiency Drive” was underway. That train has left the station.

Nope. Faux outrage and jingoism rules the roost. Life will go on just as it did with SolarWinds. The “Mea Culpa” public message will be “Hey! Nothing to see here. This was not a cyberattack. Just a mistake, no reason for alarm”.

Note to readers: but cyber attackers did take advantage. There was a rush to register domains like crowdstrike-bsod.com and crowdstrikefix.com (see list below) during the outage which signaled the danger for exploitation by bad actors preying on user desperation and their intentions to exploit DNS to redirect users to malicious domains. There is no better way to do this by using compromised, Blacklisted DNS servers.

Much like the SolarWinds’ behavior where there was major concern initially that began to wane over time and is now, fundamentally, forgotten.last week’s event will also be forgotten.

And these incidents reveal a systemic problem: arrogance in the industry, software vendors’ updates that lack rigorous quality control, and clients who naively assume the updates they are getting are secure. The “1st Principle of Human Nature” will take over. Humans are creatures of habit who follow simple reproducible patterns, are reluctant to change those patterns of behavior until the behavior becomes unproductive, and will follow the same behavioral patterns in the hopes they will work again.

GregoryBufithis

MenuMenu

What the CrowdStrike global IT meltdown teaches us about technology risks, resilience and complexity

Leave a Reply Cancel reply

MenuMenu

You may also like...

The iPhone X: some random thoughts

Trump has an agreement with Putin. We don’t know it yet, but my guess is it’s simple: they want to tear Ukraine to pieces

Letter from Davos: great conversations, special badges, parties … and snipers

Leave a Reply Cancel reply