The preventable nature of CloudStrike (even on a minimal budget)

“It is the long history of humankind (and animal kind, too) that those who learned to collaborate and improvise most effectively have prevailed.” — Charles Darwin

Aug 20, 2024

The recent CrowdStrike outage on Windows machines was an engineering operations failure not one of developers. It can happen to any organization, but it's not how you want the brand to be known. People that were unaware of the existence of CrowdStrike learnt about it for all the wrong reasons.

So what happened?

CrowdStrike, an enterprise security provider, has a solution Falcon that runs on client systems for Endpoint Detection. As with any software, the product had updates as part of it's lifecycle. Being a security solution it has access to the kernel and runs as part of the Windows OS - traditional applications can crash and not impact the OS. In this case, it caused the OS to go down on client systems. This is where we heard of the nightmares - Delta claimed they lost $500 million within 5 days, and it's estimated $5.4 billion in losses to Fortune 500s.

While the CEO, George Kurtz, highlighted the risk of not allowing 3rd party systems in an open letter, he didn't reference the issues that would cause such a failure. Below I speculate on issues that might cause such a disastrous scenario.

Development Lifecycles

Product development is a complex task involving multiple teams including but not limited to developers, product managers, quality assurance, infrastructure and platform engineering, & DevOps teams. The processes for these teams varies greatly depending on the medium they are producing software for - web development lifecycles will be different from kernel.

As lean processes have been integrating into the development lifecycle over the part decade, it's commonly recommended to push code into repos and deploy frequently. The obvious benefit here being issues get resolved quickly, the product becomes more robust over time, and clients get latest features.

What often gets missed in these discussions is it depends on the product, risk factors, and impact to users. For instance, if Netflix goes down it's an inconvenience but the impact is minimal I might choose to read a book or go out and see what the world was prior to streaming. For a product such as Falcon, it meant flights got missed, banking systems were offline - obviously the impact is significant.

I still believe in the value of deploying frequently, in this case it should be done internally. The quality assurance gates should be more thorough than less impactful applications. They might benefit from lab environments testing for scenarios that might cause failures. They would benefit from doing canary releases, rolling out to a subset prior to the general populace that way you can easily iron out any kinks.

Team Development

I've worked with many that would look to fire not just a developer but the team responsible. I would argue that CrowdStrike spent a significant sum to train the team, in fact a retrospective of what happened would help to avoid similar scenarios in the future. Other teams should be adopting those lessons learnt.

Engineering is a team sport, it's not just the mistakes of one team that caused this. It was an issue across the organization and processes that allowed a failed patch to make it's way through. A first step would be to measure the development and release life cycles understanding the nuances, members of the team involved, and conducting an impact mapping exercise to align various teams.

Market Factors

Cybersecurity market was once deemed unstoppable but volatility from funding has hit this space over the last few years. Cybersecurity deals dropped by 20% year over year (source - Cyber Security Dive). This has a knock on effect with hiring and retaining resources. While CrowdStrike hasn't had layoffs, other organizations in the sector have been impacted. Okta laid off 400 employees while ProofPoint had 280 this year (source - layoffs.fyi).

Looking at the broader tech market, there has been a strong narrative that GenAI has enabled automations leading to pressures to reduce headcounts. It's a great narrative for shareholders, but not possible with the technology (I discuss specifics in another post). GenAI cannot replace core teams such as quality assurance. Even dedicated products such as dynamic application security testing have difficulty evaluating boundary scenarios.

The implication is even though the engineering and supporting departments may need critical resourcing to support growth, it might not be available and be delayed. These types of delays can cause issues that show up later in the product cycle.

Should core services be held to a different standard?

Much like core public infrastructure the effects of core systems that are used across industries and are vital to everyday services, these may need to be held to a more stringent standard. I'm not saying regulation is the response, but stricter guidelines of core services such as cyber security systems managed within the industry should be adopted.

What if I have a minimal budget?

Truth of the matter is many are impacted with reduced budgets and still need to ensure quality. Limited budgets don't mean incomplete or low quality releases, these can be maintained, the following are several steps you can adopt as part of your engineering efforts.

Sometimes we forget that teams will produce what we prioritize, often I see emphasizing speed of release and that's what gets produced. Incorporating quality needs to be done across the organization and communicated as a priority. Practices such as behaviour-driven development, and risk-based testing can be conducted by teams of any size.
Team structure can play a significant role in building quality releases. These can be achieved by building cross functional teams, quality champions, and embedding QA earlier in the development process meaning you need less support later during release.
Tools and automation can extend limited budgets further. Automating code reviews can provide feedback early on, using GenAI for test case and data generation can extend teams, setting up stages in pipelines for quality gates, and performance monitoring of code can provide valuable information long before customer releases.
Feedback loops setup across the engineering team are key to providing information early so issues can be resolved quickly. Even during releases, if you do canary releases based on client risk appetite and provide feedback to teams, issues would be resolved sooner.
Manage technical debt aggressively - I've often found debt made by short term goals for the business can have a multiplying effect the longer it stays put. It needs to be prioritized and managed on a regular cadence, otherwise it's likely the area to cause lots of quality issues.

Discussion about this post

Ready for more?