CrowdStrike root cause analysis

In this episode, Marc and Darren discuss the update that crashed eight million PCs, highlighting the importance of testing, transparency, and a learning culture. Join in the conversation at The DEVOPS Conference in Copenhagen and Stockholm and experience a fantastic group of speakers.

[Darren] (0:05 - 0:14)

If there is a kind of silver lining to be taken from the whole CloudStrike situation, this should be the template for root cause analysis.

[Marc] (0:19 - 0:29)

Welcome to the DevOps Sauna, the podcast where we dive deep into the world where software security and platform engineering converge with the principles of DevOps.

[Darren] (0:30 - 0:33)

Enjoy some steam and let's forge some new ideas together.

[Marc] (0:41 - 1:09)

We're back in the sauna. Hello, Darren.

Afternoon Marc.

It's always afternoon somewhere, but it's turning into autumn on the CloudStrike situation. I think the dust has settled, the root cause analysis has now been published and I think you've been studying it. So let's dive back into the situation and put the final nail in the coffin.

Would you like to remind us, Darren, what's CloudStrike and basically what happened?

[Darren] (1:09 - 1:59)

So CloudStrike quite simply are a security vendor. They put out some various security software. The one we've all, I think, been most interested in over the last month is their Falcon sensor, which was essentially installed on a great deal of Windows endpoints around the world... I'm not exactly clear on the date.

When in end of June or July did it happen? Well, basically an update was pushed in July 19th, 2024 and what happened is overnight, at least in Europe, we woke up and it seemed a large portion of the internet was down due to CrowdStrike and we saw some fairly wide scale impact of airports and transit authorities across huge parts of the world.

[Marc] (1:59 - 2:34)

So essentially 8 million Microsoft Windows computers blue screen'd due to a configuration update essentially from CrowdStrike and you know, one of the things that made me feel bad about this is you've got a company that is out there to try to make everybody's computers more secure and they have a certified product, and mistakes happen. But this mistake happened to not only paralyze 8 million computers, but it also stranded people all over the world and who knows how many different services impacted people and how many different ways.

[Darren] (2:35 - 4:24)

And it's one of those things that its impact has led it to have a much more wide scale visibility. So the story would actually go back to 2011 because in 2011 there was this EU antitrust monopoly case against Microsoft because Microsoft had the only piece of security software that was running in the Windows kernel space. The Windows Security Center was the kernel level security system and the EU said, no, you have to allow other programs to operate in that kernel space.

And that's like completely valid. You can't lock security programs out of a specific area. It's problematic, but we actually see these kind of crashes all the time.

The actual issue is the Falcon sensor software expects an input from 20 sensors and the software update that was pushed gave data on 21. So it's a simple like mismatch and these things happen all the time, but when these things happen in regular software, what happens is the software crashes. You get a little spinning beach ball if you're on Mac or a little error message if you're on Windows and it says Windows is searching for the solution to this problem, which invariably doesn't find, but then you open the program again and go on with your day.

But because security software can operate in kernel spaces, this kind of error was causing these blue screens. So we have to kind of be on the side of CrowdStrike and when this, these things happen, they happen all the time. They happen even to the most rigorous testers, in my opinion, there'll be some edge case that hasn't been considered and they'll hit an edge case and they'll think, well, why didn't we consider this before?

And it's like, well, hindsight is 20:20.

[Marc] (4:24 - 5:27)

You know, a little anecdote here, there's, when I talk to people about career choices and what they want to do with their life and many other things, I often say there's two jobs I think everybody in the world should have. One of them is that person that stands in the superMarket or the shopping mall and tries to offer you to sell you something as you walk by and excuse me, excuse me, pardon me, would you like to hear, because that teaches you persistence and it teaches you about conversion rates and things like that. But software engineer is the other one because the greatest software engineers in the world create bugs every day.

And of course the greatest ones create fewer than the less experienced ones. But one of the things as well with how you described the problem space here is that oftentimes services crash and they just automatically restart and you never even notice unless it's running in the kernel space and doesn't allow the kernel to continue. So what happened in this: I guess we've got the root cause analysis and what could have been done differently in this case that would have produced a different result?

[Darren] (5:27 - 6:29)

Well there are several things. The first thing I think we should address is their solution. Their solution went out extremely quickly with, you know, set to 20 spaces.

But what they actually suggested was just restarting your computer a bunch of times. They said, I think it was up to 15 times, please restart your computer up to 15 times. And the point of that was to try and see if your computer could pull the update before the system crashed again.

So they basically were, well, keep trying it. Maybe it'll work. And it's like, I can't decide and it's something I want your opinion on actually, whether this is a good solution, because in my opinion it, it acts like a kind of high pass filter that it would have resolved some people's issues.

It wasn't an ineffective fix across the board. It wasn't a fully effective one, but they reduced their number of support cases and did so with very little work, but it also left a lot of people feeling quite dumb.

[Marc] (6:29 - 8:09)

So essentially we had a race condition that it was difficult to get the update in before the kernel would crash again with the existing software. So one of the things about software that's so fascinating to me is that when you have something that's up and running for a long time, you might have lots of variables and data in states that no developer ever could have perceived that they would be in. And race condition is a really common problem that we've had in multi-threaded software.

I'm aware of these types of problems back in the nineties that we had to do all kinds of interesting things. And you know, how often do we talk about mutexes today and stuff like this? Not that much in my work, but if it's truly just a race condition, then they found that this type of issue now creates a new problem.

And the solution to that is turn it off and on many times. And hopefully that you will get through that one update. And I can imagine how many smart-ass managers around the world were like, Oh yeah, turn it off and on.

I could have done that. And it's like, well, to actually understand that there was a race condition occurring and that one way to beat that might be power cycling the machine at a certain rate in order to get the update. Hey, if it gets some people online and back, that's great.

I would expect that they took some preventative measures for this in a lot of different ways. Do we know anything about those? What are the preventative measures that are going into place to prevent this type of thing from happening?

And I don't just mean let's test it more and then do canary releases and things like that, which I expect they're doing anyway.

[Darren] (8:10 - 9:11)

Do you think they're doing this? I mean, there are like three different, you mentioned the testing and canary release. I do think we should open up on those a little bit because well, if we're talking about preventative measures, if you expect something to have 20 inputs and it has 21 inputs, this is in my opinion, kind of an obvious testing case, don't you think?

You might, for example, test it with 99 inputs or 99 million inputs. These are all standard black box testing methodologies. So the idea that they would receive 21 inputs and that would cause a kernel panic because it just, the program crashed.

So the kernel fails. These kinds of things are surprising. So yeah, I think the first thing that was missing was error handling.

And it's kind of difficult to see why this didn't come up in a DevOps scenario. In my opinion, this is a very normal test case.

[Marc] (9:11 - 10:41)

When we have test automation and we put gates into place in our CI-CD systems that prevent me from putting through a change that hasn't had at least my oversight and whoever did my code review or in my unit tests and preferably some other gates, preferably in the CI pipeline about integration or module or contract testing or all of these things. I can see this getting missed. I advocate a lot for ad hoc testing or exploratory testing is the proper term.

And exploratory testing should result in requirements or backlog items for new test automation. So all of those testers that we have employed over the years and we're moving towards more test automation, there's still a role. And that role is to perform exploratory testing on, even when you're doing things like contract testing to try different values in APIs or try something off the, really off the wall or put an image into a text file descriptor or things like this in order to try to plug these holes.

So I'm not completely surprised. And I would even expect to find this in potentially a sophisticated organization that has a high level of test automation because we come to rely on those tests, and we only are really maintaining or adding new functionality for adding tests for new functionality rather than going back and revisiting, Hey, did I really test this thing thoroughly enough with the parameters that I put in before?

[Darren] (10:42 - 11:58)

I don't disagree. And I think there's actually a coefficient to this equation that people aren't really considering, which is the requirement for speed. They are a security company.

If, for example, a new log4j happened tomorrow, how quickly do you think people would want patches? They would want patches immediately. They would want patches right then and there.

And this happens every time we actually see a critical vulnerability. We, we will receive emails extremely quickly to talk about what the plans are and how quickly to react. There's almost no time to even develop plans in a lot of situations where before people are expectant of results.

So we can talk about testing the fact that it is strange that they didn't have any direct to machine testing, something like actually installing it on an endpoint, which from what I understood would have caused an immediate crash, but these things slow down the speed of delivery and DevOps has always, in my opinion, one of its core tenants has been speeding up delivery and nowhere in my opinion is that more important than in security.

[Marc] (11:58 - 13:14)

We talk about in DevOps, we talk about vertical slices and we talk about end to end and we talk about gates quite a lot. I've always been of the signal through variety, which is that if I'm a, if I'm writing software that is supposed to run on a machine, then whatever software I write day one, preferably enables testing directly on a target hardware, target architecture machine. I don't care if it's embedded and it's, you know, elevators or smart vehicles or just, you know, a web app or a mobile app.

It should run on target hardware as soon as possible. And this level of smoke testing, you know, literally, where did smoke testing come from? Well, there's an awful lot of smoke inside of that integrated circuit chip or CPU.

And the goal is to keep the smoke inside, right? It's magic smoke. Once you let it out, it disappears.

It doesn't work anymore. So having this type of smoke testing end to end that allows the signal to get through from the developer's desktop to a target architecture, to me is the most important core tenant of any type of DevOps that we have. And it's certainly a target that we do in many of our transformation activities with our customers is to make sure that they facilitate this type of testing.

[Darren] (13:14 - 13:48)

I think that's kind of vital. And I do think that's one of the places that was missing, but do you think you say that's important? I don't disagree, but I also think you mentioned it already, the canary releases, they can act as kind of a, maybe not an alternative, but kind of a stop gap method if you don't have that.

Because let's be fair, CrowdStrike pushed this to 8 million different endpoints all at once and crashed all of them. Do you expect that maybe there was some canary testing in place and this was the canary release? Would it have actually been much worse?

[Marc] (13:49 - 14:55)

I was thinking the same thing. Oh my God. What if their install base is a billion machines and oh, 8 million is our canary.

The funny thing here that you always remind me, Darren, is that it's not only the speed and the pressure that every working developer in the world is under in order to take the ideas usually that someone else has had and turn those into valuable code, but when we're talking in terms of security, does the world understand that shutting down those machines on that one day, and I think when we were talking before, the last time we saw something kind of like this was McAfee in the 90s, was it 98 or something like that?

A security company crippling or breaking a lot of devices. This doesn't happen very often and this is one of the costs of having secure software. And the reality is these things can happen and is it better to do this than to have malicious actors taking over 8 million machines and using those in order to divert resources into a very bad situation?

[Darren] (14:56 - 15:51)

It's the frequent saying that said at least once every security convention that the only way to secure a machine is to switch it off and lock it in a cabinet. And this security software is the kind of the balance we have of that. It's the tug of war we end up playing.

So to be able to use this security software, yeah, we have to be able to talk about the fact that software is never perfect and there are often edge cases and weirdly it can be, we can reference it with a Harry Potter quote where Dumbledore says that with him being smarter than most people, his mistakes are correspondingly huger. And that's a case of what we have here where the Crowdside Strike software operates at kind of a lower level of the operating system. So the damage it can do in the case of these unfortunate failures is correspondingly larger.

[Marc] (15:51 - 16:12)

I think we brought this up once before, but it reminds me of our old friend Kelsey Hightower's no code. So he has a repository on GitHub under his name, Kelsey Hightower slash no code. And the best way to write secure and reliable applications is to have no code whatsoever.

So that's another alternative other than disconnecting the machine in order to have a secure environment.

[Darren] (16:12 - 16:30)

Yeah, I think we're going to have to dive further into that one. I still haven't looked at that repository. So it's something I need to investigate.

I do think there is one thing here we should talk about which Crowdstrike did extremely well. I don't know if you've had a chance to look through the full root cause analysis document that they put out.

[Marc] (16:30 - 16:42)

It's miraculous. It's really, really well done and it's very transparent. And you know, the fact that they did this and they published it and they talk about everything.

I think this was really, really well done.

[Darren] (16:43 - 17:45)

Yeah. And I don't think we can undersell this. They talk about everything.

And maybe this is just me being jaded for dealing with various vendors who think that one paragraph is an acceptable security release saying we're fixing something about this thing. Don't worry. We've got it all in hand.

And this kind of troubling lack of transparency that I see extremely frequently in security. So Crowdstrike putting out a 12 page document, giving an exact timeline, showing the exact cause, going through and talking about basically everything that failed. And now that they are improving, including a third party review.

So it's not just an internal thing, but they've also called up a third party. And then even going into the technical details, it's extremely refreshing. Like if there is a kind of silver lining to be taken from the whole CloudStrike situation, this should be the template for root cause analysis.

It's in my opinion, that good.

[Marc] (17:46 - 18:36)

One of the things that I want to remind our listeners is that root cause analysis does not mean you find the root cause of something. I prefer the term contributing factors, contributing factors analysis, but root cause analysis it is. In this case, they listed six findings and mitigations in the CrowdStrike RCA.

So not just a single thing causes something like this to happen. A single thing doesn't make an airplane fall out of the sky either. So going into all of the different things that contributed to this fault.

And I mean, these guys even went down to it's practically a core dump. Really, really well done. And this level of transparency is something that I think we need to have, especially in the secure software space so that we can all learn how to prevent malicious actors in the future as best we can.

[Darren] (18:36 - 19:50)

Yeah, definitely. Right. We have this other term, which I think goes well, it's the whole blameless post-mortem.

If you look at root cause analysis, it's very much about targeting. It's about finding one reason, and this doesn't actually play well with this one statistic we have in security, which suggests that a huge number of incidents are caused by human error. And if you run it as like a blameful root cause analysis, you're basically looking for a person to point a finger at.

And if you're trying to blame people, information will not flow. People will clam up. If you're doing this kind of contributing factor analysis, if you're doing these blameless post-mortems, information is allowed to flow freely because people don't fear repudiation from their mistakes.

So yeah, I think you make an excellent point. We should probably get rid of the term root cause analysis and go for contributing factor analysis because as we talked about, lack of error handling, lack of testing, lack of canary releases, these are all contributing factors. And as long as we can keep blame out of it, I think that's probably what they did and probably why they were able to put out such a clear document stating exactly what had gone wrong.

[Marc] (19:51 - 21:41)

The Westrum Organizational Culture was a study that looked at pathological, bureaucratic and generative organizations. Generative organizations being ones that focus on learning from not only their mistakes, but learning from everything and how they go forward. And one of the tenants there is the messenger.

So we've all heard 'shoot the messenger', which is the pathological way of looking at things. And the bureaucratic way of looking at things, well, the messenger needs to get at the back of the line. So I don't imagine there was a great deal of bureaucracy required for CrowdStrike to deliver this root cause analysis.

There's probably a hell of a lot of work and teamwork and coordination of this. There's probably not highly processed bureaucracy, but it's clearly generative. And this is a type of thing that we see in so many organizations and people, not only in IT, but as well in IT.

And understanding that how you train the messenger is one of the most critical things. When something happens, how do you report upon it? Who do you tell?

How do you work within the system in order to resolve it? And are you allowed to be as transparent as possible while doing these types of things? Okay, Darren, thank you a great deal for looking at this with me and talking about it today.

I'm always learning from these discussions. And thank you all for listening to our podcast on the CrowdStrike RCA. Thanks for having me, Marc.

It's always a pleasure. We'll now tell you a little bit about who we are. Hi, I'm Marc Dillon, lead consultant at Eficode in the advisory and coaching team.

And I specialize in enterprise transformations.

[Darren] (21:41 - 21:48)

Hey, I'm Darren Richardson, security architect at Eficode. And I work to ensure the security of our managed services offerings.

[Marc] (21:48 - 21:55)

If you like what you hear, please like, rate, and subscribe on your favorite podcast platform. It means the world to us.

Published: August 20, 2024

DevOps Sauna Sessions Security

Eficode

Subscribe to our podcast

Related tracks

Sauna Sessions

GitLab Duo and the future of AI in DevOps

Our first-ever GitLab champion, Dan Plumbley, joins Marc and Darren to explore GitLab's rise in DevOps, the holistic approach, AI capabilities, and more.

Go to full transcript

Sauna Sessions

The DEVOPS Conference pregame: Helen Beal

Get a feel for the themes and discussions that Helen Beal will explore at The DEVOPS Conference in Copenhagen and Stockholm.

Go to full transcript