That huge 'Microsoft outage' probably didn't affect you, but the next one might
How does this happen and how is the next one prevented?
On Friday, a whole lot of Microsoft Windows servers and the services running on them went out for a good portion of the morning. You probably weren't affected much (neither was I), but thousands of corporations and businesses were, including the airline and rail industry, bringing transportation and other services to a standstill.
Needless to say, it was messy and will end up costing the companies affected millions. Messy, expensive technical blunders are fascinating to me and one of the things I think is always worth exploring more. At the risk of sounding like the proverbial Monday morning quarterback, let's have a look at this one.
One of the web's longest-running tech columns, Android & Chill is your Saturday discussion of Android, Google, and all things tech.
While I think the overall blame must be laid at Microsoft's feet, the Redmond giant didn't cause this outage. An optional third-party Windows component from CrowdStrike—another Windows Security vendor—sent out an update that crashed the low-level systems of the affected computers and sent them into the famous Windows blue screen. The only thing Microsoft did wrong was build a system that allows this to happen, but this is also the most important part of what happened.
That should also be your biggest takeaway from this because the next time it happens—and there will be a next time—you could be affected, and it could be much worse. CrowdStrike may have caused this, but it was Microsoft's fault.
How does CloudStrike factor into all of this?
Let's talk a little more about what CrowdStrike is and why so many big companies use their products. According to the company's website, CrowdStrike has "redefined security", securing "the most critical areas of risk – endpoints and cloud workloads, identity, and data." I am definitely not a Windows security professional but I can recognize a sales pitch when I see one.
I'm sure the software offers an important service. I'm equally sure that the decision to use what CrowdStrike offers is financially based as much or more than it is technically. Salesmen exist because they are good at selling a good or service and if the service is legitimate, it's a lot easier to do.
I have no problem with an entrepreneur finding a way to get the corporate world to buy into their product. I do find two things very concerning here.
Be an expert in 5 minutes
Get the latest news from Android Central, your trusted companion in the world of Android
Firstly, and most importantly, if CrowdStrike offers something so important, why is it not already a part of Windows Server? Microsoft is one of the biggest, and dare I say best, software companies in the world. If there is a legitimate need for a product like the ones CrowdStrike offers, Microsoft could provide it themselves. With Windows Server licensing being so expensive, it probably should be provided.
My next concern is how an optional piece of software can get such low-level OS access and cripple a machine if it's corrupt or misconfigured. Microsoft should never allow software from another company to hijack its operating system this way.
This is why I'll place the blame for this particular outage on Microsoft even though the company did nothing to directly cause it. I'm always going to hold the best companies to higher standards.
Neither of these ideas is crazy or new. I guarantee that engineers at Microsoft knew this could happen, looked at how it could be prevented, and analyzed what the company needed to do to "fix" them. It's trendy to hate on the company, but Microsoft is one of the best companies in the world when it comes to computing, both at the edge and in the cloud. Even if you're not a fan of its products, you can easily see this. Critical infrastructure depends on Microsoft because it is so good at what it does.
What about next time?
Enough with the amateur analysis, though. This is all concerning because we got off easy this time. Yes, your flight got canceled if you were traveling today, and maybe you had no cell service on your new phone for a few hours this morning. If you were lucky, you got to slack off instead of work at your office this morning. If you're unlucky, you get to spend the weekend repairing the damage the outage caused to your IT department.
What if, the next time, the national power grid goes down? Imagine an entire country in the dark for an extended amount of time because of a misconfigured kernel module from a third-party vendor. I know there are multiple fail-safes in place to prevent anything like this, but you should never say never.
More realistically, what if the next global outage affects mobile devices? Forget the inconvenience of Gmail or iMessage going down and instead imagine every Android or iPhone or Surface laptop crapping out for a few hours. It's easy to say it would be an opportunity to go outside and get some much-needed fresh air, but billions and billions of dollars would be lost, and entire companies would go bankrupt because of it.
I'm certain that incidents like what happened this week are great educational tools and help prevent a more serious incident from happening. I hope the right people—the ones who control the purse strings—use them as a learning opportunity.
Jerry is an amateur woodworker and struggling shade tree mechanic. There's nothing he can't take apart, but many things he can't reassemble. You'll find him writing and speaking his loud opinion on Android Central and occasionally on Threads.
-
Golfdriver97 I can honestly say this....I would hate to be the guy who developed the patch, or the guy who authorized it's release. I'm sure both are going to have a rough Monday if they didn't already have a bad day Friday. I'm honestly surprised there was no testing of this patch.Reply
I agree that MS should have some kind of software like this already in place. -
davinp I agree that before Microsoft or any third-party vendor releases an update to Windows PCs, they should test it to prevent this incident like this from happeningReply -
JudasD any company impacted by this also failed at their corporate IT level. IT should have regressed the patch before deployment. My work did not deploy the patch, IT experienced issues and did not approve the update.Reply -
brian_m85 Microsoft DOES have a way to vet drivers before they are installed. They have to be tested and SIGNED by Microsoft. Crowdstrike bypassed this additional code being downloaded and loaded by their driver.Reply
Linux is just as vulnerable. Apparently Crowdstrike caused Debian and Rocky Linux to boot loop. When you load 3rd party code into ring 0 you’re vulnerable. This is true for Windows and Linux. -
SvenJ Note that the CrowdStrike Falcon Sensor module that is tagged as responsible for the MS issues has also been tagged as causing kernal panics in Linux. So, whether you allow low level access to security packages, (which is how they work), because of EU regulations and agreements (MS), or because everything is a lot more secure if everyone has access to everything (Linux), this is a CloudStrike failure.Reply