In July 2024, the cybersecurity firm CrowdStrike pushed a security software update to one of its products and caused a widespread IT outage that significantly affected a variety of industries, from airlines to hospitals and beyond.
The U.S. Cybersecurity & Infrastructure Security Agency provided real-time updates on the situation, but some of the organizations affected by the outage—and those organizations’ customers—continued to experience significant disruptions for days in what one cyber expert referred to as “the largest IT outage in history.”
So, what exactly happened, and what are the lessons learned? We connected with two University of Maryland Global Campus (UMGC) cyber and IT experts for a Q&A in which they shared their thoughts on the outage and their reflections on what it will mean not only for the cybersecurity and IT industries but also cybersecurity and IT education.
Meet the Cyber & IT Experts
Calvin Nobles, PhD, is dean of the School of Cybersecurity and Information Technology and portfolio vice president at UMGC. Nobles is a retired U.S. Navy officer and has more than 20 years of cybersecurity, military, and academic leadership experience, including serving as senior director of Cyber Support Operations for the U.S. Fleet Cyber Command, senior director and chief cryptologist and security officer for Expeditionary Strike Group 7 and senior director of Global Security Operations for Naval Information Operations Command.
Manish Patel is a collegiate professor with more than 20 years of experience teaching at UMGC. He has extensive experience in the cyber and IT industry and has a master’s degree from Mercer University and a bachelor’s degree from the Georgia Institute of Technology. He also has several industry certifications, including A+, Network+, Microsoft 365 Enterprise Administrator Expert, Microsoft Azure Solutions Architect, and MCSE.
What happened? How did the CrowdStrike software update take entire systems down?
CN: CrowdStrike is a cybersecurity company and Microsoft vendor. One of CrowdStrike’s products, Falcon, is a cloud service that is used to defend against cyberattacks. CrowdStrike rolled out an update to Falcon that had a bug in it, and when that bug rolled out it impacted 8.5 million devices. These systems were critical systems used by organizations, so even though 8.5 million is a large number, the impact was significantly larger than the numbers suggest.
MP: The root cause of the CrowdStrike outage was a sensor configuration update to Windows systems. The outage had a significant impact on various sectors, including commercial flights, hospital operations, financial services, and media broadcasts. The primary services affected were those running Falcon sensor for Windows version 10 and above. This resulted in a system crash and a “blue screen of death” (BSOD) on impacted systems.
What immediate steps did CrowdStrike take to mitigate the impact?
CN: They were able to discover that it was a software bug and inform the entire community of exactly what it was. That helped put everybody at ease because a lot of people were thinking it was a global cyber attack, and in fact it wasn’t. They were very forthcoming and provided information. They’ve been very transparent. Their communications have been clear, frequent, and effective. A lot of times, organizations struggle to get the right communications out in a timely manner, but CrowdStrike has done an amazingly good job.
MP: CrowdStrike took several steps to address the issue. They engaged with other stakeholders, including Microsoft, Google Cloud Platform (GCP), and Amazon Web Services (AWS), to share awareness on the state of impact. They also provided customers with technical guidance and support to safely bring disrupted systems back online.
Did this make CrowdStrike vulnerable to cyberattacks, to malicious actors?
CN: Any time there is something major like a global outage or a cyber attack, you’re going to have malicious actors trying to capitalize on that and trying to socially engineer people, to take advantage of the situation. CrowdStrike was leading the way saying we are seeing these websites pop-up and they are not CrowdStrike authorized sites. They clearly stated that their customers should work directly with them as their vendor and not use third-party sites or vendors.
MP: An outage like this can have serious security implications. It can expose vulnerabilities in critical infrastructure and demonstrate the potential for widespread disruption from even unintentional errors in cybersecurity software. Additionally, it provides an opportunity for threat actors to exploit the situation. The key takeaway for organizations is to adopt a more holistic approach to security, robust incident response plans, and regular security audits.
How can we prevent something like the CrowdStrike outage from happening again?
MP: It is important to emphasize that this incident was not caused by a cyberattack but rather a routine update to a configuration file. To prevent such incidents, it’s crucial to have robust testing and quality assurance processes in software development in place. In this particular scenario, CrowdStrike deviated from the industry’s established best practices concerning software updates, specifically the tier-update process. The tier-update process, alternatively referred to as phased or staged rollout, is a strategic approach to software deployment. Its primary objective is to mitigate the risks associated with deploying new versions by initially releasing updates to a limited user base, prior to a broader distribution. This methodical approach to updates is designed to ensure stability and minimize potential disruptions.
To regain customer trust and prevent such incidents in the future, CrowdStrike must provide a comprehensive transparent explanation of the incident that includes root cause analysis, improve their testing and deployment processes, ensure robust disaster recovery mechanisms, and maintain transparent communication with their customers. Additionally, CrowdStrike must establish strong partnerships with industry peers and sharing information about the incident response best practices can lead to a more resilient cybersecurity ecosystem.
CN: I think the steps CrowdStrike is taking now, including the canary process, are key. The canary process means instead of releasing software to all of your customers at once, you release them to a smaller number of constituents and see if those constituents have any problems with the software update. If that group doesn’t have any problems, then you release it to another larger group until eventually you have it rolled out to all of your customers. Taking a phased approach like that is significantly safer and the right thing to do.
But, there’s something else they could have done, too. Releasing software updates to operational systems is a major flaw in the process. We should look to implement software updates in a quarantined environment. We call it “the sandbox.” It allows you to go in and play with things that aren’t going to be used in the real world. It’s not normal, but sometimes companies don’t use the sandbox because there’s a push to market in terms of providing better products and better services as well as a push to market to help organizations defend against oncoming threats and vulnerabilities within their systems.
CrowdStrike has a capability called rapid release content, and that rapid release content is a way they update the Falcon platform to get it updated rather quickly. That process did not undergo traditional quality assurance processes. Now they have to go back and implement quality assurance practices with the rapid release content, so this never happens again.
How can cyber and IT educators and professors and professionals learn from the CrowdStrike incident?
CN: On the academic side, we can learn from this by using it as a case study, teaching students the right steps to take and processes to follow to prevent the rollout of a major update without sufficient quality assurance processes. On the industry side, we can all learn from CrowdStrike and see what CrowdStrike comes up with, how they approach it. You can be more prepared for incidents like this by looking at your procedures, your processes. What would we do if this happened to us? If we had a major outage, how would we respond? What is our emergency incident response to something of this magnitude? So we can all learn from it, and we can all show some empathy and sympathy to CrowdStrike because CrowdStrike is a really good company.
Typically, when an incident happens like this, if it’s not the magnitude of the global IT outage we’ve just seen, the information is not made public. CrowdStrike has been making their information public to everybody. CrowdStrike quickly published their preliminary investigative report, and they talk about what caused it and the things they are going to do better. They also are going to do a root cause analysis and make that public as well. This is the opportunity for all of us to pay attention to what happened.
MP: The incident underscores the critical need for educators to emphasize real-world scenarios in cybersecurity curriculums. By dissecting this high-profile breach, faculty can illuminate the complex interplay of technology, human error, and organizational culture. Incorporating lessons on redundancy, system interdependencies, incident response planning, and continuous learning will help equip students with the practical skills necessary to mitigate risks and respond effectively to future threats. Additionally, fostering a culture of critical thinking and risk assessment will empower students to become more proactive cybersecurity professionals capable of preventing and addressing vulnerabilities.
Is there any concern that releasing all of this information could open organizations to cyber attacks?
CN: Absolutely. They’re listening, they’re learning, they’re reading just like we are, but a lot of these individuals are already up to speed on how all these things work in the first place.
Are these topics covered in UMGC programs and courses?
CN: We do have a class in our cybersecurity management and policy master’s curriculum called Human Factors in Cybersecurity. It’s the perfect course—but not the only course—to talk about this and really get into the details. In the course, we study human factors and analysis classification system. We don’t just look at the incident’s immediate cause. We look at it organizationally and systematically because there’s a ton of research that tells us that one incident should not lead to something catastrophic. So, what other failures were there within the organization that contributed to this becoming something as chaotic as it was?
This is a course that attracts a lot of people from all areas—not just cybersecurity students, but people in psychology, business, marketing, and biotech—who don’t just want to accept that we have weaknesses or vulnerabilities but want to know how we can design around those so that we have a defensible mechanism in place.
You can look at the executives and see if they provided adequate resources, adequate policies. Have they implemented the proper oversight? Have they set the right culture? When you look at it from that perspective, you quickly learn that something catastrophic happening at the bottom or an incident happening at the bottom doesn’t happen in isolation. Human error is the start of the investigation, not the end. At UMGC, we are preparing students to go out and be part of the solutions to these issues, to be part of the teams that are helping develop solutions. In this type of community, everybody’s on the same side, even though we’re working for different companies or organizations. At the end of the day, our job is to protect the organizations and people we work for.
Share This