The prominent cybersecurity company “CrowdStrike”, recently issued a public apology after a widespread IT outage caused by its Falcon Sensor software update brought many systems to a standstill. Affecting an estimated 8.5 million Windows PCs globally, the incident on July 19 led to disruptions in industries including aviation, banking, and media.
CrowdStrike Acknowledge the Mistake
Adam Meyers, CrowdStrike’s Vice President for Counter-Adversary Operations, spoke to a U.S. congressional committee on September 24, acknowledging that the company let customers down due to a storm of events. The update pushed out to Falcon Sensor software had a fault, causing the blue screen of death (BSOD) on many affected devices. Meyers attributed the incident to a mismatch between input parameters and predefined rules in the new threat detection configurations. These configurations were not properly processed by the Falcon sensor’s rules engine, causing the systems to malfunction until CrowdStrike replaced the problematic settings. The apology was damaging in regards the reputational impact on CrowdStrike, as the company has built its name on cybersecurity and system stability. George Kurtz, CrowdStrike’s CEO, was initially asked to testify before the U.S. House Committee on Homeland Security, but Meyers was sent to explain the situation once the incident was under control.
Impact on Customers and Industries
The update to CrowdStrike’s Falcon Sensor affected Windows operating systems worldwide, leading to widespread BSOD errors and operational downtime for businesses relying on those systems. Major airlines like Delta, rail operators, financial institutions, healthcare services, and even media companies faced disruptions. The incident was challenging because it required physical access to affected machines to reboot and resolve the error. The situation also led to multiple lawsuits against CrowdStrike, including one from Delta Airlines, which claimed the outage caused thousands of flight cancellations and resulted in an estimated $500 million in losses. CrowdStrike, while managing customer recovery efforts, faced questions about its responsibility and the scope of potential negligence.
Measures to Restore Systems and Prevent Future Incidents
In response to the widespread issues, CrowdStrike worked to restore systems quickly, using automated techniques by July 22 to accelerate remediation and deploying staff to assist directly. The company claims that by July 29, virtually all customer systems were back to normal.
To prevent such incidents from reoccurring, CrowdStrike has made several improvements:
- To ensure that updates do not disrupt systems, CrowdStrike has put new validation processes in place, aligning configurations and sensor rules before deploying any changes.
- The company has expanded its testing protocols, now covering a wider range of scenarios to detect any issues that may arise before updates are sent to customer devices.
- CrowdStrike has enabled greater customer control over when configuration updates are deployed, allowing them to decide the best time for implementation.
- CrowdStrike has implemented a phased approach to rolling out updates, ensuring gradual deployment to preventing the risk associated with immediate global changes.
- New safeguards, including runtime checks, have been introduced to confirm that all data processed aligns with the system’s expected parameters.
- To improve its internal review process, CrowdStrike has involved independent third-party security experts to carry out code reviews of the Falcon sensor and evaluate the overall quality control mechanisms.
Kernel Access Controversy and Future of Security Software
During his testimony, Meyers defended the need for software like Falcon Sensor to have kernel access within Microsoft Windows. The kernel is the main component of an operating system, and access allows software to interact closely with hardware resources and processes. Meyers argued that kernel access is important for visibility and threat prevention, and ensuring anti-tampering. The incident has led Microsoft to consider moving some threat-detection updates to user mode, reducing the potential for such system-wide failures. Meyers stated that reducing kernel access could make CrowdStrike’s products less effective at preventing and responding to cyber threats. He pointed out that threat groups, like Scattered Spider, who are known for recent high-profile attacks, often exploit privileges to disable security tools. Thus, kernel-level access remains a necessary tool for counteracting such adversarial tactics.
The CrowdStrike outage was a clear example of the challenge in balancing security and stability. The company acknowledged its fault, apologized, and introduced steps to improve how updates are tested and deployed. While these changes aim to regain trust, the situation shows how important reliable incident response is for systems that are involved in day-to-day operations.
Image credit: Ascannio, Adobestock