Lessons from the Cloudstrike Outage
On July 19, 2024 at 04:09 UTC, CrowdStrike released an update for ‘Falcon Sensor 7.11’ or above to Windows systems. This caused a system crash and blue screen of death on Windows machines. Mac and Linux hosts of Falcon were not affected. Cloudstrike remediated the problem patch at 05:27 UTC. This would have been no help to the many Windows machines that ran the patch as an automated update over that 80 minutes or so. It would have penalised those organisations following the golden rules of running the most up to date versions of software and installing updates automatically.
The Cloudstrike updated was classified as a signature file; of urgent importance. It ignored rules that delayed update installs so only updates that were 1 or 2 versions behind the most recent release were to be installed. The result; even these cautious operations were affected.
An infected machine might be recovered by a simple reboot but if that failed a log on to safe mode followed by command line instructions, a system restore or booting from a specific USB recovery tool would be required. If the affected machine was protected by the Microsoft Bitlocker data encryption system then additional steps would be needed. Hopefully these Bitlocker keys are not stored on a machine that is locked out by the same outage.
Although tricky this solution is achievable if the operator has physical access to the affected machine. Some remote access programs allow restarting a remote machine in safe mode but this is not an out-of-the-box Windows 11 option. If the operating system is being run as a virtual machine from a remote host then the operation relies on the device hosting the virtual machines itself not being compromised.
The problem is not so much how to solve the crash but who is authorised to do it? The more employees that are trained and available to fix problems the more quickly they could be overcome. In this example an employee would need to at least know a password to put a machine into safe mode and possibly other system passwords as well. If a solution were to be delivered by USB device then a user might need to go into the BIOS (hopefully password protected) and change the boot options. An organisation could use the same password for all machines and release it when required. That has obvious security loopholes so each operator will need to know the specific passwords for the machine they are operating. This could be securely stored and only released when required followed by someone going around every single affected machine and changing those passwords very soon after the restore. All of these options would poor choices for all but the smallest organisation; the risk of subsequent misuse of system passwords is not worth considering. Certainly end users should be trained in simple procedures to recover machines (beyond turning it on and off again) to ease some of the workload on technical staff but in this case almost all the heavy lifting involved in fixing the crash would fall on the technical team.
Cloudstrike Falcon is endpoint detection software. Designed to monitor system activity and identify possible malware or hacker attacks. It is unfortunate that a solution designed to protect systems was responsible for taking them down. China is implementing a policy of relying on home-grown software so was relatively safe from the recent outage. Many end users that were affected would not have been direct users of Cloudstrike. For example a company uses some cloud-based software service. That service runs on machines that in-turn use Cloudstrike. This meant that the original user lost access to that service although they had no direct dealings with Cloudstrike. Other systems would have been unaffected.
Although the disruption caused was considerable the benefits of having some sort of foolproof backup were obvious. A clear example of what not to do would be the aftermath of the British Airways data outage in 2017. Following the July 2024 Cloudstrike failure some providers reverted back to pen and paper solutions. At Indira Gandhi International Airport in Delhi airline boarding passes and luggage tags were filled in manually. In the UK patients had to get a hand written prescription from their doctor to get drugs from the chemist.
There would be computer based alternatives when a system is down but all come with additional overhead and security concerns. These might be considered where the need to keep a system live is of the utmost importance; genuine life or death cases. If access to data is required but that data is still in theory available (unlike the case of ransomware data locking) an alternate connection could be put in place. The data back end could be accessed through a different front end and possibly alternate middleware as well. This could be web based or through some other local operating system such as Linux. Any developer would need to decide upon using different access keys and passwords or sharing those with the regular system. If such a parallel access system is in place it needs regular testing and monitoring. It could provide a backdoor for malicious actors.
The Cloudstrike outage is a symptom of a global reliance on a few key players. Having a small number of major providers makes updates and compatibility easier but creates single points of failure that could be almost impossible to avoid. The core lesson is to ensure that some sort of alternative system can be put into place with the minimum of disruption. That system needs to be regularly tested but hopefully will never need to be put into real-world use.