Massive global IT outage

Is there is any ongoing issue with Windows globally …?

1 Like

Yes, there is.

1 Like

Yes, ongoing…

1 Like

Windows crashing and getting blue screen…

1 Like

I have win10 and it works fine.

1 Like

Because it’s without crowdstrike.

1 Like

As per the article it is for crowdstrike users only.

1 Like

Or might be only window 11 having issue?

1 Like

Current Action
CrowdStrike Engineering has identified a content deployment related to this issue and reverted those changes.

If hosts are still crashing and unable to stay online to receive the Channel File Changes, the following steps can be used to workaround this issue:

Workaround Steps:

  1. Boot Windows into Safe Mode or the Windows Recovery Environment (you can do that by holding down the F8 key before the Windows logo flashes on screen)
  2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
  3. Locate the file matching “C-00000291*.sys”, and delete it.
  4. Boot the host normally.
2 Likes

I spoke with IT manager, he said we are trying not to do this workaround on servers but may have no choice…

1 Like

CrowdStrike has identified the issue and reverted the faulty update, but that doesn’t appear to help machines that have already been impacted.

1 Like

CERT-In Advisory: Microsoft windows systems crippled by CrowdStrike Falcon sensor update

Read more At:
https://www.aninews.in/news/business/cert-in-advisory-microsoft-windows-systems-crippled-by-crowdstrike-falcon-sensor-update20240719145431/

1 Like

Crowdstrike Case Executive Summary:

On July 19, 2024, CrowdStrike experienced a significant issue that caused a global IT outage affecting many of its customers. Here’s a summary of what happened:

  1. The incident was caused by a defective content update for CrowdStrike’s Falcon sensor on Windows hosts.

  2. This update caused Windows systems to experience bugchecks or “blue screen of death” errors, rendering many devices inoperable.

  3. The issue primarily affected Windows hosts running Falcon sensor versions 7.15 and 7.16.

  4. CrowdStrike emphasized that this was not a security incident or cyber attack, but rather a software update problem.

  5. The company identified, isolated, and deployed a fix for the issue.

  6. As a workaround, CrowdStrike advised users to boot Windows in safe mode or the Windows Recovery Environment and delete a specific file.

  7. The incident highlighted the risks associated with relying heavily on a single vendor for cybersecurity solutions.

  8. CrowdStrike’s CEO, George Kurtz, stated that the company was actively working with impacted customers and that Mac and Linux hosts were not affected.

This event underscores the importance of having backup plans and the potential widespread impact of issues with widely-used security software.

1 Like

5 Learnings from the Microsoft Cloud Outage !

(1) Mass Upgrades are Dangerous & should be moderated through a gradual rollout process.

Any software patch rollout should not be carried out at a mass scale at all the sites together.

In carrier-grade telecom networks for example, we follow the process of First Office Application (FOA), where the upgrade is first rolled out at a few clusters of the product at a SINGLE geographical site / customer site.

Only after a soak period of 48 hours, we upgrade the patch at other instances of the same site and observe thereafter, and that upgrade too is carried out in batches.

We apply the software upgrade to other sites, once there is confirmation that there is no collateral damage at the FOA site.

Automation of upgrades should not mean that we push the software at all global locations concurrently without a feedback loop.

(2) First Office Upgrade requires Pre and Post Analysis for all software versions being upgraded.

It is possible that the software being upgraded is running at different versions on different geographical sites.

During the FOA process, there should be a proper pre-and-post analysis to ascertain the impact of the upgrade during the soak period.

This can mitigate issues such as intermittent crashes / core dumps (eg: blue screen of windows is a core dump) or even memory leaks that lead to “slowdown” and “hanging” of the software.

This helps in early isolation of side-effects and pausing further upgrades, in case of faulty software & prevent it from becoming “viral”.

(3) Automated Rollback after a crash via backed-up software.

Once crashes are discovered during a FOA upgrade, there needs to be an automated process to “fallback” to the original software version.

This needs backup of the entire application binary, configurations and databases prior to attempting an upgrade.

The rollback should automatically commence as soon as a core dump is observed or basis human action at the central level (through automation).

(4) Real-Time fault management and anomaly detection via Machine Learning.

It is surprising that Microsoft and CrowdStrike got to know about the issues only when customers reported the crashes.

As these upgrades were done in bulk, thousands of windows machines would have gone down in a small period of time.

This should have been caught by their fault management systems, as a major anomaly prior to customers reporting failures.

(5) Disaster Recovery Architecture and Process was missing.

Even when Microsoft and CrowdStrike were working to recover the primary systems, there was no process of moving mission critical workloads to a disaster recovery site.

It is reported that emergency 911 services were down at several states in the US, and hospitals were also impacted.

Such customers who are saving lives of people should have been moved to a DR Cloud Site to restore their services immediately.

Link: :point_down: