Massive global IT outage

Md_Ab_Zaheer · July 19, 2024, 6:00am

Is there is any ongoing issue with Windows globally …?

Raghvendra_Singh · July 19, 2024, 10:23am

Yes, there is.

Govind · July 19, 2024, 10:24am

Yes, ongoing…

Md_Ab_Zaheer · July 19, 2024, 10:25am

Windows crashing and getting blue screen…

RFSpecialist · July 19, 2024, 10:25am

I have win10 and it works fine.

Arsalan_Wasti · July 19, 2024, 10:26am

Because it’s without crowdstrike.

sumit_jha · July 19, 2024, 10:26am

As per the article it is for crowdstrike users only.

Govind · July 19, 2024, 10:27am

Or might be only window 11 having issue?

JoseMiguel · July 19, 2024, 10:28am

Current Action
CrowdStrike Engineering has identified a content deployment related to this issue and reverted those changes.

If hosts are still crashing and unable to stay online to receive the Channel File Changes, the following steps can be used to workaround this issue:

Workaround Steps:

Boot Windows into Safe Mode or the Windows Recovery Environment (you can do that by holding down the F8 key before the Windows logo flashes on screen)
Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
Locate the file matching “C-00000291*.sys”, and delete it.
Boot the host normally.

ran_core_consultant · July 19, 2024, 10:28am

I spoke with IT manager, he said we are trying not to do this workaround on servers but may have no choice…

IT_Specialist · July 19, 2024, 10:35am

CrowdStrike has identified the issue and reverted the faulty update, but that doesn’t appear to help machines that have already been impacted.

Stanis · July 19, 2024, 10:39am

CERT-In Advisory: Microsoft windows systems crippled by CrowdStrike Falcon sensor update

Read more At:
https://www.aninews.in/news/business/cert-in-advisory-microsoft-windows-systems-crippled-by-crowdstrike-falcon-sensor-update20240719145431/

IT_Specialist · July 19, 2024, 11:07pm

Crowdstrike Case Executive Summary:

On July 19, 2024, CrowdStrike experienced a significant issue that caused a global IT outage affecting many of its customers. Here’s a summary of what happened:

The incident was caused by a defective content update for CrowdStrike’s Falcon sensor on Windows hosts.
This update caused Windows systems to experience bugchecks or “blue screen of death” errors, rendering many devices inoperable.
The issue primarily affected Windows hosts running Falcon sensor versions 7.15 and 7.16.
CrowdStrike emphasized that this was not a security incident or cyber attack, but rather a software update problem.
The company identified, isolated, and deployed a fix for the issue.
As a workaround, CrowdStrike advised users to boot Windows in safe mode or the Windows Recovery Environment and delete a specific file.
The incident highlighted the risks associated with relying heavily on a single vendor for cybersecurity solutions.
CrowdStrike’s CEO, George Kurtz, stated that the company was actively working with impacted customers and that Mac and Linux hosts were not affected.

This event underscores the importance of having backup plans and the potential widespread impact of issues with widely-used security software.

TechRadar · July 20, 2024, 6:32pm

5 Learnings from the Microsoft Cloud Outage !

(1) Mass Upgrades are Dangerous & should be moderated through a gradual rollout process.

Any software patch rollout should not be carried out at a mass scale at all the sites together.

In carrier-grade telecom networks for example, we follow the process of First Office Application (FOA), where the upgrade is first rolled out at a few clusters of the product at a SINGLE geographical site / customer site.

Only after a soak period of 48 hours, we upgrade the patch at other instances of the same site and observe thereafter, and that upgrade too is carried out in batches.

We apply the software upgrade to other sites, once there is confirmation that there is no collateral damage at the FOA site.

Automation of upgrades should not mean that we push the software at all global locations concurrently without a feedback loop.

(2) First Office Upgrade requires Pre and Post Analysis for all software versions being upgraded.

It is possible that the software being upgraded is running at different versions on different geographical sites.

During the FOA process, there should be a proper pre-and-post analysis to ascertain the impact of the upgrade during the soak period.

This can mitigate issues such as intermittent crashes / core dumps (eg: blue screen of windows is a core dump) or even memory leaks that lead to “slowdown” and “hanging” of the software.

This helps in early isolation of side-effects and pausing further upgrades, in case of faulty software & prevent it from becoming “viral”.

(3) Automated Rollback after a crash via backed-up software.

Once crashes are discovered during a FOA upgrade, there needs to be an automated process to “fallback” to the original software version.

This needs backup of the entire application binary, configurations and databases prior to attempting an upgrade.

The rollback should automatically commence as soon as a core dump is observed or basis human action at the central level (through automation).

(4) Real-Time fault management and anomaly detection via Machine Learning.

It is surprising that Microsoft and CrowdStrike got to know about the issues only when customers reported the crashes.

As these upgrades were done in bulk, thousands of windows machines would have gone down in a small period of time.

This should have been caught by their fault management systems, as a major anomaly prior to customers reporting failures.

(5) Disaster Recovery Architecture and Process was missing.

Even when Microsoft and CrowdStrike were working to recover the primary systems, there was no process of moving mission critical workloads to a disaster recovery site.

It is reported that emergency 911 services were down at several states in the US, and hospitals were also impacted.

Such customers who are saving lives of people should have been moved to a DR Cloud Site to restore their services immediately.

Link: