Filtering ZIP Files That Do Not Contain Specific Keywords

Umer Saeed

Senior RF Planning & Optimization Engineer

Filtering ZIP Files That Do Not Contain Specific Keywords

Introduction

Handling compressed files efficiently is an essential skill for managing data in bulk. Often, we encounter scenarios where we need to process ZIP files while ensuring they meet specific conditions. One such case is filtering ZIP files based on their contents. This article discusses how to identify ZIP files that do not contain a particular keyword in any of their internal filenames.

Understanding the Need for Filtering ZIP Files

ZIP files often contain multiple files, and in many cases, we need to analyze or extract only those that meet certain criteria. For example, an organization may receive daily reports in ZIP format, but only a subset of those reports is relevant to a specific analysis. Instead of extracting every file manually, we can automate the filtering process.

Key Steps in the Filtering Process

To achieve this, the process involves:

  1. Locating the ZIP Files: The first step is to specify the directory containing the ZIP files.

  2. Reading ZIP File Contents: Each ZIP file is opened and scanned to retrieve the list of filenames it contains.

  3. Applying the Filtering Condition: The filenames within each ZIP file are checked for the presence of a specific keyword.

  4. Listing the Filtered ZIP Files: Finally, the ZIP files that do not contain the keyword are displayed.

Real-World Applications

  • Automated Report Processing: Filtering ZIP files that contain only the required reports can save time in data analysis.

  • Data Quality Checks: Ensuring that required files exist within ZIP archives before further processing.

  • File Management & Cleanup: Identifying unnecessary ZIP files that do not contain relevant data.

:building_construction: Import Required Modules

import zipfile                        # πŸ“¦ Module for working with ZIP files  
from pathlib import Path              # πŸ›€ Module for handling file paths  

:open_file_folder: Define the ZIP Folder

# πŸ“Œ Set the path where ZIP files are stored  
zip_folder = Path(r"E:\PRS_Email\Attachments")  

:mag: Find All ZIP Files in the Folder

# πŸ“‹ Get a list of all ZIP files in the specified folder  
zip_files = list(zip_folder.glob("*.zip"))  

:rocket: Filter ZIP Files That Do Not Contain β€˜LTE_DA’

# πŸ“‚ List to store ZIP files that do not contain 'LTE_DA' in any filename  
filtered_zips = []

for zip_file in zip_files:
    with zipfile.ZipFile(zip_file, 'r') as zf:
        # πŸ“œ Get the list of file names inside the ZIP  
        file_names = zf.namelist()
        
        # πŸ”Ž Check if any file name contains 'LTE_DA'  
        if not any("LTE_DA" in name for name in file_names):
            filtered_zips.append(zip_file.name)  # βœ… Add ZIP file to the filtered list  

:loudspeaker: Display Filtered ZIP Files

# πŸ–¨ Print ZIP files that do not contain any file with 'LTE_DA'  
print("ZIP files that do not contain 'LTE_DA':")
for zip_name in filtered_zips:
    print(zip_name)  # πŸ“œ Display the ZIP filenames  
ZIP files that do not contain 'LTE_DA':

Conclusion

By leveraging automation, filtering ZIP files based on their contents becomes a streamlined process, eliminating the need for manual inspections. This method can be particularly useful in handling large volumes of data efficiently.

For a complete implementation, you can check out the code on GitHub: [GitHub Repository Link]

for .gz files you can use this for folders that have folders with documents inside:
zgrep -Li β€˜keyword’ *.gz