File Metadata Extraction and Reporting

Umer Saeed

Sr. RF Planning & Optimization Engineer

BSc Telecommunications Engineering, School of Engineering

MS Data Science, School of Business and Economics

University of Management & Technology

Mobile: +923018412180

Email: umersaeed81@hotmail.com

Address: Dream Gardens,Defence Road, Lahore

File Metadata Extraction and Reporting

In this chapter, we explore essential techniques for analyzing and managing file systems using Python. We start with functions that provide valuable insights into file characteristics, including size distribution, type categorization, and age analysis. These tools are crucial for organizing and understanding large datasets, helping users make informed decisions about file management and archival.

The chapter demonstrates practical applications through various examples, such as calculating the distribution of file sizes across predefined bins, summarizing file types and their sizes, and analyzing file ages to identify how long files have been in the system. Additionally, it covers detailed file information retrieval, including size, creation and modification dates, and directory paths. These techniques are designed to streamline file handling processes, making it easier to manage and interpret file system data effectively.

1. File Size Distribution Analysis

Analyze the distribution of file sizes in a directory to identify large files or categorize them into size ranges.

# import requird libraries
import os
import pandas as pd
from glob import glob

def file_size_distribution(path):
    file_list = glob(f'{path}/*.*')
    sizes = [os.path.getsize(file) / 1024 for file in file_list]  # Size in KB
    
    # Define size bins
    bins = [0, 100, 500, 1000, 5000, float('inf')]
    labels = ['0 - 100 KB', '101 - 500 KB', '501 - 1000 KB', '1001 - 5000 KB', '5001 KB and above']
    
    df = pd.DataFrame({'Size (KB)': sizes})
    df['Size Range'] = pd.cut(df['Size (KB)'], bins=bins, labels=labels, right=False)
    
    size_summary = df.groupby('Size Range').agg(
        Number_of_Files=('Size (KB)', 'count'),
        Total_Size_KB=('Size (KB)', 'sum')
    ).reset_index()
    
    return size_summary

# Example usage
df0= file_size_distribution('D:/Copy/Umer_Saeed')
df0
Size Range Number_of_Files Total_Size_KB
0 0 - 100 KB 15 285.271484
1 101 - 500 KB 0 0.000000
2 501 - 1000 KB 0 0.000000
3 1001 - 5000 KB 3 7935.145508
4 5001 KB and above 0 0.000000

2. File Type Summary Report

Summarize the types of files in a directory, including counts and sizes for each file type.

# import requird libraries
import os
import pandas as pd
from glob import glob

def file_type_summary(path):
    file_list = glob(f'{path}/*.*')
    extensions = [os.path.splitext(file)[1].lower() for file in file_list]
    sizes = [os.path.getsize(file) / 1024 for file in file_list]  # Size in KB
    
    df = pd.DataFrame({'Extension': extensions, 'Size (KB)': sizes})
    type_summary = df.groupby('Extension').agg(
        Number_of_Files=('Size (KB)', 'count'),
        Total_Size_KB=('Size (KB)', 'sum')
    ).reset_index()
    
    return type_summary

# Example usage
df0= file_type_summary('D:/Copy/Umer_Saeed')
df0
Extension Number_of_Files Total_Size_KB
0 .csv 2 3734.724609
1 .docx 1 45.729492
2 .txt 6 0.015625
3 .xlsx 9 4439.947266

3. File Age Analysis

Determine the age of files based on their creation or last modification dates to identify old or unused files.

# import requird libraries
import os
import pandas as pd
from glob import glob
from datetime import datetime

def file_age_analysis(path):
    file_list = glob(f'{path}/*.*')
    creation_dates = [os.path.getctime(file) for file in file_list]
    current_time = datetime.now().timestamp()
    
    ages = [(current_time - creation) / (60 * 60 * 24 * 365) for creation in creation_dates]  # Age in years
    
    df = pd.DataFrame({'Age (Years)': ages})
    age_bins = [0, 1, 5, float('inf')]
    age_labels = ['Less than 1 year', '1 to 5 years', 'More than 5 years']
    
    df['Age Category'] = pd.cut(df['Age (Years)'], bins=age_bins, labels=age_labels, right=False)
    
    age_summary = df.groupby('Age Category').agg(
        Number_of_Files=('Age (Years)', 'count'),
        Total_Size_KB=('Age (Years)', 'size')  # Count as proxy for size in this example
    ).reset_index()
    
    return age_summary

# Example usage
df0 = file_age_analysis('D:/Copy/Umer_Saeed')
df0
Age Category Number_of_Files Total_Size_KB
0 Less than 1 year 18 18
1 1 to 5 years 0 0
2 More than 5 years 0 0

4. Retrieve and Display File Metadata Including Name, Size, Creation Date, Modification Date, Extension, and Directory Path

The goal of this code is to gather and present detailed information about the files in a specific folder. It provides a list showing each file’s name (without the extension), size, when it was created, when it was last modified, its type, and the folder it is located in.

# import required libraries
import os
import math
import pandas as pd
from glob import glob

# user define funcation
def get_file_info(path):
    # Get the list of files
    file_list = glob(f'{path}/**/*.*',recursive=True)

    # Using list comprehension to collect file details
    file_info = [(os.path.splitext(os.path.basename(file_path))[0],  # File name without extension
                  math.ceil(os.path.getsize(file_path) / 1024),  # File size in KB, rounded up
                  pd.to_datetime(os.path.getctime(file_path), unit='s')  # File creation time in UTC
                  .tz_localize('UTC')  # Localize to UTC
                  .tz_convert('Asia/Karachi')  # Convert to your local timezone
                  .strftime('%Y-%m-%d %I:%M %p'),  # Format as date and time
                  pd.to_datetime(os.path.getmtime(file_path), unit='s')  # Last modification time in UTC
                  .tz_localize('UTC')  # Localize to UTC
                  .tz_convert('Asia/Karachi')  # Convert to your local timezone
                  .strftime('%Y-%m-%d %I:%M %p'),  # Format as date and time
                  os.path.splitext(file_path)[1].lower(),  # File extension
                  os.path.dirname(file_path)  # Directory path without file name and extension
                 )
                 for file_path in file_list]

    # Create a DataFrame
    df = pd.DataFrame(file_info, columns=['File Name', 'File Size (KB)',
                                          'Created Date', 'Modified Date',
                                          'File Extension', 'Full File Path'])
    return df

# Get file information from the source and destination directories
df0 = get_file_info('D:/Copy/Umer_Saeed')

# Display the DataFrame
df0
File Name File Size (KB) Created Date Modified Date File Extension Full File Path
0 03_PRS 1868 2024-06-12 06:36 PM 2024-03-21 05:11 AM .csv D:/Copy/Umer_Saeed
1 1234_US_G 0 2024-06-13 05:02 AM 2024-06-13 05:02 AM .txt D:/Copy/Umer_Saeed
2 15031984_AliSaeed 0 2024-09-01 02:31 PM 2024-09-01 02:31 PM .txt D:/Copy/Umer_Saeed
3 19980802_UmerSaeed 30 2024-09-01 02:32 PM 2024-09-01 02:31 PM .xlsx D:/Copy/Umer_Saeed
4 AS_123_US 30 2024-06-13 05:46 PM 2024-06-13 05:46 PM .xlsx D:/Copy/Umer_Saeed
5 Babar_Azam 30 2024-09-01 02:19 PM 2024-09-01 02:16 PM .xlsx D:/Copy/Umer_Saeed
6 Cor_US 46 2024-06-13 05:02 AM 2024-06-13 05:02 AM .docx D:/Copy/Umer_Saeed
7 gmail 0 2024-06-13 05:24 AM 2024-06-13 05:24 AM .txt D:/Copy/Umer_Saeed
8 g_AS 0 2024-06-13 05:18 AM 2024-06-13 05:18 AM .txt D:/Copy/Umer_Saeed
9 g_US 30 2024-06-13 05:02 AM 2024-06-13 05:01 AM .xlsx D:/Copy/Umer_Saeed
10 Hello_US 1868 2024-06-13 04:55 PM 2024-03-21 05:11 AM .csv D:/Copy/Umer_Saeed
11 Ijlal_Khan 30 2024-09-01 02:17 PM 2024-09-01 02:16 PM .xlsx D:/Copy/Umer_Saeed
12 Pakistan_1947-08-14 0 2024-09-01 02:30 PM 2024-09-01 02:30 PM .txt D:/Copy/Umer_Saeed
13 Test 30 2024-06-22 11:30 PM 2024-06-18 10:38 PM .xlsx D:/Copy/Umer_Saeed
14 Test1 31 2024-08-13 06:37 PM 2024-08-11 10:53 AM .xlsx D:/Copy/Umer_Saeed
15 US1234 1 2024-06-13 04:46 PM 2024-09-01 02:24 PM .txt D:/Copy/Umer_Saeed
16 US_123_AS 30 2024-06-13 05:46 PM 2024-06-13 05:45 PM .xlsx D:/Copy/Umer_Saeed
17 US_Test 4201 2024-06-13 05:53 PM 2024-06-10 12:20 PM .xlsx D:/Copy/Umer_Saeed
18 g 0 2024-08-13 07:30 PM 2024-08-13 07:30 PM .txt D:/Copy/Umer_Saeed\BH
19 a 1868 2024-06-12 06:37 PM 2024-03-21 05:11 AM .csv D:/Copy/Umer_Saeed\DA
20 AS_1234_US 30 2024-08-11 03:54 PM 2024-08-11 03:54 PM .xlsx D:/Copy/Umer_Saeed\DA
21 b 1 2024-06-12 06:37 PM 2024-06-12 06:36 PM .txt D:/Copy/Umer_Saeed\DA
22 c 30 2024-06-12 06:37 PM 2024-06-12 06:35 PM .xlsx D:/Copy/Umer_Saeed\DA
23 d 1868 2024-08-11 11:43 AM 2024-03-21 05:11 AM .csv D:/Copy/Umer_Saeed\DA
24 Hello_t_US 1868 2024-08-11 03:32 PM 2024-03-21 05:11 AM .csv D:/Copy/Umer_Saeed\DA
25 US1234_Salam 0 2024-08-11 03:15 PM 2024-06-13 04:46 PM .txt D:/Copy/Umer_Saeed\DA
26 US_123_AS_Hello 30 2024-08-11 03:15 PM 2024-06-13 05:45 PM .xlsx D:/Copy/Umer_Saeed\DA
27 US_Test_Hi 4201 2024-08-11 03:15 PM 2024-06-10 12:20 PM .xlsx D:/Copy/Umer_Saeed\DA
28 1234_US_G 0 2024-09-01 03:31 PM 2024-06-13 05:02 AM .txt D:/Copy/Umer_Saeed\Umer
29 15031984_AliSaeed 0 2024-09-01 03:31 PM 2024-09-01 02:31 PM .txt D:/Copy/Umer_Saeed\Umer
30 19980802_UmerSaeed 30 2024-09-01 03:32 PM 2024-09-01 02:31 PM .xlsx D:/Copy/Umer_Saeed\Umer
31 Babar_Azam 30 2024-09-01 03:32 PM 2024-09-01 02:16 PM .xlsx D:/Copy/Umer_Saeed\Umer
32 Cor_US 46 2024-09-01 03:32 PM 2024-06-13 05:02 AM .docx D:/Copy/Umer_Saeed\Umer
33 gmail 0 2024-09-01 03:31 PM 2024-06-13 05:24 AM .txt D:/Copy/Umer_Saeed\Umer
34 g_AS 0 2024-09-01 03:31 PM 2024-06-13 05:18 AM .txt D:/Copy/Umer_Saeed\Umer
35 Ijlal_Khan 30 2024-09-01 03:32 PM 2024-09-01 02:16 PM .xlsx D:/Copy/Umer_Saeed\Umer
36 Pakistan_1947-08-14 0 2024-09-01 03:31 PM 2024-09-01 02:30 PM .txt D:/Copy/Umer_Saeed\Umer
37 US1234 1 2024-09-01 03:31 PM 2024-09-01 02:24 PM .txt D:/Copy/Umer_Saeed\Umer
38 US_123_AS 30 2024-09-01 03:32 PM 2024-06-13 05:45 PM .xlsx D:/Copy/Umer_Saeed\Umer
39 US_Test 4201 2024-09-01 03:32 PM 2024-06-10 12:20 PM .xlsx D:/Copy/Umer_Saeed\Umer