Umer Saeed
Sr. RF Planning & Optimization Engineer
BSc Telecommunications Engineering, School of Engineering
MS Data Science, School of Business and Economics
University of Management & Technology
Mobile: +923018412180
Email: umersaeed81@hotmail.com
Address: Dream Gardens,Defence Road, Lahore
File Metadata Extraction and Reporting
In this chapter, we explore essential techniques for analyzing and managing file systems using Python. We start with functions that provide valuable insights into file characteristics, including size distribution, type categorization, and age analysis. These tools are crucial for organizing and understanding large datasets, helping users make informed decisions about file management and archival.
The chapter demonstrates practical applications through various examples, such as calculating the distribution of file sizes across predefined bins, summarizing file types and their sizes, and analyzing file ages to identify how long files have been in the system. Additionally, it covers detailed file information retrieval, including size, creation and modification dates, and directory paths. These techniques are designed to streamline file handling processes, making it easier to manage and interpret file system data effectively.
1. File Size Distribution Analysis
Analyze the distribution of file sizes in a directory to identify large files or categorize them into size ranges.
# import requird libraries
import os
import pandas as pd
from glob import glob
def file_size_distribution(path):
file_list = glob(f'{path}/*.*')
sizes = [os.path.getsize(file) / 1024 for file in file_list] # Size in KB
# Define size bins
bins = [0, 100, 500, 1000, 5000, float('inf')]
labels = ['0 - 100 KB', '101 - 500 KB', '501 - 1000 KB', '1001 - 5000 KB', '5001 KB and above']
df = pd.DataFrame({'Size (KB)': sizes})
df['Size Range'] = pd.cut(df['Size (KB)'], bins=bins, labels=labels, right=False)
size_summary = df.groupby('Size Range').agg(
Number_of_Files=('Size (KB)', 'count'),
Total_Size_KB=('Size (KB)', 'sum')
).reset_index()
return size_summary
# Example usage
df0= file_size_distribution('D:/Copy/Umer_Saeed')
df0
Size Range | Number_of_Files | Total_Size_KB | |
---|---|---|---|
0 | 0 - 100 KB | 15 | 285.271484 |
1 | 101 - 500 KB | 0 | 0.000000 |
2 | 501 - 1000 KB | 0 | 0.000000 |
3 | 1001 - 5000 KB | 3 | 7935.145508 |
4 | 5001 KB and above | 0 | 0.000000 |
2. File Type Summary Report
Summarize the types of files in a directory, including counts and sizes for each file type.
# import requird libraries
import os
import pandas as pd
from glob import glob
def file_type_summary(path):
file_list = glob(f'{path}/*.*')
extensions = [os.path.splitext(file)[1].lower() for file in file_list]
sizes = [os.path.getsize(file) / 1024 for file in file_list] # Size in KB
df = pd.DataFrame({'Extension': extensions, 'Size (KB)': sizes})
type_summary = df.groupby('Extension').agg(
Number_of_Files=('Size (KB)', 'count'),
Total_Size_KB=('Size (KB)', 'sum')
).reset_index()
return type_summary
# Example usage
df0= file_type_summary('D:/Copy/Umer_Saeed')
df0
Extension | Number_of_Files | Total_Size_KB | |
---|---|---|---|
0 | .csv | 2 | 3734.724609 |
1 | .docx | 1 | 45.729492 |
2 | .txt | 6 | 0.015625 |
3 | .xlsx | 9 | 4439.947266 |
3. File Age Analysis
Determine the age of files based on their creation or last modification dates to identify old or unused files.
# import requird libraries
import os
import pandas as pd
from glob import glob
from datetime import datetime
def file_age_analysis(path):
file_list = glob(f'{path}/*.*')
creation_dates = [os.path.getctime(file) for file in file_list]
current_time = datetime.now().timestamp()
ages = [(current_time - creation) / (60 * 60 * 24 * 365) for creation in creation_dates] # Age in years
df = pd.DataFrame({'Age (Years)': ages})
age_bins = [0, 1, 5, float('inf')]
age_labels = ['Less than 1 year', '1 to 5 years', 'More than 5 years']
df['Age Category'] = pd.cut(df['Age (Years)'], bins=age_bins, labels=age_labels, right=False)
age_summary = df.groupby('Age Category').agg(
Number_of_Files=('Age (Years)', 'count'),
Total_Size_KB=('Age (Years)', 'size') # Count as proxy for size in this example
).reset_index()
return age_summary
# Example usage
df0 = file_age_analysis('D:/Copy/Umer_Saeed')
df0
Age Category | Number_of_Files | Total_Size_KB | |
---|---|---|---|
0 | Less than 1 year | 18 | 18 |
1 | 1 to 5 years | 0 | 0 |
2 | More than 5 years | 0 | 0 |
4. Retrieve and Display File Metadata Including Name, Size, Creation Date, Modification Date, Extension, and Directory Path
The goal of this code is to gather and present detailed information about the files in a specific folder. It provides a list showing each file’s name (without the extension), size, when it was created, when it was last modified, its type, and the folder it is located in.
# import required libraries
import os
import math
import pandas as pd
from glob import glob
# user define funcation
def get_file_info(path):
# Get the list of files
file_list = glob(f'{path}/**/*.*',recursive=True)
# Using list comprehension to collect file details
file_info = [(os.path.splitext(os.path.basename(file_path))[0], # File name without extension
math.ceil(os.path.getsize(file_path) / 1024), # File size in KB, rounded up
pd.to_datetime(os.path.getctime(file_path), unit='s') # File creation time in UTC
.tz_localize('UTC') # Localize to UTC
.tz_convert('Asia/Karachi') # Convert to your local timezone
.strftime('%Y-%m-%d %I:%M %p'), # Format as date and time
pd.to_datetime(os.path.getmtime(file_path), unit='s') # Last modification time in UTC
.tz_localize('UTC') # Localize to UTC
.tz_convert('Asia/Karachi') # Convert to your local timezone
.strftime('%Y-%m-%d %I:%M %p'), # Format as date and time
os.path.splitext(file_path)[1].lower(), # File extension
os.path.dirname(file_path) # Directory path without file name and extension
)
for file_path in file_list]
# Create a DataFrame
df = pd.DataFrame(file_info, columns=['File Name', 'File Size (KB)',
'Created Date', 'Modified Date',
'File Extension', 'Full File Path'])
return df
# Get file information from the source and destination directories
df0 = get_file_info('D:/Copy/Umer_Saeed')
# Display the DataFrame
df0
File Name | File Size (KB) | Created Date | Modified Date | File Extension | Full File Path | |
---|---|---|---|---|---|---|
0 | 03_PRS | 1868 | 2024-06-12 06:36 PM | 2024-03-21 05:11 AM | .csv | D:/Copy/Umer_Saeed |
1 | 1234_US_G | 0 | 2024-06-13 05:02 AM | 2024-06-13 05:02 AM | .txt | D:/Copy/Umer_Saeed |
2 | 15031984_AliSaeed | 0 | 2024-09-01 02:31 PM | 2024-09-01 02:31 PM | .txt | D:/Copy/Umer_Saeed |
3 | 19980802_UmerSaeed | 30 | 2024-09-01 02:32 PM | 2024-09-01 02:31 PM | .xlsx | D:/Copy/Umer_Saeed |
4 | AS_123_US | 30 | 2024-06-13 05:46 PM | 2024-06-13 05:46 PM | .xlsx | D:/Copy/Umer_Saeed |
5 | Babar_Azam | 30 | 2024-09-01 02:19 PM | 2024-09-01 02:16 PM | .xlsx | D:/Copy/Umer_Saeed |
6 | Cor_US | 46 | 2024-06-13 05:02 AM | 2024-06-13 05:02 AM | .docx | D:/Copy/Umer_Saeed |
7 | gmail | 0 | 2024-06-13 05:24 AM | 2024-06-13 05:24 AM | .txt | D:/Copy/Umer_Saeed |
8 | g_AS | 0 | 2024-06-13 05:18 AM | 2024-06-13 05:18 AM | .txt | D:/Copy/Umer_Saeed |
9 | g_US | 30 | 2024-06-13 05:02 AM | 2024-06-13 05:01 AM | .xlsx | D:/Copy/Umer_Saeed |
10 | Hello_US | 1868 | 2024-06-13 04:55 PM | 2024-03-21 05:11 AM | .csv | D:/Copy/Umer_Saeed |
11 | Ijlal_Khan | 30 | 2024-09-01 02:17 PM | 2024-09-01 02:16 PM | .xlsx | D:/Copy/Umer_Saeed |
12 | Pakistan_1947-08-14 | 0 | 2024-09-01 02:30 PM | 2024-09-01 02:30 PM | .txt | D:/Copy/Umer_Saeed |
13 | Test | 30 | 2024-06-22 11:30 PM | 2024-06-18 10:38 PM | .xlsx | D:/Copy/Umer_Saeed |
14 | Test1 | 31 | 2024-08-13 06:37 PM | 2024-08-11 10:53 AM | .xlsx | D:/Copy/Umer_Saeed |
15 | US1234 | 1 | 2024-06-13 04:46 PM | 2024-09-01 02:24 PM | .txt | D:/Copy/Umer_Saeed |
16 | US_123_AS | 30 | 2024-06-13 05:46 PM | 2024-06-13 05:45 PM | .xlsx | D:/Copy/Umer_Saeed |
17 | US_Test | 4201 | 2024-06-13 05:53 PM | 2024-06-10 12:20 PM | .xlsx | D:/Copy/Umer_Saeed |
18 | g | 0 | 2024-08-13 07:30 PM | 2024-08-13 07:30 PM | .txt | D:/Copy/Umer_Saeed\BH |
19 | a | 1868 | 2024-06-12 06:37 PM | 2024-03-21 05:11 AM | .csv | D:/Copy/Umer_Saeed\DA |
20 | AS_1234_US | 30 | 2024-08-11 03:54 PM | 2024-08-11 03:54 PM | .xlsx | D:/Copy/Umer_Saeed\DA |
21 | b | 1 | 2024-06-12 06:37 PM | 2024-06-12 06:36 PM | .txt | D:/Copy/Umer_Saeed\DA |
22 | c | 30 | 2024-06-12 06:37 PM | 2024-06-12 06:35 PM | .xlsx | D:/Copy/Umer_Saeed\DA |
23 | d | 1868 | 2024-08-11 11:43 AM | 2024-03-21 05:11 AM | .csv | D:/Copy/Umer_Saeed\DA |
24 | Hello_t_US | 1868 | 2024-08-11 03:32 PM | 2024-03-21 05:11 AM | .csv | D:/Copy/Umer_Saeed\DA |
25 | US1234_Salam | 0 | 2024-08-11 03:15 PM | 2024-06-13 04:46 PM | .txt | D:/Copy/Umer_Saeed\DA |
26 | US_123_AS_Hello | 30 | 2024-08-11 03:15 PM | 2024-06-13 05:45 PM | .xlsx | D:/Copy/Umer_Saeed\DA |
27 | US_Test_Hi | 4201 | 2024-08-11 03:15 PM | 2024-06-10 12:20 PM | .xlsx | D:/Copy/Umer_Saeed\DA |
28 | 1234_US_G | 0 | 2024-09-01 03:31 PM | 2024-06-13 05:02 AM | .txt | D:/Copy/Umer_Saeed\Umer |
29 | 15031984_AliSaeed | 0 | 2024-09-01 03:31 PM | 2024-09-01 02:31 PM | .txt | D:/Copy/Umer_Saeed\Umer |
30 | 19980802_UmerSaeed | 30 | 2024-09-01 03:32 PM | 2024-09-01 02:31 PM | .xlsx | D:/Copy/Umer_Saeed\Umer |
31 | Babar_Azam | 30 | 2024-09-01 03:32 PM | 2024-09-01 02:16 PM | .xlsx | D:/Copy/Umer_Saeed\Umer |
32 | Cor_US | 46 | 2024-09-01 03:32 PM | 2024-06-13 05:02 AM | .docx | D:/Copy/Umer_Saeed\Umer |
33 | gmail | 0 | 2024-09-01 03:31 PM | 2024-06-13 05:24 AM | .txt | D:/Copy/Umer_Saeed\Umer |
34 | g_AS | 0 | 2024-09-01 03:31 PM | 2024-06-13 05:18 AM | .txt | D:/Copy/Umer_Saeed\Umer |
35 | Ijlal_Khan | 30 | 2024-09-01 03:32 PM | 2024-09-01 02:16 PM | .xlsx | D:/Copy/Umer_Saeed\Umer |
36 | Pakistan_1947-08-14 | 0 | 2024-09-01 03:31 PM | 2024-09-01 02:30 PM | .txt | D:/Copy/Umer_Saeed\Umer |
37 | US1234 | 1 | 2024-09-01 03:31 PM | 2024-09-01 02:24 PM | .txt | D:/Copy/Umer_Saeed\Umer |
38 | US_123_AS | 30 | 2024-09-01 03:32 PM | 2024-06-13 05:45 PM | .xlsx | D:/Copy/Umer_Saeed\Umer |
39 | US_Test | 4201 | 2024-09-01 03:32 PM | 2024-06-10 12:20 PM | .xlsx | D:/Copy/Umer_Saeed\Umer |