In today’s data-driven world, managing and processing large datasets efficiently is more crucial than ever. Our Python program is designed to streamline your data workflows, enabling you to clean, analyze, and manage datasets with ease. Whether you’re working with MySQL databases, Prometheus data, or complex hierarchical data structures, this tool has you covered.
What Does the Program Do?
This Python script offers a comprehensive solution for:
- Data Cleaning: Automatically cleans input files to ensure they’re ready for processing.
- Database Interaction: Connects seamlessly with MySQL databases to query and retrieve data.
- File Parsing: Extracts critical information from structured text files such as mount names, NFS paths, and site locations.
- Data Matching: Compares data between multiple sources (ITSS, Prometheus, QTree) to identify matches and discrepancies.
- Output Generation: Produces a detailed CSV report summarizing data matches, ownership verification, and location-based insights.
Implementation
Dependencies and Configuration
import pandas as pd # Data manipulation and analysis
import subprocess # Execute shell commands
import json, time # Data serialization and timing operations
from datetime import datetime # Date and time handling
import logging # Logging operations
from logging.handlers import SysLogHandler # System logging
from docopt import docopt # Command-line argument parsing
from prettytable import PrettyTable # Formatted table output
import smtplib # Email functionality
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
logging.basicConfig(level=logging.INFO,
filename='logs.log',
filemode='a',
format='%(message)s %(asctime)s',
datefmt="%Y-%m-%d %T")
Command Line Interface
def get_args():
"""Function to get command line arguments."""
usage = """
Usage:
try.py --replace [--short]
try.py --prepare [--short]
try.py --dryrun [--short]
try.py --email -e
try.py --version
try.py -h | --help
"""
args = docopt(usage)
return args
Core Functionality
subprocess.Popen
to gather information about disk status:
all_not_ok = ['mmvdisk', 'pdisk', 'list', '--rg', 'all', '--not-ok']
replace = ['mmvdisk', 'pdisk', 'list', '--rg', 'all', '--replace']
def clean_data(file):
try:
result = []
file_name = os.path.splitext(file)[0]
substring = '{"metric":'
filename = file_name+"_data.txt"
newdata = open(filename, "w")
with open(file, "r") as f:
mmapped_file = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
start = 0
index = mmapped_file.find(substring.encode(), start)
while index != -1:
result.append(index)
start = index + 1
index = mmapped_file.find(substring.encode(), start)
newdata.write("\n"+mmapped_file[start:index].decode("utf-8"))
if len(result) == 0:
print('The file that was created named '+filename+' is EMPTY! check your data ')
return filename
else:
return filename
except Exception as e:
print(e)
4. Multi-threaded Processing
threading
module, the script performs concurrent operations, enabling multiple tasks to execute simultaneously. This approach significantly reduces processing time, especially when dealing with large datasets or I/O-bound tasks, by utilizing available system resources more effectively.
prometheus_data = data_clean(dir)
qtree_data = data_clean(q3)
file = get.get_data(prometheus_data)
q3_data = get.get_data_q3(qtree_data)
t1 = threading.Thread(target=search_filer, args=(table,file,site,q3_data))
t1.start()
t1.join()
5. Data-matching
- Database Query: Executes a MySQL query on the specified table to fetch data relevant to the provided site.
- Cross-Referencing: Matches database entries against Prometheus and QTree data from the input files.
- Discrepancy Detection: Flags mismatches or missing entries, ensuring a comprehensive understanding of the data.
- Reporting: Prepares detailed logs for matched and unmatched items, streamlining the output generation process.
def search_filer(table,file, site, q3_data):
data_filer=[]
connection = con()
try:
db = "SELECT filer, path, nfspath, owner, location FROM "+table+" WHERE location = '"+site+"'"
resp = pd.read_sql(db, connection)
for i in resp.iterrows():
owner =i[1]['owner']
nfspath = i[1]['nfspath']
filer = i[1]['filer']
path = i[1]['path']
loc = i[1]['location']
print('Query filer Owner {} and Nfspath {} '.format(owner,nfspath))
if owner == 'none' or owner == '':
if path == 'none' or path == '':
# data.append([path, '','', 'No Match', owner, '', 'No Match'])
pass
else:
data = get.check_val_db(owner, nfspath, file, site, q3_data)
if data == 1:
data_filer.append([filer, nfspath, '', '', 'No Match', owner, '', 'No Match', loc])
else:
for i in data:
data_filer.append([filer, i[0], i[1], i[2], i[3], i[4], i[5], i[6], loc])
get.write_data_db(data_filer)
except Exception as e:
print(e)
6. Customizable Output
The final report is saved as a CSV file. Saving the results of the data-matching process into a structured CSV file. It plays a critical role in ensuring that the final output is easy to interpret and analyze.
- Data Structuring: Organizes matched and unmatched records into distinct columns for clarity.
-
File Export: Writes the processed data into a CSV file named
data_db.csv
. The CSV format ensures compatibility with various data analysis tools. - Scalability: Handles large datasets efficiently by leveraging optimized file-writing techniques in Python.
def write_data_db(data_filer):
filer = []
itss_path = []
Prometheus_path = []
q3_path = []
nfspath_match = []
itss_owner = []
Prometheus_owner = []
ownermatch =[]
loc = []
r = 0
try:
for i in data_filer:
filer.append(i[0])
itss_path.append(i[1])
Prometheus_path.append(i[2])
q3_path.append(i[3])
nfspath_match.append(i[4])
itss_owner.append(i[5])
Prometheus_owner.append(i[6])
ownermatch.append(i[7])
loc.append(i[8])
r+=1
dot = random.randint(1,20)
print('adding row '+str(r)+' .'*dot)
df = pd.DataFrame(list(zip(filer, itss_path, Prometheus_path, q3_path, nfspath_match, itss_owner, Prometheus_owner,ownermatch, loc)),
columns =['Filer', 'Itss path','Prometheus path','Qtree', 'NFSpath match', 'Itss owner',
'Prometheus owner', 'Owner match', 'location'])
df.to_csv('data_db.csv', index=False)
print('File data_db.csv Created')
except Exception as e:
print(e)
Who Is This For?
- Data Engineers: Automate repetitive data cleaning and matching tasks.
- Database Administrators: Simplify database queries and integrate data from multiple sources.
- DevOps Professionals: Analyze and manage Prometheus and QTree data efficiently.
- Researchers: Handle large datasets with ease, ensuring accurate and organized results.
Conclusion
Our Python program is a powerful tool for anyone dealing with large-scale data management tasks. By automating the tedious aspects of data processing, it frees you up to focus on analysis and decision-making. Try it today and experience seamless data management!
To see the program visit our git repository. (GITHUB)