Automate Your Data Management with Python: Introducing the Ultimate Data Processing Script

In today’s data-driven world, managing and processing large datasets efficiently is more crucial than ever. Our Python program is designed to streamline your data workflows, enabling you to clean, analyze, and manage datasets with ease. Whether you’re working with MySQL databases, Prometheus data, or complex hierarchical data structures, this tool has you covered.

What Does the Program Do?

This Python script offers a comprehensive solution for:

Data Cleaning: Automatically cleans input files to ensure they’re ready for processing.
Database Interaction: Connects seamlessly with MySQL databases to query and retrieve data.
File Parsing: Extracts critical information from structured text files such as mount names, NFS paths, and site locations.
Data Matching: Compares data between multiple sources (ITSS, Prometheus, QTree) to identify matches and discrepancies.
Output Generation: Produces a detailed CSV report summarizing data matches, ownership verification, and location-based insights.

Implementation

Dependencies and Configuration

The script uses several Python libraries for different functionalities:

				
					import pandas as pd              # Data manipulation and analysis
import subprocess                # Execute shell commands
import json, time                # Data serialization and timing operations
from datetime import datetime    # Date and time handling
import logging                   # Logging operations
from logging.handlers import SysLogHandler  # System logging
from docopt import docopt        # Command-line argument parsing
from prettytable import PrettyTable  # Formatted table output
import smtplib                   # Email functionality
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

The logging configuration is set up to maintain a permanent record of all operations:

				
					logging.basicConfig(level=logging.INFO, 
                   filename='logs.log', 
                   filemode='a', 
                   format='%(message)s  %(asctime)s', 
                   datefmt="%Y-%m-%d %T")

Command Line Interface

The script uses docopt to provide a user-friendly CLI:

				
					def get_args():
    """Function to get command line arguments."""
    usage = """
    Usage:
        try.py --replace [--short]
        try.py --prepare [--short]
        try.py --dryrun [--short]
        try.py --email -e <EMAIL> 
        try.py --version
        try.py -h | --help
    """
    args = docopt(usage)
    return args

This makes the script highly user-friendly, with clear documentation built right into the command-line help.

Core Functionality

Disk Status Checking

The script executes storage management commands using subprocess.Popen to gather information about disk status:

				
					all_not_ok = ['mmvdisk', 'pdisk', 'list', '--rg', 'all', '--not-ok']
replace = ['mmvdisk', 'pdisk', 'list', '--rg', 'all', '--replace']

These commands identify disks that are not functioning correctly and those that are marked for replacement.

				
					
def clean_data(file):
    try:
        
        result = []

        file_name = os.path.splitext(file)[0]
        substring = '{"metric":'

        filename = file_name+"_data.txt"
        newdata = open(filename, "w")
        with open(file, "r") as f:
            mmapped_file = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
            start = 0
            index = mmapped_file.find(substring.encode(), start)
            while index != -1:
                result.append(index)

                start = index + 1
                index = mmapped_file.find(substring.encode(), start)

                newdata.write("\n"+mmapped_file[start:index].decode("utf-8"))
            if len(result) == 0:
                print('The file that was created named '+filename+' is EMPTY! check your data ')
                return filename
            else:
                return filename

    except Exception as e:
        print(e)

4. Multi-threaded Processing

By utilizing Python’s threading module, the script performs concurrent operations, enabling multiple tasks to execute simultaneously. This approach significantly reduces processing time, especially when dealing with large datasets or I/O-bound tasks, by utilizing available system resources more effectively.

				
					    prometheus_data = data_clean(dir)
    qtree_data = data_clean(q3)

    file = get.get_data(prometheus_data)


    q3_data = get.get_data_q3(qtree_data)

    t1 = threading.Thread(target=search_filer, args=(table,file,site,q3_data))
    t1.start()
    t1.join()

5. Data-matching

This feature serves as the core data-matching engine. It combines database queries with input file data to generate meaningful insights. Here’s how it works:

Database Query: Executes a MySQL query on the specified table to fetch data relevant to the provided site.
Cross-Referencing: Matches database entries against Prometheus and QTree data from the input files.
Discrepancy Detection: Flags mismatches or missing entries, ensuring a comprehensive understanding of the data.
Reporting: Prepares detailed logs for matched and unmatched items, streamlining the output generation process.

				
					
def search_filer(table,file, site, q3_data):
    data_filer=[]
    connection = con()
    try:
        db = "SELECT filer, path, nfspath, owner, location FROM "+table+" WHERE location = '"+site+"'"
        resp = pd.read_sql(db, connection)
        
        for i in resp.iterrows():
            
            owner =i[1]['owner']
            nfspath = i[1]['nfspath']
            filer = i[1]['filer']
            path = i[1]['path']
            loc = i[1]['location']

            print('Query filer Owner {} and Nfspath {} '.format(owner,nfspath))
            if owner == 'none' or owner == '':
                if path == 'none' or path == '':
            # data.append([path, '','', 'No Match', owner, '', 'No Match'])
                    pass
            else:
                data = get.check_val_db(owner, nfspath, file, site, q3_data)

            if data == 1:

                data_filer.append([filer, nfspath, '', '', 'No Match', owner, '', 'No Match', loc])
            else:

                for i in data:

                    data_filer.append([filer, i[0], i[1], i[2], i[3], i[4], i[5], i[6], loc])
                        



        get.write_data_db(data_filer)

    except Exception as e:
        print(e)

6. Customizable Output

The final report is saved as a CSV file. Saving the results of the data-matching process into a structured CSV file. It plays a critical role in ensuring that the final output is easy to interpret and analyze.

Here’s how it works:

Data Structuring: Organizes matched and unmatched records into distinct columns for clarity.
File Export: Writes the processed data into a CSV file named data_db.csv. The CSV format ensures compatibility with various data analysis tools.
Scalability: Handles large datasets efficiently by leveraging optimized file-writing techniques in Python.

				
					
def write_data_db(data_filer):
    
    filer = []
    itss_path = []
    Prometheus_path = []
    q3_path = []
    nfspath_match = []
    itss_owner = []
    Prometheus_owner = []
    ownermatch =[]
    loc = []
    r = 0

    try:

        for i in data_filer:
            filer.append(i[0])
            itss_path.append(i[1])
            Prometheus_path.append(i[2])
            q3_path.append(i[3])
            nfspath_match.append(i[4])
            itss_owner.append(i[5])
            Prometheus_owner.append(i[6])
            ownermatch.append(i[7])
            loc.append(i[8])
            r+=1
            dot = random.randint(1,20)
            print('adding row '+str(r)+' .'*dot)

        df = pd.DataFrame(list(zip(filer, itss_path, Prometheus_path, q3_path, nfspath_match, itss_owner, Prometheus_owner,ownermatch, loc)),
                columns =['Filer', 'Itss path','Prometheus path','Qtree', 'NFSpath match', 'Itss owner',
                  'Prometheus owner', 'Owner match', 'location'])
        df.to_csv('data_db.csv', index=False)
        print('File data_db.csv Created')

    except Exception as e:
        print(e)

Who Is This For?

This program is ideal for:

Data Engineers: Automate repetitive data cleaning and matching tasks.
Database Administrators: Simplify database queries and integrate data from multiple sources.
DevOps Professionals: Analyze and manage Prometheus and QTree data efficiently.
Researchers: Handle large datasets with ease, ensuring accurate and organized results.

Conclusion

Our Python program is a powerful tool for anyone dealing with large-scale data management tasks. By automating the tedious aspects of data processing, it frees you up to focus on analysis and decision-making. Try it today and experience seamless data management!

To see the program visit our git repository. (GITHUB)