Automate Your Data Management with Python: Introducing the Ultimate Data Processing Script

In today’s data-driven world, managing and processing large datasets efficiently is more crucial than ever. Our Python program is designed to streamline your data workflows, enabling you to clean, analyze, and manage datasets with ease. Whether you’re working with MySQL databases, Prometheus data, or complex hierarchical data structures, this tool has you covered.

What Does the Program Do?

This Python script offers a comprehensive solution for:

  1. Data Cleaning: Automatically cleans input files to ensure they’re ready for processing.
  2. Database Interaction: Connects seamlessly with MySQL databases to query and retrieve data.
  3. File Parsing: Extracts critical information from structured text files such as mount names, NFS paths, and site locations.
  4. Data Matching: Compares data between multiple sources (ITSS, Prometheus, QTree) to identify matches and discrepancies.
  5. Output Generation: Produces a detailed CSV report summarizing data matches, ownership verification, and location-based insights.

Key Features

1. Database Connectivity

Our program integrates with MySQL databases using the mysql.connector module. It efficiently retrieves data from specified tables and allows you to perform location-based queries with ease.
Establishing a connection to the MySQL database:
The function below establishes a connection to the MySQL database. It returns a connection object, which is used for executing queries. Key parameters include:
				
					
def con():
    connection = mysql.connector.connect(
    host="localhost", #your host address
    user="root", #your username
    passwd="password", #Password
    port = 3307,
    db = "itss" #name of database you want to use
    )

    return connection
				
			

2. Advanced File Parsing

With support for large text files, the script extracts key-value pairs from JSON-like structures and organizes them into a clean, tabular format using Pandas. This ensures that your data is structured and ready for analysis.

3. Data Cleaning and Validation

How Data Cleaning Works?
The program ensures your input files are free from unnecessary noise and errors. Here’s the process:
Cleaning data
This function ensures input files are ready for processing.
				
					
def clean_data(file):
    try:
        
        result = []

        file_name = os.path.splitext(file)[0]
        substring = '{"metric":'

        filename = file_name+"_data.txt"
        newdata = open(filename, "w")
        with open(file, "r") as f:
            mmapped_file = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
            start = 0
            index = mmapped_file.find(substring.encode(), start)
            while index != -1:
                result.append(index)

                start = index + 1
                index = mmapped_file.find(substring.encode(), start)

                newdata.write("\n"+mmapped_file[start:index].decode("utf-8"))
            if len(result) == 0:
                print('The file that was created named '+filename+' is EMPTY! check your data ')
                return filename
            else:
                return filename

    except Exception as e:
        print(e)


				
			

4. Multi-threaded Processing

By utilizing Python’s threading module, the script performs concurrent operations, enabling multiple tasks to execute simultaneously. This approach significantly reduces processing time, especially when dealing with large datasets or I/O-bound tasks, by utilizing available system resources more effectively.
				
					    prometheus_data = data_clean(dir)
    qtree_data = data_clean(q3)

    file = get.get_data(prometheus_data)


    q3_data = get.get_data_q3(qtree_data)

    t1 = threading.Thread(target=search_filer, args=(table,file,site,q3_data))
    t1.start()
    t1.join()
				
			

5. Data-matching

This feature serves as the core data-matching engine. It combines database queries with input file data to generate meaningful insights. Here’s how it works:
				
					
def search_filer(table,file, site, q3_data):
    data_filer=[]
    connection = con()
    try:
        db = "SELECT filer, path, nfspath, owner, location FROM "+table+" WHERE location = '"+site+"'"
        resp = pd.read_sql(db, connection)
        
        for i in resp.iterrows():
            
            owner =i[1]['owner']
            nfspath = i[1]['nfspath']
            filer = i[1]['filer']
            path = i[1]['path']
            loc = i[1]['location']

            print('Query filer Owner {} and Nfspath {} '.format(owner,nfspath))
            if owner == 'none' or owner == '':
                if path == 'none' or path == '':
            # data.append([path, '','', 'No Match', owner, '', 'No Match'])
                    pass
            else:
                data = get.check_val_db(owner, nfspath, file, site, q3_data)

            if data == 1:

                data_filer.append([filer, nfspath, '', '', 'No Match', owner, '', 'No Match', loc])
            else:

                for i in data:

                    data_filer.append([filer, i[0], i[1], i[2], i[3], i[4], i[5], i[6], loc])
                        



        get.write_data_db(data_filer)

    except Exception as e:
        print(e)
            
				
			

6. Customizable Output

The final report is saved as a CSV file. Saving the results of the data-matching process into a structured CSV file. It plays a critical role in ensuring that the final output is easy to interpret and analyze.

Here’s how it works:
				
					
def write_data_db(data_filer):
    
    filer = []
    itss_path = []
    Prometheus_path = []
    q3_path = []
    nfspath_match = []
    itss_owner = []
    Prometheus_owner = []
    ownermatch =[]
    loc = []
    r = 0

    try:

        for i in data_filer:
            filer.append(i[0])
            itss_path.append(i[1])
            Prometheus_path.append(i[2])
            q3_path.append(i[3])
            nfspath_match.append(i[4])
            itss_owner.append(i[5])
            Prometheus_owner.append(i[6])
            ownermatch.append(i[7])
            loc.append(i[8])
            r+=1
            dot = random.randint(1,20)
            print('adding row '+str(r)+' .'*dot)

        df = pd.DataFrame(list(zip(filer, itss_path, Prometheus_path, q3_path, nfspath_match, itss_owner, Prometheus_owner,ownermatch, loc)),
                columns =['Filer', 'Itss path','Prometheus path','Qtree', 'NFSpath match', 'Itss owner',
                  'Prometheus owner', 'Owner match', 'location'])
        df.to_csv('data_db.csv', index=False)
        print('File data_db.csv Created')

    except Exception as e:
        print(e)
				
			

Who Is This For?

This program is ideal for:

Conclusion

Our Python program is a powerful tool for anyone dealing with large-scale data management tasks. By automating the tedious aspects of data processing, it frees you up to focus on analysis and decision-making. Try it today and experience seamless data management!

To see the program visit our git repository. (GITHUB)