In today’s data-driven world, managing and processing large datasets efficiently is more crucial than ever. Our Python program is designed to streamline your data workflows, enabling you to clean, analyze, and manage datasets with ease. Whether you’re working with MySQL databases, Prometheus data, or complex hierarchical data structures, this tool has you covered.
What Does the Program Do?
This Python script offers a comprehensive solution for:
- Data Cleaning: Automatically cleans input files to ensure they’re ready for processing.
- Database Interaction: Connects seamlessly with MySQL databases to query and retrieve data.
- File Parsing: Extracts critical information from structured text files such as mount names, NFS paths, and site locations.
- Data Matching: Compares data between multiple sources (ITSS, Prometheus, QTree) to identify matches and discrepancies.
- Output Generation: Produces a detailed CSV report summarizing data matches, ownership verification, and location-based insights.
Key Features
1. Database Connectivity
mysql.connector
module. It efficiently retrieves data from specified tables and allows you to perform location-based queries with ease. - Host: Specifies the database server.
- User and Password: Credentials for authentication.
- Database: The target database name.
def con():
connection = mysql.connector.connect(
host="localhost", #your host address
user="root", #your username
passwd="password", #Password
port = 3307,
db = "itss" #name of database you want to use
)
return connection
2. Advanced File Parsing
3. Data Cleaning and Validation
- File Verification: The script checks if the input file exists and whether it contains data. If the file is empty, it halts further execution.
-
Content Extraction: Using efficient file handling and memory mapping, the script extracts relevant sections of the file based on predefined markers (e.g.,
"metric":
). - Data Transformation: The extracted content is cleaned of unwanted characters and formatted into a consistent structure for further processing.
- Uses memory mapping to extract relevant data sections.
- Returns a cleaned and formatted version of the input file.
- This approach ensures that only high-quality data is fed into subsequent steps, reducing errors and improving the accuracy of your analysis.
def clean_data(file):
try:
result = []
file_name = os.path.splitext(file)[0]
substring = '{"metric":'
filename = file_name+"_data.txt"
newdata = open(filename, "w")
with open(file, "r") as f:
mmapped_file = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
start = 0
index = mmapped_file.find(substring.encode(), start)
while index != -1:
result.append(index)
start = index + 1
index = mmapped_file.find(substring.encode(), start)
newdata.write("\n"+mmapped_file[start:index].decode("utf-8"))
if len(result) == 0:
print('The file that was created named '+filename+' is EMPTY! check your data ')
return filename
else:
return filename
except Exception as e:
print(e)
4. Multi-threaded Processing
threading
module, the script performs concurrent operations, enabling multiple tasks to execute simultaneously. This approach significantly reduces processing time, especially when dealing with large datasets or I/O-bound tasks, by utilizing available system resources more effectively.
prometheus_data = data_clean(dir)
qtree_data = data_clean(q3)
file = get.get_data(prometheus_data)
q3_data = get.get_data_q3(qtree_data)
t1 = threading.Thread(target=search_filer, args=(table,file,site,q3_data))
t1.start()
t1.join()
5. Data-matching
- Database Query: Executes a MySQL query on the specified table to fetch data relevant to the provided site.
- Cross-Referencing: Matches database entries against Prometheus and QTree data from the input files.
- Discrepancy Detection: Flags mismatches or missing entries, ensuring a comprehensive understanding of the data.
- Reporting: Prepares detailed logs for matched and unmatched items, streamlining the output generation process.
def search_filer(table,file, site, q3_data):
data_filer=[]
connection = con()
try:
db = "SELECT filer, path, nfspath, owner, location FROM "+table+" WHERE location = '"+site+"'"
resp = pd.read_sql(db, connection)
for i in resp.iterrows():
owner =i[1]['owner']
nfspath = i[1]['nfspath']
filer = i[1]['filer']
path = i[1]['path']
loc = i[1]['location']
print('Query filer Owner {} and Nfspath {} '.format(owner,nfspath))
if owner == 'none' or owner == '':
if path == 'none' or path == '':
# data.append([path, '','', 'No Match', owner, '', 'No Match'])
pass
else:
data = get.check_val_db(owner, nfspath, file, site, q3_data)
if data == 1:
data_filer.append([filer, nfspath, '', '', 'No Match', owner, '', 'No Match', loc])
else:
for i in data:
data_filer.append([filer, i[0], i[1], i[2], i[3], i[4], i[5], i[6], loc])
get.write_data_db(data_filer)
except Exception as e:
print(e)
6. Customizable Output
The final report is saved as a CSV file. Saving the results of the data-matching process into a structured CSV file. It plays a critical role in ensuring that the final output is easy to interpret and analyze.
- Data Structuring: Organizes matched and unmatched records into distinct columns for clarity.
-
File Export: Writes the processed data into a CSV file named
data_db.csv
. The CSV format ensures compatibility with various data analysis tools. - Scalability: Handles large datasets efficiently by leveraging optimized file-writing techniques in Python.
def write_data_db(data_filer):
filer = []
itss_path = []
Prometheus_path = []
q3_path = []
nfspath_match = []
itss_owner = []
Prometheus_owner = []
ownermatch =[]
loc = []
r = 0
try:
for i in data_filer:
filer.append(i[0])
itss_path.append(i[1])
Prometheus_path.append(i[2])
q3_path.append(i[3])
nfspath_match.append(i[4])
itss_owner.append(i[5])
Prometheus_owner.append(i[6])
ownermatch.append(i[7])
loc.append(i[8])
r+=1
dot = random.randint(1,20)
print('adding row '+str(r)+' .'*dot)
df = pd.DataFrame(list(zip(filer, itss_path, Prometheus_path, q3_path, nfspath_match, itss_owner, Prometheus_owner,ownermatch, loc)),
columns =['Filer', 'Itss path','Prometheus path','Qtree', 'NFSpath match', 'Itss owner',
'Prometheus owner', 'Owner match', 'location'])
df.to_csv('data_db.csv', index=False)
print('File data_db.csv Created')
except Exception as e:
print(e)
Who Is This For?
- Data Engineers: Automate repetitive data cleaning and matching tasks.
- Database Administrators: Simplify database queries and integrate data from multiple sources.
- DevOps Professionals: Analyze and manage Prometheus and QTree data efficiently.
- Researchers: Handle large datasets with ease, ensuring accurate and organized results.
Conclusion
Our Python program is a powerful tool for anyone dealing with large-scale data management tasks. By automating the tedious aspects of data processing, it frees you up to focus on analysis and decision-making. Try it today and experience seamless data management!
To see the program visit our git repository. (GITHUB)