Advanced Disk Monitoring and Replacement System: A Python Solution

As a systems engineer or DevOps professional, staying ahead of disk failures is critical. Manual intervention doesn’t scale—and that’s where automation comes in.
This Python script provides an automated solution for managing disk replacements in a storage system environment. It interfaces with IBM Spectrum Virtualize (formerly SVC) CLI commands to monitor disk health status, prepare disks for replacement, and execute replacement procedures – all while maintaining comprehensive logging and reporting.

What Does the Program Do?

This script interfaces with a storage management system (likely IBM Spectrum Scale or a similar system) to:

  1. Disk Health Monitoring: Identifies disks in “not OK” state, Detects disks marked for replacement and Provides detailed disk information including recovery group, state, and location.
  2. Replacement Preparation: Safely prepares disks for replacement to minimize data loss.
  3. Automated Operations: Automates disk replacement (with dry-run and preparation options).
  4. Detailed Logging: Comprehensive logging to track all operations.
  5. Email Notifications: Alerts system administrators about problematic disks.
  6. Formatted Output: Clean table presentation of disk status information.
  7. Performance Monitoring: Tracks execution time for operational efficiency

Technical Implementation

Dependencies and Configuration

The script uses several Python libraries for different functionalities:
				
					import pandas as pd              # Data manipulation and analysis
import subprocess                # Execute shell commands
import json, time                # Data serialization and timing operations
from datetime import datetime    # Date and time handling
import logging                   # Logging operations
from logging.handlers import SysLogHandler  # System logging
from docopt import docopt        # Command-line argument parsing
from prettytable import PrettyTable  # Formatted table output
import smtplib                   # Email functionality
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
				
			
The logging configuration is set up to maintain a permanent record of all operations:
				
					logging.basicConfig(level=logging.INFO, 
                   filename='logs.log', 
                   filemode='a', 
                   format='%(message)s  %(asctime)s', 
                   datefmt="%Y-%m-%d %T")
				
			

CLI Interface

The script utilizes the docopt library to provide a clean, well-documented command-line interface:
				
					Usage:
    pdisk.py --replace [--short]
    pdisk.py --prepare [--short]
    pdisk.py --dryrun [--short]
    pdisk.py --email -e <EMAIL>
    pdisk.py --version
    pdisk.py -h | --help
				
			
This makes the script highly user-friendly, with clear documentation built right into the command-line help.

Disk Status Collection

The script executes storage management commands using subprocess.Popen to gather information about disk status:
				
					mmvdisk pdisk list --rg all --not-ok
mmvdisk pdisk list --rg all --replace

				
			
These commands identify disks that are not functioning correctly and those that are marked for replacement.

Data Processing

The output from system commands is processed using Pandas, a powerful data analysis library:

				
					def get_failed_pdisk(filename, command):
    # Process command output files
    df = pd.read_csv(filename, sep='\s{2,}', engine='python')
    return df[["recovery group", "pdisk"]]
				
			
This approach allows for easy manipulation and filtering of disk information.

Disk Replacement Operations

The script can prepare disks for replacement and execute the replacement procedure:
				
					def replace_pdisk(args, pdisk, group, need_replace):
    # Command to prepare or replace a disk
    if args['--prepare']:
        output_ = subprocess.Popen(['mmvdisk', 'pdisk', 'replace', '--prepare', '--rg', group, '--pdisk', pdisk], stdout=subprocess.PIPE)
        # ...process output
    else:
        output_ = subprocess.Popen(['mmvdisk', 'pdisk', 'replace', '--recovery-group', group, '--pdisk', pdisk], stdout=subprocess.PIPE)
        # ...process output
				
			

Notification System

For critical operations, the script includes an email notification system:
				
					def send_email(sender_email, sender_password, receiver_email, subject, message):
    # Create and send email notifications
    msg = MIMEMultipart()
    # ... email configuration
    with smtplib.SMTP("smtp.gmail.com", 587) as smtp:
        smtp.starttls()
        smtp.login(sender_email, sender_password)
        smtp.send_message(msg)
				
			

Visualization and Output

The script uses PrettyTable to format the output in an easily readable table:
				
					def display_state(dataframe, t_info):
    # Create and display tabular data
    table = PrettyTable()
    table.field_names = ["Name", "RecoveryGroup", "state", "location", "hardware", "User location", "Server"]
    # ... populate and display table
				
			

Usage Examples

Checking Disks That Need Replacement

To identify disks that need replacement without making any changes:

				
					python pdisk.py --dryrun
				
			
Simulated output:
				
					$ python pdisk.py --dryrun

+------------------+------------------------------------------------+
| Command:         | mmvdisk pdisk list --rg all --not-ok          |
+------------------+------------------------------------------------+
Disk not ok

Name       RecoveryGroup  state     location   hardware   User location   Server
pdisk1     rg0            failed    bay3       SAS        Rack-A01        node01
pdisk2     rg1            missing   bay7       NVMe       Rack-B04        node02

+------------------+------------------------------------------------+
| Command:         | mmvdisk pdisk list --rg all --replace         |
+------------------+------------------------------------------------+
List of replace disks

Name       RecoveryGroup  state     location   hardware   User location   Server
pdisk1     rg0            failed    bay3       SAS        Rack-A01        node01
pdisk2     rg1            missing   bay7       NVMe       Rack-B04        node02


DISKS NEEDS REPLACEMENT!
[{'name': 'pdisk1', 'recoveryGroup': 'rg0', 'state': 'failed', 'location': 'bay3', ...},
 {'name': 'pdisk2', 'recoveryGroup': 'rg1', 'state': 'missing', 'location': 'bay7', ...}]

List of pdisk needs to be replaced:
Command: ['mmvdisk pdisk replace --prepare --rg rg0 --pdisk pdisk1', 'mmvdisk pdisk replace --prepare --rg rg1 --pdisk pdisk2']

[DRY-RUN] Would run:
mmvdisk pdisk replace --prepare --rg rg0 --pdisk pdisk1
mmvdisk pdisk replace --prepare --rg rg1 --pdisk pdisk2

The program took 0:00:45 to run.
Date and time program was initiated 2025-04-11,09:27 UTC
				
			

Preparing Disks for Replacement

To prepare disks for physical replacement:
				
					python pdisk.py --prepare
				
			
Simulated output:
				
					$ python pdisk.py --prepare

+------------------+------------------------------------------------+
| Command:         | mmvdisk pdisk list --rg all --not-ok          |
+------------------+------------------------------------------------+
Disk not ok

Name       RecoveryGroup  state     location   hardware   User location   Server
pdisk1     rg0            failed    bay3       SAS        Rack-A01        node01
pdisk2     rg1            missing   bay7       NVMe       Rack-B04        node02

+------------------+------------------------------------------------+
| Command:         | mmvdisk pdisk list --rg all --replace         |
+------------------+------------------------------------------------+
List of replace disks

Name       RecoveryGroup  state     location   hardware   User location   Server
pdisk1     rg0            failed    bay3       SAS        Rack-A01        node01
pdisk2     rg1            missing   bay7       NVMe       Rack-B04        node02

DISKS NEEDS PREPARATION BEFORE REPLACEMENT!
[{'name': 'pdisk1', 'recoveryGroup': 'rg0', 'state': 'failed', 'location': 'bay3', ...},
 {'name': 'pdisk2', 'recoveryGroup': 'rg1', 'state': 'missing', 'location': 'bay7', ...}]


Preparing disks for replacement:
Command: mmvdisk pdisk replace --prepare --rg rg0 --pdisk pdisk1
Command: mmvdisk pdisk replace --prepare --rg rg1 --pdisk pdisk2

Successfully prepared pdisk for replace!
Command: mmvdisk pdisk replace --prepare --rg rg0 --pdisk pdisk1 --> OUTPUT: Reinsert carrier.

Successfully prepared pdisk for replace!
Command: mmvdisk pdisk replace --prepare --rg rg1 --pdisk pdisk2 --> OUTPUT: Reinsert carrier.

Name       RecoveryGroup  state     location   hardware   User location   Server
pdisk1     rg0            failed    bay3       SAS        Rack-A01        node01
pdisk2     rg1            missing   bay7       NVMe       Rack-B04        node02

The program took 0:01:05 to run.
Date and time program was initiated 2025-04-11,09:34 UTC

				
			

Executing Disk Replacement

To perform the actual disk replacement:
				
					python pdisk.py --replace
				
			

Simulated output:

				
					$ python pdisk.py --replace

+------------------+------------------------------------------------+
| Command:         | mmvdisk pdisk list --rg all --not-ok          |
+------------------+------------------------------------------------+
Disk not ok

Name       RecoveryGroup  state     location   hardware   User location   Server
pdisk1     rg0            failed    bay3       SAS        Rack-A01        node01
pdisk2     rg1            missing   bay7       NVMe       Rack-B04        node02

+------------------+------------------------------------------------+
| Command:         | mmvdisk pdisk list --rg all --replace         |
+------------------+------------------------------------------------+
List of replace disks

Name       RecoveryGroup  state     location   hardware   User location   Server
pdisk1     rg0            failed    bay3       SAS        Rack-A01        node01
pdisk2     rg1            missing   bay7       NVMe       Rack-B04        node02


DISKS NEEDS REPLACEMENT!
[{'name': 'pdisk1', 'recoveryGroup': 'rg0', 'state': 'failed', 'location': 'bay3', ...},
 {'name': 'pdisk2', 'recoveryGroup': 'rg1', 'state': 'missing', 'location': 'bay7', ...}]


List of pdisk needs to be replaced:
Command: ['mmvdisk pdisk replace --prepare --rg rg0 --pdisk pdisk1', 'mmvdisk pdisk replace --prepare --rg rg1 --pdisk pdisk2']

Successfully prepared pdisk for replace!
Command: mmvdisk pdisk replace --prepare --rg rg0 --pdisk pdisk1 --> OUTPUT: Reinsert carrier.

Successfully prepared pdisk for replace!
Command: mmvdisk pdisk replace --prepare --rg rg1 --pdisk pdisk2 --> OUTPUT: Reinsert carrier.


Name       RecoveryGroup  state     location   hardware   User location   Server
pdisk1     rg0            failed    bay3       SAS        Rack-A01        node01
pdisk2     rg1            missing   bay7       NVMe       Rack-B04        node02

The program took 0:01:02 to run.
Date and time program was initiated 2025-04-11,09:24 UTC

				
			

Sending Notifications

To send an email notification about disks that need replacement:
				
					python pdisk.py --email -e admin@example.com
				
			

Simulated output:

				
					$ python pdisk.py --email -e example@example.com

... (disk check tables as above)

Sending email to: example@example.com

Email sent to Trial1 (example@example.com)

The program took 0:00:48 to run.
Date and time program was initiated 2025-04-11,09:29 UTC

				
			

Sample Log Output

Below is a sample of the log output generated by the program during simulation run. These logs demonstrate the tool’s ability to track both successful operations and error conditions to track all the operations.
				
					Command: mmvdisk pdisk list --rg all --not-ok ---> Output: Disk not ok.  
2025-04-11 09:32:15

Command: mmvdisk pdisk list --rg all --replace ---> Output: Disk list with replacement suggestions.  
2025-04-11 09:32:17

List of pdisk needs to be replaced:
 Command: mmvdisk pdisk list --rg all --replace
  recovery group   pdisk
0           rg0    pdisk1
1           rg1    pdisk2  
2025-04-11 09:32:18

Successfully prepared pdisk for replace!
 Command: mmvdisk pdisk replace --prepare --rg rg0 --pdisk pdisk1 --> OUTPUT: Reinsert carrier.   
2025-04-11 09:32:23

Successfully prepared pdisk for replace!
 Command: mmvdisk pdisk replace --prepare --rg rg1 --pdisk pdisk2 --> OUTPUT: Reinsert carrier.   
2025-04-11 09:32:29

The program took 0:01:05 to run.  
Date and time program was initiated 2025-04-11,09:31 UTC  
2025-04-11 09:32:35

				
			

Security Considerations

The script includes email authentication but stores credentials in plain text. In a production environment, this should be replaced with a more secure approach such as environment variables or a secure credentials store.

Conclusion

This disk management tool demonstrates how Python can be used to automate complex system administration tasks. By combining system commands with data analysis and reporting capabilities, we’ve created a powerful utility that simplifies maintenance operations and improves reliability.

Whether you’re managing a small cluster or a large-scale storage infrastructure, automated tools like this can significantly reduce the operational burden and minimize the risk of human error during critical maintenance procedures.

This script saves hours of manual work and reduces the risk of overlooking critical disk failures in production. Feel free to adapt and expand it to fit your infrastructure needs.

You can find the full code on my GitHub or reach out if you’d like help adapting it for your environment.