Validation
Validation module for Macrodata Refinement (MDR).
This module provides functions and classes for validating macrodata and ensuring data quality.
- class mdr.core.validation.ValidationResult(is_valid, error_messages, invalid_indices=None, statistics=None)[source]
Bases:
objectResult of a data validation operation.
- Parameters:
- invalid_indices: <MagicMock id='136017411329424'> | None = None
- mdr.core.validation.check_data_range(data, min_value, max_value)[source]
Check if all values in the data are within the specified range.
- Parameters:
- Returns:
ValidationResult object containing validation results
- Return type:
- mdr.core.validation.check_missing_values(data, threshold=0.1)[source]
Check for missing values in the data.
- Parameters:
data (<MagicMock id='136017416365360'>) – Input data array
threshold (float) – Maximum allowed fraction of missing values
- Returns:
ValidationResult object containing validation results
- Return type:
- mdr.core.validation.check_outliers(data, threshold=3.0, method='zscore')[source]
Check for outliers in the data.
- Parameters:
- Returns:
ValidationResult object containing validation results
- Return type:
- mdr.core.validation.check_data_integrity(data, checks=['range', 'missing', 'outliers'], params=None)[source]
Perform a comprehensive data integrity check.
- mdr.core.validation.validate_data(data_dict, checks=['range', 'missing', 'outliers'], params=None)[source]
Validate multiple data arrays.
- Parameters:
data_dict (Dict[str, <MagicMock id='136017411437904'>]) – Dictionary mapping variable names to data arrays
params (Dict[str, Dict[str, Any]] | None) – Parameters for each check - can be structured in two ways: 1. Global parameters: {check_name: {parameters}} 2. Variable-specific parameters: {variable_name: {check_name: {parameters}}} Variable-specific parameters take precedence over global ones.
- Returns:
Dictionary mapping variable names to ValidationResult objects
- Return type:
Overview
The validation module provides functions and classes for validating data quality through
various checks and tests. It helps identify issues such as:
Missing values
Statistical outliers
Values outside expected ranges
Inconsistencies between related variables
Core Components
ValidationResult
- class mdr.core.validation.ValidationResult(is_valid, error_messages, invalid_indices=None, statistics=None)[source]
Result of a data validation operation.
- Parameters:
- __post_init__()[source]
Validate the ValidationResult instance.
- Return type:
None
The ValidationResult class encapsulates the results of validation checks, including
whether the data passed validation, any error messages, and relevant statistics.
Data Validation Functions
- mdr.core.validation.validate_data(data_dict, checks=['range', 'missing', 'outliers'], params=None)[source]
Validate multiple data arrays.
- Parameters:
data_dict (Dict[str, <MagicMock id='136017411437904'>]) – Dictionary mapping variable names to data arrays
params (Dict[str, Dict[str, Any]] | None) – Parameters for each check - can be structured in two ways: 1. Global parameters: {check_name: {parameters}} 2. Variable-specific parameters: {variable_name: {check_name: {parameters}}} Variable-specific parameters take precedence over global ones.
- Returns:
Dictionary mapping variable names to ValidationResult objects
- Return type:
- mdr.core.validation.check_outliers(data, threshold=3.0, method='zscore')[source]
Check for outliers in the data.
- Parameters:
- Returns:
ValidationResult object containing validation results
- Return type:
Usage Examples
Basic validation of data:
import numpy as np
from mdr.core.validation import validate_data
# Create a dictionary of data variables
data_dict = {
"temperature": np.array([20.5, 21.3, np.nan, 21.7, 45.0]),
"pressure": np.array([101.3, 101.4, 80.0, np.nan, np.nan])
}
# Validate the data
validation_results = validate_data(
data_dict,
checks=["range", "missing", "outliers"],
params={
"range": {
"min_value": 0.0,
"max_value": 100.0
},
"missing": {
"threshold": 0.1 # Allow up to 10% missing values
},
"outliers": {
"threshold": 2.5, # Z-score threshold for outliers
"method": "zscore"
}
}
)
# Print validation results
for var_name, result in validation_results.items():
print(f"{var_name} validation: {'Passed' if result.is_valid else 'Failed'}")
if not result.is_valid:
for msg in result.error_messages:
print(f" - {msg}")