Validation

Validation module for Macrodata Refinement (MDR).

This module provides functions and classes for validating macrodata and ensuring data quality.

class mdr.core.validation.ValidationResult(is_valid, error_messages, invalid_indices=None, statistics=None)[source]

Bases: object

Result of a data validation operation.

Parameters:
  • is_valid (bool)

  • error_messages (List[str])

  • invalid_indices (<MagicMock id='136017411329424'> | None)

  • statistics (Dict[str, float] | None)

is_valid: bool
error_messages: List[str]
invalid_indices: <MagicMock id='136017411329424'> | None = None
statistics: Dict[str, float] | None = None
__post_init__()[source]

Validate the ValidationResult instance.

Return type:

None

mdr.core.validation.check_data_range(data, min_value, max_value)[source]

Check if all values in the data are within the specified range.

Parameters:
  • data (<MagicMock id='136017416358784'>) – Input data array

  • min_value (float) – Minimum allowed value

  • max_value (float) – Maximum allowed value

Returns:

ValidationResult object containing validation results

Return type:

ValidationResult

mdr.core.validation.check_missing_values(data, threshold=0.1)[source]

Check for missing values in the data.

Parameters:
  • data (<MagicMock id='136017416365360'>) – Input data array

  • threshold (float) – Maximum allowed fraction of missing values

Returns:

ValidationResult object containing validation results

Return type:

ValidationResult

mdr.core.validation.check_outliers(data, threshold=3.0, method='zscore')[source]

Check for outliers in the data.

Parameters:
  • data (<MagicMock id='136017413687472'>) – Input data array

  • threshold (float) – Threshold for outlier detection

  • method (str) – Method for outlier detection (‘zscore’, ‘iqr’, ‘mad’)

Returns:

ValidationResult object containing validation results

Return type:

ValidationResult

mdr.core.validation.check_data_integrity(data, checks=['range', 'missing', 'outliers'], params=None)[source]

Perform a comprehensive data integrity check.

Parameters:
  • data (<MagicMock id='136017413695248'>) – Input data array

  • checks (List[str]) – List of checks to perform

  • params (Dict[str, Any] | None) – Parameters for each check

Returns:

ValidationResult object containing validation results

Return type:

ValidationResult

mdr.core.validation.validate_data(data_dict, checks=['range', 'missing', 'outliers'], params=None)[source]

Validate multiple data arrays.

Parameters:
  • data_dict (Dict[str, <MagicMock id='136017411437904'>]) – Dictionary mapping variable names to data arrays

  • checks (List[str]) – List of checks to perform

  • params (Dict[str, Dict[str, Any]] | None) – Parameters for each check - can be structured in two ways: 1. Global parameters: {check_name: {parameters}} 2. Variable-specific parameters: {variable_name: {check_name: {parameters}}} Variable-specific parameters take precedence over global ones.

Returns:

Dictionary mapping variable names to ValidationResult objects

Return type:

Dict[str, ValidationResult]

Overview

The validation module provides functions and classes for validating data quality through various checks and tests. It helps identify issues such as:

  • Missing values

  • Statistical outliers

  • Values outside expected ranges

  • Inconsistencies between related variables

Core Components

ValidationResult

class mdr.core.validation.ValidationResult(is_valid, error_messages, invalid_indices=None, statistics=None)[source]

Result of a data validation operation.

Parameters:
  • is_valid (bool)

  • error_messages (List[str])

  • invalid_indices (<MagicMock id='136017411329424'> | None)

  • statistics (Dict[str, float] | None)

__post_init__()[source]

Validate the ValidationResult instance.

Return type:

None

The ValidationResult class encapsulates the results of validation checks, including whether the data passed validation, any error messages, and relevant statistics.

Data Validation Functions

mdr.core.validation.validate_data(data_dict, checks=['range', 'missing', 'outliers'], params=None)[source]

Validate multiple data arrays.

Parameters:
  • data_dict (Dict[str, <MagicMock id='136017411437904'>]) – Dictionary mapping variable names to data arrays

  • checks (List[str]) – List of checks to perform

  • params (Dict[str, Dict[str, Any]] | None) – Parameters for each check - can be structured in two ways: 1. Global parameters: {check_name: {parameters}} 2. Variable-specific parameters: {variable_name: {check_name: {parameters}}} Variable-specific parameters take precedence over global ones.

Returns:

Dictionary mapping variable names to ValidationResult objects

Return type:

Dict[str, ValidationResult]

mdr.core.validation.check_outliers(data, threshold=3.0, method='zscore')[source]

Check for outliers in the data.

Parameters:
  • data (<MagicMock id='136017413687472'>) – Input data array

  • threshold (float) – Threshold for outlier detection

  • method (str) – Method for outlier detection (‘zscore’, ‘iqr’, ‘mad’)

Returns:

ValidationResult object containing validation results

Return type:

ValidationResult

Usage Examples

Basic validation of data:

import numpy as np
from mdr.core.validation import validate_data

# Create a dictionary of data variables
data_dict = {
    "temperature": np.array([20.5, 21.3, np.nan, 21.7, 45.0]),
    "pressure": np.array([101.3, 101.4, 80.0, np.nan, np.nan])
}

# Validate the data
validation_results = validate_data(
    data_dict,
    checks=["range", "missing", "outliers"],
    params={
        "range": {
            "min_value": 0.0,
            "max_value": 100.0
        },
        "missing": {
            "threshold": 0.1  # Allow up to 10% missing values
        },
        "outliers": {
            "threshold": 2.5,  # Z-score threshold for outliers
            "method": "zscore"
        }
    }
)

# Print validation results
for var_name, result in validation_results.items():
    print(f"{var_name} validation: {'Passed' if result.is_valid else 'Failed'}")
    if not result.is_valid:
        for msg in result.error_messages:
            print(f"  - {msg}")