Data Readers

Data readers for Macrodata Refinement (MDR).

This module provides functions and classes for reading macrodata from various file formats.

class mdr.io.readers.DataSource(value)[source]

Bases: Enum

Types of data sources.

FILE = 1

DATABASE = 2

API = 3

MEMORY = 4

class mdr.io.readers.DataReader(source_type=DataSource.FILE)[source]

Bases: ABC

Abstract base class for data readers.

Parameters:: source_type (DataSource)

__init__(source_type=DataSource.FILE)[source]

Initialize the data reader.

Parameters:: source_type (DataSource) – Type of data source

abstract read(source, **options)[source]

Read data from the source.

Parameters:

source (str) – Source identifier (file path, table name, etc.)
**options – Additional reading options

Returns:

Dictionary mapping variable names to data arrays

Return type:

Dict[str, <MagicMock id=’136017410570032’>]

abstract validate_source(source)[source]

Validate if the source exists and is readable.

Parameters:: source (str) – Source identifier
Returns:: True if the source is valid, False otherwise
Return type:: bool

class mdr.io.readers.FileReader(encoding='utf-8')[source]

Bases: DataReader

Base class for file-based data readers.

Parameters:: encoding (str)

__init__(encoding='utf-8')[source]

Initialize the file reader.

Parameters:: encoding (str) – File encoding

validate_source(source)[source]

Validate if the file exists and is readable.

Parameters:: source (str) – File path
Returns:: True if the file is valid, False otherwise
Return type:: bool

class mdr.io.readers.CSVReader(delimiter=',', quotechar='"', encoding='utf-8')[source]

Bases: FileReader

Reader for CSV files.

Parameters:

delimiter (str)
quotechar (str)
encoding (str)

__init__(delimiter=',', quotechar='"', encoding='utf-8')[source]

Initialize the CSV reader.

Parameters:

delimiter (str) – Field delimiter
quotechar (str) – Character for quoting fields
encoding (str) – File encoding

read(source, header=True, index_col=None, na_values=None, parse_dates=False, **options)[source]

Read data from a CSV file.

Parameters:

source (str) – File path
header (bool) – Whether to use the first row as column names
index_col (str | int | None) – Column to use as the index
na_values (List[str] | None) – List of strings to interpret as NA/NaN
parse_dates (bool) – Whether to parse date columns
**options – Additional pandas.read_csv options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017404955040’>]

class mdr.io.readers.JSONReader(encoding='utf-8')[source]

Bases: FileReader

Reader for JSON files.

Parameters:: encoding (str)

read(source, orient='columns', convert_dates=True, **options)[source]

Read data from a JSON file.

Parameters:

source (str) – File path
orient (str) – Expected JSON dict format, one of [‘columns’, ‘records’, ‘index’, ‘split’, ‘values’]
convert_dates (bool) – Whether to convert date strings to datetime objects
**options – Additional pandas.read_json options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017404934032’>]

class mdr.io.readers.ExcelReader(encoding='utf-8')[source]

Bases: FileReader

Reader for Excel files.

Parameters:: encoding (str)

read(source, sheet_name=0, header=0, na_values=None, **options)[source]

Read data from an Excel file.

Parameters:

source (str) – File path
sheet_name (str | int | List | None) – Name, index, or list of sheets to read
header (int) – Row to use for column names (0-indexed)
na_values (List[str] | None) – List of strings to interpret as NA/NaN
**options – Additional pandas.read_excel options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405028256’>]

class mdr.io.readers.ParquetReader(encoding='utf-8')[source]

Bases: FileReader

Reader for Parquet files.

Parameters:: encoding (str)

read(source, columns=None, **options)[source]

Read data from a Parquet file.

Parameters:

source (str) – File path
columns (List[str] | None) – List of columns to read (None for all)
**options – Additional pandas.read_parquet options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405069440’>]

class mdr.io.readers.HDF5Reader(encoding='utf-8')[source]

Bases: FileReader

Reader for HDF5 files.

Parameters:: encoding (str)

read(source, key, **options)[source]

Read data from an HDF5 file.

Parameters:

source (str) – File path
key (str) – Group identifier in the HDF5 file
**options – Additional pandas.read_hdf options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405094176’>]

mdr.io.readers.get_reader(file_type, **options)[source]

Get a reader for the specified file type.

Parameters:

file_type (str) – Type of file (‘csv’, ‘json’, ‘excel’, ‘parquet’, ‘hdf5’)
**options – Additional options for the reader

Returns:

Appropriate DataReader instance

Return type:

DataReader

mdr.io.readers.read_csv(filepath, delimiter=',', header=True, **options)[source]

Read data from a CSV file.

Parameters:

filepath (str) – Path to the CSV file
delimiter (str) – Field delimiter
header (bool) – Whether to use the first row as column names
**options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405151696’>]

mdr.io.readers.read_json(filepath, orient='columns', **options)[source]

Read data from a JSON file.

Parameters:

filepath (str) – Path to the JSON file
orient (str) – Expected JSON dict format
**options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405176336’>]

mdr.io.readers.read_excel(filepath, sheet_name=0, **options)[source]

Read data from an Excel file.

Parameters:

filepath (str) – Path to the Excel file
sheet_name (str | int | List | None) – Name, index, or list of sheets to read
**options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405010480’>]

mdr.io.readers.read_parquet(filepath, columns=None, **options)[source]

Read data from a Parquet file.

Parameters:

filepath (str) – Path to the Parquet file
columns (List[str] | None) – List of columns to read (None for all)
**options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017410567248’>]

mdr.io.readers.read_hdf5(filepath, key, **options)[source]

Read data from an HDF5 file.

Parameters:

filepath (str) – Path to the HDF5 file
key (str) – Group identifier in the HDF5 file
**options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405226832’>]

Overview

The readers module provides functions for reading data from various file formats into numpy arrays or dictionaries of arrays. These functions handle data loading, parsing, and initial preprocessing to prepare data for the MDR refinement pipeline.

Supported File Formats

The module supports reading data from the following formats:

CSV: Comma-separated values files
JSON: JavaScript Object Notation files
Excel: Microsoft Excel workbooks (.xlsx, .xls)
Parquet: Apache Parquet columnar storage files
HDF5: Hierarchical Data Format version 5 files

Core Functions

mdr.io.readers.read_csv(filepath, delimiter=',', header=True, **options)[source]

Read data from a CSV file.

Parameters:

filepath (str) – Path to the CSV file
delimiter (str) – Field delimiter
header (bool) – Whether to use the first row as column names
**options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405151696’>]

mdr.io.readers.read_json(filepath, orient='columns', **options)[source]

Read data from a JSON file.

Parameters:

filepath (str) – Path to the JSON file
orient (str) – Expected JSON dict format
**options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405176336’>]

mdr.io.readers.read_excel(filepath, sheet_name=0, **options)[source]

Read data from an Excel file.

Parameters:

filepath (str) – Path to the Excel file
sheet_name (str | int | List | None) – Name, index, or list of sheets to read
**options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405010480’>]

mdr.io.readers.read_parquet(filepath, columns=None, **options)[source]

Read data from a Parquet file.

Parameters:

filepath (str) – Path to the Parquet file
columns (List[str] | None) – List of columns to read (None for all)
**options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017410567248’>]

mdr.io.readers.read_hdf5(filepath, key, **options)[source]

Read data from an HDF5 file.

Parameters:

filepath (str) – Path to the HDF5 file
key (str) – Group identifier in the HDF5 file
**options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405226832’>]

Usage Examples

Reading from a CSV file:

from mdr.io.readers import read_csv

# Read data from a CSV file
data_dict = read_csv("path/to/data.csv")

# Print the variable names and shapes
for var_name, values in data_dict.items():
    print(f"{var_name}: {values.shape}")

Reading from an Excel file with multiple sheets:

from mdr.io.readers import read_excel

# Read data from specific sheets
data_dict = read_excel(
    "path/to/data.xlsx",
    sheets=["Temperature", "Pressure"],
    column_mapping={
        "Temperature": {"Temp (C)": "temperature"},
        "Pressure": {"Press (hPa)": "pressure"}
    }
)