Data Readers

Data readers for Macrodata Refinement (MDR).

This module provides functions and classes for reading macrodata from various file formats.

class mdr.io.readers.DataSource(value)[source]

Bases: Enum

Types of data sources.

FILE = 1
DATABASE = 2
API = 3
MEMORY = 4
class mdr.io.readers.DataReader(source_type=DataSource.FILE)[source]

Bases: ABC

Abstract base class for data readers.

Parameters:

source_type (DataSource)

__init__(source_type=DataSource.FILE)[source]

Initialize the data reader.

Parameters:

source_type (DataSource) – Type of data source

abstract read(source, **options)[source]

Read data from the source.

Parameters:
  • source (str) – Source identifier (file path, table name, etc.)

  • **options – Additional reading options

Returns:

Dictionary mapping variable names to data arrays

Return type:

Dict[str, <MagicMock id=’136017410570032’>]

abstract validate_source(source)[source]

Validate if the source exists and is readable.

Parameters:

source (str) – Source identifier

Returns:

True if the source is valid, False otherwise

Return type:

bool

class mdr.io.readers.FileReader(encoding='utf-8')[source]

Bases: DataReader

Base class for file-based data readers.

Parameters:

encoding (str)

__init__(encoding='utf-8')[source]

Initialize the file reader.

Parameters:

encoding (str) – File encoding

validate_source(source)[source]

Validate if the file exists and is readable.

Parameters:

source (str) – File path

Returns:

True if the file is valid, False otherwise

Return type:

bool

class mdr.io.readers.CSVReader(delimiter=',', quotechar='"', encoding='utf-8')[source]

Bases: FileReader

Reader for CSV files.

Parameters:
  • delimiter (str)

  • quotechar (str)

  • encoding (str)

__init__(delimiter=',', quotechar='"', encoding='utf-8')[source]

Initialize the CSV reader.

Parameters:
  • delimiter (str) – Field delimiter

  • quotechar (str) – Character for quoting fields

  • encoding (str) – File encoding

read(source, header=True, index_col=None, na_values=None, parse_dates=False, **options)[source]

Read data from a CSV file.

Parameters:
  • source (str) – File path

  • header (bool) – Whether to use the first row as column names

  • index_col (str | int | None) – Column to use as the index

  • na_values (List[str] | None) – List of strings to interpret as NA/NaN

  • parse_dates (bool) – Whether to parse date columns

  • **options – Additional pandas.read_csv options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017404955040’>]

class mdr.io.readers.JSONReader(encoding='utf-8')[source]

Bases: FileReader

Reader for JSON files.

Parameters:

encoding (str)

read(source, orient='columns', convert_dates=True, **options)[source]

Read data from a JSON file.

Parameters:
  • source (str) – File path

  • orient (str) – Expected JSON dict format, one of [‘columns’, ‘records’, ‘index’, ‘split’, ‘values’]

  • convert_dates (bool) – Whether to convert date strings to datetime objects

  • **options – Additional pandas.read_json options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017404934032’>]

class mdr.io.readers.ExcelReader(encoding='utf-8')[source]

Bases: FileReader

Reader for Excel files.

Parameters:

encoding (str)

read(source, sheet_name=0, header=0, na_values=None, **options)[source]

Read data from an Excel file.

Parameters:
  • source (str) – File path

  • sheet_name (str | int | List | None) – Name, index, or list of sheets to read

  • header (int) – Row to use for column names (0-indexed)

  • na_values (List[str] | None) – List of strings to interpret as NA/NaN

  • **options – Additional pandas.read_excel options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405028256’>]

class mdr.io.readers.ParquetReader(encoding='utf-8')[source]

Bases: FileReader

Reader for Parquet files.

Parameters:

encoding (str)

read(source, columns=None, **options)[source]

Read data from a Parquet file.

Parameters:
  • source (str) – File path

  • columns (List[str] | None) – List of columns to read (None for all)

  • **options – Additional pandas.read_parquet options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405069440’>]

class mdr.io.readers.HDF5Reader(encoding='utf-8')[source]

Bases: FileReader

Reader for HDF5 files.

Parameters:

encoding (str)

read(source, key, **options)[source]

Read data from an HDF5 file.

Parameters:
  • source (str) – File path

  • key (str) – Group identifier in the HDF5 file

  • **options – Additional pandas.read_hdf options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405094176’>]

mdr.io.readers.get_reader(file_type, **options)[source]

Get a reader for the specified file type.

Parameters:
  • file_type (str) – Type of file (‘csv’, ‘json’, ‘excel’, ‘parquet’, ‘hdf5’)

  • **options – Additional options for the reader

Returns:

Appropriate DataReader instance

Return type:

DataReader

mdr.io.readers.read_csv(filepath, delimiter=',', header=True, **options)[source]

Read data from a CSV file.

Parameters:
  • filepath (str) – Path to the CSV file

  • delimiter (str) – Field delimiter

  • header (bool) – Whether to use the first row as column names

  • **options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405151696’>]

mdr.io.readers.read_json(filepath, orient='columns', **options)[source]

Read data from a JSON file.

Parameters:
  • filepath (str) – Path to the JSON file

  • orient (str) – Expected JSON dict format

  • **options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405176336’>]

mdr.io.readers.read_excel(filepath, sheet_name=0, **options)[source]

Read data from an Excel file.

Parameters:
  • filepath (str) – Path to the Excel file

  • sheet_name (str | int | List | None) – Name, index, or list of sheets to read

  • **options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405010480’>]

mdr.io.readers.read_parquet(filepath, columns=None, **options)[source]

Read data from a Parquet file.

Parameters:
  • filepath (str) – Path to the Parquet file

  • columns (List[str] | None) – List of columns to read (None for all)

  • **options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017410567248’>]

mdr.io.readers.read_hdf5(filepath, key, **options)[source]

Read data from an HDF5 file.

Parameters:
  • filepath (str) – Path to the HDF5 file

  • key (str) – Group identifier in the HDF5 file

  • **options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405226832’>]

Overview

The readers module provides functions for reading data from various file formats into numpy arrays or dictionaries of arrays. These functions handle data loading, parsing, and initial preprocessing to prepare data for the MDR refinement pipeline.

Supported File Formats

The module supports reading data from the following formats:

  • CSV: Comma-separated values files

  • JSON: JavaScript Object Notation files

  • Excel: Microsoft Excel workbooks (.xlsx, .xls)

  • Parquet: Apache Parquet columnar storage files

  • HDF5: Hierarchical Data Format version 5 files

Core Functions

mdr.io.readers.read_csv(filepath, delimiter=',', header=True, **options)[source]

Read data from a CSV file.

Parameters:
  • filepath (str) – Path to the CSV file

  • delimiter (str) – Field delimiter

  • header (bool) – Whether to use the first row as column names

  • **options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405151696’>]

mdr.io.readers.read_json(filepath, orient='columns', **options)[source]

Read data from a JSON file.

Parameters:
  • filepath (str) – Path to the JSON file

  • orient (str) – Expected JSON dict format

  • **options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405176336’>]

mdr.io.readers.read_excel(filepath, sheet_name=0, **options)[source]

Read data from an Excel file.

Parameters:
  • filepath (str) – Path to the Excel file

  • sheet_name (str | int | List | None) – Name, index, or list of sheets to read

  • **options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405010480’>]

mdr.io.readers.read_parquet(filepath, columns=None, **options)[source]

Read data from a Parquet file.

Parameters:
  • filepath (str) – Path to the Parquet file

  • columns (List[str] | None) – List of columns to read (None for all)

  • **options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017410567248’>]

mdr.io.readers.read_hdf5(filepath, key, **options)[source]

Read data from an HDF5 file.

Parameters:
  • filepath (str) – Path to the HDF5 file

  • key (str) – Group identifier in the HDF5 file

  • **options – Additional reading options

Returns:

Dictionary mapping column names to data arrays

Return type:

Dict[str, <MagicMock id=’136017405226832’>]

Usage Examples

Reading from a CSV file:

from mdr.io.readers import read_csv

# Read data from a CSV file
data_dict = read_csv("path/to/data.csv")

# Print the variable names and shapes
for var_name, values in data_dict.items():
    print(f"{var_name}: {values.shape}")

Reading from an Excel file with multiple sheets:

from mdr.io.readers import read_excel

# Read data from specific sheets
data_dict = read_excel(
    "path/to/data.xlsx",
    sheets=["Temperature", "Pressure"],
    column_mapping={
        "Temperature": {"Temp (C)": "temperature"},
        "Pressure": {"Press (hPa)": "pressure"}
    }
)