Data Formats

Format utilities for Macrodata Refinement (MDR).

This module provides functions for detecting, validating, and converting between different data formats.

class mdr.io.formats.FormatType(value)[source]

Bases: Enum

Supported data format types.

CSV = 1
JSON = 2
EXCEL = 3
PARQUET = 4
HDF5 = 5
UNKNOWN = 6
mdr.io.formats.detect_format(filepath)[source]

Detect the format of a file based on its extension or content.

Parameters:

filepath (str) – Path to the file

Returns:

Detected format type

Return type:

FormatType

mdr.io.formats.validate_format(filepath, expected_format)[source]

Validate if a file has the expected format.

Parameters:
  • filepath (str) – Path to the file

  • expected_format (FormatType) – Expected format type

Returns:

True if the file has the expected format, False otherwise

Return type:

bool

mdr.io.formats.convert_format(data, source_format, target_format, **options)[source]

Convert data from one format to another.

Parameters:
  • data (Dict[str, <MagicMock id='136017403479184'>]) – Dictionary mapping column names to data arrays

  • source_format (FormatType) – Source format type

  • target_format (FormatType) – Target format type

  • **options – Additional options for the conversion

Returns:

Converted data as bytes

Return type:

bytes

mdr.io.formats.convert_file_format(source_filepath, target_filepath, **options)[source]

Convert a file from one format to another.

Parameters:
  • source_filepath (str) – Path to the source file

  • target_filepath (str) – Path to the target file

  • **options – Additional options for the conversion

Return type:

None

mdr.io.formats.infer_column_types(data)[source]

Infer the data types of columns in a DataFrame.

Parameters:

data (<MagicMock id='136017405132864'>) – The DataFrame to analyze

Returns:

Dictionary mapping column names to inferred types

Return type:

Dict[str, str]

mdr.io.formats.cast_column_types(data, type_map)[source]

Cast columns in a DataFrame to specified types.

Parameters:
  • data (<MagicMock id='136017403691792'>) – The DataFrame to modify

  • type_map (Dict[str, str]) – Dictionary mapping column names to target types

Returns:

DataFrame with columns cast to specified types

Return type:

<MagicMock id=’136017403699520’>

mdr.io.formats.is_numeric_column(data)[source]

Check if a numpy array contains numeric data.

Parameters:

data (<MagicMock id='136017403707248'>) – The array to check

Returns:

True if the array contains numeric data, False otherwise

Return type:

bool

mdr.io.formats.try_common_datetime_formats(col_data)[source]

Try to parse a column with common datetime formats.

Parameters:

col_data (<MagicMock id='136017403682272'>) – The pandas Series to check

Returns:

True if the column contains datetime data, False otherwise

Return type:

bool

mdr.io.formats.is_datetime_column(data)[source]

Check if a numpy array contains datetime data.

Parameters:

data (<MagicMock id='136017403690000'>) – The array to check

Returns:

True if the array contains datetime data, False otherwise

Return type:

bool

Overview

The formats module provides utilities for working with different data formats, including format detection, conversion, and validation. It supports the core I/O functionality of the MDR package.

Core Functions

mdr.io.formats.detect_format(filepath)[source]

Detect the format of a file based on its extension or content.

Parameters:

filepath (str) – Path to the file

Returns:

Detected format type

Return type:

FormatType

mdr.io.formats.convert_format(data, source_format, target_format, **options)[source]

Convert data from one format to another.

Parameters:
  • data (Dict[str, <MagicMock id='136017403479184'>]) – Dictionary mapping column names to data arrays

  • source_format (FormatType) – Source format type

  • target_format (FormatType) – Target format type

  • **options – Additional options for the conversion

Returns:

Converted data as bytes

Return type:

bytes

mdr.io.formats.validate_format(filepath, expected_format)[source]

Validate if a file has the expected format.

Parameters:
  • filepath (str) – Path to the file

  • expected_format (FormatType) – Expected format type

Returns:

True if the file has the expected format, False otherwise

Return type:

bool

Supported Formats

The module supports the following data formats:

  • CSV: Comma-separated values files

  • JSON: JavaScript Object Notation files

  • Excel: Microsoft Excel workbooks (.xlsx, .xls)

  • Parquet: Apache Parquet columnar storage files

  • HDF5: Hierarchical Data Format version 5 files

Usage Examples

Format detection:

from mdr.io.formats import detect_format

# Detect the format of a file
format_info = detect_format("path/to/data.csv")
print(f"Format: {format_info['format']}")
print(f"Structure: {format_info['structure']}")

Format conversion:

from mdr.io.formats import convert_format

# Convert a file from CSV to Parquet
convert_format(
    "path/to/input.csv",
    "path/to/output.parquet",
    source_format="csv",
    target_format="parquet"
)