Data Formats
Format utilities for Macrodata Refinement (MDR).
This module provides functions for detecting, validating, and converting between different data formats.
- class mdr.io.formats.FormatType(value)[source]
Bases:
EnumSupported data format types.
- CSV = 1
- JSON = 2
- EXCEL = 3
- PARQUET = 4
- HDF5 = 5
- UNKNOWN = 6
- mdr.io.formats.detect_format(filepath)[source]
Detect the format of a file based on its extension or content.
- Parameters:
filepath (str) – Path to the file
- Returns:
Detected format type
- Return type:
- mdr.io.formats.validate_format(filepath, expected_format)[source]
Validate if a file has the expected format.
- Parameters:
filepath (str) – Path to the file
expected_format (FormatType) – Expected format type
- Returns:
True if the file has the expected format, False otherwise
- Return type:
- mdr.io.formats.convert_format(data, source_format, target_format, **options)[source]
Convert data from one format to another.
- Parameters:
data (Dict[str, <MagicMock id='136017403479184'>]) – Dictionary mapping column names to data arrays
source_format (FormatType) – Source format type
target_format (FormatType) – Target format type
**options – Additional options for the conversion
- Returns:
Converted data as bytes
- Return type:
- mdr.io.formats.convert_file_format(source_filepath, target_filepath, **options)[source]
Convert a file from one format to another.
- mdr.io.formats.cast_column_types(data, type_map)[source]
Cast columns in a DataFrame to specified types.
- mdr.io.formats.is_numeric_column(data)[source]
Check if a numpy array contains numeric data.
- Parameters:
data (<MagicMock id='136017403707248'>) – The array to check
- Returns:
True if the array contains numeric data, False otherwise
- Return type:
- mdr.io.formats.try_common_datetime_formats(col_data)[source]
Try to parse a column with common datetime formats.
- Parameters:
col_data (<MagicMock id='136017403682272'>) – The pandas Series to check
- Returns:
True if the column contains datetime data, False otherwise
- Return type:
- mdr.io.formats.is_datetime_column(data)[source]
Check if a numpy array contains datetime data.
- Parameters:
data (<MagicMock id='136017403690000'>) – The array to check
- Returns:
True if the array contains datetime data, False otherwise
- Return type:
Overview
The formats module provides utilities for working with different data formats,
including format detection, conversion, and validation. It supports the core
I/O functionality of the MDR package.
Core Functions
- mdr.io.formats.detect_format(filepath)[source]
Detect the format of a file based on its extension or content.
- Parameters:
filepath (str) – Path to the file
- Returns:
Detected format type
- Return type:
- mdr.io.formats.convert_format(data, source_format, target_format, **options)[source]
Convert data from one format to another.
- Parameters:
data (Dict[str, <MagicMock id='136017403479184'>]) – Dictionary mapping column names to data arrays
source_format (FormatType) – Source format type
target_format (FormatType) – Target format type
**options – Additional options for the conversion
- Returns:
Converted data as bytes
- Return type:
- mdr.io.formats.validate_format(filepath, expected_format)[source]
Validate if a file has the expected format.
- Parameters:
filepath (str) – Path to the file
expected_format (FormatType) – Expected format type
- Returns:
True if the file has the expected format, False otherwise
- Return type:
Supported Formats
The module supports the following data formats:
CSV: Comma-separated values files
JSON: JavaScript Object Notation files
Excel: Microsoft Excel workbooks (.xlsx, .xls)
Parquet: Apache Parquet columnar storage files
HDF5: Hierarchical Data Format version 5 files
Usage Examples
Format detection:
from mdr.io.formats import detect_format
# Detect the format of a file
format_info = detect_format("path/to/data.csv")
print(f"Format: {format_info['format']}")
print(f"Structure: {format_info['structure']}")
Format conversion:
from mdr.io.formats import convert_format
# Convert a file from CSV to Parquet
convert_format(
"path/to/input.csv",
"path/to/output.parquet",
source_format="csv",
target_format="parquet"
)