dsipts.data_structure.time_series_d1 module

Time Series D1 Layer Module

This module provides the D1 layer for time series data handling: - MultiSourceTSDataSet: Handles raw data from multiple CSV files

Key Features: - Supports multiple CSV files with different groups - Handles regular time intervals - Efficiently processes data in chunks for memory-efficient operation - Handles categorical encoding and normalization - Preserves NaN values for D2 layer to handle

dsipts.data_structure.time_series_d1.extend_time_df(df, time_col, freq, group_cols=None, max_length=None)[source]

Extend a dataframe to ensure regular time intervals.

Parameters:
  • df – Input dataframe containing time series data

  • time_col – Column name containing time information

  • freq – Frequency to use for extending the dataframe

  • group_cols – Optional list of columns identifying groups

  • max_length – Optional maximum length for the extended dataframe

Returns:

DataFrame extended with regular time intervals with all original columns preserved

class dsipts.data_structure.time_series_d1.MultiSourceTSDataSet(file_paths, group_cols, time_col, feature_cols, target_cols, static_cols=None, cat_cols=None, num_cols=None, known_cols=None, unknown_cols=None, weights=None, memory_efficient=False, chunk_size=10000)[source]

Bases: Dataset

Layer 1 (D1) dataset for multi-source time series data.

This dataset: 1. Loads time series data from multiple CSV files 2. Handles categorical encoding and normalization 3. Efficiently processes data in chunks for memory-efficient operation 4. Preserves NaN values for D2 layer to handle

It does NOT compute validity of windows or create sliding windows - that is the responsibility of the D2 layer (TSDataProcessor).

Initialize the MultiSourceTSDataSet.

Parameters:
  • file_paths (List[str]) – List of paths to CSV files containing time series data

  • group_cols (str | List[str]) – Column(s) that identify unique time series groups

  • time_col (str) – Column containing time/date information

  • feature_cols (List[str]) – Columns to use as features (X)

  • target_cols (List[str]) – Columns to use as targets (y)

  • static_cols (List[str] | None) – Columns with static (non-time-varying) features

  • cat_cols (List[str] | None) – Categorical columns that need encoding

  • num_cols (List[str] | None) – Numerical columns (if None, all non-categorical columns are treated as numerical)

  • known_cols (List[str] | None) – Columns that are known at prediction time (if None, all feature_cols are considered known)

  • unknown_cols (List[str] | None) – Columns that are unknown at prediction time (if None, all target_cols are considered unknown)

  • weights (str | None) – Name of weights column

  • memory_efficient (bool) – Whether to use memory-efficient mode

  • chunk_size (int) – Chunk size for processing data (used in memory-efficient mode)

__init__(file_paths, group_cols, time_col, feature_cols, target_cols, static_cols=None, cat_cols=None, num_cols=None, known_cols=None, unknown_cols=None, weights=None, memory_efficient=False, chunk_size=10000)[source]

Initialize the MultiSourceTSDataSet.

Parameters:
  • file_paths (List[str]) – List of paths to CSV files containing time series data

  • group_cols (str | List[str]) – Column(s) that identify unique time series groups

  • time_col (str) – Column containing time/date information

  • feature_cols (List[str]) – Columns to use as features (X)

  • target_cols (List[str]) – Columns to use as targets (y)

  • static_cols (List[str] | None) – Columns with static (non-time-varying) features

  • cat_cols (List[str] | None) – Categorical columns that need encoding

  • num_cols (List[str] | None) – Numerical columns (if None, all non-categorical columns are treated as numerical)

  • known_cols (List[str] | None) – Columns that are known at prediction time (if None, all feature_cols are considered known)

  • unknown_cols (List[str] | None) – Columns that are unknown at prediction time (if None, all target_cols are considered unknown)

  • weights (str | None) – Name of weights column

  • memory_efficient (bool) – Whether to use memory-efficient mode

  • chunk_size (int) – Chunk size for processing data (used in memory-efficient mode)

__len__()[source]

Return the number of file-group combinations in the dataset.

__getitem__(idx)[source]

Get data for a specific file-group combination by index.

This method: 1. Maps the index to a specific file-group combination 2. Loads all data for that combination 3. Converts data to appropriate formats for model consumption

Parameters:

idx – Index of the file-group combination to retrieve

Returns:

  • ‘x’: Feature tensor

  • ’y’: Target tensor

  • ’t’: Time values (as numpy array)

  • ’w’: Weight tensor

  • ’group_id’: Group identifier

  • ’st’: Static features

Return type:

Dictionary containing group data with keys

get_metadata()[source]

Return metadata about the dataset.

Returns:

Dictionary containing metadata about columns and their properties

Return type:

Dict