dsipts.data_structure package

Submodules

dsipts.data_structure.data_structure module

class dsipts.data_structure.data_structure.Categorical(name: str, frequency: int, duration: List[int], classes: int, action: ActionEnum, level: List[float])

Bases: object

Class for generating toy categorical data

Parameters:
  • name (str) – name of the categorical signal

  • frequency (int) – frequency of the signal

  • duration (List[int]) – duration of each class

  • classes (int) – number of classes

  • action (str) – one between additive or multiplicative

  • level (List[float]) – intensity of each class

generate_signal(length: int) None

Generate the resposne signal

Parameters:

length (int) – length of the signal

plot() None

Plot the series

class dsipts.data_structure.data_structure.TimeSeries(name: str, stacked: bool = False)

Bases: object

Class for generating time series object. If you don’t have any time series you can build one fake timeseries using some helping classes (Categorical for instance).

Parameters:
  • name (str) – name of the series

  • stacked (bool) – if true it is a stacked model

Usage:

For example we can generate a toy timeseries:

  • add a multiplicative categorical feature (weekly)

>>> settimana = Categorical('settimanale',1,[1,1,1,1,1,1,1],7,'multiplicative',[0.9,0.8,0.7,0.6,0.5,0.99,0.99])
  • an additive montly feature (here a year is composed by 5 months)

>>> mese = Categorical('mensile',1,[31,28,20,10,33],5,'additive',[10,20,-10,20,0])
  • a spotted categorical variable that happens every 100 days and lasts 1 day

>>> spot = Categorical('spot',100,[7],1,'additive',[10])
>>> ts = TimeSeries('prova')
>>> ts.generate_signal(length = 5000,categorical_variables = [settimana,mese,spot],noise_mean=1,type=0) ##we can add also noise
>>> ts.plot()
create_data_loader(data: DataFrame, past_steps: int, future_steps: int, shift: int = 0, keep_entire_seq_while_shifting: bool = False, starting_point: None | dict = None, skip_step: int = 1, is_inference: bool = False) MyDataset

Create the dataset for the training/inference step

Parameters:
  • data (pd.DataFrame) – input dataset, usually a subset of self.data

  • past_steps (int) – past context length

  • future_steps (int) – future lags to predict

  • shift (int, optional) – if >0 the future input variables will be shifted (categorical and numerical). For example for attention model it is better to start with a know value of y and use it during the process. Defaults to 0.

  • keep_entire_seq_while_shifting (bool, optional) – if the dataset is shifted, you may want the future data be of length future_step+shift (like informer), default false

  • starting_point (Union[None,dict], optional) – a dictionary indicating if a sample must be considered. It is checked for the first lag in the future (useful in the case your model has to predict only starting from hour 12). Defaults to None.

  • skip_step (int, optional) – list of the categortial variables (same for past and future). Usual there is a skip of one between two saples but for debugging or training time purposes you can skip some samples. Defaults to 1.

Returns:

class that extends torch.utils.data.Dataset (see utils)

keys of a batch: y : the target variable(s) x_num_past: the numerical past variables x_num_future: the numerical future variables x_cat_past: the categorical past variables x_cat_future: the categorical future variables idx_target: index of target features in the past array

Return type:

MyDataset

enrich(dataset, columns)
generate_signal(length: int = 5000, categorical_variables: List[Categorical] = [], noise_mean: int = 1, type: int = 0) None

This will generate a syntetic signal with a selected length, a noise level and some categorical variables. The additive series are added at the end while the multiplicative series acts on the original signal The TS structure will be populated

Parameters:
  • length (int, optional) – length of the signal. Defaults to 5000.

  • categorical_variables (List[Categorical], optional) – list of Categorical variables. Defaults to [].

  • noise_mean (int, optional) – variance of the noise to add at the end. Defaults to 1.

  • type (int, optional) – type of the timeseries (only type=0 available right now). Defaults to 0.

inference(batch_size: int = 100, num_workers: int = 4, split_params: None | dict = None, rescaling: bool = True, data: DataFrame = None, steps_in_future: int = 0, check_holes_and_duplicates: bool = True, is_inference: bool = False) DataFrame

similar to inference_on_set only change is split_params that must contain this keys but using the default can be sufficient: ‘past_steps’,’future_steps’,’shift’,’keep_entire_seq_while_shifting’,’starting_point’

skip_step is set to 1 for convenience (generally you want all the predictions) You can set split_params to None and use the standard parameters (at your own risck)

Parameters:
  • batch_size (int, optional) – see inference_on_set. Defaults to 100.

  • num_workers (int, optional) – inference_on_set. Defaults to 4.

  • split_params (Union[None,dict], optional) – inference_on_set. Defaults to None.

  • rescaling (bool, optional) – inference_on_set. Defaults to True.

  • data (pd.DataFrame, optional) – startin dataset. Defaults to None.

  • steps_in_future (int, optional) – if>0 the dataset is extendend in order to make predictions in the future. Defaults to 0.

  • check_holes_and_duplicates (bool, optional) – if False the routine does not check for holes or for duplicates, set to False for stacked model. Defaults to True.

Returns:

predicted values

Return type:

pd.DataFrame

inference_on_set(batch_size: int = 100, num_workers: int = 4, split_params: None | dict = None, set: str = 'test', rescaling: bool = True, data: None | Dataset = None) DataFrame

This function allows to get the prediction on a particular set (train, test or validation).

Parameters:
  • batch_size (int, optional) – barch sise. Defaults to 100.

  • num_workers (int, optional) – num workers. Defaults to 4.

  • split_params (Union[None,dict], optional) – if not None the spliting procedure will use the given data otherwise it will use the same configuration used in train. Defaults to None.

  • set (str, optional) – trai, validation or test. Defaults to ‘test’.

  • rescaling (bool, optional) – If rescaling is true the output will be rescaled to the initial values. . Defaults to True.

  • data (None or pd.DataFrame, optional)

Returns:

the predicted values in a pandas format

Return type:

pd.DataFrame

load(model: Base, filename: str, load_last: bool = True, dirpath: str | None = None, weight_path: str | None = None) None

Load a saved model

Parameters:
  • model (Base) – class of the model to load (it will be initiated by pytorch-lightening)

  • filename (str) – filename of the saved model

  • load_last (bool, optional) – if true the last checkpoint will be loaded otherwise the best (in the validation set). Defaults to True.

  • dirpath (Union[str,None], optional) – if None we asssume that the model is loaded from the same pc where it has been trained, otherwise we can pass the dirpath where all the stuff has been saved . Defaults to None.

  • weight_path (Union[str, None], optional) – if None the standard path will be used. Defaults to None.

load_signal(data: DataFrame, enrich_cat: List[str] = [], past_variables: List[str] = [], future_variables: List[str] = [], target_variables: List[str] = [], cat_past_var: List[str] = [], cat_fut_var: List[str] = [], check_past: bool = True, group: None | str = None, check_holes_and_duplicates: bool = True, silly_model: bool = False) None
This is a crucial point in the data structure. We expect here to have a dataset with time as timestamp.
There are some checks:

1- the duplicates will tbe removed taking the first instance

2- the frequency will the inferred taking the minumum time distance between samples

3- the dataset will be filled completing the missing timestamps

Parameters:
  • data (pd.DataFrame) – input dataset the column indicating the time must be called time

  • enrich_cat (List[str], optional) – it is possible to let this function enrich the dataset for example adding the standard columns: hour, dow, month and minute. Defaults to [].

  • past_variables (List[str], optional) – list of column names of past variables not available for future times . Defaults to [].

  • future_variables (List[str], optional) – list of future variables available for tuture times. Defaults to [].

  • target_variables (List[str], optional) – list of the target variables. They will added to past_variables by default unless check_past is false. Defaults to [].

  • cat_past_var (List[str], optional) – list of the past categorical variables. Defaults to [].

  • cat_future_var (List[str], optional) – list of the future categorical variables. Defaults to [].

  • check_past (bool, optional) – see target_variables. Defaults to True.

  • group (str or None, optional) – if not None the time serie dataset is considered composed by omogeneus timeseries coming from different realization (for example point of sales, cities, locations) and the relative series are not splitted during the sample generation. Defaults to None

  • check_holes_and_duplicates (bool, optional) – if False duplicates or holes will not checked, the dataloader can not correctly work, disable at your own risk. Defaults True

  • silly_model (bool, optional) – if True, target variables will be added to the pool of the future variables. This can be useful to see if information passes throught the decoder part of your model (if any)

plot()

Easy way to control the loaded data :returns: figure of the target variables :rtype: plotly.graph_objects._figure.Figure

save(filename: str) None

save the timeseries object

Parameters:

filename (str) – name of the file

set_model(model: Base, config: dict = None, custom_init: bool = False)

Set the model to train

Parameters:
  • model (Base) – see models

  • config (dict, optional) – usually the configuration used by the model. Defaults to None.

  • custom_init (bool, optional) – if true a custom initialization paradigm will be used (see weight_init in models/utils.py ) .

set_verbose(verbose: bool)
split_for_train(perc_train: float | None = 0.6, perc_valid: float | None = 0.2, range_train: List[datetime | str] | None = None, range_validation: List[datetime | str] | None = None, range_test: List[datetime | str] | None = None, past_steps: int = 100, future_steps: int = 20, shift: int = 0, keep_entire_seq_while_shifting: bool = False, starting_point: None | dict = None, skip_step: int = 1, normalize_per_group: bool = False, check_consecutive: bool = True, scaler: str = 'StandardScaler()') List[DataLoader]

Split the data and create the datasets.

Parameters:
  • perc_train (Union[float,None], optional) – fraction of the training set. Defaults to 0.6.

  • perc_valid (Union[float,None], optional) – fraction of the test set. Defaults to 0.2.

  • range_train (Union[List[Union[datetime, str]],None], optional) – a list of two elements indicating the starting point and end point of the training set (string date style or datetime). Defaults to None.

  • range_validation (Union[List[Union[datetime, str]],None], optional) – a list of two elements indicating the starting point and end point of the validation set (string date style or datetime). Defaults to None.

  • range_test (Union[List[Union[datetime, str]],None], optional) – a list of two elements indicating the starting point and end point of the test set (string date style or datetime). Defaults to None.

  • past_steps (int, optional) – past step to consider for making the prediction. Defaults to 100.

  • future_steps (int, optional) – future step to predict. Defaults to 20.

  • shift (int, optional) – see create_data_loader. Defaults to 0.

  • keep_entire_seq_while_shifting (bool, optional) – if the dataset is shifted, you may want the future data be of length future_step+shift (like informer), default false

  • starting_point (Union[None, dict], optional) – see create_data_loader. Defaults to None.

  • skip_step (int, optional) – see create_data_loader. Defaults to 1.

  • normalize_per_group (boolean, optional) – if true and self.group is not None, the variables are scaled respect to the groups. Default False

  • check_consecutive (boolean, optional) – if false it skips the check on the consecutive ranges. Default True

  • scaler – instance of a sklearn.preprocessing scaler. Default ‘StandardScaler()’

Returns:

three dataloader used for training or inference

Return type:

List[DataLoader,DataLoader,DataLoadtrainer]

train_model(dirpath: str, split_params: dict, batch_size: int = 100, num_workers: int = 4, max_epochs: int = 500, auto_lr_find: bool = True, gradient_clip_val: float | None = None, gradient_clip_algorithm: str = 'value', devices: str | List[int] = 'auto', precision: str | int = 32, modifier: None | str = None, modifier_params: None | dict = None, seed: int = 42) float

Train the model

Parameters:
  • dirpath (str) – path where to put all the useful things

  • split_params (dict) – see split_for_train

  • batch_size (int, optional) – batch size. Defaults to 100.

  • num_workers (int, optional) – num_workers for the dataloader. Defaults to 4.

  • max_epochs (int, optional) – maximum epochs to perform. Defaults to 500.

  • auto_lr_find (bool, optional) – find initial learning rate, see pytorch-lightening. Defaults to True.

  • gradient_clip_val (Union[float,None], optional) – gradient_clip_val. Defaults to None. See https://lightning.ai/docs/pytorch/stable/advanced/training_tricks.html

  • gradient_clip_algorithm (str, optional) – gradient_clip_algorithm. Defaults to ‘norm ‘. See https://lightning.ai/docs/pytorch/stable/advanced/training_tricks.html

  • devices (Union[str,List[int]], optional) – devices to use. Use auto if cpu or the list of gpu to use otherwise. Defaults to ‘auto’.

  • precision (Union[str,int], optional) – precision to use. Usually 32 bit is fine but for larger model you should try ‘bf16’. If ‘auto’ it will use bf16 for GPU and 32 for cpu

  • modifier (Union[str,int], optional) – if not None a modifier is applyed to the dataloader. Sometimes lightening has very restrictive rules on the dataloader, or we want to use a ML model before or after the DL model (See readme for more information)

  • modifier_params (Union[dict,int], optional) – parameters of the modifier

  • seed (int, optional) – seed for reproducibility

dsipts.data_structure.modifiers module

class dsipts.data_structure.modifiers.Modifier(**kwargs)

Bases: ABC

In the constructor you can store some parameters of the modifier. It will be saved when the timeseries is saved.

abstractmethod fit_transform(train: MyDataset, val: MyDataset) [<class 'torch.utils.data.dataset.Dataset'>, <class 'torch.utils.data.dataset.Dataset'>]

This funtion is called before the training procedure and it should tasnform the standard Dataset into the new Dataset

Parameters:
  • train (MyDataset) – initial train Dataset

  • val (MyDataset) – initial validation Dataset

Returns:

transformed train and validation Datasets

Return type:

Dataset, Dataset

abstractmethod inverse_transform(res: array, real: array) [<built-in function array>, <built-in function array>]

The results must be reverted respect to the prediction task

Parameters:
  • res (np.array) – raw prediction

  • real (np.array) – raw real data

Returns:

inverse transfrmation of the predictions and the real data

Return type:

[np.array, np.array]

abstractmethod transform(test: MyDataset) Dataset

Similar to fit_transform but only transformation task will be performed, it is used in the inference function before calling the inference method :param test: initial test Dataset :type test: MyDataset

Returns:

transformed test Dataset

Return type:

Dataset

class dsipts.data_structure.modifiers.ModifierVVA(**kwargs)

Bases: Modifier

In the constructor you can store some parameters of the modifier. It will be saved when the timeseries is saved.

fit_transform(train: MyDataset, val: MyDataset) [<class 'torch.utils.data.dataset.Dataset'>, <class 'torch.utils.data.dataset.Dataset'>]

BisectingKMeans is used on segments of length token_split

Parameters:
  • train (MyDataset) – initial train Dataset

  • val (MyDataset) – initial validation Dataset

Returns:

transformed train and validation Datasets

Return type:

Dataset, Dataset

inverse_transform(res: array, real: array) [<built-in function array>, <built-in function array>]

The results must be reverted respect to the prediction task

Parameters:

res (np.array) – raw prediction

Returns:

inverse transofrmation of the predictions

Return type:

np.array

transform(test: MyDataset) Dataset

Similar to fit_transform but only transformation task will be performed :param test: test val Dataset :type test: MyDataset

Returns:

transformed test Dataset

Return type:

Dataset

class dsipts.data_structure.modifiers.VVADataset(x, y, y_orig, t, length_in, length_out, num_digits)

Bases: Dataset

dsipts.data_structure.utils module

class dsipts.data_structure.utils.MyDataset(data: dict, t: array, groups: array, idx_target: array | None, idx_target_future: array | None)

Bases: Dataset

Extension of Dataset class. While training the returned item is a batch containing the standard keys

Parameters:
  • data (dict) – a dictionary. Each key is a np.array containing the data. The keys are: y : the target variable(s) x_num_past: the numerical past variables x_num_future: the numerical future variables x_cat_past: the categorical past variables x_cat_future: the categorical future variables idx_target: index of target features in the past array

  • t (np.array) – the time array related to the target variables

  • idx_target (Union[np.array,None]) – you can specify the index in the past data that represent the input features (for differntial analysis or detrending strategies)

  • idx_target_future (Union[np.array,None]) – you can specify the index in the future data that represent the input features (for differntial analysis or detrending strategies)

Returns:

a torch Dataset to be used in a Dataloader

Return type:

torch.utils.data.Dataset

dsipts.data_structure.utils.beauty_string(message: str, type: str, verbose: bool)
dsipts.data_structure.utils.extend_time_df(x: DataFrame, freq: str | int, group: str | None = None, global_minmax: bool = False) DataFrame

Utility for generating a full dataset and then merge the real data

Parameters:
  • x (pd.DataFrame) – dataframe containing the column time

  • freq (str) – frequency (in pandas notation) of the resulting dataframe

  • group (string or None) – if not None the min max are computed by the group column, default None

  • global_minmax (bool) – if True the min_max is computed globally for each group. Usually used for stacked model

Returns:

a dataframe with the column time ranging from thr minumum of x to the maximum with frequency freq

Return type:

pd.DataFrame

Module contents