DataHandler#

This module contains the DataHandler class, which is responsible for the data preparation pipeline ahead of the model training.

class udao.data.handler.data_handler.DataHandler(data: DataFrame, params: Params)#

Bases: object

DataHandler class to handle data loading, splitting, feature extraction and dataset iterator creation.

Parameters:

data (pd.DataFrame) – Dataframe containing the data.
params (DataHandler.Params) – DataHandler.Params object containing the parameters of the DataHandler.

class Params(index_column: str, data_processor: udao.data.handler.data_processor.DataProcessor, stratify_on: str | None = None, val_frac: float = 0.2, test_frac: float = 0.1, dryrun: bool = False, random_state: int | None = None, tensors_dtype: torch.dtype | None = None)#

Bases: object

data_processor: DataProcessor#: DataProcessor object to extract features from the data and create the iterator.

dryrun: bool = False#: Dry run mode for fast computation on a large dataset (sampling of a small portion), by default False

index_column: str#: Column that should be used as index (unique identifier)

random_state: int | None = None#: Random state for reproducibility, by default None

stratify_on: str | None = None#: Column on which to stratify the split, by default None. If None, no stratification is performed.

tensors_dtype: dtype | None = None#: Data type of the tensors, by default None

test_frac: float = 0.1#: Fraction allotted to the validation set, by default 0.2

val_frac: float = 0.2#: Column on which to stratify the split (keeping proportions for each split) If None, no stratification is performed

classmethod from_csv(csv_path: str, params: Params) → DataHandler#

Initialize DataHandler from csv.

Parameters:

csv_path (str) – Path to the data file.
params (Params) –

Returns:

Initialized DataHandler object.

Return type:

DataHandler

get_iterators() → Dict[Literal['train', 'val', 'test'], BaseIterator]#

Return a dictionary of iterators for the different splits of the data.

Returns:: Dictionary of iterators for the different splits of the data.
Return type:: Dict[DatasetType, BaseDatasetIterator]

split_data() → DataHandler#

Split the data into train, test and validation sets, split indices are stored in self.index_splits.

Returns:: set
Return type:: DataHandler

DataHandler

Contents

DataHandler#