DataHandler#

This module contains the DataHandler class, which is responsible for the data preparation pipeline ahead of the model training.

class udao.data.handler.data_handler.DataHandler(data: DataFrame, params: Params)#

Bases: object

DataHandler class to handle data loading, splitting, feature extraction and dataset iterator creation.

Parameters:
  • data (pd.DataFrame) – Dataframe containing the data.

  • params (DataHandler.Params) – DataHandler.Params object containing the parameters of the DataHandler.

class Params(index_column: str, data_processor: udao.data.handler.data_processor.DataProcessor, stratify_on: str | None = None, val_frac: float = 0.2, test_frac: float = 0.1, dryrun: bool = False, random_state: int | None = None, tensors_dtype: torch.dtype | None = None)#

Bases: object

data_processor: DataProcessor#

DataProcessor object to extract features from the data and create the iterator.

dryrun: bool = False#

Dry run mode for fast computation on a large dataset (sampling of a small portion), by default False

index_column: str#

Column that should be used as index (unique identifier)

random_state: int | None = None#

Random state for reproducibility, by default None

stratify_on: str | None = None#

Column on which to stratify the split, by default None. If None, no stratification is performed.

tensors_dtype: dtype | None = None#

Data type of the tensors, by default None

test_frac: float = 0.1#

Fraction allotted to the validation set, by default 0.2

val_frac: float = 0.2#

Column on which to stratify the split (keeping proportions for each split) If None, no stratification is performed

classmethod from_csv(csv_path: str, params: Params) DataHandler#

Initialize DataHandler from csv.

Parameters:
  • csv_path (str) – Path to the data file.

  • params (Params) –

Returns:

Initialized DataHandler object.

Return type:

DataHandler

get_iterators() Dict[Literal['train', 'val', 'test'], BaseIterator]#

Return a dictionary of iterators for the different splits of the data.

Returns:

Dictionary of iterators for the different splits of the data.

Return type:

Dict[DatasetType, BaseDatasetIterator]

split_data() DataHandler#

Split the data into train, test and validation sets, split indices are stored in self.index_splits.

Returns:

set

Return type:

DataHandler