DataHandler#
This module contains the DataHandler class, which is responsible for the data preparation pipeline ahead of the model training.
- class udao.data.handler.data_handler.DataHandler(data: DataFrame, params: Params)#
Bases:
object
DataHandler class to handle data loading, splitting, feature extraction and dataset iterator creation.
- Parameters:
data (pd.DataFrame) – Dataframe containing the data.
params (DataHandler.Params) – DataHandler.Params object containing the parameters of the DataHandler.
- class Params(index_column: str, data_processor: udao.data.handler.data_processor.DataProcessor, stratify_on: str | None = None, val_frac: float = 0.2, test_frac: float = 0.1, dryrun: bool = False, random_state: int | None = None, tensors_dtype: torch.dtype | None = None)#
Bases:
object
- data_processor: DataProcessor#
DataProcessor object to extract features from the data and create the iterator.
- dryrun: bool = False#
Dry run mode for fast computation on a large dataset (sampling of a small portion), by default False
- index_column: str#
Column that should be used as index (unique identifier)
- random_state: int | None = None#
Random state for reproducibility, by default None
- stratify_on: str | None = None#
Column on which to stratify the split, by default None. If None, no stratification is performed.
- tensors_dtype: dtype | None = None#
Data type of the tensors, by default None
- test_frac: float = 0.1#
Fraction allotted to the validation set, by default 0.2
- val_frac: float = 0.2#
Column on which to stratify the split (keeping proportions for each split) If None, no stratification is performed
- classmethod from_csv(csv_path: str, params: Params) DataHandler #
Initialize DataHandler from csv.
- Parameters:
csv_path (str) – Path to the data file.
params (Params) –
- Returns:
Initialized DataHandler object.
- Return type:
- get_iterators() Dict[Literal['train', 'val', 'test'], BaseIterator] #
Return a dictionary of iterators for the different splits of the data.
- Returns:
Dictionary of iterators for the different splits of the data.
- Return type:
Dict[DatasetType, BaseDatasetIterator]
- split_data() DataHandler #
Split the data into train, test and validation sets, split indices are stored in self.index_splits.
- Returns:
set
- Return type: