DataProcessor#

This module contains the DataProcessor class, which is responsible for storing and applying the data processing pipeline.

class udao.data.handler.data_processor.DataProcessor(iterator_cls: Type[IT], feature_extractors: Dict[str, TrainedExtractor | StaticExtractor], feature_preprocessors: Mapping[str, Sequence[TrainedPreprocessor | StaticPreprocessor]] | None = None, tensors_dtype: dtype | None = None)#

Bases: Generic[IT]

Parameters:
  • iterator_cls (Type[BaseDatasetIterator]) – Dataset iterator class type.

  • feature_extractors (Mapping[str, Tuple[FeatureExtractorType, Any]]) –

    Dict that links a feature name to tuples of the form (Extractor, args) where Extractor implements FeatureExtractor and args are the arguments to be passed at initialization. N.B.: Feature names must match the iterator’s parameters.

    If Extractor is a StaticExtractor, the features are extracted independently of the split.

    If Extractor is a TrainedExtractor, the extractor is first fitted on the train split and then applied to the other splits.

  • feature_preprocessors (Optional[Mapping[str, List[FeaturePreprocessor]]]) –

    Dict that links a feature name to a list of tuples of the form (Processor, args) where Processor implements FeatureProcessor and args are the arguments to be passed at initialization. This allows to apply a series of processors to different features, e.g. to normalize the features. N.B.: Feature names must match the iterator’s parameters. If Processor is a StaticExtractor, the features are processed independently of the split.

    If Extractor is a TrainedExtractor, the processor is first fitted on the train split and then applied to the other splits (typically for normalization).

  • tensors_dtype (Optional[th.dtype]) – Data type of the tensors returned by the iterator, by default None

extract_features(data: DataFrame, split: Literal['train', 'val', 'test']) Dict[str, BaseContainer]#

Extract features for the different splits of the data.

Returns:

self

Return type:

DataHandler

Raises:

ValueError – Expects data to be split before extracting features.

inverse_transform(container: TabularContainer, pipeline_name: str) DataFrame#

Inverse transform the data to the original format.

Parameters:
  • container (TabularContainer) – Data to be inverse transformed.

  • pipeline_name (str) – Name of the feature pipeline to be inverse transformed.

Returns:

Inverse transformed data.

Return type:

DataFrame

class udao.data.handler.data_processor.FeaturePipeline(extractor: udao.data.extractors.base_extractors.TrainedExtractor | udao.data.extractors.base_extractors.StaticExtractor, preprocessors: List[udao.data.preprocessors.base_preprocessor.TrainedPreprocessor | udao.data.preprocessors.base_preprocessor.StaticPreprocessor] | None = None)#

Bases: object

extractor: TrainedExtractor | StaticExtractor#

Tuple defining the feature extractor and its initialization arguments.

preprocessors: List[TrainedPreprocessor | StaticPreprocessor] | None = None#

List of tuples defining feature preprocessors and their initialization arguments.

udao.data.handler.data_processor.create_data_processor(iterator_cls: Type[IT], *args: str) Callable[[...], DataProcessor[IT]]#

Creates a function dynamically to instatiate DataProcessor based on provided iterator class and additional arguments.

Parameters:
  • iterator_cls (Type[BaseDatasetIterator]) – Dataset iterator class type.

  • args (str) – Additional feature names to be included.

Returns:

create_data_processor – A dynamically generated function with arguments derived from the provided iterator class, in addition to other specified arguments.

Return type:

Callable[…, DataProcessor]

Notes

The returned function has the following signature:
>>> def get_processor(
>>>    **kwargs: FeaturePipeline,
>>> ) -> DataProcessor:

where kwargs are the feature names and their corresponding feature