DataProcessor#
This module contains the DataProcessor class, which is responsible for storing and applying the data processing pipeline.
- class udao.data.handler.data_processor.DataProcessor(iterator_cls: Type[IT], feature_extractors: Dict[str, TrainedExtractor | StaticExtractor], feature_preprocessors: Mapping[str, Sequence[TrainedPreprocessor | StaticPreprocessor]] | None = None, tensors_dtype: dtype | None = None)#
Bases:
Generic
[IT
]- Parameters:
iterator_cls (Type[BaseDatasetIterator]) – Dataset iterator class type.
feature_extractors (Mapping[str, Tuple[FeatureExtractorType, Any]]) –
Dict that links a feature name to tuples of the form (Extractor, args) where Extractor implements FeatureExtractor and args are the arguments to be passed at initialization. N.B.: Feature names must match the iterator’s parameters.
If Extractor is a StaticExtractor, the features are extracted independently of the split.
If Extractor is a TrainedExtractor, the extractor is first fitted on the train split and then applied to the other splits.
feature_preprocessors (Optional[Mapping[str, List[FeaturePreprocessor]]]) –
Dict that links a feature name to a list of tuples of the form (Processor, args) where Processor implements FeatureProcessor and args are the arguments to be passed at initialization. This allows to apply a series of processors to different features, e.g. to normalize the features. N.B.: Feature names must match the iterator’s parameters. If Processor is a StaticExtractor, the features are processed independently of the split.
If Extractor is a TrainedExtractor, the processor is first fitted on the train split and then applied to the other splits (typically for normalization).
tensors_dtype (Optional[th.dtype]) – Data type of the tensors returned by the iterator, by default None
- extract_features(data: DataFrame, split: Literal['train', 'val', 'test']) Dict[str, BaseContainer] #
Extract features for the different splits of the data.
- Returns:
self
- Return type:
- Raises:
ValueError – Expects data to be split before extracting features.
- inverse_transform(container: TabularContainer, pipeline_name: str) DataFrame #
Inverse transform the data to the original format.
- Parameters:
container (TabularContainer) – Data to be inverse transformed.
pipeline_name (str) – Name of the feature pipeline to be inverse transformed.
- Returns:
Inverse transformed data.
- Return type:
DataFrame
- class udao.data.handler.data_processor.FeaturePipeline(extractor: udao.data.extractors.base_extractors.TrainedExtractor | udao.data.extractors.base_extractors.StaticExtractor, preprocessors: List[udao.data.preprocessors.base_preprocessor.TrainedPreprocessor | udao.data.preprocessors.base_preprocessor.StaticPreprocessor] | None = None)#
Bases:
object
- extractor: TrainedExtractor | StaticExtractor#
Tuple defining the feature extractor and its initialization arguments.
- preprocessors: List[TrainedPreprocessor | StaticPreprocessor] | None = None#
List of tuples defining feature preprocessors and their initialization arguments.
- udao.data.handler.data_processor.create_data_processor(iterator_cls: Type[IT], *args: str) Callable[[...], DataProcessor[IT]] #
Creates a function dynamically to instatiate DataProcessor based on provided iterator class and additional arguments.
- Parameters:
iterator_cls (Type[BaseDatasetIterator]) – Dataset iterator class type.
args (str) – Additional feature names to be included.
- Returns:
create_data_processor – A dynamically generated function with arguments derived from the provided iterator class, in addition to other specified arguments.
- Return type:
Callable[…, DataProcessor]
Notes
- The returned function has the following signature:
>>> def get_processor( >>> **kwargs: FeaturePipeline, >>> ) -> DataProcessor:
where kwargs are the feature names and their corresponding feature