Extractors#

Base Extractors#

class udao.data.extractors.base_extractors.StaticExtractor#

Bases: ABC, Generic[T]

class udao.data.extractors.base_extractors.TrainedExtractor#

Bases: ABC, Generic[T]

Query Plan Extractors#

class udao.data.extractors.query_structure_extractor.QueryStructureExtractor(positional_encoding_size: int | None = None)#

Bases: TrainedExtractor[QueryStructureContainer]

Extracts the features of the operations in the logical plan, and the tree structure of the logical plan. Keep track of the different query plans seen so far, and their template id.

Parameters:

with_positional_encoding (bool) – Whether to add positional encoding to the query plan gaph.

extract_features(df: DataFrame, split: Literal['train', 'val', 'test']) QueryStructureContainer#

Extract the features of the operations in the logical plan, and the tree structure of the logical plan for each query plan in the dataframe.

Parameters:

df (pd.DataFrame) – Dataframe with a column “plan” containing the query plans.

Returns:

Dataframe with one row per operation in the query plans, and one column per feature of the operations.

Return type:

pd.DataFrame

class udao.data.extractors.predicate_embedding_extractor.PredicateEmbeddingExtractor(embedder: ~udao.data.predicate_embedders.base_predicate_embedder.BasePredicateEmbedder, op_preprocessing: ~typing.Callable[[str], str] = <function prepare_operation>, extract_operations: ~typing.Callable[[~pandas.core.frame.DataFrame, ~typing.Callable], ~typing.Tuple[~typing.Dict[int, ~typing.List[int]], ~typing.List[str]]] = <function extract_operations>)#

Bases: TrainedExtractor[TabularContainer]

Class to extract embeddings from a DataFrame of query plans.

Parameters:

embedder (BaseEmbedder) – Embedder to use to extract the embeddings, e.g. an instance of Word2Vecembedder.

extract_features(df: DataFrame, split: str) TabularContainer#

Extract embeddings from a DataFrame of query plans.

Parameters:
  • df (pd.DataFrame) – DataFrame containing the query plans and their ids.

  • split (str) – Split of the dataset, either “train”, “test” or “validation”. Will fit the embedder if “train” and transform otherwise.

Returns:

DataFrame containing the embeddings of each operation of the query plans.

Return type:

pd.DataFrame

Tabular Extractor#

class udao.data.extractors.tabular_extractor.TabularFeatureExtractor(columns: List[str] | Dict[str, VarTypes | None] | None = None)#

Bases: StaticExtractor[TabularContainer]

Extract columns from a DataFrame as a TabularContainer.

Parameters:

columns (Union[List[str], Dict[str, Optional[VarTypes]]], optional) – Either: - a list of column names to extract from the DataFrame - a dictionary that maps column names to variable types if the variable type is None, the column is extracted without casting - None, in which case all columns are extracted

extract_features(df: DataFrame) TabularContainer#

extract and cast features from the DataFrame