Embedders#

Base Predicate Embedder#

class udao.data.predicate_embedders.base_predicate_embedder.BasePredicateEmbedder#

Bases: ABC

Word2Vec Embedder#

class udao.data.predicate_embedders.word2vec_embedder.Word2VecEmbedder(w2v_params: Word2VecParams | None = None)#

Bases: BasePredicateEmbedder

A class to embed query plans using Word2Vec. The embedding is computed as the average of the word embeddings of the words in the query plan.

To use it: - first call fit_transform on a list of training query plans, - then call transform on a list of query plans.

N.B. To ensure reproducibility, several things need to be done: - set the seed in the Word2VecParams - set the PYTHONHASHSEED - set the number of workers to 1

Parameters:

w2v_params (Word2VecParams) – Parameters to pass to the gensim.Word2Vec model (ref: https://radimrehurek.com/gensim/models/word2vec.html)

fit_transform(training_texts: Sequence[str], epochs: int | None = None) ndarray#

Train the Word2Vec model on the training texts and return the embeddings.

Parameters:
  • training_texts (Sequence[str]) – list of training texts

  • epochs (int, optional) – number of epochs for training the model, by default will use the value in the Word2VecParams.

Returns:

Embeddings of the training plans

Return type:

np.ndarray

transform(texts: Sequence[str]) ndarray#

Transform a list of query plans into embeddings.

Parameters:
  • texts (Sequence[str]) – list of texts to transform

  • epochs (int, optional) – number of epochs for infering a document’s embedding, by default will use the value in the Word2VecParams.

Returns:

Embeddings of the query plans

Return type:

np.ndarray

Raises:

ValueError – If the model has not been trained

class udao.data.predicate_embedders.word2vec_embedder.Word2VecParams(min_count: int = 1, window: int = 3, vec_size: int = 32, alpha: float = 0.025, sample: float = 0.1, min_alpha: float = 0.0007, workers: int = 1, seed: int = 42, epochs: int = 10)#

Bases: object

Parameters to pass to the gensim.Word2Vec model (ref: https://radimrehurek.com/gensim/models/word2vec.html)

Doc2Vec Embedder#

class udao.data.predicate_embedders.doc2vec_embedder.Doc2VecEmbedder(d2v_params: Doc2VecParams | None = None)#

Bases: BasePredicateEmbedder

A class to embed query plans using Doc2Vec. To use it: - first call fit_transform on a list of training query plans, - then call transform on a list of query plans.

N.B. To ensure reproducibility, several things need to be done: - set the seed in the Doc2VecParams - set the PYTHONHASHSEED - set the number of workers to 1

Parameters:

d2v_params (Doc2VecParams) – Parameters to pass to the gensim.Doc2Vec model (ref: https://radimrehurek.com/gensim/models/doc2vec.html)

fit(training_texts: Sequence[str], /, epochs: int | None = None) None#

Train the Doc2Vec model on the training plans

Parameters:
  • training_plans (Sequence[str]) – list of training plans

  • epochs (int, optional) – number of epochs for training the model, by default will use the value in the Doc2VecParams.

Returns:

Normalized (L2) embeddings of the training plans

Return type:

np.ndarray

transform(texts: Sequence[str]) ndarray#

Transform a list of query plans into normalized embeddings.

Parameters:
  • plans (Sequence[str]) – list of query plans

  • epochs (int, optional) – number of epochs for infering a document’s embedding, by default will use the value in the Doc2VecParams.

Returns:

Normalized (L2) embeddings of the query plans

Return type:

np.ndarray

Raises:

ValueError – If the model has not been trained

class udao.data.predicate_embedders.doc2vec_embedder.Doc2VecParams(min_count: int = 1, window: int = 3, vec_size: int = 32, alpha: float = 0.025, sample: float = 0.1, min_alpha: float = 0.0007, workers: int = 1, seed: int = 42, epochs: int = 10)#

Bases: Word2VecParams

Utilities#

udao.data.predicate_embedders.utils.brief_clean(s: str) str#

Remove special characters from a string and convert to lower case

udao.data.predicate_embedders.utils.build_unique_operations(df: DataFrame) Tuple[Dict[int, List[int]], List[str]]#

Build a dictionary of unique operations and their IDs

udao.data.predicate_embedders.utils.extract_operations(plan_df: ~pandas.core.frame.DataFrame, operation_processing: ~typing.Callable[[str], str] = <function <lambda>>) Tuple[Dict[int, List[int]], List[str]]#

Extract unique operations from a DataFrame of query plans and links them to query plans. Operations are transformed using prepare_operation to remove statistical information and hashes.

Parameters:
  • plan_df (pd.DataFrame) – DataFrame containing the query plans and their ids.

  • operation_processing (Callable[[str], str]) – Function to process the operations, by default no processing will be applied and the raw operations will be used.

Returns:

plan_to_ops: Dict[int, List[int]]

Links a query plan ID to a list of operation IDs in the operations list

operations_list: List[str]

List of unique operations in the dataset

Return type:

Tuple[Dict[int, List[int]], List[str]]

udao.data.predicate_embedders.utils.prepare_operation(operation: str) str#

Prepare an operation for embedding by keeping only relevant semantic information

udao.data.predicate_embedders.utils.remove_hashes(s: str) str#

Remove hashes from a query plan, e.g. #1234L

udao.data.predicate_embedders.utils.remove_statistics(s: str) str#

Remove statistical information from a query plan (in the form of Statistics(…)

udao.data.predicate_embedders.utils.remove_unknown(s: str) str#

Remove unknown symbol from a query plan (in the form of (unknown))

udao.data.predicate_embedders.utils.replace_symbols(s: str) str#

Replace symbols with tokens