Embedders#
Base Predicate Embedder#
- class udao.data.predicate_embedders.base_predicate_embedder.BasePredicateEmbedder#
Bases:
ABC
Word2Vec Embedder#
- class udao.data.predicate_embedders.word2vec_embedder.Word2VecEmbedder(w2v_params: Word2VecParams | None = None)#
Bases:
BasePredicateEmbedder
A class to embed query plans using Word2Vec. The embedding is computed as the average of the word embeddings of the words in the query plan.
To use it: - first call fit_transform on a list of training query plans, - then call transform on a list of query plans.
N.B. To ensure reproducibility, several things need to be done: - set the seed in the Word2VecParams - set the PYTHONHASHSEED - set the number of workers to 1
- Parameters:
w2v_params (Word2VecParams) – Parameters to pass to the gensim.Word2Vec model (ref: https://radimrehurek.com/gensim/models/word2vec.html)
- fit_transform(training_texts: Sequence[str], epochs: int | None = None) ndarray #
Train the Word2Vec model on the training texts and return the embeddings.
- Parameters:
training_texts (Sequence[str]) – list of training texts
epochs (int, optional) – number of epochs for training the model, by default will use the value in the Word2VecParams.
- Returns:
Embeddings of the training plans
- Return type:
np.ndarray
- transform(texts: Sequence[str]) ndarray #
Transform a list of query plans into embeddings.
- Parameters:
texts (Sequence[str]) – list of texts to transform
epochs (int, optional) – number of epochs for infering a document’s embedding, by default will use the value in the Word2VecParams.
- Returns:
Embeddings of the query plans
- Return type:
np.ndarray
- Raises:
ValueError – If the model has not been trained
- class udao.data.predicate_embedders.word2vec_embedder.Word2VecParams(min_count: int = 1, window: int = 3, vec_size: int = 32, alpha: float = 0.025, sample: float = 0.1, min_alpha: float = 0.0007, workers: int = 1, seed: int = 42, epochs: int = 10)#
Bases:
object
Parameters to pass to the gensim.Word2Vec model (ref: https://radimrehurek.com/gensim/models/word2vec.html)
Doc2Vec Embedder#
- class udao.data.predicate_embedders.doc2vec_embedder.Doc2VecEmbedder(d2v_params: Doc2VecParams | None = None)#
Bases:
BasePredicateEmbedder
A class to embed query plans using Doc2Vec. To use it: - first call fit_transform on a list of training query plans, - then call transform on a list of query plans.
N.B. To ensure reproducibility, several things need to be done: - set the seed in the Doc2VecParams - set the PYTHONHASHSEED - set the number of workers to 1
- Parameters:
d2v_params (Doc2VecParams) – Parameters to pass to the gensim.Doc2Vec model (ref: https://radimrehurek.com/gensim/models/doc2vec.html)
- fit(training_texts: Sequence[str], /, epochs: int | None = None) None #
Train the Doc2Vec model on the training plans
- Parameters:
training_plans (Sequence[str]) – list of training plans
epochs (int, optional) – number of epochs for training the model, by default will use the value in the Doc2VecParams.
- Returns:
Normalized (L2) embeddings of the training plans
- Return type:
np.ndarray
- transform(texts: Sequence[str]) ndarray #
Transform a list of query plans into normalized embeddings.
- Parameters:
plans (Sequence[str]) – list of query plans
epochs (int, optional) – number of epochs for infering a document’s embedding, by default will use the value in the Doc2VecParams.
- Returns:
Normalized (L2) embeddings of the query plans
- Return type:
np.ndarray
- Raises:
ValueError – If the model has not been trained
- class udao.data.predicate_embedders.doc2vec_embedder.Doc2VecParams(min_count: int = 1, window: int = 3, vec_size: int = 32, alpha: float = 0.025, sample: float = 0.1, min_alpha: float = 0.0007, workers: int = 1, seed: int = 42, epochs: int = 10)#
Bases:
Word2VecParams
Utilities#
- udao.data.predicate_embedders.utils.brief_clean(s: str) str #
Remove special characters from a string and convert to lower case
- udao.data.predicate_embedders.utils.build_unique_operations(df: DataFrame) Tuple[Dict[int, List[int]], List[str]] #
Build a dictionary of unique operations and their IDs
- udao.data.predicate_embedders.utils.extract_operations(plan_df: ~pandas.core.frame.DataFrame, operation_processing: ~typing.Callable[[str], str] = <function <lambda>>) Tuple[Dict[int, List[int]], List[str]] #
Extract unique operations from a DataFrame of query plans and links them to query plans. Operations are transformed using prepare_operation to remove statistical information and hashes.
- Parameters:
plan_df (pd.DataFrame) – DataFrame containing the query plans and their ids.
operation_processing (Callable[[str], str]) – Function to process the operations, by default no processing will be applied and the raw operations will be used.
- Returns:
- plan_to_ops: Dict[int, List[int]]
Links a query plan ID to a list of operation IDs in the operations list
- operations_list: List[str]
List of unique operations in the dataset
- Return type:
Tuple[Dict[int, List[int]], List[str]]
- udao.data.predicate_embedders.utils.prepare_operation(operation: str) str #
Prepare an operation for embedding by keeping only relevant semantic information
- udao.data.predicate_embedders.utils.remove_hashes(s: str) str #
Remove hashes from a query plan, e.g. #1234L
- udao.data.predicate_embedders.utils.remove_statistics(s: str) str #
Remove statistical information from a query plan (in the form of Statistics(…)
- udao.data.predicate_embedders.utils.remove_unknown(s: str) str #
Remove unknown symbol from a query plan (in the form of (unknown))
- udao.data.predicate_embedders.utils.replace_symbols(s: str) str #
Replace symbols with tokens