Modeling the objective function#
The objective function is what we want to minimize in the optimization module. Here we aim at training a machine learning model from the training data prepared in Data processing, as the objective function to minimize.
The case of query plans#
To model a latency function from a query plan graph and features, we provide two components:
Embedders that embed the query plan graph and features into a vector space, based on the
BaseGraphEmbedder
class.Regressors that take tabular features concatened with the query plan embedding to output the predicted latency. The
MLP
implements an MLP regressor.
Embedders#
Several embedders are available in the embedders module. They all inherit from the BaseGraphEmbedder
class.
Regressors#
The MLP regressor then takes the concatenation of the query plan embedding and the tabular features as input, and outputs the predicted latency.
The UdaoModel#
The UdaoModel
class is a wrapper around the embedder and regressor.
It is used to train the model and predict the latency of a query plan.
Putting it all together#
To train a UdaoModel, we use the UdaoModule
inheriting LightningModel from Pytorch Lightning to set up the training parameters.
We then use Pytorch Lightning’s Trainer to train the model.
Here is a minimal example of how to train a UdaoModel:
# First process the data
model = UdaoModel.from_config(
embedder_cls=GraphAverager,
regressor_cls=MLP,
iterator_shape=split_iterators["train"].shape,
embedder_params={
"output_size": 1024,
"op_groups": ["cbo", "op_enc", "type"],
"type_embedding_dim": 8,
"embedding_normalizer": None,
},
regressor_params={"n_layers": 2, "hidden_dim": 32, "dropout": 0.1},
)
module = UdaoModule(
model,
["latency"],
loss=WMAPELoss(),
metrics=[WeightedMeanAbsolutePercentageError],
)
train_iterator = cast(QueryPlanIterator, split_iterators["train"])
scheduler = UdaoLRScheduler(setup_cosine_annealing_lr, warmup.UntunedLinearWarmup)
trainer = pl.Trainer(
accelerator=device,
max_epochs=10,
callbacks=[scheduler],
)
trainer.fit(
model=module,
train_dataloaders=split_iterators["train"].get_dataloader(batch_size),
val_dataloaders=split_iterators["val"].get_dataloader(batch_size),
)