Tutorial 2: Quantum Evolution Kernel-Based Machine Learning Tutorial
(download this tutorial here (external))
Introduction
Section titled “Introduction”Machine Learning (ML) is a field of artificial intelligence that focuses on building systems capable of learning from data to make predictions or decisions. A common ML task is classification, where we try to assign labels to data points. In this tutorial, we will use the Quantum Evolution Kernel, a custom kernel that computes the similarity between graph-structured data to model a classification problem using Support Vector Machine (SVM) on graph dataset.
Tutorial Objectives
Section titled “Tutorial Objectives”In the previous part of this tutorial, we have loaded a dataset of molecules, labelled with their toxicity, and used a quantum device to extract machine-learning features. Now, our high-level objective is to use these features to predict the toxicity of molecules.
This tutorial will guide you through:
- Loading and preprocessing a dataset for kernel-based machine learning.
- Introduction to Quantum Evolution Kernel (QEK), including fit and transform methods.
- Training and evaluating a Support Vector Machine (SVM) using the Quantum Evolution Kernel.
- Integrating the kernel and SVM into a scikit-learn Pipeline for streamlined workflows.
- Performing hyperparameter optimization using GridSearchCV to improve model performance.
A companion notebook will guide you through using a Quantum Device to extract machine-learning features from graphs.
In this tutorial, we use the results of the Quantum Device execution on a classical device (i.e. your computer) to create a Quantum Evolution Kernel. Since our algorithm combines steps that are executed on a Quantum Device and steps that are executed on a classical device, we call this a hybrid algorithm.
This tutorial uses scikit-learn for common machine learning tasks, but the concepts would work with any other machine learning framework as well.
1. Loading the Data
Section titled “1. Loading the Data”We begin by loading a dataset that contains graph-structured data. Each data point is represented as a processed object with features (in this example, the quantum distribution excitation for a molecule) and a target value (in this example, the toxicity of the molecule). We will split the data into training and testing sets for model evaluation.
import qek.data.processed_data as qek_dataset
# Load the dataset we processed in the quantum extraction tutorialprocessed_dataset = qek_dataset.load_dataset(file_path="ptcfm_processed_dataset.json")print(f"Size of the quantum compatible dataset = {len(processed_dataset)}")
Size of the quantum compatible dataset = 279
from sklearn.model_selection import train_test_split
# Prepare features (X) and targets (y)X = [data for data in processed_dataset] # Quantum distribution excitation.y = [data.target for data in processed_dataset] # Toxicity.
# Split data into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size=0.2, random_state=42)
print(f'Size of the training quantum compatible dataset = {len(X_train)}')print(f'Size of the testing quantum compatible dataset = {len(X_test)}')
Size of the training quantum compatible dataset = 223 Size of the testing quantum compatible dataset = 56
dataset_example = X[64]# The features we have extracted for this sampledataset_example.draw_excitation()
2. Quantum Evolution Kernel
Section titled “2. Quantum Evolution Kernel”The Quantum Evolution Kernel computes a similarity score between two graphs based on quantum-inspired measures. It is designed to work with graph-structured data.
- A kernel is a function that computes a similarity measure between two data points. In ML, kernels are often used to implicitly map data into a higher-dimensional space without computing the transformation explicitly. This enables algorithms like Support Vector Machines (SVM) to solve problems that are not linearly separable in the original feature space.
Introducing the Quantum Evolution Kernel
Section titled “Introducing the Quantum Evolution Kernel”For a graph $G$, let's call the excitation distribution $P_G$.
We may now construct the Quantum Evolution Kernel, or QEK. Mathematically, QEK is defined as: $$ K(G, G') = \exp \left( -\mu JS(P_G, P_{G'}) \right) $$
where $\mu$ is an hyperparameter of our kernel and $JS$ is the Jensen-Shannon divergence. For processing raw graph data, the IntegratedQEK is recommended; however, it requires an extractor to be provided as an argument.
import qek.data.extractors as qek_extractorsimport qek.data.graphs as qek_graphsfrom qek.kernel import IntegratedQEK as QEKfrom pathlib import Path
# First, we need to intantiate the compiler for the type of data.compiler = qek_graphs.PTCFMCompiler()
# Then, instantiate the Qutip Extractor.extractor = qek_extractors.QutipExtractor( # Once computing is complete, data will be saved in this file. path=Path("saved_data.json"), compiler=compiler)
# Finally, initialize the Integrated Quantum Evolution Kernel with both the extractor and the parameter mu.kernel = QEK(mu=0.5, extractor=extractor)
In this case and since data has already been pre-processed, we will use the FastQEK
kernel instead that speeds-up the computation and allows for computing matrix similarities.
from qek.kernel import FastQEK as QEK
# Initialize the Quantum Evolution Kernel with a parameter mukernel = QEK(mu=0.5)
Fit
and Transform
graph datasets using QEK
Section titled “Fit and Transform graph datasets using QEK”To use the kernel in machine learning algorithms, we can fit the kernel on a training dataset, and use it to transform training/testing datasets. The result of such a transformation is a kernel matrix, which represents the similarities between graphs.
# Fitkernel.fit(X_train)
# TransformK_train = kernel.transform(X_train)K_test = kernel.transform(X_test)
print(f"Training Kernel Matrix Shape: {K_train.shape}")print(f"Testing Kernel Matrix Shape: {K_test.shape}")
Training Kernel Matrix Shape: (223, 223) Testing Kernel Matrix Shape: (56, 223)
The kernel matrix for the testing dataset looks like:
K_train
array([[1. , 0.93051453, 0.84982929, ..., 0.87518824, 0.84227677, 0.78327268], [0.93051453, 1. , 0.95171697, ..., 0.96514616, 0.93813683, 0.87256124], [0.84982929, 0.95171697, 1. , ..., 0.94978791, 0.94922164, 0.95506836], ..., [0.87518824, 0.96514616, 0.94978791, ..., 1. , 0.99034745, 0.87734572], [0.84227677, 0.93813683, 0.94922164, ..., 0.99034745, 1. , 0.87424501], [0.78327268, 0.87256124, 0.95506836, ..., 0.87734572, 0.87424501, 1. ]])
Parameter $\mu$ controls the rate of exponential decay. A large value of $\mu$ makes QEK very sensitive to small variations of the Jensen-Shanon distance. Conversely, when $\mu$ is small, the kernel is less affected by small variations in of $JS$.
QEK compares two processed graphs by their distribution of excitations. If a
and b
are two graphs, a value of kernel(a, b)
close to 1 indicates a big similarity between graphs a
and b
, while a value close to 0 means a small graph similarity.
Let's try that:
graph_1 = processed_dataset[2]graph_2 = processed_dataset[0]
display(f"Comparing a graph with itself: {kernel.similarity(graph_1, graph_1)}")display(f"Comparing two much dissimilar graphs: {kernel.similarity(graph_1, graph_2)}")
'Comparing a graph with itself: 1.0'
'Comparing two much dissimilar graphs: 0.7835581001861575'
We can further validate this by checking that the two graphs are registered differently and have different excitations.
graph_1.draw_register()graph_2.draw_register()
graph_1.draw_excitation()graph_2.draw_excitation()
3. Training a simple model
Section titled “3. Training a simple model”We will use an SVM (Support Vector Machine) to learn how to predict the toxicity of a molecule based on the precomputed kernel.
from sklearn.svm import SVC
# Define a SVC model with FastQEKqek_kernel = QEK(mu=0.5)model = SVC(kernel=qek_kernel, random_state=42)
Training
Section titled “Training”This precomputed kernel will allow us to evaluate the algorithm QEK.
model.fit(X_train, y_train)
SVC(kernel=<qek.kernel.kernel.FastQEK object at 0x7efd87a9a5c0>, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(kernel=<qek.kernel.kernel.FastQEK object at 0x7efd87a9a5c0>, random_state=42)
<qek.kernel.kernel.FastQEK object at 0x7efd87a9a5c0>
<qek.kernel.kernel.FastQEK object at 0x7efd87a9a5c0>
Evaluation
Section titled “Evaluation”We are using the following metrics:
The F1 score is a way to measure how well a model performs, especially when the data is uneven (e.g., more examples of one category than another). It combines two important aspects: how precise the model is (precision) and how well it captures all the actual positives (recall). It provides a single number that balances these two aspects, making it useful for evaluating performance in real-world scenarios where some categories are much more common than others.
Balanced accuracy is a method to evaluate a model's performance fairly, even when the data is imbalanced (e.g., one category is much more frequent than others). Instead of just looking at overall accuracy, which can be misleading in such cases, balanced accuracy considers how well the model performs for each category separately and then averages these performances. This ensures that the evaluation is not skewed by the more common categories, giving a more honest picture of the model's effectiveness across all categories.
import numpy as npfrom sklearn.metrics import classification_report, f1_score, balanced_accuracy_score
def evaluate_predictions(y_test, y_pred): """ Evaluates the model predictions using multiple metrics: F1 score, and balanced accuracy. Prints the classification report and other evaluation results.
Args: y_test (array-like): The true labels. y_pred (array-like): The predicted labels. """ # Calculate F1 score and balanced accuracy max_f1_score = f1_score(y_test, y_pred, average='weighted') final_f1_std = np.std(f1_score(y_test, y_pred, average=None)) max_bal_acc = balanced_accuracy_score(y_test, y_pred) std_bal_acc = np.std(balanced_accuracy_score(y_test, y_pred))
# Print the evaluation results print("\nEvaluation Results:") print(f"F1 Score: {max_f1_score:.2f}") print(f"Standard Deviation of F1 Score: {final_f1_std:.2f}") print(f"Balanced Accuracy: {max_bal_acc:.2f}") print(f"Standard Deviation of Balanced Accuracy: {std_bal_acc:.2f}")
# Print classification report print("Classification Report:") print(classification_report(y_test, y_pred))
We can use the trained model to make predictions on the test dataset
# Making predictions using the trained modely_pred = model.predict(X_test)
evaluate_predictions(y_test, y_pred)
Evaluation Results: F1 Score: 0.58 Standard Deviation of F1 Score: 0.16 Balanced Accuracy: 0.56 Standard Deviation of Balanced Accuracy: 0.00 Classification Report: precision recall f1-score support 0 0.63 0.82 0.71 33 1 0.54 0.30 0.39 23 accuracy 0.61 56 macro avg 0.58 0.56 0.55 56 weighted avg 0.59 0.61 0.58 56
4. Creating a Pipeline with SVM
Section titled “4. Creating a Pipeline with SVM”Pipelines in scikit-learn allow us to streamline the workflow by chaining preprocessing steps and models. In this step, we integrate the Quantum Evolution Kernel with an SVM classifier in a pipeline for end-to-end model training and prediction.
from sklearn.svm import SVCfrom sklearn.pipeline import Pipeline
# Define the pipelinepipeline = Pipeline([ ('QEK', QEK(mu=0.5)), ('svm', SVC(kernel='precomputed', random_state=42))])
# Train the pipelinepipeline.fit(X_train, y_train)
Pipeline(steps=[('QEK', <qek.kernel.kernel.FastQEK object at 0x7efd879cc280>), ('svm', SVC(kernel='precomputed', random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('QEK', <qek.kernel.kernel.FastQEK object at 0x7efd879cc280>), ('svm', SVC(kernel='precomputed', random_state=42))])
<qek.kernel.kernel.FastQEK object at 0x7efd879cc280>
SVC(kernel='precomputed', random_state=42)
# Making predictions using the trained modely_pred = pipeline.predict(X_test)
evaluate_predictions(y_test, y_pred)
Evaluation Results: F1 Score: 0.58 Standard Deviation of F1 Score: 0.16 Balanced Accuracy: 0.56 Standard Deviation of Balanced Accuracy: 0.00 Classification Report: precision recall f1-score support 0 0.63 0.82 0.71 33 1 0.54 0.30 0.39 23 accuracy 0.61 56 macro avg 0.58 0.56 0.55 56 weighted avg 0.59 0.61 0.58 56
5. GridSearchCV for Hyperparameter Optimization
Section titled “5. GridSearchCV for Hyperparameter Optimization”Hyperparameter tuning is a critical step in improving machine learning model performance. GridSearchCV systematically searches through a predefined set of hyperparameters to find the combination that yields the best results. Here, we optimize:
mu
: A parameter of the Quantum Evolution Kernel.C
: The regularization parameter of the SVM.
We will use the pipeline defined in the previous section inside the grid search. Additionally, we employ multiple scoring metrics such as F1 Score and Balanced Accuracy to evaluate the performance of the models comprehensively.
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFoldfrom sklearn.metrics import make_scorerimport matplotlib.pyplot as plt
# Define scoring metricsscoring = { "balanced_accuracy": make_scorer(balanced_accuracy_score), "f1_score": make_scorer(f1_score, average="weighted")}
# Define cross-validation strategyskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
# Define parameter gridparam_grid = { 'svm__C': np.logspace(0.001, 1, 5), 'QEK__mu': [0.25, 0.5, 0.75],}
grid_search = GridSearchCV(pipeline, param_grid, scoring=scoring, cv=skf, refit="f1_score", n_jobs=8, return_train_score=True)
# Perform grid searchgrid_search.fit(X_train, y_train)
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=42), estimator=Pipeline(steps=[('QEK', <qek.kernel.kernel.FastQEK object at 0x7efd879cc280>), ('svm', SVC(kernel='precomputed', random_state=42))]), n_jobs=8, param_grid={'QEK__mu': [0.25, 0.5, 0.75], 'svm__C': array([ 1.00230524, 1.78135304, 3.16592046, 5.62665128, 10. ])}, refit='f1_score', return_train_score=True, scoring={'balanced_accuracy': make_scorer(balanced_accuracy_score, response_method='predict'), 'f1_score': make_scorer(f1_score, response_method='predict', average=weighted)})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=10, n_splits=5, random_state=42), estimator=Pipeline(steps=[('QEK', <qek.kernel.kernel.FastQEK object at 0x7efd879cc280>), ('svm', SVC(kernel='precomputed', random_state=42))]), n_jobs=8, param_grid={'QEK__mu': [0.25, 0.5, 0.75], 'svm__C': array([ 1.00230524, 1.78135304, 3.16592046, 5.62665128, 10. ])}, refit='f1_score', return_train_score=True, scoring={'balanced_accuracy': make_scorer(balanced_accuracy_score, response_method='predict'), 'f1_score': make_scorer(f1_score, response_method='predict', average=weighted)})
Pipeline(steps=[('QEK', <qek.kernel.kernel.FastQEK object at 0x7efd8b857fa0>), ('svm', SVC(C=10.0, kernel='precomputed', random_state=42))])
<qek.kernel.kernel.FastQEK object at 0x7efd8b857fa0>
SVC(C=10.0, kernel='precomputed', random_state=42)
Evaluation
Section titled “Evaluation”We can access best trained model from the grid search using .best_estimator_.
# Evaluate the best modelbest_model = grid_search.best_estimator_y_pred = best_model.predict(X_test)
evaluate_predictions(y_test, y_pred)
Evaluation Results: F1 Score: 0.59 Standard Deviation of F1 Score: 0.12 Balanced Accuracy: 0.57 Standard Deviation of Balanced Accuracy: 0.00 Classification Report: precision recall f1-score support 0 0.64 0.76 0.69 33 1 0.53 0.39 0.45 23 accuracy 0.61 56 macro avg 0.59 0.57 0.57 56 weighted avg 0.60 0.61 0.59 56
We can also access the results of grid search using .cv_results_. This allows us to plot learning curves, as well as see the impact of different hyperparamters on model performance.
# Extract resultscv_results = grid_search.cv_results_
# Simple plot function to plot resultsdef plot_grid_search_results(grid_search, param_name="svm__C"): cv_results = grid_search.cv_results_ param_values = [p[param_name] for p in cv_results['params']] unique_param_values = sorted(set(param_values))
mean_f1_scores = [] std_f1_scores = []
for param_value in unique_param_values: indices = [i for i, p in enumerate(cv_results['params']) if p[param_name] == param_value] mean_f1_scores.append(np.mean([cv_results['mean_test_f1_score'][i] for i in indices])) std_f1_scores.append(np.mean([cv_results['std_test_f1_score'][i] for i in indices]))
plt.figure(figsize=(6, 5)) plt.plot(unique_param_values, mean_f1_scores, label="Mean F1 Score") plt.fill_between(unique_param_values, np.array(mean_f1_scores) - np.array(std_f1_scores), np.array(mean_f1_scores) + np.array(std_f1_scores), alpha=0.2)
plt.title(f"Grid Search Results for {param_name}") plt.xlabel(f"{param_name}") plt.ylabel("Mean F1 Score") plt.legend() plt.grid(True) plt.show()
plot_grid_search_results(grid_search, 'svm__C')
plot_grid_search_results(grid_search, 'QEK__mu')