Skip to content
Pasqal Documentation

Tutorial 2: Quantum Evolution Kernel-Based Machine Learning Tutorial

Machine Learning (ML) is a field of artificial intelligence that focuses on building systems capable of learning from data to make predictions or decisions. A common ML task is classification, where we try to assign labels to data points. In this tutorial, we will use the Quantum Evolution Kernel, a custom kernel that computes the similarity between graph-structured data to model a classification problem using Support Vector Machine (SVM) on graph dataset.

In the previous part of this tutorial, we have loaded a dataset of molecules, labelled with their toxicity, and used a quantum device to extract machine-learning features. Now, our high-level objective is to use these features to predict the toxicity of molecules.

This tutorial will guide you through:

  1. Loading and preprocessing a dataset for kernel-based machine learning.
  2. Introduction to Quantum Evolution Kernel (QEK), including fit and transform methods.
  3. Training and evaluating a Support Vector Machine (SVM) using the Quantum Evolution Kernel.
  4. Integrating the kernel and SVM into a scikit-learn Pipeline for streamlined workflows.
  5. Performing hyperparameter optimization using GridSearchCV to improve model performance.

A companion notebook will guide you through using a Quantum Device to extract machine-learning features from graphs.

In this tutorial, we use the results of the Quantum Device execution on a classical device (i.e. your computer) to create a Quantum Evolution Kernel. Since our algorithm combines steps that are executed on a Quantum Device and steps that are executed on a classical device, we call this a hybrid algorithm.

This tutorial uses scikit-learn for common machine learning tasks, but the concepts would work with any other machine learning framework as well.

import qek.data.processed_data as qek_dataset
# Load the dataset we processed in the quantum extraction tutorial
processed_dataset = qek_dataset.load_dataset(file_path="ptcfm_processed_dataset.json")
print(f"Size of the quantum compatible dataset = {len(processed_dataset)}")
from sklearn.model_selection import train_test_split
# Prepare features (X) and targets (y)
X = [data for data in processed_dataset] # Quantum distribution excitation.
y = [data.target for data in processed_dataset] # Toxicity.
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size=0.2, random_state=42)
print(f'Size of the training quantum compatible dataset = {len(X_train)}')
print(f'Size of the testing quantum compatible dataset = {len(X_test)}')
dataset_example = X[64]
# The features we have extracted for this sample
dataset_example.draw_excitation()

The Quantum Evolution Kernel computes a similarity score between two graphs based on quantum-inspired measures. It is designed to work with graph-structured data.

  • A kernel is a function that computes a similarity measure between two data points. In ML, kernels are often used to implicitly map data into a higher-dimensional space without computing the transformation explicitly. This enables algorithms like Support Vector Machines (SVM) to solve problems that are not linearly separable in the original feature space.

For a graph $G$, let's call the excitation distribution $P_G$.

We may now construct the Quantum Evolution Kernel, or QEK. Mathematically, QEK is defined as: $$ K(G, G') = \exp \left( -\mu JS(P_G, P_{G'}) \right) $$

where $\mu$ is an hyperparameter of our kernel and $JS$ is the Jensen-Shannon divergence. For processing raw graph data, the IntegratedQEK is recommended; however, it requires an extractor to be provided as an argument.

import qek.data.extractors as qek_extractors
import qek.data.graphs as qek_graphs
from qek.kernel import IntegratedQEK as QEK
from pathlib import Path
# First, we need to intantiate the compiler for the type of data.
compiler = qek_graphs.PTCFMCompiler()
# Then, instantiate the Qutip Extractor.
extractor = qek_extractors.QutipExtractor(
# Once computing is complete, data will be saved in this file.
path=Path("saved_data.json"),
compiler=compiler
)
# Finally, initialize the Integrated Quantum Evolution Kernel with both the extractor and the parameter mu.
kernel = QEK(mu=0.5, extractor=extractor)
from qek.kernel import FastQEK as QEK
# Initialize the Quantum Evolution Kernel with a parameter mu
kernel = QEK(mu=0.5)

Fit and Transform graph datasets using QEK

Section titled “Fit and Transform graph datasets using QEK”

To use the kernel in machine learning algorithms, we can fit the kernel on a training dataset, and use it to transform training/testing datasets. The result of such a transformation is a kernel matrix, which represents the similarities between graphs.

# Fit
kernel.fit(X_train)
# Transform
K_train = kernel.transform(X_train)
K_test = kernel.transform(X_test)
print(f"Training Kernel Matrix Shape: {K_train.shape}")
print(f"Testing Kernel Matrix Shape: {K_test.shape}")
K_train
graph_1 = processed_dataset[2]
graph_2 = processed_dataset[0]
display(f"Comparing a graph with itself: {kernel.similarity(graph_1, graph_1)}")
display(f"Comparing two much dissimilar graphs: {kernel.similarity(graph_1, graph_2)}")
graph_1.draw_register()
graph_2.draw_register()
graph_1.draw_excitation()
graph_2.draw_excitation()
from sklearn.svm import SVC
# Define a SVC model with FastQEK
qek_kernel = QEK(mu=0.5)
model = SVC(kernel=qek_kernel, random_state=42)

This precomputed kernel will allow us to evaluate the algorithm QEK.

model.fit(X_train, y_train)

We are using the following metrics:

  • The F1 score is a way to measure how well a model performs, especially when the data is uneven (e.g., more examples of one category than another). It combines two important aspects: how precise the model is (precision) and how well it captures all the actual positives (recall). It provides a single number that balances these two aspects, making it useful for evaluating performance in real-world scenarios where some categories are much more common than others.

  • Balanced accuracy is a method to evaluate a model's performance fairly, even when the data is imbalanced (e.g., one category is much more frequent than others). Instead of just looking at overall accuracy, which can be misleading in such cases, balanced accuracy considers how well the model performs for each category separately and then averages these performances. This ensures that the evaluation is not skewed by the more common categories, giving a more honest picture of the model's effectiveness across all categories.

import numpy as np
from sklearn.metrics import classification_report, f1_score, balanced_accuracy_score
def evaluate_predictions(y_test, y_pred):
"""
Evaluates the model predictions using multiple metrics: F1 score, and
balanced accuracy. Prints the classification report
and other evaluation results.
Args:
y_test (array-like): The true labels.
y_pred (array-like): The predicted labels.
"""
# Calculate F1 score and balanced accuracy
max_f1_score = f1_score(y_test, y_pred, average='weighted')
final_f1_std = np.std(f1_score(y_test, y_pred, average=None))
max_bal_acc = balanced_accuracy_score(y_test, y_pred)
std_bal_acc = np.std(balanced_accuracy_score(y_test, y_pred))
# Print the evaluation results
print("\nEvaluation Results:")
print(f"F1 Score: {max_f1_score:.2f}")
print(f"Standard Deviation of F1 Score: {final_f1_std:.2f}")
print(f"Balanced Accuracy: {max_bal_acc:.2f}")
print(f"Standard Deviation of Balanced Accuracy: {std_bal_acc:.2f}")
# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Making predictions using the trained model
y_pred = model.predict(X_test)
evaluate_predictions(y_test, y_pred)
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
# Define the pipeline
pipeline = Pipeline([
('QEK', QEK(mu=0.5)),
('svm', SVC(kernel='precomputed', random_state=42))
])
# Train the pipeline
pipeline.fit(X_train, y_train)
# Making predictions using the trained model
y_pred = pipeline.predict(X_test)
evaluate_predictions(y_test, y_pred)

5. GridSearchCV for Hyperparameter Optimization

Section titled “5. GridSearchCV for Hyperparameter Optimization”
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
from sklearn.metrics import make_scorer
import matplotlib.pyplot as plt
# Define scoring metrics
scoring = {
"balanced_accuracy": make_scorer(balanced_accuracy_score),
"f1_score": make_scorer(f1_score, average="weighted")
}
# Define cross-validation strategy
skf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
# Define parameter grid
param_grid = {
'svm__C': np.logspace(0.001, 1, 5),
'QEK__mu': [0.25, 0.5, 0.75],
}
grid_search = GridSearchCV(pipeline, param_grid, scoring=scoring, cv=skf, refit="f1_score", n_jobs=8, return_train_score=True)
# Perform grid search
grid_search.fit(X_train, y_train)
# Evaluate the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
evaluate_predictions(y_test, y_pred)
# Extract results
cv_results = grid_search.cv_results_
# Simple plot function to plot results
def plot_grid_search_results(grid_search, param_name="svm__C"):
cv_results = grid_search.cv_results_
param_values = [p[param_name] for p in cv_results['params']]
unique_param_values = sorted(set(param_values))
mean_f1_scores = []
std_f1_scores = []
for param_value in unique_param_values:
indices = [i for i, p in enumerate(cv_results['params']) if p[param_name] == param_value]
mean_f1_scores.append(np.mean([cv_results['mean_test_f1_score'][i] for i in indices]))
std_f1_scores.append(np.mean([cv_results['std_test_f1_score'][i] for i in indices]))
plt.figure(figsize=(6, 5))
plt.plot(unique_param_values, mean_f1_scores, label="Mean F1 Score")
plt.fill_between(unique_param_values,
np.array(mean_f1_scores) - np.array(std_f1_scores),
np.array(mean_f1_scores) + np.array(std_f1_scores),
alpha=0.2)
plt.title(f"Grid Search Results for {param_name}")
plt.xlabel(f"{param_name}")
plt.ylabel("Mean F1 Score")
plt.legend()
plt.grid(True)
plt.show()
plot_grid_search_results(grid_search, 'svm__C')
plot_grid_search_results(grid_search, 'QEK__mu')