Benchmarks Tutorial¶
Welcome to the benchmark tutorial of quanda. This tutorial walks you through the process of using the benchmarking tools in quanda to evaluate a data attribution method. This tutorial covers 3 different examples of benchmarks. It includes all different initialization schemes: training a benchmark from scratch using train(), loading a benchmark from a YAML configuration using from_config(), and downloading a precomputed benchmark using load_pretrained().
See also
The Linear Datamodeling Score (LDS) page covers caveats specific to the most expensive benchmark in quanda, including how to precompute and reuse counterfactual subset logits across explainers.
To install the library with tutorial dependencies, run:
pip install -e '.[tutorials]'
Note
This tutorial is also available as a notebook.
Throughout this tutorial, we will be using a LeNet model trained on the MNIST dataset. Let’s start the tutorial by importing the necessary libraries and components:
import os
import pytest
import torch
import torchvision
import yaml
from quanda.benchmarks.downstream_eval import (
ShortcutDetection,
SubclassDetection,
)
from quanda.benchmarks.ground_truth import LinearDatamodeling
Next, we need to prepare for the following computations.
torch.set_float32_matmul_precision("medium")
to_img = torchvision.transforms.Compose(
[
torchvision.transforms.Normalize(mean=0.0, std=2.0),
torchvision.transforms.Normalize(mean=-0.5, std=1.0),
torchvision.transforms.ToPILImage(),
torchvision.transforms.Resize((224, 224)),
]
)
Downloading Precomputed Benchmarks¶
In this part of the tutorial, we will use the ShortcutDetection metric.
We will use the benchmark corresponding to this metric to evaluate all data attributors currently included in quanda in terms of their ability to detect when the model is using a shortcut.
We will download the precomputed MNIST benchmark. This includes an MNIST dataset which has shortcut features (an 8-by-8 white box on a specific location) on a subset of its samples from the class 0, and a model trained on this dataset. This model has learned to classify images with these features as class 0, and we will measure the extent to which this is reflected in the attributions of different methods.
device = "cuda" if torch.cuda.is_available() else "cpu"
benchmark = ShortcutDetection.load_pretrained(
bench_id="mnist_shortcut_detection",
cache_dir=cache_dir,
device=device,
)
The benchmark object contains all information about the controlled evaluation setup. Run the following to get some samples with the shortcut features, using benchmark.train_dataset and benchmark.train_dataset.transform_indices.
shortcut_img = benchmark.train_dataset[
benchmark.train_dataset.transform_indices[0]
][0]
tensor_img = shortcut_img.repeat(3, 1, 1)
img = to_img(tensor_img)
img.show(title="Shortcut Image")
Prepare initialization parameters for TDA methods¶
We now prepare the initialization parameters of attributors: hyperparameters, and components from the benchmark as needed. Note that we do not provide the model and dataset to use for attribution, since those components will be supplied by the benchmark objects, while initializing the attributor during evaluation.
Similarity Influence:
captum_similarity_args = {
"model_id": "mnist_shortcut_detection_tutorial",
"layers": "fc_2",
"cache_dir": os.path.join(cache_dir, "captum_similarity"),
}
Arnoldi Influence Functions: Notice that the trained checkpoints have been saved to the
cache_dirwhile downloading the benchmark. The checkpoint paths are available viabenchmark.checkpoints.
captum_influence_args = {
"layers": ["fc_3"],
"batch_size": 8,
"precompute_data_ratio": 0.1,
"projection_dim": 5,
}
TracInCP:
captum_tracin_args = {
"final_fc_layer": "fc_3",
"loss_fn": torch.nn.CrossEntropyLoss(reduction="mean"),
"batch_size": 8,
}
TRAK:
trak_args = {
"model_id": "mnist_shortcut_detection",
"cache_dir": os.path.join(cache_dir, "trak"),
"batch_size": 8,
"proj_dim": 2048,
}
Representer Point Selection:
representer_points_args = {
"model_id": "mnist_shortcut_detection",
"cache_dir": os.path.join(cache_dir, "representer_points"),
"batch_size": 8,
"epoch": 100,
"features_layer": "relu_4",
"classifier_layer": "fc_3",
}
Run the benchmark evaluation on the attributors¶
Note that some attributors take a long time to initialize or compute attributions. For a proof of concept, we recommend using CaptumSimilarity or RepresenterPoints, or lowering the parameter values given above (i.e. using low proj_dim for TRAK or a low Hessian dataset size for ArnoldiInfluence)
attributors = {
"captum_similarity": (CaptumSimilarity, captum_similarity_args),
"captum_arnoldi": (CaptumArnoldi, captum_influence_args),
"captum_tracin": (CaptumTracInCPFast, captum_tracin_args),
"trak": (TRAK, trak_args),
"representer": (RepresenterPoints, representer_points_args),
}
results = dict()
for name, (cls, kwargs) in attributors.items():
results[name] = benchmark.evaluate(
explainer_cls=cls, expl_kwargs=kwargs, batch_size=8
)["score"]
At this point, the dictionary results contains the scores of the attributors on the benchmark.
Training a Benchmark from Scratch¶
We will now showcase how a benchmark can be created from a YAML configuration and trained from scratch. Quanda parses the configuration, sets up dataset manipulations, and trains the model. Then the benchmark can be used to evaluate different attributors. This is done through the Benchmark.train method.
We will go through this use-case with the SubclassDetection benchmark which groups classes of the base dataset into superclasses. A model is trained to predict these superclasses, and the original label of the highest attributed datapoint for each test sample is observed. The benchmark expects this to be the same as the true class of the test sample.
The YAML configuration file specifies all required components:
the model architecture and its training parameters (optimizer, scheduler, number of epochs, etc.)
the training, validation, and evaluation datasets with their transforms
a dataset wrapper (
LabelGroupingDataset) that handles class grouping into superclassesthe number of superclasses and the grouping strategy (
randomor a specific mapping)
The class grouping can be set to random in the configuration to randomly assign classes into superclasses, which is the approach we will take in this tutorial.
Important
The configuration must specify bench_save_dir: the directory under which the trained benchmark (model checkpoints and metadata) is saved. There should be enough disk space to save the main model and M subset models for LDS (if applicable) under this directory. If training multiple benchmarks from scratch, make sure to set different bench_save_dir for each to avoid overwriting.
If multiple training jobs must share a bench_save_dir (e.g. concurrent runs of the same benchmark), pass use_pid=True to train (or train_and_push_to_hub) to suffix checkpoint and metadata directories with the current process id and avoid clobbering each other’s outputs. By default use_pid=False.
Note
Please note that calling SubclassDetection.train will initiate model training, therefore it will potentially take a long time.
from quanda.explainers.wrappers import (
TRAK,
CaptumArnoldi,
CaptumSimilarity,
CaptumTracInCPFast,
RepresenterPoints,
)
with open(
"tests/assets/mnist_local_bench/83edb41-default_SubclassDetection.yaml",
"r",
) as f:
subclass_config = yaml.safe_load(f)
subclass_config["bench_save_dir"] = os.path.join(
cache_dir, "subclass_detection_bench"
)
benchmark = SubclassDetection.train(
subclass_config,
device=device,
)
Now that we have trained the model on the MNIST dataset with grouped classes as defined in the configuration, we finalize this tutorial by evaluating the CaptumSimilarity attributor. The results dictionary will contain the score of the attributor on the benchmark after running the following:
results = benchmark.evaluate(
explainer_cls=CaptumSimilarity,
expl_kwargs={
"model_id": "mnist_subclass_detection_tutorial",
"layers": "fc_2",
"cache_dir": os.path.join(cache_dir, "captum_similarity"),
},
)
Caching and Sharing Explanations¶
Computing attributions is typically the most expensive step in TDA evaluation. To avoid recomputing them every time, every Benchmark exposes an explain classmethod that precomputes attributions over the evaluation dataset and writes them to disk together with an explanations_config.yaml describing how they were generated. A subsequent call to benchmark.evaluate(..., cache_dir=<path>, use_cached_expl=True) reads from that cache instead of recomputing.
The cache directory is keyed on:
the benchmark id (or its
explanations_group, see below),the explainer class name,
a stable hash of
expl_kwargs,the eval-subsample parameters
max_eval_nandeval_seed.
Changing any of these produces a different cache key, so cached explanations stay coupled to the exact setup they were computed on.
Sharing across benchmarks. Several benchmarks (e.g. ClassDetection and LinearDatamodelingMetric) can be defined on top of the same model + train/eval datasets. Setting a common explanations_group in their YAML configs replaces the per-benchmark id segment of the cache key with a shared one, so a single attribution pass can drive multiple evaluations. Only opt in when the grouped benchmarks really share those inputs — mismatched inputs under a shared group will silently corrupt results.
See Linear Datamodeling Score (LDS) for the analogous mechanism that caches counterfactual subset logits across explainers in the LDS benchmark.