Benchmarks Tutorial¶

Welcome to the benchmark tutorial of quanda. This tutorial walks you through the process of using the benchmarking tools in quanda to evaluate a data attribution method. This tutorial covers 3 different examples of benchmarks. It includes all different initialization schemes: training a benchmark from scratch using train(), loading a benchmark from a YAML configuration using from_config(), and downloading a precomputed benchmark using load_pretrained().

Downloading Precomputed Benchmarks¶

In this part of the tutorial, we will use the ShortcutDetection metric.

We will use the benchmark corresponding to this metric to evaluate all data attributors currently included in quanda in terms of their ability to detect when the model is using a shortcut.

We will download the precomputed MNIST benchmark. This includes an MNIST dataset which has shortcut features (an 8-by-8 white box on a specific location) on a subset of its samples from the class 0, and a model trained on this dataset. This model has learned to classify images with these features as class 0, and we will measure the extent to which this is reflected in the attributions of different methods.

device = "cuda" if torch.cuda.is_available() else "cpu"

benchmark = ShortcutDetection.load_pretrained(
    bench_id="mnist_shortcut_detection",
    cache_dir=cache_dir,
    device=device,
)

The benchmark object contains all information about the controlled evaluation setup. Run the following to get some samples with the shortcut features, using benchmark.train_dataset and benchmark.train_dataset.transform_indices.

shortcut_img = benchmark.train_dataset[
    benchmark.train_dataset.transform_indices[0]
][0]
tensor_img = shortcut_img.repeat(3, 1, 1)
img = to_img(tensor_img)
img.show(title="Shortcut Image")

Prepare initialization parameters for TDA methods¶

We now prepare the initialization parameters of attributors: hyperparameters, and components from the benchmark as needed. Note that we do not provide the model and dataset to use for attribution, since those components will be supplied by the benchmark objects, while initializing the attributor during evaluation.

Similarity Influence:

captum_similarity_args = {
    "model_id": "mnist_shortcut_detection_tutorial",
    "layers": "fc_2",
    "cache_dir": os.path.join(cache_dir, "captum_similarity"),
}

Arnoldi Influence Functions: Notice that the trained checkpoints have been saved to the cache_dir while downloading the benchmark. The checkpoint paths are available via benchmark.checkpoints.

captum_influence_args = {
    "layers": ["fc_3"],
    "batch_size": 8,
    "precompute_data_ratio": 0.1,
    "projection_dim": 5,
}

TracInCP:

captum_tracin_args = {
    "final_fc_layer": "fc_3",
    "loss_fn": torch.nn.CrossEntropyLoss(reduction="mean"),
    "batch_size": 8,
}

TRAK:

trak_args = {
    "model_id": "mnist_shortcut_detection",
    "cache_dir": os.path.join(cache_dir, "trak"),
    "batch_size": 8,
    "proj_dim": 2048,
}

Representer Point Selection:

representer_points_args = {
    "model_id": "mnist_shortcut_detection",
    "cache_dir": os.path.join(cache_dir, "representer_points"),
    "batch_size": 8,
    "epoch": 100,
    "features_layer": "relu_4",
    "classifier_layer": "fc_3",
}

Run the benchmark evaluation on the attributors¶

Note that some attributors take a long time to initialize or compute attributions. For a proof of concept, we recommend using CaptumSimilarity or RepresenterPoints, or lowering the parameter values given above (i.e. using low proj_dim for TRAK or a low Hessian dataset size for ArnoldiInfluence)

attributors = {
    "captum_similarity": (CaptumSimilarity, captum_similarity_args),
    "captum_arnoldi": (CaptumArnoldi, captum_influence_args),
    "captum_tracin": (CaptumTracInCPFast, captum_tracin_args),
    "trak": (TRAK, trak_args),
    "representer": (RepresenterPoints, representer_points_args),
}
results = dict()
for name, (cls, kwargs) in attributors.items():
    results[name] = benchmark.evaluate(
        explainer_cls=cls, expl_kwargs=kwargs, batch_size=8
    )["score"]

At this point, the dictionary results contains the scores of the attributors on the benchmark.

Training a Benchmark from Scratch¶

We will now showcase how a benchmark can be created from a YAML configuration and trained from scratch. Quanda parses the configuration, sets up dataset manipulations, and trains the model. Then the benchmark can be used to evaluate different attributors. This is done through the Benchmark.train method.

We will go through this use-case with the SubclassDetection benchmark which groups classes of the base dataset into superclasses. A model is trained to predict these superclasses, and the original label of the highest attributed datapoint for each test sample is observed. The benchmark expects this to be the same as the true class of the test sample.

The YAML configuration file specifies all required components:

the model architecture and its training parameters (optimizer, scheduler, number of epochs, etc.)
the training, validation, and evaluation datasets with their transforms
a dataset wrapper (LabelGroupingDataset) that handles class grouping into superclasses
the number of superclasses and the grouping strategy (random or a specific mapping)

The class grouping can be set to random in the configuration to randomly assign classes into superclasses, which is the approach we will take in this tutorial.

Important

The configuration must specify bench_save_dir: the directory under which the trained benchmark (model checkpoints and metadata) is saved. There should be enough disk space to save the main model and M subset models for LDS (if applicable) under this directory. If training multiple benchmarks from scratch, make sure to set different bench_save_dir for each to avoid overwriting.

If multiple training jobs must share a bench_save_dir (e.g. concurrent runs of the same benchmark), pass use_pid=True to train (or train_and_push_to_hub) to suffix checkpoint and metadata directories with the current process id and avoid clobbering each other’s outputs. By default use_pid=False.

Note

Please note that calling SubclassDetection.train will initiate model training, therefore it will potentially take a long time.

from quanda.explainers.wrappers import (
    TRAK,
    CaptumArnoldi,
    CaptumSimilarity,
    CaptumTracInCPFast,
    RepresenterPoints,
)

with open(
    "tests/assets/mnist_local_bench/83edb41-default_SubclassDetection.yaml",
    "r",
) as f:
    subclass_config = yaml.safe_load(f)

subclass_config["bench_save_dir"] = os.path.join(
    cache_dir, "subclass_detection_bench"
)
benchmark = SubclassDetection.train(
    subclass_config,
    device=device,
)

Now that we have trained the model on the MNIST dataset with grouped classes as defined in the configuration, we finalize this tutorial by evaluating the CaptumSimilarity attributor. The results dictionary will contain the score of the attributor on the benchmark after running the following:

results = benchmark.evaluate(
    explainer_cls=CaptumSimilarity,
    expl_kwargs={
        "model_id": "mnist_subclass_detection_tutorial",
        "layers": "fc_2",
        "cache_dir": os.path.join(cache_dir, "captum_similarity"),
    },
)

Benchmarks Tutorial¶

Downloading Precomputed Benchmarks¶

Prepare initialization parameters for TDA methods¶

Run the benchmark evaluation on the attributors¶

Training a Benchmark from Scratch¶

Caching and Sharing Explanations¶