How to Assess the Quality of Attributions?¶

Evaluation of interpretability approaches is a challenging task, as it is often difficult to define a ground truth for interpretability. Although there are various demonstrations of TDA’s potential for interpretability and practical applications, the critical question of how TDA methods should be effectively evaluated remains open. While methods based on estimating counterfactual retraining effects have a well-defined ground truth, this ground truth is computationally demanding and is not feasibly computable for large scale experiments. To address these shortcomings, several approaches have been proposed by the community, which can be categorized into three groups:

Ground truth

As some of the methods are designed to approximate LOO effects, ground truth can often be computed for TDA evaluation. As explained above, this counterfactual ground truth approach requires retraining the model multiple times on different subsets of the training data, which is computationally expensive. Additionally, this ground truth is shown to be dominated by noise in practical deep learning settings, due to the inherent stochasticity of a typical training process (Basu et al., 2021; Nguyen et al., 2023). The most straightforward example of ground truth metrics is the Leave-one-out (LOO) metric (Koh and Liang, 2017). Linear Datamodeling Score (LDS) (Park et al., 2023) is another example of a ground truth metric that measures the correlation between the (grouped) attribution scores and the actual output of models trained on a limited number subsets of the training set, which helps with the computational demand of the metric, but doesn’t solve the problem in its totality.

Downstream Task Evaluators

To remedy the challenges associated with ground truth evaluation, the literature proposes to assess the utility of a TDA method within the context of an end-task. The most commonly used evaluation criteria is Mislabeling Detection (Koh and Liang, 2017; Yeh et al., 2018; Pruthi et al., 2020) which compares different TDA methods in terms of their usefulness for detecting mislabeled samples after training the network on a dataset of which the labels are deliberately poisoned. Other examples could be detecting backdoor attacks (Karthikeyan et al., 2021; Yolcu et al., 2025) or predicting the model decision from its attributions (Hanawa et al., 2021).

Heuristics

Finally, the community also made use of heuristics (desirable properties or sanity checks) to evaluate the quality of TDA techniques. These include comparing the attributions of a trained model and a randomized model (Hanawa et al., 2021) and measuring the amount of overlap between the attributions for different test samples (Barshan et al., 2020).

quanda is designed to meet the need of a comprehensive and systematic evaluation framework, allowing practitioners and researchers to obtain a detailed view of the performance of TDA methods in various contexts.