What is Training Data Attribution?¶

The interpretability of neural network decisions is an active area of research which has seen a variety of approaches over time. Most of the initial focus was on feature attribution methods, which highlight features in the input space that are responsible for a specific prediction (Simonyan et al., 2014; Bach et al., 2015; Lundberg and Lee, 2017). These methods were often criticized for being unreliable and difficult to understand (Adebayo et al., 2018; Ghorbani et al., 2019). In response, researchers explored new directions, such as concept-based (Poeta et al., 2023) and mechanistic interpretability (Bereska and Gavves) methods. Recently, Training Data Attribution (TDA) has gained attention as a promising approach for enhancing the interpretability of neural networks.

TDA methods attribute model output on a specific test sample to the training dataset that it was trained on. As such, they reveal the training datapoints responsible for the model’s decisions. Tracing model decisions back to the training data, TDA methods enable practitioners to understand the model’s behavior and identify potential issues in the training setup, such as biases in the dataset. Different approaches have been proposed for this problem. While some methods focus on estimating the counterfactual effect of removing datapoints from the training set and retraining the model (Koh and Liang, 2017; Park et al., 2023; Bae et al., 2024), other methods achieve the attribution by tracking the contributions of training points to the loss reduction throughout training (Pruthi et al., 2020), using interpretable surrogate models (Yeh et al., 2018) or finding training samples that are deemed similar to the test sample by the model (Caruana et al., 1999; Hanawa et al., 2021). In addition to model understanding, TDA has been used in a variety of applications such as debugging model behavior (Koh and Liang, 2017; Yeh et al., 2018; K and Søgaard, 2021; Guo et al., 2021), data summarization (Khanna et al., 2019; Marion et al., 2023; Yang et al., 2023), dataset selection (Engstrom et al., 2024; Chhabra et al., 2024), fact tracing (Akyurek et al., 2022) and machine unlearning (Warnecke et al., 2023).