Self-Explaining Neural Networks

The authors propose an ante-hoc explainability framework for training neural networks that are inherently interpretable. Evaluating based on explicitness, faithfulness, and stability, the authors show that existing methods for explainability are unsatisfactory. They propose SENNs and show that they are superior with respect to the aforementioned criteria.

From Linear Models to SENNs

The main intuition is that linear models are highly interpretable as we can see exactly what features the model is placing emphasis on, and they make consistent predictions based on a small number of parameters.

For input features $x_{1}, ..., x_{n} \in R$ , and associated parameters $θ_{1}, ..., θ_{n} \in R$ the linear regression model is given by $f (x) = \sum_{i}^{n} θ_{i} x_{i} + θ_{0}$ . This model is arguably interpretable for three specific reasons: i) input features ( $x_{i}$ ’s) are clearly anchored with the available observations, e.g., arising from empirical measurements; ii) each parameter $θ_{i}$ provides a quantitative positive/negative contribution of the corresponding feature $x_{i}$ to the predicted value; and iii) the aggregation of feature specific terms $θ_{i} x_{i}$ is additive without conflating feature-by-feature interpretation of impact.

The authors aim to generalize this kind of linear model to simultaneously enrich its learning capabilities while maintaining its interpretability. The generalization is achieved in 3 steps:

Generalizing coefficients
Utilizing rich feature bases
Generalizing aggregation function.

Generalizing Coefficients

The idea is to have the coefficients depend on the input $x$ , i.e. have the coefficients $θ$ be functions with respect to $x$ .

Specifically, we define (offset function omitted) $f (x) = θ (x)^{T} x$ , and choose $θ$ from a complex model class $θ$ , realized for example via deep neural networks.

Crucially, with just this generalization of $θ$ , the model becomes “as powerful as any deep neural network” since $θ$ could simply be a neural network. However, by doing so, we reintroduce black box neural nets and lose the interpretability that we had in scalar parameters. Thus, the authors assert that we must enforce “close” inputs $x$ and $x^{'}$ to have similar parameter values $θ (x)$ and $θ (x^{'})$ .

More precisely, we can, for example, regularize the model in such a manner that $\nabla_{x} f (x) \approx θ (x_{0})$ for all $x$ in a neighborhood of $x_{0}$ . In other words, ==the model acts locally, around each $x_{0}$ , as a linear model with a vector of stable coefficients $θ (x_{0})$ ==. The individual values $θ (x_{0})_{i}$ act as and are interpretable as coefficients of a linear model with respect to the final prediction, but adapt dynamically to the input, albeit varying slower than x.

Feature Basis

Typical interpretable models tend to consider each variable (one feature or one pixel) as the fundamental unit which explanations consist of. However, pixels are rarely the basic units used in human image understanding; instead, we would rely on strokes and other higher order features. We refer to these more general features as interpretable basis concepts and use them in place of raw inputs in our models.

We define a function $h (x) : X \to f e a t u re s p a ce$ that takes raw input features and projects them into a more interpretable feature space. The authors offer several suggestions for $h$ :

we consider functions $h (x) : X \to Z \subset R^{k}$ , where $Z$ is some space of interpretable atoms. Naturally, $k$ should be small so as to keep the explanations easily digestible. Alternatives for $h (\cdot)$ include: (i) subset aggregates of the input (e.g., with $h (x) = A x$ for a boolean mask matrix $A$ ), (ii) predefined, pre-grounded feature extractors designed with expert knowledge (e.g., filters for image processing), (iii) prototype based concepts, e.g. $h_{i} (x) = ∣∣ x - z_{i} ∣∣$ for some $z_{i} \in X$ [12], or learnt representations with specific constraints to ensure grounding [19]. Naturally, we can let $h (x) = x$ to recover raw-input explanations if desired.

With the generalized coefficients and the feature basis, we have the following model: $f (x) = θ (x)^{T} h (x) = \sum_{i = 1}^{K} θ (x)_{i} h (x)_{i} .$

Since each $h (x)_{i}$ remains a scalar, it can still be interpreted as the degree to which a particular feature is present. In turn, with constraints similar to those discussed above $θ (x)_{i}$ remains interpretable as a local coefficient. Note that the notion of locality must now take into account how the concepts rather than inputs vary since the model is interpreted as being linear in the concepts rather than $x$ .

Generalized Aggregation

The last step is to generalize the summation present in $f$ . We can consider more generally, any function $g (z_{1}, ..., z_{k})$ that effectively aggregates the $θ (x)_{i} h (x)_{i}$ ‘s.

Naturally, in order for this function to preserve the desired interpretation of $θ (x)$ in relation to $h (x)$ , it should: i) be permutation invariant, so as to eliminate higher order uninterpretable effects caused by the relative position of the arguments, (ii) isolate the effect of individual $h (x)_{i}$ ‘s in the output (e.g., avoiding multiplicative interactions between them), and (iii) preserve the sign and relative magnitude of the impact of the relevance values $θ (x)_{i}$ .

Self Explaining Models

The authors use the following definition to enforce the notion of “local stability.”

Definition 3.2. $f : X \subseteq R^{n} \to R^{m}$ is locally difference bounded by $h : X \subseteq R^{n} \to R^{k}$ if for every $x_{0}$ there exist $δ > 0$ and $L \in R$ such that $∣∣ x - x_{0} ∣∣ < δ$ implies $∣∣ f (x) - f (x_{0}) ∣∣ \leq L ∣∣ h (x) - h (x_{0}) ∣∣$ .

The following definition defines the class of functions the authors constitute as self-explaining prediction models:

Definition 3.3. Let $x \in X \subset R^{n}$ and $Y \subseteq R^{m}$ be the input and output spaces. We say that $f : X \to Y$ is a self-explaining prediction model if it has the form $f (x) = g (θ_{1} (x) h_{1} (x), ..., θ_{k} (x) h_{k} (x))$ where:

P1) $g$ is monotone and completely additively separable

P2) For every $z_{i} := θ_{i} (x) h_{i} (x)$ , $g$ satisfies $\frac{\partial g}{\partial z _{i}} \geq 0$

P3) $θ$ is locally difference bounded by $h$

P4) $h_{i} (x)$ is an interpretable representation of $x$

P5) $k$ is small. In that case, for a given input $x$ , we define the explanation of $f (x)$ to be the set $E_{f} (x) \equiv (h_{i} (x), θ_{i} (x))_{i = 1}^{k}$ of basis concepts and their influence scores.

Since our aim is maintaining model richness even in the case where the $h_{i}$ are chosen to be trivial input feature indicators, we rely predominantly on $θ$ for modeling capacity, realizing it with larger, higher-capacity architectures.

The tricky part about enforcing P3 is that we want the model to be linear in terms of the concepts, not $x$ , i.e. we want $f$ to be stable with respect to $h$ .

For this, let us consider what $f$ would look like if the $θ_{i}$ ’s were indeed (constant) parameters. Looking at $f$ as a function of $h (x)$ , i.e. $f (x) = g (h (x))$ , let $z = h (x)$ . Using the chain rule we get $\nabla_{x} f = \nabla_{z} f \cdot J_{x}^{h}$ , where $J_{x}^{h}$ denotes the Jacobian of $h$ (with respect to $x$ ). At a given point $x_{0}$ , we want $θ (x_{0})$ to behave as the derivative of $f$ with respect to the concept vector $h (x)$ around $x_{0}$ , i.e., we seek $θ (x_{0}) \approx \nabla_{z} f$ . Since this is hard to enforce directly, we can instead plug this ansatz in $\nabla_{x} f = \nabla_{z} f \cdot J_{x}^{h}$ to obtain a proxy condition: $L_{θ} (f (x)) := ∣∣ \nabla_{x} f (x) - θ (x)^{T} J_{x}^{h} (x) ∣∣ \approx 0$ All three terms in $L_{θ} (f)$ can be computed, and when using differentiable architectures $h (\cdot)$ and $θ (\cdot)$ , we obtain gradients with respect to (3) [above equation] through automatic differentiation and thus use it as a regularization term in the optimization objective. With this, we obtain a gradient-regularized objective of the form $L_{y} (f (x), y) + λ L_{θ} (f)$ , where the first term is a classification loss and $θ$ a parameter that trades off performance against stability—and therefore, interpretability— of $θ (x)$ .

This is so incredibly clever I’m at a loss for words. Essentially, we know we want $a \approx b$ to be true. We also know that $x = b \cdot c$ . Since it’s difficult to directly encode the $a \approx b$ objective into the loss function, we instead encode $x \approx a \cdot c$ into the loss function. A very nice problem-solving technique seen used in the wild!

Learning Interpretable Basis Concepts

Basis concepts which serve as “units” of explanation are ideally “expert-informed.” However, in cases where expert knowledge is scarce or expensive, such concepts can be learned. The authors propose the following desiderata for interpretable concepts:

(i) Fidelity: the representation of $x$ in terms of concepts should preserve relevant information, (ii) Diversity: inputs should be representable with few non-overlapping concepts, and (iii) Grounding: concepts should have an immediate human-understandable interpretation.

The authors then describe how they achieved their proposed desiderata:

Here, we enforce these conditions upon the concepts learnt by SENN by: (i) training h as an autoencoder, (ii) enforcing diversity through sparsity and (iii) providing interpretation on the concepts by prototyping (e.g., by providing a small set of training examples that maximally activate each concept). Learning of h is done end-to-end in conjunction with the rest of the model. If we denote by $h_{d ec} (\cdot) : R^{k} \to R^{n}$ the decoder associated with $h$ , and $\overset{x}{^} := h_{d ec} (h (x))$ the reconstruction of $x$ , we use an additional penalty $L_{h} (x, \overset{x}{^})$ on the objective, yielding the loss: $L_{y} (f (x), y) + λ L_{θ} (f) + ξ L_{h} (x, \overset{x}{^})$ Achieving (iii), i.e., the grounding of $h (x)$ , is more subjective. A simple approach consists of representing each concept by the elements in a sample of data that maximize their value, that is, we can represent concept $i$ through the set $X^{i} = a r g ma x_{\hat{X} \subseteq X, ∣ \hat{X} ∣ = l} \sum_{x \in \hat{X}} h (x)_{i}$ where $l$ is small. Similarly, one could construct (by optimizing $h$ ) synthetic inputs that maximally activate each concept (and do not activate others), i.e., $a r g ma x_{x \in X} h_{i} (x) - \sum_{j \neq = i} h_{j} (x)$ . Alternatively, when available, one might want to represent concepts via their learnt weights—e.g., by looking at the filters associated with each concept in a CNN-based $h (\cdot)$ . In our experiments, we use the first of these approaches (i.e., using maximally activated prototypes), leaving exploration of the other two for future work.

References: Towards Robust Interpretability with Self-Explaining Neural Networks

markedown_

Explorer