I love the field of interpretability, but one issue faced by everyone who tries dipping their toes into interpretability is: “What is Interpretability?”. There never seems to be a universally agreed-upon definition for interpretability. Much like philosophy, this often leads to disagreements over the definitions, fights over the contexts, and arguments over the objectives. Interpretability is simply a “you know it when you see it” phenomenon. In many ways there is no single definition, since interpretability depends on the context, the goals, the target audience, the application, and even more. Nonetheless, my feeling is that the field continues to make progress by figuring out what interpretability is not, slowly refining the collective definition and slowly improving on the available tools. Again, much like the moral philosophy of Plato’s “Republic”, we must keep the conversation going to continue getting closer to the truth.
Although many researchers in interpretability are already aware of the difference between “Interpretability” and “Explainability” since their seminal distinction in [1], these too are not completely universal definitions. As a reminder, we will call something interpretable if it is ‘intrinsically interpretable’ or ‘interpretable by design’, meaning we can directly understand the decisions of a model. The classical examples are linear regression and decision trees. This is contrasted with something explainable, meaning that the model itself is blackbox, but we provide a post-hoc explanation or a subsequent justification for the model’s decision.
Like many of the definitions in interpretability, this might raise more questions than it answers. Although the distinction of “explanations beforehand” and “explanations afterward” is a fairly straightforward characterization of the difference, the question of “How can I tell if my model is intrinsically interpretable?” is much less straightforward. In this blog post, I will attempt to answer this last question, focusing on the task of supervised machine learning. We will look at several quintessential examples of interpretable models and how they generalize and extend to cover a wide class of machine learning algorithms.
The most commonly cited examples of “interpretable” machine learning methods are: linear regression and decision trees. Although these models are not free from interpretation issues (e.g. correlated features for linear models and depth requirements for decision trees), they are widely accepted as simple models where the reasoning process is easily understood. For the linear model, one can easily add each individual contribution to get the final prediction. For the decision tree, one can easily follow the flow chart leading to the final decision.
A third method which often gets left out here is the nearest neighbor algorithm. I will argue here that the nearest neighbor also makes decisions in a way which is easy to interpret and that it represents a third type of understandable logic. A test sample is compared to –under a fixed distance metric –all of the training samples (or even better, the learned prototypes), and the test sample is categorized as the same as the representative to which it was closest. Although the distance itself is not necessarily well understood, the classification into archetypes provides a sufficient and contrastive rationale (X is closer to A than to B).
In this blog post, I will argue that each of these methods has a quintessential reasoning approach which defines one of the “pillars” of interpretability. The linear model will develop into the additive approach which uses the accumulation over multiple input factors. The decision tree will develop into the logical approach which uses deduction over a given set of input literals. Finally, the nearest neighbor model will develop into the categorical approach which codifies each set of inputs into its correct type. It is quickly noted that additivity is most naturally continuous to continuous, logic is most naturally discrete to discrete, and classification is most naturally continuous to discrete.
The Pillar of Evidence Accumulation or the Additive Pillar is about a logical process which additively incorporates different pieces of evidence to come to a final conclusion. In the case of the linear model for independent input variables, this corresponds with the additive influence according to each of the linear coefficients. It is straightforward to generalize these independent influences from linear functions to nonlinear functions of the input variables, leading to additive models [2, 3, 4].
These models generally remain interpretable by using interactions of size two or less between the input variables. This allows for plotting the additive contribution as a 1D dependence function or a 2D dependence heatmap. Across many tabular datasets, these models even achieve state-of-the-art performance, matching the performance of blackbox methods like XGB and MLPs. This is especially true when allowing for 3D and higher interaction terms; however, this begins to push the boundary of what can be considered fully interpretable or not.
For additive models, we would like the individual terms to be as sparse as possible (considering as few factors as possible) and as simple as possible (considering only factors which are themselves easily understood). With these criteria obeyed, one is able to have an additive model which simplistically defines its predictions in terms of the input variables. Another key point for the ease of interpretation, however, is the independence of these different terms. When the factors are themselves independent, we can easily interpret the evidences provided by each of them as separate additive contributions to the prediction. This interpretation becomes more difficult in the presence of heavily correlated features, begging the question of which feature of the pair is actually causing the output to be as such. How to train and interpret additive models under these heavily correlated settings is a key direction of current exploration and an active area of research.
The Pillar of Logical Deduction or the Reasoning Pillar is about a logical process which carefully follows a deductive argument to come to a final conclusion. In the case of the decision tree, this is a simple flowchart logic which follows a sequence of logical steps to arrive at a final decision. The necessary variables and sufficient variables are clear to see through the chain of logic. Extensions to interpretable machine learning are optimal rule lists and optimal sparse decision trees [5, 6]. These provide simple decision making processes which achieve good performance on tabular datasets. These have been further combined into ‘Rashomon sets’ which, instead of providing a single optimal tree, provides a large set of near-optimal trees [7].
The key advantage of this is the ability to reason across many candidate solutions and choose one which obeys a secondary criterion like minimizing unfairness, aligning with domain expertise, or maximizing robustness. Recent approaches also investigate how to make inferences over the entire set of models. Interestingly, this is not the same as an ensemble of decision trees, although there are many similarities. Most glaringly, we never directly add the outputs of the individual decision trees. This very slight nuance in the interpretation makes the difference in giving an interpretable model or an uninterpretable model. In general, ensembles of models are considered uninterpretable (unless the base model is itself an additive model). This is because the naive combination of these two approaches does not respect what makes the other simple in the first place. A random forest is a simple sum of factors which are no longer simple or independent from one another; a random forest also combines the decision tree’s cold and calculating logic with a heuristic voting in the final step. The simplicity and sparsity of the decision logic is also another key aspect of allowing for a deduction to be interpretable.
A perhaps more natural representative generalization of a decision process would be a general logical function or a boolean circuit; however, these approaches have received little attention in interpretable machine learning. This is due to the fact that these branches of study are typically not concerned with the aspects that ensure these logical functions are sufficiently simple when applied as an ML model. This would involve learning circuits which have ‘necessity’ queries and ‘sufficiency’ queries easily computable and which minimize the required variable context to complete a computation. This likely poses an important direction of future research exploration. It is noted that for generative applications, circuits have recently attracted some attention, such as the interpretability of “Probabilistic Circuit” models and the explainable ‘circuit finding’ of “Mechanistic Interpretability”.