Explainability in AI: Making Black-Box Models Interpretable

Explainable AI (XAI) covers techniques for understanding why machine-learning models produce the predictions they do.
The problem is acute for deep neural networks, which have hundreds of millions to hundreds of billions of parameters and no obvious human-readable logic.
Popular tools include SHAP (feature attribution based on Shapley values) and LIME (local linear approximations).
Regulation — EU AI Act, GDPR, fair-lending laws — increasingly requires explanations for high-stakes automated decisions.
A newer research area, mechanistic interpretability, aims to reverse-engineer neural network internals to understand how concepts are represented.

Why we need to explain AI

Regulatory bodies, courts, users, and internal stakeholders all want answers to “why did the model decide that?”. Without explanations, AI decisions create legal exposure (adverse-action notices for credit), erode trust (users who don’t know why they were denied), and hide bugs (engineers who can’t tell when the model is going wrong).

The explainability problem is most acute for deep learning. A decision tree with 50 nodes is roughly comprehensible at a glance. A neural network with 300 billion parameters is not — there is no subset of its weights you can read and understand. For the underlying network architecture, see our neural networks primer.

Two kinds of explanation

Global explanations

Describe the model’s overall behaviour. Which features does it rely on most? What kinds of inputs does it handle well or poorly? Global explanations help stakeholders understand what kind of system they’re deploying.

Local explanations

Explain a single prediction. Why was this applicant’s loan denied? Why was this X-ray flagged? Local explanations matter for adverse-action notices, customer support, and debugging specific failures.

Interpretable by design vs. post-hoc explanation

Interpretable models

Some models are inherently understandable — decision trees, linear and logistic regression, small rule-based systems, generalized additive models. If the application is high-stakes and the accuracy trade-off is tolerable, starting with an interpretable model is often the cleanest path.

Post-hoc explanation

For deep neural networks, ensembles, and gradient boosting, you train the model first and then try to explain it. Post-hoc methods do not change the model; they add a layer of interpretation on top.

Feature attribution methods

SHAP (SHapley Additive exPlanations)

SHAP, based on Shapley values from cooperative game theory, attributes each prediction to its contributing features. The mathematical foundation (uniqueness under axioms of local accuracy and consistency) makes it the most rigorous widely-used method. Tree SHAP is fast for tree ensembles; Kernel SHAP works for any model but is slow. Output is a set of per-feature contributions that add up to the prediction.

LIME (Local Interpretable Model-agnostic Explanations)

LIME explains individual predictions by fitting a simple linear model to the region around the point of interest. It perturbs the input, sees how the model’s prediction changes, and approximates the local behaviour with a linear model whose coefficients are the explanation. Fast and intuitive, though sometimes unstable — small changes in perturbation sampling produce different explanations.

Integrated gradients

For differentiable models (neural networks), integrated gradients attribute predictions by integrating the gradient of the output with respect to the input along a path from a baseline to the actual input. Widely used for vision and text models.

Permutation importance

Shuffle each feature’s values and measure how much accuracy drops. Simple, model-agnostic, good for global feature importance. Can be misleading when features are correlated.

Attention visualization

For transformer models, attention weights show which input tokens each output attends to. Visualizing attention gives a qualitative sense of what the model “looks at”. There are caveats: attention is not a faithful explanation — a 2019 paper by Sarthak Jain and Byron Wallace argued attention weights do not always reflect the true information flow in a model. Treat attention visualizations as heuristic, not ground truth.

Counterfactual explanations

Instead of saying “the model used these features”, counterfactuals say “if you had this instead of that, the decision would flip”. “Your loan was denied; with an income of $55,000 instead of $48,000, it would have been approved.” Counterfactuals are useful for users (actionable) and increasingly required in regulated contexts (GDPR’s “right to explanation” debate).

Mechanistic interpretability

A younger, more ambitious research agenda. Instead of explaining model outputs, mechanistic interpretability tries to understand the internal computations — which neurons encode which concepts, how features combine, what circuits implement specific capabilities. Work at Anthropic, OpenAI, and academic labs has identified features corresponding to interpretable concepts (“the concept of a specific city”, “writing in formal English”), and demonstrated intervention experiments that steer model behaviour by editing these features.

Mechanistic interpretability is still early research but holds potential long-term promise for AI safety — understanding what a model “believes” before deploying it. It is also compute-intensive and has not yet scaled to frontier model sizes with full coverage.

Regulatory drivers

EU AI Act

High-risk AI systems must provide clear information to users and deployers about capabilities, limitations, and intended purpose. Conformity assessments include transparency and traceability requirements.

GDPR

Article 22 gives EU residents the right not to be subject to decisions based solely on automated processing, with a contested “right to explanation” that courts are still interpreting. Meaningful information about the logic involved in automated decisions is required in several contexts.

Fair-lending laws

US ECOA and FCRA require adverse-action notices giving specific reasons for credit denials. Model explanations feed directly into generating valid notices, whether using SHAP-based feature attribution or other methods.

Sector-specific rules

Healthcare, insurance, employment all have increasing requirements for documented decision logic. FDA’s AI/ML-enabled device guidance calls for descriptions of the model’s decision processes. See our ai bias coverage for the fairness and discrimination angle.

Pitfalls and limitations

Explanation fidelity

An explanation may misrepresent the underlying model. Post-hoc methods produce something that looks reasonable but does not necessarily correspond to the model’s actual internal logic. This gap is particularly large for deep networks.

User misinterpretation

Even accurate explanations can be misunderstood. Users treating SHAP values as causal explanations, or believing that a low-attribution feature has no effect, misread what the math actually says.

Explanation manipulation

Research has shown that adversarial perturbations can change SHAP or LIME explanations while leaving the model’s prediction unchanged. This is a concern where explanations are regulatory artifacts or build user trust.

Cost and integration

Computing explanations for every prediction is expensive. Systems at scale often explain only a sampled subset or cache explanations for typical cases. Integrating explanations into user-facing UIs and workflows adds engineering cost.

Practical guidance

For most teams deploying ML: (a) Use inherently interpretable models where the accuracy trade-off is acceptable. (b) For complex models, implement SHAP or a comparable method. (c) For user-facing decisions, generate natural-language summaries grounded in the numeric explanations. (d) Audit explanations periodically for stability and fidelity. (e) Document explainability choices in model cards so stakeholders know what kind of explanation is produced and its limitations. For the broader ML context, see our machine learning primer.

Frequently asked questions

Is an “explainable” model always the right choice?
Not always. Accuracy matters too, and sometimes the most accurate model is the hardest to explain. The right choice depends on the use case — for high-stakes or regulated decisions, interpretability may be a hard requirement; for low-stakes recommendations, a more opaque but accurate model may be fine. There is no universal answer; the question is whether the explainability cost or benefit is greater.

Are SHAP and LIME reliable?
They are useful but have known limitations. SHAP is more mathematically grounded but slower and sensitive to assumptions about feature independence. LIME is faster but less stable. Both can be manipulated adversarially and both can mislead when features are correlated. Use them to form hypotheses and investigate further, not to give users definitive answers.

Can I explain a large language model’s output?
Partially. Attention visualization, activation analysis, and influence-function methods give hints about which training data or which parts of the prompt drove an output. Full causal explanations — “this token was produced because of these specific weights in these specific layers” — are beyond current interpretability tools for frontier-scale models. Expect the field to make progress; expect full LLM explainability to remain out of reach for some time.