LAMP: Extracting Local Decision Surfaces From Large Language Models
Pith reviewed 2026-05-22 15:01 UTC · model grok-4.3
The pith
LAMP approximates language models' local decision surfaces by fitting linear surrogates to their self-reported explanations, showing alignment with human and expert judgments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals how much the stated factors steer the model's decisions. Across three tasks, LAMP reveals that many language models' locally approximated linear decision landscapes overall agree with human judgments on explanation quality and, on a clinical case-file data set, align with expert assessments.
What carries the argument
LAMP (Local Attribution Mapping Probe), a lightweight probe that uses self-reported explanations as coordinates to fit a locally linear surrogate approximating the model's decision surface.
If this is right
- Enables practical auditing of proprietary language models without access to gradients or internal states.
- Demonstrates consistency between stated explanations and actual local decisions in the tested tasks.
- Provides a framework for checking whether models behave in line with the reasons they provide.
- Supports assessment on real-world data such as clinical case files matching expert views.
Where Pith is reading between the lines
- The method could be extended to flag cases where models generate explanations that do not match their internal decision patterns.
- Comparing LAMP surfaces across different model families might expose architecture-specific differences in explanation faithfulness.
- Testing the linear surrogate on prompts that deliberately contradict the model's stated reasons would provide a direct check on its reliability.
Load-bearing premise
The method assumes that a model's self-reported explanations form a sufficient and stable coordinate system for locally approximating its true decision surface rather than reflecting artifacts of the explanation generation process.
What would settle it
If the fitted linear surrogate fails to predict the model's actual outputs on new examples that vary the explanation factors, or if human and expert raters systematically disagree with the importance weights produced by the surrogate, the approximation would be shown to be invalid.
read the original abstract
We introduce LAMP (Local Attribution Mapping Probe), a method that shines light onto a black-box language model's decision surface and studies how reliably a model maps its stated reasons to its reported predictions by approximating a decision surface. LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals how much the stated factors steer the model's decisions. We apply LAMP to three tasks: sentiment analysis, controversial-topic detection, and safety-prompt auditing. Across these tasks, LAMP reveals that many language models' locally approximated linear decision landscapes overall agree with human judgments on explanation quality and, on a clinical case-file data set, align with expert assessments. Since LAMP operates without requiring access to model gradients, logits, or internal activations, it serves as a practical and lightweight framework for auditing proprietary language models, and enabling assessment of whether a model appears to behave consistently with the explanations it provides.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LAMP (Local Attribution Mapping Probe), a method that approximates a language model's local decision surface by treating its self-reported explanations as a coordinate system and fitting a locally linear surrogate linking explanation weights to the model's output. Applied to sentiment analysis, controversial-topic detection, and safety-prompt auditing (including a clinical case-file dataset), LAMP claims that the resulting linear landscapes agree with human judgments on explanation quality and align with expert assessments, all without requiring gradients, logits, or internal activations.
Significance. If the central assumption holds, LAMP would provide a lightweight, gradient-free tool for auditing proprietary LLMs by checking consistency between stated explanations and predictions. This could support interpretability research and practical auditing in domains like clinical decision support, with the no-internal-access property as a clear practical strength.
major comments (3)
- [Experiments] The central validation relies on agreement with human and expert judgments, but the manuscript does not include controls or ablations that test whether the fitted linear surrogate recovers factors that actually drove the prediction versus post-hoc rationalizations generated by the same model. This directly affects the claim that LAMP approximates the true local decision surface.
- [Method] The local linear fitting procedure is described at a high level with no equations for the regression, definition of the local neighborhood, choice of regularization, or error analysis. Without these details the reproducibility of the reported alignments cannot be evaluated.
- [Clinical evaluation] On the clinical case-file dataset, alignment with expert assessments is presented as corroboration, yet no quantitative metrics (e.g., correlation coefficients, inter-rater agreement baselines, or comparison to non-explanation controls) are supplied to substantiate the strength of this alignment.
minor comments (2)
- [Abstract] The abstract states results 'overall agree' with human judgments but supplies no numerical values, confidence intervals, or statistical tests; these should be added for precision.
- [Method] Notation for the explanation coordinates and the fitted coefficients should be introduced consistently with an equation or table to avoid ambiguity when describing the surrogate.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve methodological transparency, add necessary controls and metrics, and strengthen the validation claims where feasible.
read point-by-point responses
-
Referee: [Experiments] The central validation relies on agreement with human and expert judgments, but the manuscript does not include controls or ablations that test whether the fitted linear surrogate recovers factors that actually drove the prediction versus post-hoc rationalizations generated by the same model. This directly affects the claim that LAMP approximates the true local decision surface.
Authors: We agree that distinguishing causal drivers from post-hoc rationalizations is a substantive concern for any explanation-based method. Because LAMP is explicitly designed for black-box auditing without internal access, direct causal interventions (e.g., feature ablation inside the model) are unavailable. To address this, we will add an ablation in the revised experiments section that compares the linear surrogate fit on the model's native explanations against fits on randomized or permuted explanation weights. A significant drop in alignment with human judgments under randomization would indicate that the surrogate captures non-arbitrary structure. We will also explicitly discuss this limitation and its implications for the 'true decision surface' claim. revision: yes
-
Referee: [Method] The local linear fitting procedure is described at a high level with no equations for the regression, definition of the local neighborhood, choice of regularization, or error analysis. Without these details the reproducibility of the reported alignments cannot be evaluated.
Authors: We accept that the current description lacks the necessary mathematical detail for reproducibility. In the revised manuscript we will add a dedicated methods subsection containing: (1) the explicit locally linear regression objective (including the loss function and L2 regularization term), (2) the precise definition of the local neighborhood (via explanation embedding similarity with a cosine threshold or a fixed number of perturbations), (3) the regularization parameter selection procedure (cross-validation on held-out local samples), and (4) quantitative error analysis (R^{2}, MSE, and residual diagnostics) reported per task. These additions will allow independent reproduction of the reported alignments. revision: yes
-
Referee: [Clinical evaluation] On the clinical case-file dataset, alignment with expert assessments is presented as corroboration, yet no quantitative metrics (e.g., correlation coefficients, inter-rater agreement baselines, or comparison to non-explanation controls) are supplied to substantiate the strength of this alignment.
Authors: We will strengthen the clinical evaluation by adding the requested quantitative metrics. The revised section will report Pearson and Spearman correlations between LAMP attribution weights and expert importance ratings, inter-rater agreement among the experts (e.g., intraclass correlation coefficient), and a control comparison against random or non-explanation feature attributions to establish that the observed alignment exceeds chance. These numbers will be presented alongside the existing qualitative description. revision: yes
Circularity Check
No significant circularity in LAMP method definition or validation
full rationale
The paper introduces LAMP as a practical auditing technique that selects the model's self-reported explanations as input coordinates and performs a standard local linear regression to produce a surrogate model linking those coordinates to the output prediction. This construction is explicit and does not claim to derive the decision surface from first principles or to predict a held-out quantity that is statistically forced by the fit itself. Validation of the resulting surrogate relies on separate external human judgments of explanation quality and expert clinical assessments rather than on any internal self-consistency metric or re-use of the same fitted values. No equations, uniqueness theorems, or self-citations are invoked to force the central claims; the method remains a lightweight, gradient-free probe whose outputs are independently falsifiable against human data. The derivation chain is therefore self-contained and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- local linear coefficients
axioms (1)
- domain assumption Self-reported explanations provide a faithful and complete enough coordinate system for the model's local decision surface.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fits a linear surrogate model of the form ŷ = X β̂ (Eq. 1); MSE(δ) = 1/36 ∥H∥²_F δ⁴ + σ²/(n δ^d) (Eq. 4); δ* from curvature (Eq. 5)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.