pith. sign in

arxiv: 2505.11772 · v3 · submitted 2025-05-17 · 💻 cs.LG

LAMP: Extracting Local Decision Surfaces From Large Language Models

Pith reviewed 2026-05-22 15:01 UTC · model grok-4.3

classification 💻 cs.LG
keywords LAMPlocal decision surfaceslarge language modelsself-reported explanationslinear surrogateblack-box auditingexplanation qualitymodel consistency
0
0 comments X

The pith

LAMP approximates language models' local decision surfaces by fitting linear surrogates to their self-reported explanations, showing alignment with human and expert judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LAMP as a method to shine light on black-box language models by treating their self-reported explanations as a coordinate system and fitting a locally linear surrogate that connects those explanations to the model's outputs. This reveals the extent to which stated reasons actually steer predictions, all without needing gradients, logits, or internal activations. The approach is tested on sentiment analysis, controversial-topic detection, and safety-prompt auditing, where the resulting decision landscapes agree with human judgments on explanation quality. On a clinical case-file dataset the approximations also align with expert assessments, supporting its use for auditing proprietary models.

Core claim

LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals how much the stated factors steer the model's decisions. Across three tasks, LAMP reveals that many language models' locally approximated linear decision landscapes overall agree with human judgments on explanation quality and, on a clinical case-file data set, align with expert assessments.

What carries the argument

LAMP (Local Attribution Mapping Probe), a lightweight probe that uses self-reported explanations as coordinates to fit a locally linear surrogate approximating the model's decision surface.

If this is right

  • Enables practical auditing of proprietary language models without access to gradients or internal states.
  • Demonstrates consistency between stated explanations and actual local decisions in the tested tasks.
  • Provides a framework for checking whether models behave in line with the reasons they provide.
  • Supports assessment on real-world data such as clinical case files matching expert views.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be extended to flag cases where models generate explanations that do not match their internal decision patterns.
  • Comparing LAMP surfaces across different model families might expose architecture-specific differences in explanation faithfulness.
  • Testing the linear surrogate on prompts that deliberately contradict the model's stated reasons would provide a direct check on its reliability.

Load-bearing premise

The method assumes that a model's self-reported explanations form a sufficient and stable coordinate system for locally approximating its true decision surface rather than reflecting artifacts of the explanation generation process.

What would settle it

If the fitted linear surrogate fails to predict the model's actual outputs on new examples that vary the explanation factors, or if human and expert raters systematically disagree with the importance weights produced by the surrogate, the approximation would be shown to be invalid.

read the original abstract

We introduce LAMP (Local Attribution Mapping Probe), a method that shines light onto a black-box language model's decision surface and studies how reliably a model maps its stated reasons to its reported predictions by approximating a decision surface. LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals how much the stated factors steer the model's decisions. We apply LAMP to three tasks: sentiment analysis, controversial-topic detection, and safety-prompt auditing. Across these tasks, LAMP reveals that many language models' locally approximated linear decision landscapes overall agree with human judgments on explanation quality and, on a clinical case-file data set, align with expert assessments. Since LAMP operates without requiring access to model gradients, logits, or internal activations, it serves as a practical and lightweight framework for auditing proprietary language models, and enabling assessment of whether a model appears to behave consistently with the explanations it provides.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LAMP (Local Attribution Mapping Probe), a method that approximates a language model's local decision surface by treating its self-reported explanations as a coordinate system and fitting a locally linear surrogate linking explanation weights to the model's output. Applied to sentiment analysis, controversial-topic detection, and safety-prompt auditing (including a clinical case-file dataset), LAMP claims that the resulting linear landscapes agree with human judgments on explanation quality and align with expert assessments, all without requiring gradients, logits, or internal activations.

Significance. If the central assumption holds, LAMP would provide a lightweight, gradient-free tool for auditing proprietary LLMs by checking consistency between stated explanations and predictions. This could support interpretability research and practical auditing in domains like clinical decision support, with the no-internal-access property as a clear practical strength.

major comments (3)
  1. [Experiments] The central validation relies on agreement with human and expert judgments, but the manuscript does not include controls or ablations that test whether the fitted linear surrogate recovers factors that actually drove the prediction versus post-hoc rationalizations generated by the same model. This directly affects the claim that LAMP approximates the true local decision surface.
  2. [Method] The local linear fitting procedure is described at a high level with no equations for the regression, definition of the local neighborhood, choice of regularization, or error analysis. Without these details the reproducibility of the reported alignments cannot be evaluated.
  3. [Clinical evaluation] On the clinical case-file dataset, alignment with expert assessments is presented as corroboration, yet no quantitative metrics (e.g., correlation coefficients, inter-rater agreement baselines, or comparison to non-explanation controls) are supplied to substantiate the strength of this alignment.
minor comments (2)
  1. [Abstract] The abstract states results 'overall agree' with human judgments but supplies no numerical values, confidence intervals, or statistical tests; these should be added for precision.
  2. [Method] Notation for the explanation coordinates and the fitted coefficients should be introduced consistently with an equation or table to avoid ambiguity when describing the surrogate.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve methodological transparency, add necessary controls and metrics, and strengthen the validation claims where feasible.

read point-by-point responses
  1. Referee: [Experiments] The central validation relies on agreement with human and expert judgments, but the manuscript does not include controls or ablations that test whether the fitted linear surrogate recovers factors that actually drove the prediction versus post-hoc rationalizations generated by the same model. This directly affects the claim that LAMP approximates the true local decision surface.

    Authors: We agree that distinguishing causal drivers from post-hoc rationalizations is a substantive concern for any explanation-based method. Because LAMP is explicitly designed for black-box auditing without internal access, direct causal interventions (e.g., feature ablation inside the model) are unavailable. To address this, we will add an ablation in the revised experiments section that compares the linear surrogate fit on the model's native explanations against fits on randomized or permuted explanation weights. A significant drop in alignment with human judgments under randomization would indicate that the surrogate captures non-arbitrary structure. We will also explicitly discuss this limitation and its implications for the 'true decision surface' claim. revision: yes

  2. Referee: [Method] The local linear fitting procedure is described at a high level with no equations for the regression, definition of the local neighborhood, choice of regularization, or error analysis. Without these details the reproducibility of the reported alignments cannot be evaluated.

    Authors: We accept that the current description lacks the necessary mathematical detail for reproducibility. In the revised manuscript we will add a dedicated methods subsection containing: (1) the explicit locally linear regression objective (including the loss function and L2 regularization term), (2) the precise definition of the local neighborhood (via explanation embedding similarity with a cosine threshold or a fixed number of perturbations), (3) the regularization parameter selection procedure (cross-validation on held-out local samples), and (4) quantitative error analysis (R^{2}, MSE, and residual diagnostics) reported per task. These additions will allow independent reproduction of the reported alignments. revision: yes

  3. Referee: [Clinical evaluation] On the clinical case-file dataset, alignment with expert assessments is presented as corroboration, yet no quantitative metrics (e.g., correlation coefficients, inter-rater agreement baselines, or comparison to non-explanation controls) are supplied to substantiate the strength of this alignment.

    Authors: We will strengthen the clinical evaluation by adding the requested quantitative metrics. The revised section will report Pearson and Spearman correlations between LAMP attribution weights and expert importance ratings, inter-rater agreement among the experts (e.g., intraclass correlation coefficient), and a control comparison against random or non-explanation feature attributions to establish that the observed alignment exceeds chance. These numbers will be presented alongside the existing qualitative description. revision: yes

Circularity Check

0 steps flagged

No significant circularity in LAMP method definition or validation

full rationale

The paper introduces LAMP as a practical auditing technique that selects the model's self-reported explanations as input coordinates and performs a standard local linear regression to produce a surrogate model linking those coordinates to the output prediction. This construction is explicit and does not claim to derive the decision surface from first principles or to predict a held-out quantity that is statistically forced by the fit itself. Validation of the resulting surrogate relies on separate external human judgments of explanation quality and expert clinical assessments rather than on any internal self-consistency metric or re-use of the same fitted values. No equations, uniqueness theorems, or self-citations are invoked to force the central claims; the method remains a lightweight, gradient-free probe whose outputs are independently falsifiable against human data. The derivation chain is therefore self-contained and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the unstated premise that explanations generated by the model can be treated as an independent basis for reconstructing its decision function locally; the linear surrogate introduces fitted coefficients whose values are determined by the model's outputs on the chosen neighborhood.

free parameters (1)
  • local linear coefficients
    Weights in the surrogate model that link each explanation factor to the output prediction; these are fitted per local neighborhood.
axioms (1)
  • domain assumption Self-reported explanations provide a faithful and complete enough coordinate system for the model's local decision surface.
    Invoked when the method defines the input space for the surrogate fit.

pith-pipeline@v0.9.0 · 5723 in / 1262 out tokens · 47625 ms · 2026-05-22T15:01:59.852548+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.