arxiv: 2604.08039 · v2 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.LG

Recognition: no theorem link

LINE: LLM-based Iterative Neuron Explanations for Vision Models

Vladimir Zaigrajew , Micha{\l} Piechota , Gaspar Sekula , Przemys{\l}aw Biecek

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords neuron interpretationvision modelsLLM-based explanationsopen-vocabulary labelingmodel interpretabilityblack-box methodsconcept discoveryactivation maximization

0 comments

The pith

LINE uses an LLM and text-to-image generator in a closed loop to label neurons in vision models with open-vocabulary concepts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LINE as a training-free iterative method that proposes concepts for individual neurons using a large language model, generates test images with a text-to-image model, and refines the proposals based on measured neuron activations. It operates without internal model access beyond activations and without restricting concepts to a fixed list. The approach improves labeling performance over prior methods and finds many concepts that predefined vocabularies miss. A sympathetic reader would care because accurate neuron labels help reveal how vision models reach decisions and support efforts to make them safer and more understandable.

Core claim

LINE is a novel iterative approach for open-vocabulary concept labeling in vision models that works in a strictly black-box setting by leveraging a large language model to propose and refine concepts and a text-to-image generator to create test images, guided by activation history, achieving state-of-the-art performance across model architectures with AUC improvements of up to 0.11 on ImageNet and 0.05 on Places365 while discovering on average 27% new concepts missed by predefined vocabularies.

What carries the argument

The closed-loop iterative refinement process in which the LLM proposes concepts, the text-to-image generator creates corresponding images, neuron activation is measured, and the history of activations guides the next proposal round.

If this is right

The method supplies a complete generation history that supports evaluation of whether a neuron is polysemantic.
It produces visual explanations that can be compared directly to those from gradient-dependent activation maximization techniques.
It enables concept discovery outside any fixed vocabulary while remaining applicable to multiple vision model architectures.
The black-box requirement means the same procedure can be applied without retraining or gradient access.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The generation history could be used to measure how narrowly or broadly a neuron responds across successive concept refinements.
If the generated images reliably stand in for real data, the approach could be extended to audit neurons on datasets where human-labeled concepts are unavailable.
The iterative loop might be adapted to other modalities such as audio or text models by swapping the image generator for an appropriate modality-specific generator.

Load-bearing premise

Images generated by the text-to-image model activate the target neuron in a manner representative of how the neuron would respond to real-world instances of the proposed concept.

What would settle it

Measure neuron activations on a large collection of real photographs that have been independently labeled with the exact concepts proposed by LINE and compare those activation values to the activations on the synthetic images produced for the same concepts.

Figures

Figures reproduced from arXiv: 2604.08039 by Gaspar Sekula, Micha{\l} Piechota, Przemys{\l}aw Biecek, Vladimir Zaigrajew.

**Figure 1.** Figure 1: Qualitative comparison of neuron descriptions. We show the top four activating images from the ImageNet-1K for three randomly selected neurons from the ResNet50 avgpool layer (left) and the ViT-B/16 heads layer (right). The descriptions are provided from our method (LINE), CLIP-Dissect (Oikarinen and Weng, 2023), and INVERT (Bykov et al., 2024) alongside their CoSy benchmark (Kopf et al., 2024) AUC scores.… view at source ↗

**Figure 2.** Figure 2: Overview of the LINE iterative framework. Step 1: An LLM proposes a new concept t based on the descriptions from the scoreboard H (e.g., proposing strawberry and pomegranate into red fruit). Step 2: A Text-to-Image (T2I) model generates a batch of diverse synthetic images illustrating the concept t. Step 3: These images are processed by the target vision model extracting concept activations At . Step 4: A … view at source ↗

**Figure 3.** Figure 3: Causal impact of concept ablation on neuron activation. Using image-to-image generative models, we remove visual concepts associated with the neuron description from ResNet50 avgpool layer identified LINE. Original (left) and ablated (right) images are shown side-by-side, with normalized neuron activations displayed below each pair. Original images are outlined in blue, ablated versions in orange, and the… view at source ↗

**Figure 4.** Figure 4: Visual explanation comparison on Salient ImageNet (Singla and Feizi, 2022). We present saliency maps alongside visual explanations generated by LINE, DiffExplainer, and DEXTER for the top-5 core features of the “Jeep” class in RobustResNet50. Neuron activation values are displayed above each visual explanation, with the corresponding LINE neuron description provided at the bottom. Compared to AM methods, L… view at source ↗

**Figure 5.** Figure 5: Performance analysis across iteration steps. We extend the maximum iterations from 10 to 20 for 100 randomly selected neurons in the avgpool of ResNet50 (Places365) and ResNet18. Over successive iterations, we report the relative average best activation score (left panel) and the discovery rate of the optimal description (right panel) at each step. The “S” label denotes the final summary iteration. On aver… view at source ↗

**Figure 6.** Figure 6: Impact of text-to-image (T2I) models on LINE performance. We evaluate 30 random neurons from the avgpool layer of ResNet50 trained on Places365. The box plots (left) comparing the highest concept activation scores indicate that the FLUX model produces slightly higher activations. Corresponding descriptions and visual explanations for neuron 166 (right) illustrate the distinct generative priors of each T2I … view at source ↗

**Figure 7.** Figure 7: CoSy evaluation framework. A schematic illustration of the CoSy framework for Neuron 80 in ResNet18’s avgpool layer. The figure is sourced from the original paper (Kopf et al., 2024). The CoSy benchmark (Kopf et al., 2024) evaluates the quality of open-vocabulary textual explanations for vision model neurons. Given the difficulty of finding natural datasets that perfectly isolate arbitrary concepts, CoSy … view at source ↗

**Figure 8.** Figure 8: Ablation of visual concepts derived from LINE. We show side-by-side comparisons demonstrating the effect of removing the LINE-defined concept t from highly activating images for the ResNet50 avgpool layer using an image-editing generative model. Concept removal via generative models is not always perfect and still requires manual inspection; for instance, removal failures can be observed in the 5th row for… view at source ↗

**Figure 9.** Figure 9: Extended visual explanations on Salient ImageNet. Extending the [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗

**Figure 9.** Figure 9: Continued extended visual explanations on Salient ImageNet. Extending the [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of neuron descriptions in ResNet50. We extend the qualitative analysis from [PITH_FULL_IMAGE:figures/full_fig_p040_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison of neuron descriptions in ResNet18 and ViT-B/16. We extend the qualitative analysis from [PITH_FULL_IMAGE:figures/full_fig_p041_11.png] view at source ↗

read the original abstract

Interpreting individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.11 on ImageNet and 0.05 on Places365, while discovering, on average, 27% of new concepts missed by predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, enabling polysemanticity evaluation and producing visual explanations that rival gradient-dependent activation maximization methods. The source code will be made available soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LINE's iterative LLM plus text-to-image loop for open-vocab neuron labeling is a new combination, but the reported AUC gains rest on synthetic images whose match to real data is unproven.

read the letter

The main point is that LINE runs a closed loop: an LLM proposes candidate concepts for a neuron, a text-to-image model generates test images, and measured activations decide whether to keep or refine the label. This iterative coupling of proposal and empirical test is not in the earlier fixed-vocabulary or single-pass labeling papers the abstract cites. It stays strictly black-box and training-free, which is a practical plus for large vision models. The output also includes the full generation history, which lets users check polysemanticity and supplies visual explanations without gradient-based activation maximization. Those are concrete additions. The claimed AUC lifts (0.11 on ImageNet, 0.05 on Places365) and 27 % new-concept rate come from activations on the generated images. The stress-test concern is real: if the synthetic images carry style biases or missing context, the loop can lock onto patterns that do not appear in natural data. The abstract gives no protocol details, baseline descriptions, or ablation numbers, so the performance numbers cannot be checked yet. The paper targets interpretability researchers who need flexible concept labels for debugging or safety work on vision models. Readers who already use neuron-level tools will see the most immediate value. It is worth sending to peer review so the experimental setup and the synthetic-to-real gap can be examined directly.

Referee Report

2 major / 1 minor

Summary. The paper introduces LINE, a training-free black-box iterative algorithm that couples an LLM with a text-to-image generator to propose, refine, and select open-vocabulary concepts for individual neurons in vision models. Guided by activation history, the method claims state-of-the-art performance with AUC gains of up to 0.11 on ImageNet and 0.05 on Places365, plus discovery of 27% new concepts missed by fixed vocabularies; it also supplies a generation history for polysemanticity analysis and visual explanations.

Significance. If the reported gains are reproducible under standard real-image evaluation, LINE would meaningfully advance neuron-level interpretability by removing the closed vocabulary constraint and enabling post-hoc polysemanticity checks. The approach is conceptually simple and leverages existing foundation models, which could accelerate adoption in safety and debugging workflows.

major comments (2)

[Abstract] Abstract: The central performance claims (AUC improvements of 0.11 on ImageNet and 0.05 on Places365, 27% new-concept discovery) are stated without any description of the evaluation protocol, baseline methods, number of neurons or models tested, AUC definition, or statistical tests. These omissions render the SOTA assertion unverifiable from the manuscript.
[Method] Method section (iterative loop): Neuron activations used to score and refine concepts are measured exclusively on images synthesized by the text-to-image model. No experiment validates that these synthetic activations correlate with activations on real-world images of the same concept; systematic distribution shift would invalidate the closed-loop optimization and the reported AUC numbers.

minor comments (1)

[Abstract] Abstract: The statement that source code 'will be made available soon' should include a concrete timeline or repository link to allow reviewers to inspect the implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments that help improve the clarity and rigor of the manuscript. We address each major point below and will make the necessary revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (AUC improvements of 0.11 on ImageNet and 0.05 on Places365, 27% new-concept discovery) are stated without any description of the evaluation protocol, baseline methods, number of neurons or models tested, AUC definition, or statistical tests. These omissions render the SOTA assertion unverifiable from the manuscript.

Authors: We agree that the abstract would benefit from additional context to make the claims verifiable at a glance. In the revised version we will expand the abstract to briefly note the evaluation protocol (real-image AUC on ImageNet and Places365 using ROC curves for activation prediction), the baselines (fixed-vocabulary methods), the scope (multiple architectures and hundreds of neurons), and that gains are reported as averages with standard deviation across runs. Full experimental details remain in Sections 4 and 5. revision: yes
Referee: [Method] Method section (iterative loop): Neuron activations used to score and refine concepts are measured exclusively on images synthesized by the text-to-image model. No experiment validates that these synthetic activations correlate with activations on real-world images of the same concept; systematic distribution shift would invalidate the closed-loop optimization and the reported AUC numbers.

Authors: We acknowledge the concern about possible distribution shift. The final reported AUC numbers are computed exclusively on real images from ImageNet and Places365, providing an end-to-end validation of the discovered concepts. However, we did not include an explicit correlation study between synthetic and real activations during the iterative loop. We will add a new subsection with a quantitative analysis (e.g., activation correlation coefficients on matched concept pairs) to confirm that the closed-loop optimization remains reliable. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

LINE is a procedural black-box algorithm that iteratively calls external LLM and text-to-image models to propose concepts, then measures neuron activations on the generated images to refine them. No equations, fitted parameters, or self-referential definitions appear in the method; the reported AUC gains and new-concept discovery rates are computed directly from those external activation measurements rather than being forced by any internal construction. No load-bearing self-citations or uniqueness theorems imported from prior author work are invoked to justify the core loop. The approach therefore remains self-contained against external benchmarks and does not reduce any claimed result to its own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that synthetic images can serve as faithful probes for real-world concept activation; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption Neuron activations measured on text-to-image generated pictures correspond to the neuron's response to the underlying concept in natural images.
This premise enables the closed-loop refinement process described in the abstract.

pith-pipeline@v0.9.0 · 5499 in / 1309 out tokens · 42335 ms · 2026-05-14T22:08:17.107633+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 1 internal anchor

[1]

The Llama 3 Herd of Models

Transformer Circuits Thread. Kirill Bykov, Laura Kopf, Shinichi Nakajima, Marius Kloft, and Marina Höhne. Labeling neural representations with inverse recognition.Advances in Neural Information Processing Systems, 36, 2024. Josep Lopez Camuñas, Christy Li, Tamar Rott Shaham, Antonio Torralba, and Agata Lapedriza. OpenMAIA: a multimodal automated interpret...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.23915/distill.00007 2024
[2]

Generate Synthetic Data.Given a proposed concept label t (e.g.,polka dots), a generative T2I model is used to synthesize a collection of images, denoted as in our workP

work page
[3]

This yields two sets of scalar activations: •A t: Activations from the synthetic concept images inP

Collect Neuron Activations.Both the synthetic images P and a control dataset of natural images Xcontrol ⊂ X are passed through the target vision network f and the activations are extracted from the specific neuronn. This yields two sets of scalar activations: •A t: Activations from the synthetic concept images inP. •A control: Activations from the natural...

work page
[4]

A higher score indicates that the concept t is a better match for the neuronn

Score Explanations.A scoring function ψ(Acontrol, At) is used to quantify the difference between the two activation distributions. A higher score indicates that the concept t is a better match for the neuronn. The benchmark evaluates these explanations using two complementary scoring functions that capture different aspects of neuron behavior: Area Under ...

work page
[5]

core” (for the core classes) or “spurious

and extending the results from Figure 4, we selected 3 classes containing heavily spurious features and 3 classes relaying mainly on core features from Salient ImageNet (Singla and Feizi, 2022). For each evaluated class, we provided visual explanations of the top-5 features, categorized as either “core” (for the core classes) or “spurious” (for the spurio...

work page 2022