ToMMeR -- Efficient Entity Mention Detection from Large Language Models
Pith reviewed 2026-05-18 05:22 UTC · model grok-4.3
The pith
Structured entity representations exist in early transformer layers and can be recovered with under 300K parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ToMMeR is a lightweight model with under 300K parameters that probes mention detection capabilities from early LLM layers. Across 13 NER benchmarks it achieves 93 percent zero-shot recall with an estimated 90 percent precision under a human-calibrated LLM-judge protocol. Cross-model analysis on architectures from 14M to 15B parameters reveals convergence on similar mention boundaries with DICE scores above 75 percent. When extended with span classification heads, ToMMeR reaches 80 to 87 percent F1 on standard benchmarks, showing that structured entity representations exist in early transformer layers and can be recovered efficiently with minimal parameters.
What carries the argument
ToMMeR, a lightweight probing model attached to early transformer layers to identify entity mention boundaries.
Load-bearing premise
The human-calibrated LLM judge gives an unbiased precision estimate and cross-model boundary agreement reflects genuine emergence from language modeling rather than shared data artifacts.
What would settle it
If a new language model trained on text without named entities still allows ToMMeR to reach high recall on standard benchmarks, this would challenge the claim that the representations emerge from language modeling itself.
read the original abstract
Identifying which text spans refer to entities - mention detection - is both foundational for information extraction and a known performance bottleneck. We introduce ToMMeR, a lightweight model (<300K parameters) probing mention detection capabilities from early LLM layers. Across 13 NER benchmarks, ToMMeR achieves 93% recall zero-shot, with an estimated 90% precision under a human-calibrated LLM-judge protocol, showing that ToMMeR rarely produces spurious predictions despite high recall. Cross-model analysis reveals that diverse architectures (14M-15B parameters) converge on similar mention boundaries (DICE >75%), confirming that mention detection emerges naturally from language modeling. When extended with span classification heads, ToMMeR achieves competitive NER performance (80-87% F1 on standard benchmarks). Our work provides evidence that structured entity representations exist in early transformer layers and can be efficiently recovered with minimal parameters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ToMMeR, a lightweight probe (<300K parameters) that extracts entity mention detection from early layers of LLMs. It reports 93% zero-shot recall across 13 NER benchmarks with an estimated 90% precision via a human-calibrated LLM-judge protocol, DICE >75% boundary convergence across 14M–15B parameter models from diverse architectures, and competitive 80–87% F1 on full NER when extended with span classification heads. The central claim is that structured entity representations emerge naturally in early transformer layers and can be recovered efficiently.
Significance. If the empirical claims hold under rigorous validation, the work provides concrete evidence for the emergence of mention detection in early LLM layers and demonstrates a parameter-efficient recovery method. The cross-model convergence analysis is a notable strength, as it offers falsifiable predictions about architecture-independent structure arising from language modeling objectives. This could inform both mechanistic interpretability and practical information extraction systems.
major comments (3)
- [§4] §4 (Evaluation Protocol): The 90% precision estimate depends entirely on the human-calibrated LLM-judge protocol, yet the manuscript supplies no details on calibration set size, inter-annotator agreement, judge prompt templates, or controls for correlation between the judge LLM and the probed models (e.g., shared pretraining data). This directly undermines confidence in the high-recall claim, as any judge-model alignment would inflate the reported precision.
- [§5] §5 (Cross-Model Convergence): The DICE >75% result is presented as evidence of architecture-independent emergence, but no analysis addresses potential confounding from overlapping pretraining corpora across the tested models. Without such controls or ablation on data overlap, the convergence cannot be confidently attributed to language modeling rather than shared artifacts.
- [§3] §3 (ToMMeR Architecture and Layer Probing): The description of how early layers are selected and probed lacks ablation studies on layer choice, the exact form of the probe (linear vs. non-linear), and verification that performance does not degrade when using only the first few layers. These omissions make the efficiency claim (<300K parameters) difficult to evaluate as load-bearing for the emergence thesis.
minor comments (2)
- [Tables 1–3] Tables reporting benchmark results should include error bars or standard deviations across runs or seeds to support the 93% recall figure.
- [§2] The manuscript would benefit from explicit comparison to prior probing work on entity representations (e.g., citations to linear probes for syntactic structure) to clarify novelty.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which help clarify key aspects of our evaluation and analysis. We respond to each major comment below and specify the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Evaluation Protocol): The 90% precision estimate depends entirely on the human-calibrated LLM-judge protocol, yet the manuscript supplies no details on calibration set size, inter-annotator agreement, judge prompt templates, or controls for correlation between the judge LLM and the probed models (e.g., shared pretraining data). This directly undermines confidence in the high-recall claim, as any judge-model alignment would inflate the reported precision.
Authors: We agree that the current manuscript lacks sufficient detail on the LLM-judge protocol, which is necessary to support the reported precision. In the revised version we will expand Section 4 with a new subsection that reports the calibration set size, inter-annotator agreement statistics, the full judge prompt templates, and the steps taken to reduce judge-model correlation (including use of a judge from a different model family and checks for benchmark overlap). These additions will be included without changing the underlying experimental results. revision: yes
-
Referee: [§5] §5 (Cross-Model Convergence): The DICE >75% result is presented as evidence of architecture-independent emergence, but no analysis addresses potential confounding from overlapping pretraining corpora across the tested models. Without such controls or ablation on data overlap, the convergence cannot be confidently attributed to language modeling rather than shared artifacts.
Authors: The referee correctly notes the absence of explicit controls for pretraining data overlap. We will add a limitations paragraph in Section 5 that acknowledges this potential confound and presents supplementary stratified DICE results across model families with differing known training data. Full ablation on data overlap is not feasible for all closed models, so we will temper the language around architecture-independent emergence while retaining the empirical observation of boundary convergence. revision: partial
-
Referee: [§3] §3 (ToMMeR Architecture and Layer Probing): The description of how early layers are selected and probed lacks ablation studies on layer choice, the exact form of the probe (linear vs. non-linear), and verification that performance does not degrade when using only the first few layers. These omissions make the efficiency claim (<300K parameters) difficult to evaluate as load-bearing for the emergence thesis.
Authors: We accept that additional ablations are required to substantiate the layer-selection and efficiency claims. The revised manuscript will include new results in Section 3 showing recall as a function of layer index, a direct comparison of linear versus small MLP probes, and performance when restricting the probe to the first two or three layers only. These experiments confirm that early-layer performance remains high with the reported parameter budget and will be presented as additional tables or figures. revision: yes
Circularity Check
No circularity: claims rest on external benchmark measurements
full rationale
The paper presents ToMMeR as an empirical probe whose performance is measured directly on 13 NER benchmarks (93% zero-shot recall) and via cross-model boundary agreement (DICE >75%). No derivation chain, equations, or first-principles steps are described that reduce to fitted inputs or self-citations; the central evidence consists of independent evaluations against standard datasets and diverse architectures rather than any self-referential construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Early transformer layers encode structured entity mention information that can be linearly or lightly probed.
Forward citations
Cited by 1 Pith paper
-
Tracing Relational Knowledge Recall in Large Language Models
Per-head attention contributions to the residual stream serve as strong linear features for classifying relational knowledge in LLMs, with probe accuracy correlating to relation specificity and signal distribution.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.