ToMMeR -- Efficient Entity Mention Detection from Large Language Models

Benjamin Piwowarski; Josiane Mothe; Nadi Tomeh; Victor Morand

arxiv: 2510.19410 · v2 · submitted 2025-10-22 · 💻 cs.CL · cs.AI

ToMMeR -- Efficient Entity Mention Detection from Large Language Models

Victor Morand , Nadi Tomeh , Josiane Mothe , Benjamin Piwowarski This is my paper

Pith reviewed 2026-05-18 05:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords entity mention detectionnamed entity recognitionlarge language modelstransformer layersprobingzero-shotinformation extractionboundary detection

0 comments

The pith

Structured entity representations exist in early transformer layers and can be recovered with under 300K parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ToMMeR to detect which text spans refer to entities by probing early layers inside large language models. This small model with fewer than 300 thousand parameters reaches 93 percent zero-shot recall across 13 named entity recognition benchmarks. The approach matters because mention detection is a core step in information extraction that often limits performance. Cross-model checks show that many different language models converge on similar mention boundaries. Adding simple classification heads on top of the probe produces competitive results for full named entity recognition.

Core claim

ToMMeR is a lightweight model with under 300K parameters that probes mention detection capabilities from early LLM layers. Across 13 NER benchmarks it achieves 93 percent zero-shot recall with an estimated 90 percent precision under a human-calibrated LLM-judge protocol. Cross-model analysis on architectures from 14M to 15B parameters reveals convergence on similar mention boundaries with DICE scores above 75 percent. When extended with span classification heads, ToMMeR reaches 80 to 87 percent F1 on standard benchmarks, showing that structured entity representations exist in early transformer layers and can be recovered efficiently with minimal parameters.

What carries the argument

ToMMeR, a lightweight probing model attached to early transformer layers to identify entity mention boundaries.

Load-bearing premise

The human-calibrated LLM judge gives an unbiased precision estimate and cross-model boundary agreement reflects genuine emergence from language modeling rather than shared data artifacts.

What would settle it

If a new language model trained on text without named entities still allows ToMMeR to reach high recall on standard benchmarks, this would challenge the claim that the representations emerge from language modeling itself.

read the original abstract

Identifying which text spans refer to entities - mention detection - is both foundational for information extraction and a known performance bottleneck. We introduce ToMMeR, a lightweight model (<300K parameters) probing mention detection capabilities from early LLM layers. Across 13 NER benchmarks, ToMMeR achieves 93% recall zero-shot, with an estimated 90% precision under a human-calibrated LLM-judge protocol, showing that ToMMeR rarely produces spurious predictions despite high recall. Cross-model analysis reveals that diverse architectures (14M-15B parameters) converge on similar mention boundaries (DICE >75%), confirming that mention detection emerges naturally from language modeling. When extended with span classification heads, ToMMeR achieves competitive NER performance (80-87% F1 on standard benchmarks). Our work provides evidence that structured entity representations exist in early transformer layers and can be efficiently recovered with minimal parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ToMMeR shows mention detection can be pulled from early LLM layers with a tiny probe and cross-model agreement, but the 90% precision rests on an LLM judge whose independence is not yet clear.

read the letter

The main thing to know is that this paper finds entity mentions emerging early in transformer layers and recoverable with under 300k parameters, hitting 93% zero-shot recall across 13 benchmarks while keeping spurious outputs low enough for an estimated 90% precision. The cross-model DICE scores above 75% across sizes from 14M to 15B parameters is the part that feels most interesting, because it points to something architecture-independent rather than a single-model artifact.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ToMMeR, a lightweight probe (<300K parameters) that extracts entity mention detection from early layers of LLMs. It reports 93% zero-shot recall across 13 NER benchmarks with an estimated 90% precision via a human-calibrated LLM-judge protocol, DICE >75% boundary convergence across 14M–15B parameter models from diverse architectures, and competitive 80–87% F1 on full NER when extended with span classification heads. The central claim is that structured entity representations emerge naturally in early transformer layers and can be recovered efficiently.

Significance. If the empirical claims hold under rigorous validation, the work provides concrete evidence for the emergence of mention detection in early LLM layers and demonstrates a parameter-efficient recovery method. The cross-model convergence analysis is a notable strength, as it offers falsifiable predictions about architecture-independent structure arising from language modeling objectives. This could inform both mechanistic interpretability and practical information extraction systems.

major comments (3)

[§4] §4 (Evaluation Protocol): The 90% precision estimate depends entirely on the human-calibrated LLM-judge protocol, yet the manuscript supplies no details on calibration set size, inter-annotator agreement, judge prompt templates, or controls for correlation between the judge LLM and the probed models (e.g., shared pretraining data). This directly undermines confidence in the high-recall claim, as any judge-model alignment would inflate the reported precision.
[§5] §5 (Cross-Model Convergence): The DICE >75% result is presented as evidence of architecture-independent emergence, but no analysis addresses potential confounding from overlapping pretraining corpora across the tested models. Without such controls or ablation on data overlap, the convergence cannot be confidently attributed to language modeling rather than shared artifacts.
[§3] §3 (ToMMeR Architecture and Layer Probing): The description of how early layers are selected and probed lacks ablation studies on layer choice, the exact form of the probe (linear vs. non-linear), and verification that performance does not degrade when using only the first few layers. These omissions make the efficiency claim (<300K parameters) difficult to evaluate as load-bearing for the emergence thesis.

minor comments (2)

[Tables 1–3] Tables reporting benchmark results should include error bars or standard deviations across runs or seeds to support the 93% recall figure.
[§2] The manuscript would benefit from explicit comparison to prior probing work on entity representations (e.g., citations to linear probes for syntactic structure) to clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify key aspects of our evaluation and analysis. We respond to each major comment below and specify the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§4] §4 (Evaluation Protocol): The 90% precision estimate depends entirely on the human-calibrated LLM-judge protocol, yet the manuscript supplies no details on calibration set size, inter-annotator agreement, judge prompt templates, or controls for correlation between the judge LLM and the probed models (e.g., shared pretraining data). This directly undermines confidence in the high-recall claim, as any judge-model alignment would inflate the reported precision.

Authors: We agree that the current manuscript lacks sufficient detail on the LLM-judge protocol, which is necessary to support the reported precision. In the revised version we will expand Section 4 with a new subsection that reports the calibration set size, inter-annotator agreement statistics, the full judge prompt templates, and the steps taken to reduce judge-model correlation (including use of a judge from a different model family and checks for benchmark overlap). These additions will be included without changing the underlying experimental results. revision: yes
Referee: [§5] §5 (Cross-Model Convergence): The DICE >75% result is presented as evidence of architecture-independent emergence, but no analysis addresses potential confounding from overlapping pretraining corpora across the tested models. Without such controls or ablation on data overlap, the convergence cannot be confidently attributed to language modeling rather than shared artifacts.

Authors: The referee correctly notes the absence of explicit controls for pretraining data overlap. We will add a limitations paragraph in Section 5 that acknowledges this potential confound and presents supplementary stratified DICE results across model families with differing known training data. Full ablation on data overlap is not feasible for all closed models, so we will temper the language around architecture-independent emergence while retaining the empirical observation of boundary convergence. revision: partial
Referee: [§3] §3 (ToMMeR Architecture and Layer Probing): The description of how early layers are selected and probed lacks ablation studies on layer choice, the exact form of the probe (linear vs. non-linear), and verification that performance does not degrade when using only the first few layers. These omissions make the efficiency claim (<300K parameters) difficult to evaluate as load-bearing for the emergence thesis.

Authors: We accept that additional ablations are required to substantiate the layer-selection and efficiency claims. The revised manuscript will include new results in Section 3 showing recall as a function of layer index, a direct comparison of linear versus small MLP probes, and performance when restricting the probe to the first two or three layers only. These experiments confirm that early-layer performance remains high with the reported parameter budget and will be presented as additional tables or figures. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external benchmark measurements

full rationale

The paper presents ToMMeR as an empirical probe whose performance is measured directly on 13 NER benchmarks (93% zero-shot recall) and via cross-model boundary agreement (DICE >75%). No derivation chain, equations, or first-principles steps are described that reduce to fitted inputs or self-citations; the central evidence consists of independent evaluations against standard datasets and diverse architectures rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the empirical observation that early LLM layers contain recoverable mention signals; no free parameters, invented entities, or non-standard axioms are stated in the abstract.

axioms (1)

domain assumption Early transformer layers encode structured entity mention information that can be linearly or lightly probed.
Invoked to justify the probing approach and zero-shot performance claims.

pith-pipeline@v0.9.0 · 5691 in / 1204 out tokens · 40485 ms · 2026-05-18T05:22:05.422652+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tracing Relational Knowledge Recall in Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

Per-head attention contributions to the residual stream serve as strong linear features for classifying relational knowledge in LLMs, with probe accuracy correlating to relation specificity and signal distribution.