pith. machine review for the scientific record. sign in

arxiv: 2605.07186 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Recognition: no theorem link

The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval

Ari Holtzman, Aryan Shrivastava, Chenhao Tan, Ruiyao Xu, Zekai Tong

Pith reviewed 2026-05-11 02:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsinformation retrievaltext corruptionperformance degradationmode transitiontokenizationuncanny valley
0
0 comments X

The pith

LLMs show a U-shaped drop in accuracy when detecting information in text that has been partially fragmented by inserted spaces inside words.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models lose accuracy at spotting targeted information when words are broken into fragments by adding extra spaces. Performance declines steadily as fragmentation increases, then rises again once the text is almost completely shattered into single characters. This creates a dip the authors call the Text Uncanny Valley. The proposed explanation is that models switch from reading whole words to reading individual characters, and they handle neither approach well during the switch. The finding matters because most benchmarks use clean text, yet real inputs are often noisy or uncurated.

Core claim

By inserting whitespace characters inside words at increasing rates, LLMs' detection accuracy follows a U-shaped curve. The authors call this the Text Uncanny Valley and attribute it to a mode transition: models rely on word-level processing for near-normal text and shift to character-level processing for heavily fragmented text, leaving an ineffective intermediate regime where neither mode works well. Four experiments support the account: in-context learning does not lift valley-bottom performance, regularizing the inserted spaces flattens the U-shape, a math reasoning task shows the same pattern for one model but not stronger ones, and tokenization entropy peaks before the accuracy minimum

What carries the argument

The mode transition hypothesis: LLMs toggle between word-level processing for intact text and character-level processing for broken text, with the valley marking the disordered middle ground.

If this is right

  • Clean-text benchmarks miss a failure mode that appears as soon as real-world inputs contain moderate word-boundary noise.
  • In-context examples do not close the performance gap at the bottom of the valley.
  • Tasks that depend less on exact lexical matching show a shallower or absent valley, especially in stronger models.
  • Tokenization statistics already signal the conflicting regime before accuracy reaches its lowest point.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same non-monotonic pattern may appear in other retrieval or classification tasks when inputs contain partial word corruption.
  • Training or fine-tuning on a continuous range of fragmentation levels could reduce or remove the valley.
  • Different corruption methods, such as character deletions or substitutions, could be tested to check whether the valley is tied to whitespace insertion specifically.

Load-bearing premise

That the performance dip is caused specifically by a disordered switch between word-level and character-level reading modes rather than by tokenization side effects or other task features.

What would settle it

Measuring whether the U-shaped curve disappears when the same fragmentation is applied but token boundaries are held constant, or when models are explicitly prompted to stay in one processing mode throughout.

Figures

Figures reproduced from arXiv: 2605.07186 by Ari Holtzman, Aryan Shrivastava, Chenhao Tan, Ruiyao Xu, Zekai Tong.

Figure 1
Figure 1. Figure 1: Bread slicing as an analogy for our text fragmentation method and the resulting Text [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Text Uncanny Valley: Micro-F1 vs. word_fragmentation_rate for absence detection on Legal Documents (left), GitHub PRs (centre), and Mathematical Reasoning (right). Shaded bands: ±1 std over three fragmentation seeds. 0.0 0.2 0.4 0.6 0.8 1.0 Word Fragmentation Rate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Micro-F1 Legal Documents Gemini 3.0 Flash GPT-5.2 Claude 4.5 Haiku 0.0 0.2 0.4 0.6 0.8 1.0 Word Frag… view at source ↗
Figure 3
Figure 3. Figure 3: NIAH-style insertion detection under fragmentation. Results for the same domains and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of standard and thinking mode on legal documents for Gemini 3.0 Flash (left) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of ICL on Micro-F1 across fragmentation rates for Gemini 3.0 Flash on legal [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Fixed-slot fragmentation results across three domains. In the fixed-slot condition, a space [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Math reasoning accuracy on individual GSM8K problems vs. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Tokenization entropy (dashed) overlaid on Micro-F1 (solid) from the absence detection [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Micro-F1 as a function of word_fragmentation_rate for DeepSeek-V3, Grok-4.1- Fast, and Qwen3-Plus on legal documents, with all three models overlaid. DeepSeek-V3 results at word_fragmentation_rate = 1.0 are omitted due to consistent API errors at that fragmentation level. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Existing Large Language Model (LLM) benchmarks primarily focus on syntactically correct inputs, leaving a significant gap in evaluation on imperfect text. In this work, we study how word-boundary corruption affects how LLMs detect targeted information. By inserting whitespace characters within words to break them into fragments, LLMs' detection accuracy follows a U-shaped curve with the increase in insertion rate. We refer to this curve as the Text Uncanny Valley. To explain such observation, we propose a mode transition hypothesis: LLMs operate in a word-level mode for near-normal text and a character-level mode for heavily fragmented text, with the valley marking the disordered transition where neither mode is effective. Four experiments and one analysis are consistent with this account: in-context learning fails to rescue valley-bottom performance; regularizing the perturbation substantially reduces the U-shape; a math reasoning task replicates the U-shape for Gemini 3.0 Flash but not for stronger models, suggesting the effect is attenuated when tasks rely less on exact lexical alignment; and tokenization entropy peaks before the F1 minimum, consistent with a regime-conflict interpretation. These findings reveal a failure mode invisible to clean-text benchmarks yet directly relevant to any deployment scenario involving noisy or uncurated text inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports that inserting whitespace characters within words to corrupt word boundaries causes LLMs' detection accuracy on targeted information retrieval tasks to follow a U-shaped curve as insertion rate increases, termed the 'Text Uncanny Valley'. The authors propose a mode transition hypothesis in which LLMs operate in a word-level mode for near-normal text and shift to a character-level mode for heavily fragmented text, with the valley bottom marking a disordered transitional regime where neither mode is effective. This is supported by four experiments (in-context learning fails to rescue valley performance; regularization of the perturbation reduces the U-shape; a math reasoning task replicates the U-shape for Gemini 3.0 Flash but not stronger models; tokenization entropy peaks before the F1 minimum) and one analysis presented as consistent with the hypothesis.

Significance. If the reported U-shaped degradation and its interpretation hold after addressing alternative explanations, the work identifies a robustness failure mode in LLMs for noisy or uncurated text that is invisible to standard clean-text benchmarks. This has direct implications for real-world deployments involving imperfect inputs and could motivate new evaluation protocols and training strategies focused on handling fragmentation.

major comments (2)
  1. [tokenization entropy analysis] The tokenization entropy analysis (described in the abstract and supporting the mode transition hypothesis) shows entropy peaking before the F1 minimum and is presented as consistent with regime conflict. However, this pattern is equally predicted by subword tokenizer fragmentation artifacts alone, which increase tokenization entropy and induce input distribution shift without any internal mode transition. The manuscript provides no controls (e.g., comparison across tokenizers with differing subword granularity, measurement of token-level vs. character-level accuracy, or ablation holding token count fixed) to isolate the hypothesized internal processing modes from these surface effects. This is load-bearing for the central explanatory claim.
  2. [regularization experiment] The regularization experiment (abstract) is reported to substantially reduce the U-shape and is taken to support the mode transition account. Yet regularizing the perturbation also directly reduces the degree of BPE boundary disruption; without a matched control that preserves fragmentation while altering other factors, the result does not discriminate between the mode-transition hypothesis and a simpler distribution-shift account.
minor comments (2)
  1. [Abstract] The abstract states that four experiments and one analysis are 'consistent with this account' but does not report effect sizes, confidence intervals, or exact insertion-rate values at which the valley minimum occurs; including these quantitative details would strengthen verifiability.
  2. The manuscript would benefit from a table summarizing F1 scores or accuracy across insertion rates for each experiment, along with the precise definition of 'insertion rate' and the method of whitespace placement (e.g., uniform random vs. fixed positions).

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review. We address the two major comments below on the tokenization entropy analysis and regularization experiment. We propose revisions to clarify the evidential weight of these results while preserving the reported findings and overall hypothesis.

read point-by-point responses
  1. Referee: The tokenization entropy analysis (described in the abstract and supporting the mode transition hypothesis) shows entropy peaking before the F1 minimum and is presented as consistent with regime conflict. However, this pattern is equally predicted by subword tokenizer fragmentation artifacts alone, which increase tokenization entropy and induce input distribution shift without any internal mode transition. The manuscript provides no controls (e.g., comparison across tokenizers with differing subword granularity, measurement of token-level vs. character-level accuracy, or ablation holding token count fixed) to isolate the hypothesized internal processing modes from these surface effects. This is load-bearing for the central explanatory claim.

    Authors: We agree that the entropy analysis alone cannot isolate hypothesized internal mode transitions from surface-level tokenizer fragmentation and distribution shift effects. The manuscript presents the entropy peak as consistent with regime conflict rather than as definitive evidence. The mode transition hypothesis rests on the joint pattern across all four experiments and the analysis. In revision we will add explicit discussion noting that the observed entropy pattern is also predicted by tokenizer artifacts, weaken the interpretive language around this specific result, and include a limitations paragraph on the absence of the suggested controls (tokenizer comparisons, token vs. character accuracy, fixed-token ablations). This addresses the concern directly. revision: yes

  2. Referee: The regularization experiment (abstract) is reported to substantially reduce the U-shape and is taken to support the mode transition account. Yet regularizing the perturbation also directly reduces the degree of BPE boundary disruption; without a matched control that preserves fragmentation while altering other factors, the result does not discriminate between the mode-transition hypothesis and a simpler distribution-shift account.

    Authors: We acknowledge that uniform regularization of insertions necessarily reduces both disorder and the degree of BPE boundary disruption, so the experiment does not furnish a fully matched control separating fragmentation level from transition disorder. The design aimed to show that lowering the disordered character of the perturbation attenuates the valley. We will revise the manuscript to describe the regularization procedure in more detail, present the result as consistent with but not conclusive for the mode-transition account, and note the alternative distribution-shift interpretation explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observation and hypothesis supported by independent tests.

full rationale

The paper reports an observed U-shaped performance curve under increasing word-fragmentation insertions and proposes a mode-transition hypothesis (word-level to character-level processing) as an interpretive account. The hypothesis is not derived from first principles or equations that reduce to the inputs; instead, it is evaluated through four distinct experiments (in-context learning rescue failure, regularization of perturbations, math-task replication across model strengths, and tokenization-entropy timing) plus one analysis. These tests are described as providing consistent external evidence rather than tautological fits or self-referential definitions. No load-bearing self-citations, ansatzes smuggled via prior work, or renamed known results appear in the derivation chain. The central claim remains self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on empirical observations from described experiments and the proposed mode transition hypothesis; no explicit free parameters are stated in the abstract, but the hypothesis itself functions as an explanatory construct without independent falsifiable evidence beyond the reported tests.

axioms (1)
  • domain assumption LLMs process text via distinct word-level and character-level modes depending on input fragmentation level
    Invoked in the abstract to explain the U-shaped performance curve and valley as a transition regime.
invented entities (1)
  • mode transition hypothesis no independent evidence
    purpose: To account for the observed U-shaped performance degradation as a disordered shift between processing modes
    Proposed explanatory construct for the Text Uncanny Valley; no independent evidence outside the paper's experiments is provided.

pith-pipeline@v0.9.0 · 5531 in / 1318 out tokens · 37780 ms · 2026-05-11T02:49:42.574355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 4 internal anchors

  1. [2]

    org/abs/2504.02733

    URL https://arxiv. org/abs/2504.02733. S. Alqahtani, M. T. Nayeem, M. T. R. Laskar, T. Mohiuddin, and M. S. Bari. Stop taking tokenizers for granted: They are core design decisions in large language models. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8410–8432,

  2. [3]

    URL https://aclanthology.org/2026.eacl-long. 394/. Anthropic. System card: Claude Haiku 4.5. https://anthropic.com/ claude-haiku-4-5-system-card,

  3. [4]

    Y . Chai, Y . Fang, Q. Peng, and X. Li. Tokenization falling short: On subword robustness in large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 1582–1599,

  4. [6]

    Training Verifiers to Solve Math Word Problems

    URLhttps://arxiv.org/abs/2110.14168. DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437,

  5. [7]

    URL https://arxiv.org/abs/2412.19437. H. Y . Fu, A. Shrivastava, J. Moore, P. West, C. Tan, and A. Holtzman. Absence Bench: Language models can’t see what’s missing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,

  6. [8]

    URLhttps://arxiv.org/abs/2405.17264. Google. Gemini 3 flash. https://cloud.google.com/blog/products/ ai-machine-learning/gemini-3-flash-for-enterprises,

  7. [10]

    URLhttps://arxiv.org/abs/2404.06654. A. Jain and A. Garimella. Knowing what’s missing: Assessing information sufficiency in question answering. InFindings of the Association for Computational Linguistics: EACL 2026, pages 4163–4174,

  8. [11]

    10 D. Jiao, D. Wang, and L. Hu. Understanding the dynamics of demonstration conflict in in-context learning.arXiv preprint arXiv:2603.04464,

  9. [13]

    URLhttps://arxiv.org/abs/2502.16781. M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. In Proceedings of the International Conference on Learning Representations (ICLR),

  10. [14]

    URL https://arxiv.org/abs/2310.11324. R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1715–1725,

  11. [16]

    URLhttps://arxiv.org/abs/2407.08989. xAI. Grok 4.1 Fast and Agent Tools API.https://x.ai/news/grok-4-1-fast,

  12. [18]

    URL https://arxiv.org/abs/2505.09388. Z. Yang, S. Yang, L. Hu, and D. Wang. Word recovery in large language models enables character- level tokenization robustness.arXiv preprint arXiv:2603.10771,

  13. [19]

    org/abs/2603.10771

    URL https://arxiv. org/abs/2603.10771. A. Zhuo, X. Ning, N. Li, Y . Wang, and P. Lu. On the ability of LLMs to handle character-level perturbations: How well and how?arXiv preprint arXiv:2510.14365,

  14. [20]

    11 A Limitations Indirect mechanism evidence.Our two-modes hypothesis rests on behavioral observations and tokenization statistics rather than internal model states

    URL https: //arxiv.org/abs/2510.14365. 11 A Limitations Indirect mechanism evidence.Our two-modes hypothesis rests on behavioral observations and tokenization statistics rather than internal model states. A direct mechanistic account would require access to attention patterns or layer-wise activations, none of which are available through black-box APIs. T...

  15. [21]

    Agreement

    and evalu- ate each model on each problem independently at eleven word_fragmentation_rate levels (0.0,0.1, . . . ,1.0 ). Each API call consists of a single fragmented problem; thinking mode is disabled for all models to isolate tokenization level effects. Accuracy is measured by exact numerical match, with tolerance for minor formatting differences (e.g.,...