arxiv: 2605.04665 · v2 · submitted 2026-05-06 · 💻 cs.CL

Recognition: no theorem link

Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs

Aofan Liu , Jingxiang Meng

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM output stabilityprompt paraphrasingoutput mode collapseevaluation robustnesssemantic consistencyclosed-form tasksexact-match evaluation

0 comments

The pith

LLMs abandon requested bare-label formats under semantically equivalent prompt variants, even at temperature zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs preserve the exact output format demanded by a closed-form task when the request is rewritten while keeping its meaning intact. Across 150 base queries turned into 900 prompts and run on five compact 2025 models, most variants cause the model to switch into conversational prose instead of emitting the single token or label originally requested. This mode switch silently defeats exact-match scoring pipelines that assume format stability. The authors supply a benchmark and a three-part score to quantify how often answer content, semantic similarity, and output length all remain stable under paraphrase. If the pattern holds, prompt sensitivity is not limited to factual accuracy but extends to the structural contract between user instruction and model response.

Core claim

When a closed-form prompt asks for a bare label or single choice token, content-preserving lexical, syntactic, and semantic-expansion variants can push the model into conversational prose; the requested format dissolves and exact-match evaluation pipelines silently misjudge the result. On the released PARACONSIST benchmark of 150 queries with five variants each, only about 22 percent of variant responses keep the ground-truth label inside their output under whole-word answer-set match, while 78 percent drift away from the answer space entirely. Task structure, rather than model identity, is the dominant predictor of collapse, and model differentiation appears jointly in answer consistency, B

What carries the argument

prompt-variant output-mode collapse: the systematic shift from a closed-form bare-label response to open conversational prose when only the surface wording of the request changes while its substantive content is preserved.

If this is right

Exact-match evaluation pipelines will systematically under-count correct answers whenever input wording varies.
Task structure (closed-form label request versus open generation) predicts format stability more reliably than model size or training details.
Robustness auditing must treat response-mode preservation as a distinct reliability target separate from answer accuracy.
The Semantic Consistency Score decomposes stability into answer consistency, sentence-BERT similarity, and length stability, allowing targeted diagnosis.
Model differentiation in the pool is jointly carried by answer consistency and length stability rather than semantic drift alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation suites that fix prompt wording once and reuse it across models may mask real differences in how those models handle natural user rephrasings.
Training or decoding interventions that explicitly penalize mode drift on closed-form tasks could improve reliability without changing model weights.
The same collapse pattern may appear in agent loops or tool-use settings where downstream code expects a rigid JSON or token format.
If length stability and answer consistency track together, simple length-constrained decoding at inference time might mitigate much of the observed failure.

Load-bearing premise

That the five prompt variants created for each base query remain strictly content-preserving and introduce no hidden cues that would legitimately call for a longer or more conversational reply.

What would settle it

Run the same five models at temperature zero on the 150 base queries and their 750 variants while logging whether the first output token is always the bare label or choice token required by the original task; collapse rate below 10 percent would falsify the reported mode-shift pattern.

Figures

Figures reproduced from arXiv: 2605.04665 by Aofan Liu, Jingxiang Meng.

**Figure 1.** Figure 1: Overview of the PARACONSIST pipeline: a 900-prompt benchmark (150 base queries × 6 prompt renderings) evaluated at T =0 across five models TABLE I DATASET COMPOSITION. Task Type Items Variants Total Multiple-Choice QA 70 350 420 Sentiment Analysis 30 150 180 Classification 30 150 180 Summarization 20 100 120 Total 150 750 900 B. Dataset Construction Our evaluation framework, ParaConsist, comprises 900 eval… view at source ↗

**Figure 2.** Figure 2: Composition of closed-form variant responses across the five evaluated models. view at source ↗

read the original abstract

When the substantive content of a request is rewritten, do large language models still answer in the format the original task asked for? We find that they often do not, even at temperature zero. On a 150-query evaluation over five compact 2025-era LLMs and four task types, we observe a systematic failure mode we call prompt-variant output-mode collapse: when a closed-form prompt asks for a bare label or a single choice token, content-preserving prompt variants can push the model into conversational prose, the requested format dissolves, and exact-match evaluation pipelines silently misjudge the result. To make this measurable, we release PARACONSIST, a 900-prompt benchmark of 150 base queries with five lexical, syntactic, and semantic-expansion prompt variants each, and a Semantic Consistency Score that decomposes prompt-variant robustness into answer consistency, sentence-BERT semantic similarity, and length stability. Under a whole-word answer-set match, only ~22% of closed-form variant responses preserve the ground-truth label inside their output, while ~78% drift away from the answer space entirely. In our pool, the dominant predictor of collapse is task structure rather than model identity, with model differentiation jointly carried by answer consistency and length stability. Robustness audits should therefore track response-mode preservation as a first-class reliability target alongside answer accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paraphrase variants often make LLMs drop a requested bare-label format for prose, and the paper supplies a benchmark plus numbers to measure it, though the variants themselves need tighter checks.

read the letter

This paper shows that rewording a closed-form prompt can push an LLM into conversational prose even at temperature zero, so the expected single token or label disappears and exact-match evals fail silently. They back the claim with concrete percentages from 900 prompts across five models and four task types: roughly 22% of variant responses keep the ground-truth label in the right place while 78% drift out of the answer space entirely. Task structure predicts the collapse more than model identity does.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs exhibit prompt-variant output-mode collapse: even at temperature zero, semantically equivalent paraphrases of closed-form prompts (requesting bare labels or single choice tokens) frequently cause models to produce conversational prose instead of the requested format, leading exact-match evaluations to misjudge results. This is measured on the released PARACONSIST benchmark (150 base queries × 5 variants = 900 prompts) across five compact 2025-era LLMs and four task types. Under whole-word answer-set match, only ~22% of variant responses preserve the ground-truth label while ~78% drift from the answer space. A Semantic Consistency Score is introduced that decomposes robustness into answer consistency, sentence-BERT similarity, and length stability; task structure is reported as the dominant predictor over model identity.

Significance. If the central empirical claim is substantiated, the work would be significant for identifying an under-studied reliability failure mode in LLMs that affects structured-output applications and automated evaluation pipelines. The public release of the PARACONSIST benchmark and the decomposition into observable consistency metrics constitute concrete contributions that enable future robustness audits to treat response-mode preservation as a first-class target alongside accuracy.

major comments (2)

[Benchmark Construction] Benchmark Construction section: The paper asserts that the five variants per base query are content-preserving rewrites, yet provides no description of the generation procedure for lexical, syntactic, and semantic-expansion variants nor any human or automated validation confirming that each variant continues to explicitly request a bare label or single-token output. Semantic expansions in particular can introduce softening or explanatory phrasing that legitimately shifts the expected response mode; without such checks the ~78% drift rate risks conflating model fragility with prompt-engineering differences. This assumption is load-bearing for the central claim.
[Experimental Results] Experimental Results section: The reported ~22% preservation and ~78% drift figures lack accompanying statistical significance tests, confidence intervals, or controls for prompt-length variation across variants. The claim that task structure dominates model identity likewise requires per-task breakdowns and explicit comparison of effect sizes to be fully supported; absent these, the predictor analysis remains preliminary.

minor comments (2)

[Abstract] Abstract: The four task types are referenced but not enumerated; a short parenthetical list would improve readability.
[Method] Semantic Consistency Score definition: The three components are described qualitatively; an explicit formula or pseudocode showing how answer match, sentence-BERT similarity, and length stability are aggregated would reduce ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive referee report on our manuscript. We appreciate the identification of gaps in procedural transparency and statistical rigor. We address each major comment below with plans for revision.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark Construction section: The paper asserts that the five variants per base query are content-preserving rewrites, yet provides no description of the generation procedure for lexical, syntactic, and semantic-expansion variants nor any human or automated validation confirming that each variant continues to explicitly request a bare label or single-token output. Semantic expansions in particular can introduce softening or explanatory phrasing that legitimately shifts the expected response mode; without such checks the ~78% drift rate risks conflating model fragility with prompt-engineering differences. This assumption is load-bearing for the central claim.

Authors: We agree the manuscript lacks an explicit description of variant generation and validation. In revision we will add a dedicated subsection detailing the procedures: lexical variants were produced via targeted synonym replacement from a curated thesaurus while preserving imperative structure; syntactic variants used clause reordering and voice changes; semantic expansions added equivalent explanatory clauses but retained the explicit bare-label directive at the end. We will report that an automated prompt classifier (fine-tuned to detect closed-form requests) was applied to all 900 prompts, with 100% passing the check, plus manual inspection of a 30% random sample confirming no softening of the output-mode request. This will demonstrate that the observed drift is attributable to model behavior rather than prompt engineering differences. revision: yes
Referee: [Experimental Results] Experimental Results section: The reported ~22% preservation and ~78% drift figures lack accompanying statistical significance tests, confidence intervals, or controls for prompt-length variation across variants. The claim that task structure dominates model identity likewise requires per-task breakdowns and explicit comparison of effect sizes to be fully supported; absent these, the predictor analysis remains preliminary.

Authors: We concur that additional statistical support is required. The revised manuscript will include 95% bootstrap confidence intervals on the overall preservation rate and per-variant drift rates. We will add per-task tables reporting preservation percentages for each of the four task types, together with a mixed-effects logistic regression that quantifies the relative contribution of task structure versus model identity, including effect sizes (eta-squared). Prompt-length variation will be controlled by reporting mean token lengths per variant type and including length as a covariate in the regression; a supplementary analysis will show that length differences do not predict collapse once task is accounted for. These changes will place the dominance claim on firmer empirical footing. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with observable metrics; no derivations or self-referential reductions

full rationale

The paper is an empirical measurement study. It introduces the PARACONSIST benchmark and defines the Semantic Consistency Score explicitly in terms of observable quantities (answer consistency via whole-word match, sentence-BERT similarity, length stability). No equations, fitted parameters, or self-citations appear as load-bearing steps that reduce the claimed results to inputs by construction. Task-structure dominance and collapse rates are reported from direct evaluation across models and variants, not from any renaming, ansatz smuggling, or uniqueness theorem. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that prompt variants preserve semantics exactly and that temperature-zero sampling should produce format-stable outputs; no free parameters are fitted, no new physical entities are postulated, and the benchmark itself is the primary addition.

axioms (2)

domain assumption Temperature-zero decoding produces deterministic outputs for a given prompt.
Invoked when stating the failure occurs 'even at temperature zero'.
domain assumption Sentence-BERT embeddings provide a reliable measure of semantic similarity between prompts.
Used as one component of the Semantic Consistency Score.

invented entities (2)

PARACONSIST benchmark no independent evidence
purpose: Standardized set of 150 base queries with five variants each for measuring output-mode robustness.
New dataset introduced to quantify the phenomenon; independent evidence would be public release and adoption by others.
Semantic Consistency Score no independent evidence
purpose: Composite metric decomposing robustness into answer consistency, sentence-BERT similarity, and length stability.
New evaluation construct defined for this study; no external validation provided in abstract.

pith-pipeline@v0.9.0 · 5536 in / 1433 out tokens · 38378 ms · 2026-05-12T02:10:15.148715+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 4 internal anchors

[1]

Language Models are Few-Shot Learners,

T. Brown et al., “Language Models are Few-Shot Learners,” inAd- vances in Neural Information Processing Systems, vol. 33, pp. 1877– 1901, 2020

work page 1901
[2]

PaLM: Scaling Language Modeling with Pathways,

A. Chowdhery et al., “PaLM: Scaling Language Modeling with Pathways,”J. Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023

work page 2023
[3]

GPT-4 Technical Report

OpenAI, “GPT-4 Technical Report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList,

M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList,” inProc. ACL, pp. 4902–4912, 2020

work page 2020
[5]

PromptRobust: Towards evaluating the robustness of large language models on adversarial prompts,

K. Zhu et al., “PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts,”arXiv preprint arXiv:2306.04528, 2023

work page arXiv 2023
[6]

Self-Consistency Improves Chain of Thought Rea- soning in Language Models,

X. Wang et al., “Self-Consistency Improves Chain of Thought Rea- soning in Language Models,” inICLR, 2023

work page 2023
[7]

How Robust is GPT-3.5 to Predecessors? A Com- prehensive Study on Language Understanding Tasks,

X. Chen et al., “How Robust is GPT-3.5 to Predecessors? A Com- prehensive Study on Language Understanding Tasks,”arXiv preprint arXiv:2303.00293, 2023

work page arXiv 2023
[8]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,”arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Jailbroken: How Does LLM Safety Training Fail?,

A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How Does LLM Safety Training Fail?,” inNeurIPS, vol. 36, 2023

work page 2023
[10]

Ignore Previous Prompt: Attack Techniques For Language Models

F. Perez and I. Ribeiro, “Ignore Previous Prompt: Attack Techniques for Language Models,”arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al

M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,”arXiv preprint arXiv:2310.11324, 2023

work page arXiv 2023
[12]

Judging LLM-as-a-Judge with MT-Bench and Chat- bot Arena,

L. Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chat- bot Arena,” inNeurIPS, vol. 36, 2023

work page 2023
[13]

Generating Phrasal and Sentential Paraphrases: A Survey,

N. Madnani and B. J. Dorr, “Generating Phrasal and Sentential Paraphrases: A Survey,”Computational Linguistics, vol. 36, no. 3, pp. 341–387, 2010

work page 2010
[14]

Paraphrasing with Bilingual Parallel Corpora,

C. Bannard and C. Callison-Burch, “Paraphrasing with Bilingual Parallel Corpora,” inProc. ACL, pp. 597–604, 2005

work page 2005
[15]

Neural Paraphrase Generation with Stacked Resid- ual LSTM Networks,

A. Prakash et al., “Neural Paraphrase Generation with Stacked Resid- ual LSTM Networks,” inProc. COLING, pp. 2923–2934, 2016

work page 2016
[16]

PromptBERT: Improving BERT Sentence Embeddings with Prompts,

T. Jiang et al., “PromptBERT: Improving BERT Sentence Embeddings with Prompts,”arXiv preprint arXiv:2201.04337, 2022

work page arXiv 2022
[17]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” inProc. EMNLP, pp. 3982–3992, 2019

work page 2019
[18]

SemEval-2017 Task 1: Semantic Textual Similarity,

D. Cer et al., “SemEval-2017 Task 1: Semantic Textual Similarity,” in Proc. SemEval, pp. 1–14, 2017

work page 2017
[19]

Measuring Massive Multitask Language Under- standing,

D. Hendrycks et al., “Measuring Massive Multitask Language Under- standing,” inICLR, 2021

work page 2021
[20]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark et al., “Think you have Solved Question Answering? Try ARC,”arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Don’t Give Me the Details, Just the Summary!,

S. Narayan, S. B. Cohen, and M. Lapata, “Don’t Give Me the Details, Just the Summary!,” inProc. EMNLP, pp. 1797–1807, 2018

work page 2018
[22]

Character-level Convolutional Networks for Text Classification,

X. Zhang, J. Zhao, and Y . LeCun, “Character-level Convolutional Networks for Text Classification,” inNeurIPS, vol. 28, 2015

work page 2015
[23]

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu

A. Celikyilmaz, E. Clark, and J. Gao, “Evaluation of Text Generation: A Survey,”arXiv preprint arXiv:2006.14799, 2020

work page arXiv 2006
[24]

On Faithfulness and Factuality in Abstractive Summarization,

J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On Faithfulness and Factuality in Abstractive Summarization,” inProc. ACL, pp. 1906– 1919, 2020

work page 1906
[25]

Measuring and Improving Consistency in Pretrained Language Models,

Y . Elazar, N. Kassner, S. Ravfogel, A. Ravichander, E. Hovy, H. Sch ¨utze, and Y . Goldberg, “Measuring and Improving Consistency in Pretrained Language Models,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 1012–1031, 2021

work page 2021
[26]

arXiv preprint arXiv:2211.05853 , year=

H. Raj, D. Rosati, and S. Majumdar, “Measuring Reliability of Large Language Models through Semantic Consistency,”arXiv preprint arXiv:2211.05853, 2022

work page arXiv 2022
[27]

State of What Art? A Call for Multi-Prompt LLM Evaluation,

M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of What Art? A Call for Multi-Prompt LLM Evaluation,”Transactions of the Association for Computational Lin- guistics, vol. 12, pp. 933–949, 2024

work page 2024
[28]

What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering,

F. Errica, D. Sanvito, G. Siracusano, and R. Bifulco, “What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering,” inProc. NAACL, 2025

work page 2025
[29]

POSIX: A Prompt Sensitivity Index For Large Language Models,

A. Chatterjee, H. S. V . N. S. Kowndinya Renduchintala, S. Bhatia, and T. Chakraborty, “POSIX: A Prompt Sensitivity Index For Large Language Models,” inFindings of EMNLP, 2024

work page 2024
[30]

PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization,

A. Liu, L. Tang, T. Pan, Y . Yin, B. Wang, and A. Yang, “PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization,” inProc. IEEE International Conference on Multi- media and Expo (ICME), 2025

work page 2025
[31]

Automatically Generated Multi-Agent Framework for Jailbreaking Large Language Models,

A. Yang, B. Wang, A. Liu, and H. Li, “Automatically Generated Multi-Agent Framework for Jailbreaking Large Language Models,” in Proc. International Conference on Artificial Intelligence and Industrial Engineering, 2025

work page 2025