Recognition: no theorem link
Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs
Pith reviewed 2026-05-12 02:10 UTC · model grok-4.3
The pith
LLMs abandon requested bare-label formats under semantically equivalent prompt variants, even at temperature zero.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When a closed-form prompt asks for a bare label or single choice token, content-preserving lexical, syntactic, and semantic-expansion variants can push the model into conversational prose; the requested format dissolves and exact-match evaluation pipelines silently misjudge the result. On the released PARACONSIST benchmark of 150 queries with five variants each, only about 22 percent of variant responses keep the ground-truth label inside their output under whole-word answer-set match, while 78 percent drift away from the answer space entirely. Task structure, rather than model identity, is the dominant predictor of collapse, and model differentiation appears jointly in answer consistency, B
What carries the argument
prompt-variant output-mode collapse: the systematic shift from a closed-form bare-label response to open conversational prose when only the surface wording of the request changes while its substantive content is preserved.
If this is right
- Exact-match evaluation pipelines will systematically under-count correct answers whenever input wording varies.
- Task structure (closed-form label request versus open generation) predicts format stability more reliably than model size or training details.
- Robustness auditing must treat response-mode preservation as a distinct reliability target separate from answer accuracy.
- The Semantic Consistency Score decomposes stability into answer consistency, sentence-BERT similarity, and length stability, allowing targeted diagnosis.
- Model differentiation in the pool is jointly carried by answer consistency and length stability rather than semantic drift alone.
Where Pith is reading between the lines
- Evaluation suites that fix prompt wording once and reuse it across models may mask real differences in how those models handle natural user rephrasings.
- Training or decoding interventions that explicitly penalize mode drift on closed-form tasks could improve reliability without changing model weights.
- The same collapse pattern may appear in agent loops or tool-use settings where downstream code expects a rigid JSON or token format.
- If length stability and answer consistency track together, simple length-constrained decoding at inference time might mitigate much of the observed failure.
Load-bearing premise
That the five prompt variants created for each base query remain strictly content-preserving and introduce no hidden cues that would legitimately call for a longer or more conversational reply.
What would settle it
Run the same five models at temperature zero on the 150 base queries and their 750 variants while logging whether the first output token is always the bare label or choice token required by the original task; collapse rate below 10 percent would falsify the reported mode-shift pattern.
Figures
read the original abstract
When the substantive content of a request is rewritten, do large language models still answer in the format the original task asked for? We find that they often do not, even at temperature zero. On a 150-query evaluation over five compact 2025-era LLMs and four task types, we observe a systematic failure mode we call prompt-variant output-mode collapse: when a closed-form prompt asks for a bare label or a single choice token, content-preserving prompt variants can push the model into conversational prose, the requested format dissolves, and exact-match evaluation pipelines silently misjudge the result. To make this measurable, we release PARACONSIST, a 900-prompt benchmark of 150 base queries with five lexical, syntactic, and semantic-expansion prompt variants each, and a Semantic Consistency Score that decomposes prompt-variant robustness into answer consistency, sentence-BERT semantic similarity, and length stability. Under a whole-word answer-set match, only ~22% of closed-form variant responses preserve the ground-truth label inside their output, while ~78% drift away from the answer space entirely. In our pool, the dominant predictor of collapse is task structure rather than model identity, with model differentiation jointly carried by answer consistency and length stability. Robustness audits should therefore track response-mode preservation as a first-class reliability target alongside answer accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs exhibit prompt-variant output-mode collapse: even at temperature zero, semantically equivalent paraphrases of closed-form prompts (requesting bare labels or single choice tokens) frequently cause models to produce conversational prose instead of the requested format, leading exact-match evaluations to misjudge results. This is measured on the released PARACONSIST benchmark (150 base queries × 5 variants = 900 prompts) across five compact 2025-era LLMs and four task types. Under whole-word answer-set match, only ~22% of variant responses preserve the ground-truth label while ~78% drift from the answer space. A Semantic Consistency Score is introduced that decomposes robustness into answer consistency, sentence-BERT similarity, and length stability; task structure is reported as the dominant predictor over model identity.
Significance. If the central empirical claim is substantiated, the work would be significant for identifying an under-studied reliability failure mode in LLMs that affects structured-output applications and automated evaluation pipelines. The public release of the PARACONSIST benchmark and the decomposition into observable consistency metrics constitute concrete contributions that enable future robustness audits to treat response-mode preservation as a first-class target alongside accuracy.
major comments (2)
- [Benchmark Construction] Benchmark Construction section: The paper asserts that the five variants per base query are content-preserving rewrites, yet provides no description of the generation procedure for lexical, syntactic, and semantic-expansion variants nor any human or automated validation confirming that each variant continues to explicitly request a bare label or single-token output. Semantic expansions in particular can introduce softening or explanatory phrasing that legitimately shifts the expected response mode; without such checks the ~78% drift rate risks conflating model fragility with prompt-engineering differences. This assumption is load-bearing for the central claim.
- [Experimental Results] Experimental Results section: The reported ~22% preservation and ~78% drift figures lack accompanying statistical significance tests, confidence intervals, or controls for prompt-length variation across variants. The claim that task structure dominates model identity likewise requires per-task breakdowns and explicit comparison of effect sizes to be fully supported; absent these, the predictor analysis remains preliminary.
minor comments (2)
- [Abstract] Abstract: The four task types are referenced but not enumerated; a short parenthetical list would improve readability.
- [Method] Semantic Consistency Score definition: The three components are described qualitatively; an explicit formula or pseudocode showing how answer match, sentence-BERT similarity, and length stability are aggregated would reduce ambiguity.
Simulated Author's Rebuttal
Thank you for the constructive referee report on our manuscript. We appreciate the identification of gaps in procedural transparency and statistical rigor. We address each major comment below with plans for revision.
read point-by-point responses
-
Referee: [Benchmark Construction] Benchmark Construction section: The paper asserts that the five variants per base query are content-preserving rewrites, yet provides no description of the generation procedure for lexical, syntactic, and semantic-expansion variants nor any human or automated validation confirming that each variant continues to explicitly request a bare label or single-token output. Semantic expansions in particular can introduce softening or explanatory phrasing that legitimately shifts the expected response mode; without such checks the ~78% drift rate risks conflating model fragility with prompt-engineering differences. This assumption is load-bearing for the central claim.
Authors: We agree the manuscript lacks an explicit description of variant generation and validation. In revision we will add a dedicated subsection detailing the procedures: lexical variants were produced via targeted synonym replacement from a curated thesaurus while preserving imperative structure; syntactic variants used clause reordering and voice changes; semantic expansions added equivalent explanatory clauses but retained the explicit bare-label directive at the end. We will report that an automated prompt classifier (fine-tuned to detect closed-form requests) was applied to all 900 prompts, with 100% passing the check, plus manual inspection of a 30% random sample confirming no softening of the output-mode request. This will demonstrate that the observed drift is attributable to model behavior rather than prompt engineering differences. revision: yes
-
Referee: [Experimental Results] Experimental Results section: The reported ~22% preservation and ~78% drift figures lack accompanying statistical significance tests, confidence intervals, or controls for prompt-length variation across variants. The claim that task structure dominates model identity likewise requires per-task breakdowns and explicit comparison of effect sizes to be fully supported; absent these, the predictor analysis remains preliminary.
Authors: We concur that additional statistical support is required. The revised manuscript will include 95% bootstrap confidence intervals on the overall preservation rate and per-variant drift rates. We will add per-task tables reporting preservation percentages for each of the four task types, together with a mixed-effects logistic regression that quantifies the relative contribution of task structure versus model identity, including effect sizes (eta-squared). Prompt-length variation will be controlled by reporting mean token lengths per variant type and including length as a covariate in the regression; a supplementary analysis will show that length differences do not predict collapse once task is accounted for. These changes will place the dominance claim on firmer empirical footing. revision: yes
Circularity Check
Empirical benchmark with observable metrics; no derivations or self-referential reductions
full rationale
The paper is an empirical measurement study. It introduces the PARACONSIST benchmark and defines the Semantic Consistency Score explicitly in terms of observable quantities (answer consistency via whole-word match, sentence-BERT similarity, length stability). No equations, fitted parameters, or self-citations appear as load-bearing steps that reduce the claimed results to inputs by construction. Task-structure dominance and collapse rates are reported from direct evaluation across models and variants, not from any renaming, ansatz smuggling, or uniqueness theorem. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Temperature-zero decoding produces deterministic outputs for a given prompt.
- domain assumption Sentence-BERT embeddings provide a reliable measure of semantic similarity between prompts.
invented entities (2)
-
PARACONSIST benchmark
no independent evidence
-
Semantic Consistency Score
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Language Models are Few-Shot Learners,
T. Brown et al., “Language Models are Few-Shot Learners,” inAd- vances in Neural Information Processing Systems, vol. 33, pp. 1877– 1901, 2020
work page 1901
-
[2]
PaLM: Scaling Language Modeling with Pathways,
A. Chowdhery et al., “PaLM: Scaling Language Modeling with Pathways,”J. Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023
work page 2023
-
[3]
OpenAI, “GPT-4 Technical Report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList,
M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList,” inProc. ACL, pp. 4902–4912, 2020
work page 2020
-
[5]
PromptRobust: Towards evaluating the robustness of large language models on adversarial prompts,
K. Zhu et al., “PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts,”arXiv preprint arXiv:2306.04528, 2023
-
[6]
Self-Consistency Improves Chain of Thought Rea- soning in Language Models,
X. Wang et al., “Self-Consistency Improves Chain of Thought Rea- soning in Language Models,” inICLR, 2023
work page 2023
-
[7]
How Robust is GPT-3.5 to Predecessors? A Com- prehensive Study on Language Understanding Tasks,
X. Chen et al., “How Robust is GPT-3.5 to Predecessors? A Com- prehensive Study on Language Understanding Tasks,”arXiv preprint arXiv:2303.00293, 2023
-
[8]
Universal and Transferable Adversarial Attacks on Aligned Language Models
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,”arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Jailbroken: How Does LLM Safety Training Fail?,
A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How Does LLM Safety Training Fail?,” inNeurIPS, vol. 36, 2023
work page 2023
-
[10]
Ignore Previous Prompt: Attack Techniques For Language Models
F. Perez and I. Ribeiro, “Ignore Previous Prompt: Attack Techniques for Language Models,”arXiv preprint arXiv:2211.09527, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al
M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,”arXiv preprint arXiv:2310.11324, 2023
-
[12]
Judging LLM-as-a-Judge with MT-Bench and Chat- bot Arena,
L. Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chat- bot Arena,” inNeurIPS, vol. 36, 2023
work page 2023
-
[13]
Generating Phrasal and Sentential Paraphrases: A Survey,
N. Madnani and B. J. Dorr, “Generating Phrasal and Sentential Paraphrases: A Survey,”Computational Linguistics, vol. 36, no. 3, pp. 341–387, 2010
work page 2010
-
[14]
Paraphrasing with Bilingual Parallel Corpora,
C. Bannard and C. Callison-Burch, “Paraphrasing with Bilingual Parallel Corpora,” inProc. ACL, pp. 597–604, 2005
work page 2005
-
[15]
Neural Paraphrase Generation with Stacked Resid- ual LSTM Networks,
A. Prakash et al., “Neural Paraphrase Generation with Stacked Resid- ual LSTM Networks,” inProc. COLING, pp. 2923–2934, 2016
work page 2016
-
[16]
PromptBERT: Improving BERT Sentence Embeddings with Prompts,
T. Jiang et al., “PromptBERT: Improving BERT Sentence Embeddings with Prompts,”arXiv preprint arXiv:2201.04337, 2022
-
[17]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” inProc. EMNLP, pp. 3982–3992, 2019
work page 2019
-
[18]
SemEval-2017 Task 1: Semantic Textual Similarity,
D. Cer et al., “SemEval-2017 Task 1: Semantic Textual Similarity,” in Proc. SemEval, pp. 1–14, 2017
work page 2017
-
[19]
Measuring Massive Multitask Language Under- standing,
D. Hendrycks et al., “Measuring Massive Multitask Language Under- standing,” inICLR, 2021
work page 2021
-
[20]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark et al., “Think you have Solved Question Answering? Try ARC,”arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
Don’t Give Me the Details, Just the Summary!,
S. Narayan, S. B. Cohen, and M. Lapata, “Don’t Give Me the Details, Just the Summary!,” inProc. EMNLP, pp. 1797–1807, 2018
work page 2018
-
[22]
Character-level Convolutional Networks for Text Classification,
X. Zhang, J. Zhao, and Y . LeCun, “Character-level Convolutional Networks for Text Classification,” inNeurIPS, vol. 28, 2015
work page 2015
-
[23]
A. Celikyilmaz, E. Clark, and J. Gao, “Evaluation of Text Generation: A Survey,”arXiv preprint arXiv:2006.14799, 2020
-
[24]
On Faithfulness and Factuality in Abstractive Summarization,
J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On Faithfulness and Factuality in Abstractive Summarization,” inProc. ACL, pp. 1906– 1919, 2020
work page 1906
-
[25]
Measuring and Improving Consistency in Pretrained Language Models,
Y . Elazar, N. Kassner, S. Ravfogel, A. Ravichander, E. Hovy, H. Sch ¨utze, and Y . Goldberg, “Measuring and Improving Consistency in Pretrained Language Models,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 1012–1031, 2021
work page 2021
-
[26]
arXiv preprint arXiv:2211.05853 , year=
H. Raj, D. Rosati, and S. Majumdar, “Measuring Reliability of Large Language Models through Semantic Consistency,”arXiv preprint arXiv:2211.05853, 2022
-
[27]
State of What Art? A Call for Multi-Prompt LLM Evaluation,
M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of What Art? A Call for Multi-Prompt LLM Evaluation,”Transactions of the Association for Computational Lin- guistics, vol. 12, pp. 933–949, 2024
work page 2024
-
[28]
What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering,
F. Errica, D. Sanvito, G. Siracusano, and R. Bifulco, “What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering,” inProc. NAACL, 2025
work page 2025
-
[29]
POSIX: A Prompt Sensitivity Index For Large Language Models,
A. Chatterjee, H. S. V . N. S. Kowndinya Renduchintala, S. Bhatia, and T. Chakraborty, “POSIX: A Prompt Sensitivity Index For Large Language Models,” inFindings of EMNLP, 2024
work page 2024
-
[30]
PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization,
A. Liu, L. Tang, T. Pan, Y . Yin, B. Wang, and A. Yang, “PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization,” inProc. IEEE International Conference on Multi- media and Expo (ICME), 2025
work page 2025
-
[31]
Automatically Generated Multi-Agent Framework for Jailbreaking Large Language Models,
A. Yang, B. Wang, A. Liu, and H. Li, “Automatically Generated Multi-Agent Framework for Jailbreaking Large Language Models,” in Proc. International Conference on Artificial Intelligence and Industrial Engineering, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.