Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

Jianguo Zhu; Wenjie Liu; Xiangmei Li

arxiv: 2606.04109 · v2 · pith:LK3YRYPDnew · submitted 2026-06-02 · 💻 cs.CL

Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

Jianguo Zhu , Xiangmei Li , Wenjie Liu This is my paper

Pith reviewed 2026-06-28 10:12 UTC · model grok-4.3

classification 💻 cs.CL

keywords discourse labelscontext utilizationlanguage modelsmisleading adoptionRAG benchmarkspresentation variablesprobe designlabel effects

0 comments

The pith

Discourse-role labels shift language model adoption of misleading context by 56-84 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether common wrapper labels around supplied context change how models incorporate that context. It applies the same misleading assertion to each of over 500 fixed questions under different labels and records how often the model selects the injected wrong answer. Results show large, consistent differences: labels like Instruction: and Reference: increase adoption while Example: reduces it. The finding matters for any system that adds context to prompts because the choice of label alone can alter measured reliance on that context.

Core claim

When the identical misleading content is presented under different discourse-role labels, adoption rates vary by 56-84 points across four models. Binding labels such as Instruction: and Reference: produce high uptake of the wrong answer while Example: suppresses it. Supporting evidence includes paired statistical tests, bootstrap intervals, final-instruction ablations, and log-probability probes that together indicate a label-conditioned preference rather than simple copying or token artifacts.

What carries the argument

The paired fixed-content probe that holds the misleading assertion constant and varies only the preceding discourse-role label to isolate its effect on model choice.

If this is right

Context-utilization and reader-side RAG benchmarks should report and control wrapper labels because presentation choices alter measured reliance.
Arithmetic tasks reduce overall adoption while passage-shaped external context preserves smaller label gaps.
Short-answer formats rule out option-letter copying as the driver of the observed differences.
Nested-label conflicts show that illustrative framing can delimit the scope of adoption.
The effect is stable under conservative manual adjudication on a 200-case audit subset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The label effect could interact with other prompt elements such as position or length in ways not tested here.
RAG pipeline designers might reduce unwanted context influence by choosing suppressing labels like Example: for certain content types.
The pattern may generalize to non-multiple-choice tasks if similar paired probes are run on open-ended generation.
Model-specific tokenization could still contribute at the boundary even if the main effect is semantic.

Load-bearing premise

The probe isolates the semantic discourse role conveyed by each label rather than superficial properties of the label string or tokenization differences.

What would settle it

If repeating the exact same items with labels swapped produces no reliable difference in wrong-answer selection rates across the 500-item set.

Figures

Figures reproduced from arXiv: 2606.04109 by Jianguo Zhu, Wenjie Liu, Xiangmei Li.

**Figure 2.** Figure 2: GPT-5.5 fixed-content label probe over 500 paired MMLU-Pro items per condi [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Context-augmented language model systems often wrap supplied content with labels such as Reference:, Evidence:, Instruction:, Note:, or Example:, but the effect of these labels on reader-model behavior remains underexplored. We introduce a paired fixed-content probe over 500 MMLU-Pro items: each item receives the same misleading answer-bearing assertion under different discourse-role labels, and adoption is measured by whether the model outputs the injected wrong option. Across GPT-5.5, DeepSeek V4 Pro, Llama-3-8B-Instruct, and Qwen2.5-7B-Instruct, Misleading Adoption Rate shifts by 56-84 percentage points. Binding or source-like labels such as Instruction: and Reference: produce high adoption, whereas Example: consistently suppresses it. Paired tests, bootstrap intervals, final-instruction ablations, and Qwen final-step log-probability probes support a label-conditioned candidate preference. Boundary probes show where the effect weakens or persists: arithmetic tasks reduce adoption, passage-shaped external context preserves smaller label gaps, short-answer evaluation rules out option-letter copying, and nested-label conflicts suggest that illustrative framing can delimit adoption scope. A 200-case single-author manual audit confirms that the short-answer contrasts are stable under conservative adjudication. The resulting claim is bounded but practical: context-utilization and reader-side RAG benchmarks should report and control wrapper labels, because presentation choices can change measured reliance on supplied context.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Label effects on misleading adoption are large and the paired probe is worth checking in detail.

read the letter

The one or two things to know are that the paper measures large shifts in how models adopt misleading assertions based on the discourse-role label wrapped around the same content, and that the paired design across multiple models supports the pattern.

What the paper does well is lay out a clear empirical test: 500 MMLU-Pro items, each with the same wrong answer under different labels, and adoption tracked by output choice. They test four models, report paired statistical tests with bootstrap intervals, run ablations including final-instruction variants and log-probability probes, and back it with a single-author manual audit on 200 cases. The boundary probes on task type and context shape add some nuance about when the effect holds or weakens. This is solid for an empirical claim about presentation effects.

The soft spots are proportionate to the evidence. The main one is whether the effect is truly about the discourse role or about properties of the label strings themselves, such as length, tokenization, or frequency in training data. The stress-test note raises this, and the abstract does not explicitly describe controls that keep the string fixed while changing only the role semantics. The other ablations help but may not fully close that gap. Data exclusions and model-specific behaviors are mentioned as checked via the audit, so that part looks handled.

This work is aimed at people who design or evaluate context-augmented language models and RAG pipelines. Anyone running benchmarks on context use should take note that wrapper labels can dominate the measured reliance. It is the kind of targeted finding that merits a serious referee to examine the full methods and confirm the numbers.

I would recommend sending it out for peer review rather than desk rejecting it.

Referee Report

2 major / 1 minor

Summary. The paper claims that discourse-role labels (e.g., Instruction:, Reference:, Example:) function as presentation-time variables that strongly modulate language models' adoption of misleading assertions in context-augmented settings. Using a paired fixed-content probe across 500 MMLU-Pro items—where the same wrong answer is wrapped under different labels—adoption rates shift by 56-84 percentage points across four models (GPT-5.5, DeepSeek V4 Pro, Llama-3-8B-Instruct, Qwen2.5-7B-Instruct). Binding labels increase adoption while Example: suppresses it; the result is supported by paired statistical tests, bootstrap intervals, final-instruction ablations, log-probability probes, boundary conditions (arithmetic tasks, passage context, short-answer format), nested-label conflicts, and a 200-case manual audit. The practical conclusion is that context-utilization and RAG benchmarks must report and control wrapper labels.

Significance. If the probe isolates discourse-role semantics rather than surface features, the result is significant for context-augmented LM research: it demonstrates that label choice can dominate content in measured reliance, with direct implications for benchmark design and prompt engineering. The work is strengthened by its use of paired tests, bootstrap intervals, multiple ablations, and a manual audit, providing falsifiable, reproducible empirical measurements rather than fitted parameters.

major comments (2)

[Abstract / paired fixed-content probe] The paired fixed-content probe (Abstract and described methods) applies labels that differ systematically in length, token count, pretraining frequency, and tokenization boundaries across the tested models, yet no ablation or control holds these string properties constant while varying only the intended discourse role. The final-instruction and log-probability ablations do not address this, so the 56-84 pp shifts and label-conditioned preference claim do not yet follow from the design.
[Boundary probes] Boundary probes (arithmetic tasks, passage-shaped context, short-answer evaluation, nested-label conflicts) are reported to show where effects weaken or persist, but without string-matched controls it remains unclear whether these modulations reflect role semantics or interactions with the same surface properties.

minor comments (1)

[manual audit] The 200-case manual audit is described as single-author; reporting inter-annotator agreement or a second annotator on a subset would strengthen the short-answer contrast claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive critique of our paired fixed-content probe design. We address each major comment below and agree that surface properties of the labels were not controlled, which limits isolation of discourse-role semantics. We will revise the manuscript to qualify our claims accordingly.

read point-by-point responses

Referee: [Abstract / paired fixed-content probe] The paired fixed-content probe (Abstract and described methods) applies labels that differ systematically in length, token count, pretraining frequency, and tokenization boundaries across the tested models, yet no ablation or control holds these string properties constant while varying only the intended discourse role. The final-instruction and log-probability ablations do not address this, so the 56-84 pp shifts and label-conditioned preference claim do not yet follow from the design.

Authors: We agree that the labels vary in length, token count, pretraining frequency, and tokenization, and that the final-instruction and log-probability ablations do not hold these surface properties constant. Our probe tests the effects of standard discourse-role labels as they appear in real context-augmented use rather than isolating pure role semantics from all surface features. The large, consistent shifts across four models with different tokenizers still demonstrate that label choice can dominate measured context reliance in practice. We will revise the abstract, methods, and discussion to explicitly state this scope limitation and note that stronger isolation would require additional string-matched controls. revision: partial
Referee: [Boundary probes] Boundary probes (arithmetic tasks, passage-shaped context, short-answer evaluation, nested-label conflicts) are reported to show where effects weaken or persist, but without string-matched controls it remains unclear whether these modulations reflect role semantics or interactions with the same surface properties.

Authors: We acknowledge the same limitation applies to the boundary probes: without string-matched controls, observed modulations (e.g., weaker effects on arithmetic tasks or with passage context) could reflect surface-property interactions rather than role semantics alone. We will update the boundary-conditions section to discuss this caveat explicitly while retaining the practical observation that label effects vary by task format. This qualifies rather than overturns the main finding that wrapper labels should be reported in benchmarks. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical measurements

full rationale

The paper reports an experimental probe measuring adoption rates of misleading assertions under varying discourse-role labels across multiple models, supported by paired tests, ablations, bootstrap intervals, and manual audits. No derivation chain, equations, fitted parameters presented as predictions, or self-citations load-bearing on uniqueness theorems appear in the described methods or claims. The central result (label-conditioned shifts of 56-84 pp) is obtained by direct output measurement rather than reduction to inputs by construction. This is a standard empirical study whose validity can be assessed against external benchmarks without internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical assumption that the paired probe isolates label effects; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Language models interpret discourse-role labels as conveying distinct pragmatic roles that modulate context utilization.
This interpretation is presupposed by the experimental design rather than derived from first principles.

pith-pipeline@v0.9.1-grok · 5793 in / 1087 out tokens · 25000 ms · 2026-06-28T10:12:10.721144+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 8 canonical work pages · 1 internal anchor

[1]

URL: https://aclanthology.org/2024.findings-emnlp.852/

POSIX: A prompt sensitivity index for large language models, in: Findings of the Association for Computational Linguistics: EMNLP. URL: https://aclanthology.org/2024.findings-emnlp.852/. Chen, S., Piet, J., Sitawarin, C., Wagner, D.,

2024
[2]

URL: https://arxiv.org/abs/2402.06363

StruQ: Defending against prompt injection with structured queries, in: USENIX Security Symposium. URL: https://arxiv.org/abs/2402.06363. arXiv:2402.06363. Hagström, L., et al.,

work page arXiv
[3]

URL:https://aclanthology.org/2025.acl-long.968/

A reality check on context utilisation for retrieval-augmented generation, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. URL:https://aclanthology.org/2025.acl-long.968/. Hua, A., Tang, K., Gu, C., Gu, J., Wong, E., Qin, Y.,

2025
[4]

URL: https://aclanthology.org/2025.emnlp-main.1006/

Flaw or artifact? rethinking prompt sensitivity in evaluating LLMs, in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. URL: https://aclanthology.org/2025.emnlp-main.1006/. arXiv:2509.01790. Lin, C., Wen, Y., Su, D., Tan, H., Sun, F., Chen, M., Bao, C., Lv, Z.,

work page arXiv 2025
[5]

URL:https://arxiv.org/abs/2411.06037

Sufficient context: A new lens on retrieval augmented generation systems, in: International Conference on Learning Representations. URL:https://arxiv.org/abs/2411.06037. Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.,

work page arXiv
[6]

Transactions of the 15 Association for Computational Linguistics URL: https://aclanthology.org/2024.tacl-1.9/

Lost in the middle: How language models use long contexts. Transactions of the 15 Association for Computational Linguistics URL: https://aclanthology.org/2024.tacl-1.9/. Lu, S., Schuff, H., Gurevych, I.,

2024
[7]

URL: https://aclanthology.org/2024.naacl-long.325/

How are prompts different in terms of sensitivity?, in: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics. URL: https://aclanthology.org/2024.naacl-long.325/. Ming, Y., et al.,

2024
[8]

the moon is made of marshmallows

FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows”?, in: International Conference on Learning Representations. URL:https://arxiv.org/abs/2410.03727. arXiv:2410.03727. Peng, K., et al.,

work page arXiv
[9]

URL:https://aclanthology.org/2024.acl-long.492/

Revisiting demonstration selection strategies in in-context learning, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. URL:https://aclanthology.org/2024.acl-long.492/. Qi, J., Sarti, G., Fernandez, R., Bisazza, A.,

2024
[10]

URL: https://aclanthology.org/2024.emnlp-main.347/

Model internals-based answer attribution for trustworthy retrieval-augmented generation, in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. URL: https://aclanthology.org/2024.emnlp-main.347/. Sclar, M., Choi, Y., Tsvetkov, Y., Suhr, A.,

2024
[11]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Assessing “implicit” retrieval robustness of large language models, in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. URL:https://aclanthology.org/2024.emnlp-main.507/. Wang, L., Yang, N., Wei, F., 2024a. Learning to retrieve in-context examples for large language models, in: Proceedings of the 18th Conference of ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

URL: https://aclanthology.org/2024.emnlp-main.527/

Synchronous faithfulness monitoring for trustworthy retrieval-augmented generation, in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. URL: https://aclanthology.org/2024.emnlp-main.527/. 16 Yi, J., Xie, Y., Zhu, B., Kiciman, E., Sun, G., Xie, X., Wu, F.,

2024
[13]

Benchmarking and defending against indi- rect prompt injection attacks on large language models

Benchmarking and defending against indirect prompt injection attacks on large language models, in: ACM SIGKDD Conference on Knowledge Discovery and Data Mining. URL: https://arxiv.org/abs/2312.14197. arXiv:2312.14197. Zhang, Q., Xiang, Z., Xiao, Y., Wang, L., Li, J., Wang, X., Su, J.,

work page arXiv
[14]

URL:https://aclanthology.org/2025.acl-long.1062/

FaithfulRAG: Fact-level conflict modeling for context-faithful retrieval-augmented generation, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. URL:https://aclanthology.org/2025.acl-long.1062/. Zhuo, J., et al.,

2025
[15]

URL: https://aclanthology.org/2024.findings-emnlp.108/

ProSA: Assessing and understanding the prompt sensitivity of LLMs, in: Findings of the Association for Computational Linguistics: EMNLP. URL: https://aclanthology.org/2024.findings-emnlp.108/. Zou, W., Geng, R., Wang, B., Jia, J.,

2024
[16]

URL:https://arxiv.org/abs/2402.07867

PoisonedRAG: Knowledge corruption attacks to retrieval-augmented generation of large language models, in: USENIX Security Symposium. URL:https://arxiv.org/abs/2402.07867. arXiv:2402.07867. Zverev, E., Abdelnabi, S., Tabesh, S., Fritz, M., Lampert, C.H.,

work page arXiv
[17]

URL: https://arxiv.org/abs/2403.06833

Can LLMs separate instructions from data? and what do we even mean by that?, in: International Conference on Learning Representations. URL: https://arxiv.org/abs/2403.06833. arXiv:2403.06833. 17

work page arXiv

[1] [1]

URL: https://aclanthology.org/2024.findings-emnlp.852/

POSIX: A prompt sensitivity index for large language models, in: Findings of the Association for Computational Linguistics: EMNLP. URL: https://aclanthology.org/2024.findings-emnlp.852/. Chen, S., Piet, J., Sitawarin, C., Wagner, D.,

2024

[2] [2]

URL: https://arxiv.org/abs/2402.06363

StruQ: Defending against prompt injection with structured queries, in: USENIX Security Symposium. URL: https://arxiv.org/abs/2402.06363. arXiv:2402.06363. Hagström, L., et al.,

work page arXiv

[3] [3]

URL:https://aclanthology.org/2025.acl-long.968/

A reality check on context utilisation for retrieval-augmented generation, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. URL:https://aclanthology.org/2025.acl-long.968/. Hua, A., Tang, K., Gu, C., Gu, J., Wong, E., Qin, Y.,

2025

[4] [4]

URL: https://aclanthology.org/2025.emnlp-main.1006/

Flaw or artifact? rethinking prompt sensitivity in evaluating LLMs, in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. URL: https://aclanthology.org/2025.emnlp-main.1006/. arXiv:2509.01790. Lin, C., Wen, Y., Su, D., Tan, H., Sun, F., Chen, M., Bao, C., Lv, Z.,

work page arXiv 2025

[5] [5]

URL:https://arxiv.org/abs/2411.06037

Sufficient context: A new lens on retrieval augmented generation systems, in: International Conference on Learning Representations. URL:https://arxiv.org/abs/2411.06037. Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.,

work page arXiv

[6] [6]

Transactions of the 15 Association for Computational Linguistics URL: https://aclanthology.org/2024.tacl-1.9/

Lost in the middle: How language models use long contexts. Transactions of the 15 Association for Computational Linguistics URL: https://aclanthology.org/2024.tacl-1.9/. Lu, S., Schuff, H., Gurevych, I.,

2024

[7] [7]

URL: https://aclanthology.org/2024.naacl-long.325/

How are prompts different in terms of sensitivity?, in: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics. URL: https://aclanthology.org/2024.naacl-long.325/. Ming, Y., et al.,

2024

[8] [8]

the moon is made of marshmallows

FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows”?, in: International Conference on Learning Representations. URL:https://arxiv.org/abs/2410.03727. arXiv:2410.03727. Peng, K., et al.,

work page arXiv

[9] [9]

URL:https://aclanthology.org/2024.acl-long.492/

Revisiting demonstration selection strategies in in-context learning, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. URL:https://aclanthology.org/2024.acl-long.492/. Qi, J., Sarti, G., Fernandez, R., Bisazza, A.,

2024

[10] [10]

URL: https://aclanthology.org/2024.emnlp-main.347/

Model internals-based answer attribution for trustworthy retrieval-augmented generation, in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. URL: https://aclanthology.org/2024.emnlp-main.347/. Sclar, M., Choi, Y., Tsvetkov, Y., Suhr, A.,

2024

[11] [11]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Assessing “implicit” retrieval robustness of large language models, in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. URL:https://aclanthology.org/2024.emnlp-main.507/. Wang, L., Yang, N., Wei, F., 2024a. Learning to retrieve in-context examples for large language models, in: Proceedings of the 18th Conference of ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

URL: https://aclanthology.org/2024.emnlp-main.527/

Synchronous faithfulness monitoring for trustworthy retrieval-augmented generation, in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. URL: https://aclanthology.org/2024.emnlp-main.527/. 16 Yi, J., Xie, Y., Zhu, B., Kiciman, E., Sun, G., Xie, X., Wu, F.,

2024

[13] [13]

Benchmarking and defending against indi- rect prompt injection attacks on large language models

Benchmarking and defending against indirect prompt injection attacks on large language models, in: ACM SIGKDD Conference on Knowledge Discovery and Data Mining. URL: https://arxiv.org/abs/2312.14197. arXiv:2312.14197. Zhang, Q., Xiang, Z., Xiao, Y., Wang, L., Li, J., Wang, X., Su, J.,

work page arXiv

[14] [14]

URL:https://aclanthology.org/2025.acl-long.1062/

FaithfulRAG: Fact-level conflict modeling for context-faithful retrieval-augmented generation, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. URL:https://aclanthology.org/2025.acl-long.1062/. Zhuo, J., et al.,

2025

[15] [15]

URL: https://aclanthology.org/2024.findings-emnlp.108/

ProSA: Assessing and understanding the prompt sensitivity of LLMs, in: Findings of the Association for Computational Linguistics: EMNLP. URL: https://aclanthology.org/2024.findings-emnlp.108/. Zou, W., Geng, R., Wang, B., Jia, J.,

2024

[16] [16]

URL:https://arxiv.org/abs/2402.07867

PoisonedRAG: Knowledge corruption attacks to retrieval-augmented generation of large language models, in: USENIX Security Symposium. URL:https://arxiv.org/abs/2402.07867. arXiv:2402.07867. Zverev, E., Abdelnabi, S., Tabesh, S., Fritz, M., Lampert, C.H.,

work page arXiv

[17] [17]

URL: https://arxiv.org/abs/2403.06833

Can LLMs separate instructions from data? and what do we even mean by that?, in: International Conference on Learning Representations. URL: https://arxiv.org/abs/2403.06833. arXiv:2403.06833. 17

work page arXiv