arxiv: 2605.01357 · v1 · submitted 2026-05-02 · 💻 cs.CL

Recognition: unknown

On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility

Haolin Yang, Rui Min, Yi R. Fung, Zeyu Qin, Zhitao He

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords long-form generationlength volatilityVOLTBenchlogits boostingLLM stabilitydecoding optimizationattention patterns

0 comments

The pith

Large language models exhibit unstable output lengths when generating long-form text, but a lightweight decoding adjustment raises average length by 148 percent and cuts volatility by 69 percent without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often generate long-form text with highly variable lengths even under similar prompts, causing unpredictable compute use and unreliable performance in practice. The work first builds VOLTBench to quantify this volatility across varied tasks. Probing model internals uncovers recurring attention behaviors tied to early stopping or erratic continuation. The proposed GLoBo method then counters these by boosting continuation-favoring logits during decoding. This establishes a training-free way to make long-form outputs more consistent and longer on average.

Core claim

The paper claims that severe length volatility is widespread in mainstream LLMs for long-form tasks, driven by specific attention patterns, and that GLoBo mitigates it effectively. Experiments confirm an average 148% increase in output length and 69% drop in volatility on VOLTBench, with no loss in generation quality.

What carries the argument

GLoBo, a logits-boosting strategy applied at the decoding stage to promote stable continuation in long-form text generation.

Load-bearing premise

The attention patterns identified through probing are the actual drivers of length volatility rather than just correlated symptoms, and the VOLTBench tasks represent the length consistency requirements of real-world long-form use cases.

What would settle it

A replication study on new long-form generation tasks showing that GLoBo produces no statistically significant change in length mean or variance would falsify the mitigation results.

Figures

Figures reproduced from arXiv: 2605.01357 by Haolin Yang, Rui Min, Yi R. Fung, Zeyu Qin, Zhitao He.

**Figure 1.** Figure 1: Model performance on our VOLTBENCH. As the required length increases, the actual output length of all models falls significantly short of the target (dashed line). Furthermore, many models exhibit significant output length volatility. Even for the specifically fine-tuned Longwriter-8B (Bai et al., 2024), the output standard deviation peaked at 103% of its mean length. inputs exceeding 100k tokens and perf… view at source ↗

**Figure 2.** Figure 2: An overview of the VOLTBench framework. Our benchmark is constructed from four dimensions, covering structured and unstructured tasks. We evaluate performance from two aspects: generation quality and length volatility. tant for stable streaming inference. Beyond LLMs, length bias has a long history in neural machine translation. Prior work documents long-sentence degradation and premature stopping (Koehn … view at source ↗

**Figure 3.** Figure 3: Analysis of Output Length Volatility and Output Section Volatility. The left panel (a, b, c) compares the total output length volatility across three dimensions: language, instruction complexity, and output format. The right panel (d) shows the volatility in the number of generated sections. evaluate reasoning models such as GPT-4o mini, Claude 3.5 Sonnet, and Deepseek-R1 (DeepSeek-AI et al., 2025a). Our o… view at source ↗

**Figure 4.** Figure 4: Attention traces for Qwen2.5-7B (top) and Qwen2.5- 3B (bottom) models in a long-form diary generation task with 40 required sections. Each peak in the traces indicates the initiation of a new section. It is evident that both models failed to meet the requirement: Qwen2.5-7B bypassed intermediate sections and proceeded directly from an early section to the final one, while Qwen2.5-3B generated repetitive te… view at source ↗

**Figure 5.** Figure 5: Model output length volatility (Story Writing). While baselines like Longwriter often inflate length with meaningless repetition, our method matches the target length while maintaining coherent content. Section volatility is presented in view at source ↗

**Figure 6.** Figure 6: An example of the instructional prompt for the English simple story generation task. This template specifies parameters like the number of chapters and minimum word count, guiding the structure of the generated narrative. Instruction: English Simple Dialogue Generation Please generate {num section} rounds of dialogue between customers and customer service. Each round should include a customer’s question an… view at source ↗

**Figure 7.** Figure 7: The prompt for generating simple dialogues between a customer and customer service, specifying the number of rounds and word count. Instruction: English Simple Diary Generation Please write a diary for {num section} days for Jeff. Each entry should include the date and a brief description of the content, with a minimum of {word section} words for each entry. Ensure clarity and continuity without any interr… view at source ↗

**Figure 8.** Figure 8: The prompt for generating simple diary entries for a character named Jeff, specifying the number of days and word count. Instruction: English Simple Architecture Design Please design a {num section}-story building. Describe the function or layout of each floor, with at least {word section} words for each layer. Ensure clarity and continuity without any interruptions or omissions in the narrative throughout… view at source ↗

**Figure 9.** Figure 9: The prompt for designing a multi-story building with simple functional descriptions for each floor. Instruction: English Complex Story Generation Please write a fantasy novel with {num section} chapters about Jeff. The novel should have a clear theme and structure, with characters experiencing multiple twists and personal growth throughout the plot. Each chapter should describe the main characters’ actions… view at source ↗

**Figure 10.** Figure 10: The prompt for generating a complex fantasy novel, detailing requirements for plot, character development, and emotional depth. Instruction: English Complex Diary Generation Please write a diary for {num section} days. Your name is Jeff, a white-collar worker. Each entry can include aspects such as your mood for the day, key events, challenges faced, solutions, and hopes or reflections for the future. Ens… view at source ↗

**Figure 11.** Figure 11: The prompt for generating complex and emotionally rich diary entries, covering various life scenarios and personal growth. Instruction: English Complex Dialogue Generation Please generate {num section} rounds of dialogue between customers and customer service. Each round of dialogue should include the customer’s question and the customer service representative’s response, along with service recommendation… view at source ↗

**Figure 12.** Figure 12: The prompt for generating complex customer service dialogues across various industries, focusing on emotional changes and appropriate responses. Instruction: English Complex Architecture Design Please design a {num section}-story mixed-use skyscraper for work and living. Describe the function or layout of each floor. Each floor should have a different function and design, closely connected to other floors… view at source ↗

**Figure 13.** Figure 13: The prompt for designing a detailed mixed-use skyscraper, requiring descriptions of design concepts, layouts, and unique features for each floor. Instruction: GenData - Simple Code Function Please generate a complete library of {num section} different functions. Each function should include the function name, parameters, return type, and function comments, formatted in Python. Ensure clarity and continuit… view at source ↗

**Figure 14.** Figure 14: The prompt for generating a library of simple Python functions with comments and examples. Instruction: GenData - Simple User Info Please generate {num section} virtual user profiles, with each user’s information including name, age, gender, address, email, and phone number, formatted as JSON. Ensure clarity and continuity without any interruptions or omissions in the narrative throughout the document. Do… view at source ↗

**Figure 15.** Figure 15: The prompt for generating simple virtual user profiles in JSON format. Instruction: GenData - Simple Company Info Please generate {num section} virtual company profiles. Each profile should include the company name, industry, year of establishment, company address, and contact number, formatted in JSON. Ensure clarity and continuity without any interruptions or omissions in the narrative throughout the do… view at source ↗

**Figure 16.** Figure 16: The prompt for generating simple virtual company profiles in JSON format. Instruction: GenData - Simple Math LaTeX Formula Please generate {num section} mathematical formulas, formatted in LaTeX. Each formula should be preceded by a brief comment explaining the formula. The formula should be enclosed in \begin{equation} and \end{equation}. Ensure clarity and continuity without any interruptions or omissio… view at source ↗

**Figure 17.** Figure 17: The prompt for generating simple mathematical formulas in LaTeX format. Instruction: GenData - Complex Code Function Please generate a library of {num section} Python functions with varying levels of difficulty. The functions should range from simple mathematical operations to more complex data processing, string manipulations, machine learning model training, and evaluation functions. Each function shoul… view at source ↗

**Figure 18.** Figure 18: The prompt for generating a library of complex Python functions with detailed comments and examples. Instruction: GenData - Complex User Info Please generate {num section} virtual user profiles in Json format. Each profile should include the user’s name, age, gender, address, email, phone number, occupation, hobbies, education, marital status, number of children, work experience, and personal philosophy. … view at source ↗

**Figure 19.** Figure 19: The prompt for generating complex and detailed virtual user profiles in JSON format. Instruction: GenData - Complex Company Info Please generate {num section} virtual company profiles in Json format. Each profile should include the company name, industry, year of establishment, company address, contact number, number of employees, main products or services, company bio, business model, annual revenue, mar… view at source ↗

**Figure 20.** Figure 20: The prompt for generating complex and detailed virtual company profiles in JSON format. 18 view at source ↗

**Figure 21.** Figure 21: The prompt for generating a sequence of mathematical formulas of increasing complexity in LaTeX format. C. Unstructured Content Evaluation To facilitate a scalable and consistent assessment of the quality of generated text, we employed a Large Language Model (LLM) as an automated evaluator. This approach, commonly referred to as “LLM as Judge,” relies on a rigorously designed system prompt to guide the LL… view at source ↗

**Figure 22.** Figure 22: The detailed prompt template for evaluating unstructured content generation, specifying six evaluation dimensions and a strict JSON output format. D. Fine-grained Constraints Results In this section, we present the detailed performance metrics for the three fine-grained constraint tasks defined in Section 4.2: Character-level Pattern, Keyword Presence, and Specified Theme. These visualizations provide a g… view at source ↗

**Figure 23.** Figure 23: Model performance on the Character-level Pattern Constraint. The x-axis represents the total number of generated sections (from 5 to 500), while the y-axis shows the count of sections that successfully met the constraint. For each run, a specific number of sections (1, 2, 5, 10, 20, 40, or 100) were randomly selected to carry the constraint. The dashed line indicates ideal performance, where all designate… view at source ↗

**Figure 24.** Figure 24: Model performance on the Keyword Presence Constraint. The x-axis represents the total number of generated sections (from 5 to 500), while the y-axis shows the count of sections that successfully met the constraint. For each run, a specific number of sections (1, 2, 5, 10, 20, 40, or 100) was randomly selected to carry the constraint. The dashed line indicates ideal performance, where all designated sectio… view at source ↗

**Figure 27.** Figure 27: A failure case study for the EN-simple-diary task. The model was instructed to generate 40 days of diary entries, but instead produced a 3-chapter fantasy story. It then terminated prematurely by hallucinating a user’s question, a behavior correlated with the degradation of its internal attention scores as shown in view at source ↗

**Figure 25.** Figure 25: Model performance on the Specified Theme Constraint. The x-axis represents the total number of generated sections (from 5 to 500), while the y-axis shows the count of sections that successfully met the constraint. For each run, a specific number of sections (1, 2, 5, 10, 20, 40, or 100) was randomly selected to carry the constraint. The dashed line indicates ideal performance, where all designated section… view at source ↗

**Figure 28.** Figure 28: A failure case study demonstrating section skipping. Tasked with writing a 40-day diary, the model generated a story and jumped from Chapter 10 to Chapter 40. This behavior is linked to a sharp attention spike before the skip (see view at source ↗

**Figure 26.** Figure 26: Attention traces for the Qwen2.5-7B-Instruct Model generating a 40-day diary, with and without our method. (Top) With our “Stable Generation via Logits Boosting” method. A sharp, high-magnitude attention peak is generated at the beginning of each of the 40 required sections. Our method’s periodic logit boosting for section titles at regular intervals successfully refocuses the model, preventing attention … view at source ↗

**Figure 29.** Figure 29: A failure case study of content collapse. The model begins to generate grammatically incorrect text before falling into a terminal repetitive loop. This behavior is attributed to an “attention collapse,” where the model can no longer generate meaningful attention peaks to guide content creation. G. Word diversity A common failure mode in long-context generation is “neural degeneration,” where the model fa… view at source ↗

**Figure 30.** Figure 30: CKA Analysis. To track state evolution, we compute the deviation between the average hidden state vector at each step t and the initial anchor state established at the t = 100 window. under long-context pressure. The average similarity begins to drop significantly around the t = 2000 token mark (down to 0.7377) and collapses dramatically between t = 4000 and t = 5000 (from 0.5122 to 0.3026). This collapse… view at source ↗

**Figure 31.** Figure 31: Model output length volatility (Novel Writing) for free-form task. As summarized in view at source ↗

**Figure 32.** Figure 32: Comparison of output control for various large language models on story generation tasks. The figure presents eight plots evaluating model performance across two languages (English and Chinese) and two complexity levels (simple and complex). The four columns correspond to the conditions: Simple-EN, Simple-CH, Complex-EN, and Complex-CH. The top row compares the actual mean output length against the requir… view at source ↗

**Figure 33.** Figure 33: Comparison of output control for various large language models on diary generation tasks. The figure presents eight plots evaluating model performance across two languages (English and Chinese) and two complexity levels (simple and complex). The four columns correspond to the conditions: Simple-EN, Simple-CH, Complex-EN, and Complex-CH. The top row compares the actual mean output length against the requir… view at source ↗

**Figure 34.** Figure 34: Comparison of output control for various large language models on dialogue generation tasks. The figure presents eight plots evaluating model performance across two languages (English and Chinese) and two complexity levels (simple and complex). The four columns correspond to the conditions: Simple-EN, Simple-CH, Complex-EN, and Complex-CH. The top row compares the actual mean output length against the req… view at source ↗

**Figure 35.** Figure 35: Comparison of output control for various large language models on architecture generation tasks. The figure presents eight plots evaluating model performance across two languages (English and Chinese) and two complexity levels (simple and complex). The four columns correspond to the conditions: Simple-EN, Simple-CH, Complex-EN, and Complex-CH. The top row compares the actual mean output length against the… view at source ↗

**Figure 36.** Figure 36: Comparison of output control for various large language models on Python code function generation tasks. The figure presents eight plots evaluating model performance across two languages (English and Chinese) and two complexity levels (simple and complex). The four columns correspond to the conditions: Simple-EN, Simple-CH, Complex-EN, and Complex-CH. The top row compares the actual mean output length aga… view at source ↗

**Figure 37.** Figure 37: Comparison of output control for various large language models on math latex function generation tasks. The figure presents eight plots evaluating model performance across two languages (English and Chinese) and two complexity levels (simple and complex). The four columns correspond to the conditions: Simple-EN, Simple-CH, Complex-EN, and Complex-CH. The top row compares the actual mean output length agai… view at source ↗

**Figure 38.** Figure 38: Comparison of output control for various large language models on company info generation tasks. The figure presents eight plots evaluating model performance across two languages (English and Chinese) and two complexity levels (simple and complex). The four columns correspond to the conditions: Simple-EN, Simple-CH, Complex-EN, and Complex-CH. The top row compares the actual mean output length against the… view at source ↗

**Figure 39.** Figure 39: Comparison of output control for various large language models on user info generation tasks. The figure presents eight plots evaluating model performance across two languages (English and Chinese) and two complexity levels (simple and complex). The four columns correspond to the conditions: Simple-EN, Simple-CH, Complex-EN, and Complex-CH. The top row compares the actual mean output length against the re… view at source ↗

**Figure 40.** Figure 40: Section volatility of the model with our method. Baseline models often fail to generate a sufficient number of sections, whereas our model generates more sections with greater stability. (a) GPT-4omini (b) Qwen2.5-7B CH-Complex CH-Simple EN-Complex EN-Simple Required Length Accuracy Required Length Accuracy view at source ↗

**Figure 41.** Figure 41: Code Generation Accuracy Across Different Length Requirements. This figure illustrates the performance of (a) GPT-4omini and (b) Qwen2.5-7B across different languages (CH/EN) and instruction complexities (Simple/Complex). Two main conclusions can be drawn from the figure: First, as the required output length increases, the code generation accuracy of both models shows an overall downward trend. Second, th… view at source ↗

read the original abstract

Large Language Models (LLMs) excel at long-context understanding but exhibit significant limitations in long-form generation. Existing studies primarily focus on single-generation quality, generally overlooking the volatility of the output. This volatility not only leads to significant computational costs but also severely impacts the models' reliable application. To address this gap, our work unfolds in three stages: benchmarking, probing, and mitigation. We first propose the VOlatility in Long-form Text Benchmark (VOLTBench), a novel heterogeneous-task benchmark designed to systematically quantify the length volatility of long-form generation. Subsequently, by analyzing attention traces, we conduct an in-depth probe to identify several common internal patterns that cause this volatility. Finally, to mitigate long-form output volatility, we propose Stable Generation via Logits Boosting (GLoBo), a lightweight decoding-stage optimization strategy, designed to significantly enhance both the length accuracy and stability of long-form generation without additional training. Extensive experiments on VOLTBench provide the first systematic confirmation of severe long-form output instability in mainstream models and validate that our proposed method successfully improves the mean output length of the base model by 148% and reduces the length volatility by 69%, while maintaining high generation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VOLTBench and GLoBo give a practical benchmark plus a training-free logits boost that cuts reported volatility by 69%, but the attention patterns are observed rather than shown to drive the instability.

read the letter

The paper's main takeaway is that length volatility is a real and measurable problem in long-form generation, and they've built VOLTBench to quantify it across tasks while offering GLoBo as a decoding-time intervention that reportedly boosts average length by 148% and slashes volatility by 69% while holding quality steady. What stands out is the systematic approach: they benchmark the issue first, then probe internal states, and finally propose a fix that doesn't require retraining. The benchmark itself fills a gap by focusing on consistency rather than just quality or single-shot performance. Their experiments on mainstream models confirm the instability, which aligns with what many users experience in practice. The attention analysis is interesting as far as it goes, highlighting common patterns in how models attend during generation. However, the evidence here is observational. They identify patterns from traces but stop short of interventional tests that would confirm those patterns actually cause the length swings. This leaves open the possibility that the patterns are effects rather than drivers, which slightly undercuts the story for why GLoBo works at the mechanistic level. On the positive side, the method maintains generation quality, which is crucial. Still, I'd like to see more on the exact definition of volatility, how tasks were chosen for VOLTBench to ensure they represent real use cases, and whether the improvements generalize beyond the tested models and setups. The numbers are impressive, but details on run-to-run variance and controls would strengthen the case. This paper is for researchers and engineers dealing with applications that need predictable long outputs, like automated content or reports. It deserves peer review because it brings fresh empirical focus to an understudied aspect of generation reliability and provides a practical tool that others can build on or test further.

Referee Report

2 major / 2 minor

Summary. The paper claims that mainstream LLMs exhibit severe length volatility in long-form generation. It introduces VOLTBench, a heterogeneous-task benchmark to quantify this instability, probes attention traces to identify common internal patterns causing volatility, and proposes GLoBo, a lightweight logits-boosting decoding strategy that improves length accuracy and stability without training. Experiments on VOLTBench are said to confirm the instability and show that GLoBo increases mean output length by 148% while reducing volatility by 69%, with maintained generation quality.

Significance. If the empirical results hold under rigorous controls, the work addresses a practically important but understudied limitation in LLM deployment: output length instability that raises compute costs and harms reliability. VOLTBench provides a new standardized evaluation resource, and GLoBo offers a simple, training-free intervention with potentially broad applicability. The focus on decoding-stage mitigation and the reported scale of improvements are notable strengths.

major comments (2)

[§4 (Probing)] §4 (Probing): The analysis identifies attention patterns correlated with length deviations via trace inspection on generated outputs, but presents no interventional tests (e.g., targeted head masking, attention logit intervention, or controlled ablations) that directly modify the flagged patterns and measure resulting changes in length statistics. This makes the causal claim that the patterns 'cause this volatility' unsupported and weakens the justification that GLoBo targets the root mechanism rather than a downstream correlate.
[§6 (Experiments)] §6 (Experiments) and associated result tables: The abstract reports precise gains (148% mean length increase, 69% volatility reduction). These require (a) an explicit mathematical definition of the volatility metric, (b) full per-model/per-task tables with standard deviations across seeds, (c) statistical significance tests, and (d) comparisons against multiple length-control baselines. Without these, the central performance claims cannot be verified as robust.

minor comments (2)

[Abstract] Abstract: Briefly enumerate the 'several common internal patterns' identified in probing to improve reader understanding before the detailed section.
[§3 (VOLTBench)] VOLTBench description: Clarify task selection criteria and how the benchmark ensures coverage of real-world long-form scenarios where length consistency is critical.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which helps us improve the clarity and rigor of the manuscript. We respond to each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [§4 (Probing)] The analysis identifies attention patterns correlated with length deviations via trace inspection on generated outputs, but presents no interventional tests (e.g., targeted head masking, attention logit intervention, or controlled ablations) that directly modify the flagged patterns and measure resulting changes in length statistics. This makes the causal claim that the patterns 'cause this volatility' unsupported and weakens the justification that GLoBo targets the root mechanism rather than a downstream correlate.

Authors: We agree that the probing analysis in §4 relies on observational trace inspection and identifies consistent correlations between specific attention patterns and length deviations, without direct interventional validation of causality. The patterns were selected because they appeared reliably across models and tasks in the VOLTBench evaluations. GLoBo was motivated by these observations as a practical, training-free way to counteract the downstream effects on token selection. In the revision we will (i) explicitly qualify the language to describe the patterns as strongly correlated rather than causal, (ii) add a dedicated limitations paragraph discussing the correlational nature of the analysis and the value of future interventional work, and (iii) include a brief additional ablation that perturbs the relevant logit regions to show measurable impact on length statistics. These changes will be made without claiming stronger causal evidence than the data support. revision: partial
Referee: [§6 (Experiments)] The abstract reports precise gains (148% mean length increase, 69% volatility reduction). These require (a) an explicit mathematical definition of the volatility metric, (b) full per-model/per-task tables with standard deviations across seeds, (c) statistical significance tests, and (d) comparisons against multiple length-control baselines. Without these, the central performance claims cannot be verified as robust.

Authors: We accept that the current presentation of the experimental results is insufficient for full verification. In the revised manuscript we will: (a) insert the precise mathematical definition of the volatility metric (standard deviation of normalized output lengths across repeated generations), (b) expand all result tables to report per-model and per-task means together with standard deviations computed over multiple random seeds, (c) add statistical significance tests (e.g., paired Wilcoxon tests) for the reported improvements, and (d) include direct comparisons against additional length-control baselines such as length-penalized decoding and controlled-temperature sampling. These additions will be placed in §6 and the appendix so that the 148 % and 69 % figures can be independently assessed. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on new benchmark and empirical validation

full rationale

The paper introduces VOLTBench as a new heterogeneous-task benchmark to quantify length volatility, performs attention trace analysis to identify patterns, and proposes the GLoBo decoding strategy as mitigation. Central results (148% mean length improvement, 69% volatility reduction) are obtained via direct experiments on the newly defined benchmark and models, without any fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the derivation to its inputs. The chain is self-contained with independent empirical measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical and introduces no mathematical axioms, free parameters, or new postulated entities; it relies on standard LLM inference and a new heuristic decoding rule.

pith-pipeline@v0.9.0 · 5519 in / 1028 out tokens · 38026 ms · 2026-05-09T15:04:16.557495+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 9 canonical work pages

[1]

LLM tropes: Revealing fine-grained values and opinions in large language models

URL https://aclanthology.org/2023. findings-emnlp.613/. He, Z., Cao, P., Wang, C., Jin, Z., Chen, Y ., Xu, J., Li, H., Liu, K., and Zhao, J. AgentsCourt: Build- ing judicial decision-making agents with court debate simulation and legal knowledge augmentation. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.), Findings of the Association for Computat...

work page doi:10.18653/v1/2024.findings-emnlp 2023
[2]

Metacognitive Prompting Improves Understanding in Large Language Models

URL https://aclanthology.org/2024. findings-emnlp.549/. He, Z., Liu, Z., Li, P., Fung, Y . R., Yan, M., Zhang, J., Huang, F., and Liu, Y . Advancing language multi-agent learning with credit re-assignment for interactive environ- ment generalization.arXiv preprint arXiv:2502.14496, 2025a. He, Z., Yang, H., Qin, Z., and Fung, Y . R. Medtutor- r1: Socratic ...

work page doi:10.18653/v1/ 2024
[3]

emnlp-main.677/

URL https://aclanthology.org/2021. emnlp-main.677/. Quan, S., Tang, T., Yu, B., Yang, A., Liu, D., Gao, B., Tu, J., Zhang, Y ., Zhou, J., and Lin, J. Language models can self-lengthen to generate long texts, 2024. URL https://arxiv.org/abs/2410.23933. Que, H., Duan, F., He, L., Mou, Y ., Zhou, W., Liu, J., Rong, W., Wang, Z. M., Yang, J., Zhang, G., Peng,...

work page doi:10.1609/aaai.v39i23.34631 2021
[4]

URL https: //aclanthology.org/D19-1331/

doi: 10.18653/v1/D19-1331. URL https: //aclanthology.org/D19-1331/. Tan, H., Guo, Z., Shi, Z., Xu, L., Liu, Z., Feng, Y ., Li, X., Wang, Y ., Shang, L., Liu, Q., and Song, L. Prox- yqa: An alternative framework for evaluating long-form text generation with large language models, 2024. URL https://arxiv.org/abs/2401.15042. Team, F.-L. The falcon 3 family o...

work page doi:10.18653/v1/d19-1331 2024
[5]

1145/3746027.3755221

URL https://dl.acm.org/doi/abs/10. 1145/3746027.3755221. Wang, M., Chen, L., Fu, C., Liao, S., Zhang, X., Wu, B., Yu, H., Xu, N., Zhang, L., Luo, R., Li, Y ., Yang, M., Huang, F., and Li, Y . Leave no document behind: Benchmarking long-context llms with extended multi-doc qa, 2024a. URLhttps://arxiv.org/abs/2406.17419. Wang, Y ., Ma, D., and Cai, D. With ...

work page arXiv
[6]

FC-Attack: Jailbreaking multimodal large language models via auto-generated flowcharts

URL https://openreview.net/forum? id=NG7sS51zVF. Xie, E., Xiong, G., Yang, H., Coleman, O., Kennedy, M., and Zhang, A. Leveraging grounded large language mod- els to automate educational presentation generation. In Large Foundation Models for Educational Assessment, pp. 207–220. PMLR, 2025. Yang, H., Zhang, J., He, Z., and Fung, Y . R. Mars-sql: A multi-a...

work page doi:10.18653/v1/2025.findings-emnlp 2025
[7]

emnlp-main.97

URL https://aclanthology.org/2025. findings-emnlp.1320/. Zhang, J., Zhang, R., Kong, F., Miao, Z., Ye, Y ., and Zheng, Y . Lost-in-the-middle in long-text generation: Synthetic dataset, evaluation framework, and mitigation, 2025b. URLhttps://arxiv.org/abs/2503.06868. Zhang, W., Zhou, Z., Wang, K., Fang, J., Zhang, Y ., Wang, R., Zhang, G., Li, X., Sun, L....

work page doi:10.18653/v1/2023.findings-emnlp 2025
[8]

findings-emnlp.773/

URL https://aclanthology.org/2023. findings-emnlp.773/. Zhou, Z., Li, C., Chen, X., Wang, S., Chao, Y ., Li, Z., Wang, H., Shi, Q., Tan, Z., Han, X., Shi, X., Liu, Z., and Sun, M. LLM ×MapReduce: Simplified long- sequence processing using large language models. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Proceedings of the 63rd Annual...

2023
[9]

"" This function calculates the area of a circle given its radius. Parameters: radius (float): The radius of the circle. Returns: float: The area of the circle

Association for Computational Linguistics. ISBN 979-8-89176-251-0. URL https://aclanthology. org/2025.acl-long.1341/. 13 On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility A. LLM Usage This paper addresses the challenge of output volatility in the long-form generation of Large Language Models (LLMs). We introduce VOLTBench, a no...

2025
[10]

21 On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility Figure 24.Model performance on theKeyword Presence Constraint

Premature Termination:The model failed to complete the generation, stopping abruptly after only three short paragraphs, far short of the required 40 entries. 21 On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility Figure 24.Model performance on theKeyword Presence Constraint. The x-axis represents the total number of generated sec...
[11]

Dialogue Hallucination:The model’s output terminates by hallucinating a user’s follow-up question. This suggests that the model incorrectly inferred a conversational context, switching from a content generation role to a chatbot role, and then stopped, awaiting human intervention. The generation of such special tokens associated with user queries is a dir...
[12]

Traveler,

Underlying Cause - Attention Degradation:The root cause of this failure pattern can be linked to the model’s internal state. As shown in the attention trace analysis (see Figure 26), the model’s attention scores became progressively lower towards the end of the generated sequence. This indicates that the model was losing its ability to focus on the contex...
[13]

Task Mismatch:Similar to the previous case, the model was prompted for a 40-day EN-simple-diary but generated a fantasy story instead
[14]

It correctly produced the first 10 chapters, but then jumped directly to the final chapter (Chapter 40), omitting the 29 chapters in between

Section Skipping:The primary failure is the model’s inability to generate the content sequentially. It correctly produced the first 10 chapters, but then jumped directly to the final chapter (Chapter 40), omitting the 29 chapters in between. This behavior fulfills the superficial requirement of ending at Chapter 40 without performing the actual work of ge...
[15]

completing

Underlying Cause - Attention Spike:This “lazy” behavior is correlated with a distinct attention pattern. As shown in the attention trace (see Figure 4), a sharp spike in the attention peak occurs immediately before the model generates the skipped section (”Chapter 40”). This suggests the model recognized the start (”Chapter 1”) and end (”Chapter 40”) poin...

work page arXiv
[16]

However, its output quality begins to degrade significantly (Chapters 19-20), losing grammatical structure and becoming a stream of loosely related words

Content Degradation:The model initially generates coherent content (Chapters 1-18). However, its output quality begins to degrade significantly (Chapters 19-20), losing grammatical structure and becoming a stream of loosely related words
[17]

greatly esteemed,

Repetitive Loop:The degradation culminates in Chapter 21, where the model enters a terminal repetitive loop, endlessly 24 On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility outputting a fixed sequence of high-probability words (e.g., “greatly esteemed,” ”highly revered”). This indicates a complete collapse of its ability to gene...
[18]

attention collapse

Underlying Cause - Attention Collapse:This failure is symptomatic of an “attention collapse.” After generating a substantial amount of text, the model’s attention mechanism is no longer able to produce meaningful peaks or focus on relevant parts of the context. Without sufficient attention to guide its next token selection, the model falls back into a sim...

work page arXiv 2077