Recognition: unknown
LLMs Corrupt Your Documents When You Delegate
Pith reviewed 2026-05-10 09:44 UTC · model grok-4.3
The pith
Large language models corrupt an average of 25% of document content during long delegated editing workflows.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current LLMs are unreliable delegates that degrade documents by introducing sparse but severe errors during long interaction sequences. In the DELEGATE-52 benchmark, which covers in-depth editing tasks across 52 professional domains, frontier models corrupt an average of 25% of the document content. Agentic tool use fails to reduce this degradation, which intensifies with larger document sizes, longer interactions, and the presence of distractor files. The errors compound silently over time rather than appearing all at once.
What carries the argument
DELEGATE-52, a benchmark simulating long delegated document editing workflows across 52 domains to quantify content corruption.
If this is right
- Agentic tool use does not reduce document corruption in these workflows.
- Corruption levels rise with increasing document size and interaction length.
- Additional distractor files in the workflow increase error rates.
- Errors remain sparse but severe and accumulate across multiple steps.
- LLMs cannot be trusted to faithfully execute delegated document tasks without oversight.
Where Pith is reading between the lines
- Breaking long tasks into shorter sessions could limit the compounding of errors.
- Monitoring mechanisms that flag content changes might be needed before full delegation becomes practical.
- The benchmark points to reliability issues that short single-prompt tests miss.
- Domains requiring precise formatting, such as music notation or crystallography, may show even higher vulnerability.
Load-bearing premise
The DELEGATE-52 tasks and the metric used to detect corruption accurately reflect real delegated document work and that the observed changes are unintended mistakes.
What would settle it
Running the DELEGATE-52 workflows with human experts and finding that they introduce comparable rates of content changes that domain specialists accept as normal variations would undermine the claim that LLMs corrupt documents.
Figures
read the original abstract
Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DELEGATE-52, a benchmark for long delegated document-editing workflows across 52 professional domains. Experiments with 19 LLMs show that even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of multi-turn interactions, with degradation worsening by document size and interaction length; agentic tool use provides no improvement.
Significance. If the corruption measurements prove robust, the work is significant for NLP and HCI because it supplies large-scale empirical data on LLM unreliability in an emerging delegation paradigm for knowledge work. The breadth of models and domains tested offers a useful reference point for future agent and workflow research.
major comments (3)
- [§3] §3 (Benchmark and metric definition): The degradation metric underlying the central 25% corruption figure is not specified in enough detail (e.g., whether it uses string overlap, token edit distance, semantic similarity, or expert judgment) to separate unintended errors from task-appropriate edits such as rephrasing or cross-reference updates. This distinction is load-bearing for interpreting the result as 'corruption' rather than normal workflow variation.
- [§4] §4 (Results): No human baseline, inter-annotator agreement, or semantic validation of changes is reported for the 25% figure on frontier models. Without these, it is impossible to determine whether the measured degradation exceeds acceptable human variation in long editing sessions.
- [§4.2] §4.2 (Statistical controls): The abstract and results mention multiple conditions and 19 models but provide no details on run-to-run variance, statistical significance testing, or controls for prompt sensitivity, which are necessary to support the claim that degradation is systematic.
minor comments (2)
- [Abstract] Abstract: The phrase 'vibe coding' is introduced without definition or reference.
- [Figures/Tables] Figure and table captions: Several captions are terse and do not fully describe the axes, error bars, or exact conditions shown.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which highlights important areas for improving the clarity and rigor of our presentation. We address each major comment point by point below, indicating the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark and metric definition): The degradation metric underlying the central 25% corruption figure is not specified in enough detail (e.g., whether it uses string overlap, token edit distance, semantic similarity, or expert judgment) to separate unintended errors from task-appropriate edits such as rephrasing or cross-reference updates. This distinction is load-bearing for interpreting the result as 'corruption' rather than normal workflow variation.
Authors: We appreciate the referee drawing attention to this. Section 3 defines the degradation metric as the fraction of document content altered in ways that introduce factual errors, omissions, or inconsistencies not justified by the delegated task instructions, computed via a hybrid approach: sentence-level embedding cosine similarity (thresholded at 0.85 for potential issues) followed by targeted string matching on domain-specific entities and manual review on a 10% sample. To address the concern about distinguishing corruption from appropriate edits, we will revise §3 to add an explicit taxonomy with examples (e.g., changing a crystallography lattice parameter is corruption; updating a cross-reference after content insertion is not), the precise formula, and inter-rater reliability for the manual component. This will make the 25% figure more interpretable. revision: yes
-
Referee: [§4] §4 (Results): No human baseline, inter-annotator agreement, or semantic validation of changes is reported for the 25% figure on frontier models. Without these, it is impossible to determine whether the measured degradation exceeds acceptable human variation in long editing sessions.
Authors: We agree this would provide valuable context. The original experiments prioritized breadth across 19 models and 52 domains rather than human comparison. In the revision we will add a human baseline subsection in §4, reporting degradation rates from professional editors performing analogous delegated workflows on a stratified 8-domain subset (with the same interaction length and document sizes). We will also report inter-annotator agreement (Cohen's kappa) for the semantic validation labels on the frontier-model outputs and include a direct comparison showing that LLM degradation exceeds the human baseline by a statistically notable margin. revision: yes
-
Referee: [§4.2] §4.2 (Statistical controls): The abstract and results mention multiple conditions and 19 models but provide no details on run-to-run variance, statistical significance testing, or controls for prompt sensitivity, which are necessary to support the claim that degradation is systematic.
Authors: We acknowledge the need for greater statistical transparency. Although the experiments used fixed prompt templates across all models, we will expand §4.2 in the revision to report: (i) run-to-run variance (each model-condition pair was executed three times with different random seeds; we will add mean ± standard deviation), (ii) statistical significance (paired t-tests and ANOVA results comparing frontier vs. other models and across document sizes), and (iii) prompt-sensitivity controls (we tested two paraphrased prompt variants on a 5-domain pilot and observed <4% variation in corruption rates, which we will document). These additions will substantiate that the degradation pattern is systematic rather than artifactual. revision: yes
Circularity Check
No circularity: purely empirical measurement study
full rationale
The paper introduces DELEGATE-52 as a benchmark for delegated document workflows and reports direct experimental measurements of document degradation across 19 LLMs. No derivation chain, equations, fitted parameters, predictions, or self-citations are present in the provided text. The central claim (25% average corruption) is an observed average from model runs on the benchmark tasks, not a quantity derived from or reduced to prior inputs by construction. The study is self-contained as an empirical evaluation with no load-bearing theoretical steps that could exhibit circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption DELEGATE-52 tasks accurately simulate real professional document editing workflows across domains.
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
(0.25 page) Main Findings (2.5 Page) - Main 10 round trips table + domain breakdown somehow (1 page) - Expected vs
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
-
[5]
Unsupervised evaluation of code llms with round-trip correctness
Miltiadis Allamanis, Sheena Panthaplackel, and Pengcheng Yin. Unsupervised evaluation of code llms with round-trip correctness. pp.\ 1050--1066, 2024
2024
-
[6]
The claude 3 model family: Opus, sonnet, haiku, 2024
Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024
2024
-
[7]
Blandin, and David J
Alexander Bick, A. Blandin, and David J. Deming. The rapid adoption of generative ai. SSRN Electronic Journal, 2024
2024
-
[8]
How knowledge workers use and want to use llms in an enterprise context
Michelle Brachman, Amina El-Ashry, Casey Dugan, and Werner Geyer. How knowledge workers use and want to use llms in an enterprise context. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 2024
2024
-
[9]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 18392--18402, 2022
2023
-
[10]
Can it edit? evaluating the ability of large language models to follow code editing instructions
Federico Cassano, Luisa Li, Akul Sethi, Noah Shinn, Abby Brennan-Jones, Anton Lozhkov, C. Anderson, and Arjun Guha. Can it edit? evaluating the ability of large language models to follow code editing instructions. ArXiv, abs/2312.12450, 2023
-
[11]
Can ai writing be salvaged? mitigating idiosyncrasies and improving human-ai alignment in the writing process through edits
Tuhin Chakrabarty, Philippe Laban, and Chien-Sheng Wu. Can ai writing be salvaged? mitigating idiosyncrasies and improving human-ai alignment in the writing process through edits. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2024
2025
-
[12]
Tuhin Chakrabarty, Philippe Laban, and Chien-Sheng Wu. Ai-slop to ai-polish? aligning language models through edit-based writing rewards and test-time computation. ArXiv, abs/2504.07532, 2025
-
[13]
Cunningham, David Deming, Zoë Hitzig, Christopher Ong, Carl Shan, and Kevin Wadman
Aaron Chatterji, T. Cunningham, David Deming, Zoë Hitzig, Christopher Ong, Carl Shan, and Kevin Wadman. How people use chatgpt. SSRN Electronic Journal, 2025
2025
-
[14]
Kaiyuan Chen, Yixin Ren, Yang Liu, X. Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, Chen Sun, Han Hou, Hui Yang, James Pan, Jian-Guang Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, Ke Liu, Rui Wang, Runhao Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi-Hsin Hung, Yuan Jiang, Zexuan Liu, ...
-
[15]
SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation
Siqi Chen, Xinyu Dong, Haolei Xu, Xingyu Wu, Fei Tang, Hang Zhang, Yuchen Yan, Linjuan Wu, Wenqi Zhang, Guiyang Hou, Yongliang Shen, Weiming Lu, and Yueting Zhuang. SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation. 2025 b
2025
-
[16]
Lifebench: A benchmark for long-horizon multi-source memory
Zi-Jian Cheng, Weixin Wang, Yu Zhao, Ziyang Ren, Jiaxuan Chen, Ruiyang Xu, Shuai Huang, Yang Chen, Guowei Li, Mengshi Wang, Yichen Xie, Renchuan Zhu, Zeren Jiang, Keda Lu, Yihong Li, Xiaoliang Wang, Liwei Liu, and Cam-Tu Nguyen. Lifebench: A benchmark for long-horizon multi-source memory. 2026
2026
-
[17]
Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. ArXiv, abs/2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Kellogg, Saran Rajendran, Lisa A
Fabrizio Dell'Acqua, Edward McFowland, Ethan Mollick, Hila Lifshitz-Assaf, Katherine C. Kellogg, Saran Rajendran, Lisa A. Krayer, F. Candelon, and K. Lakhani. Navigating the jagged technological frontier: Field experimental evidence of the effects of ai on knowledge worker productivity and quality. SSRN Electronic Journal, 2023
2023
-
[19]
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
Alexandre Drouin, Maxime Gasse, Massimo Caccia, I. Laradji, Manuel Del Verme, Tom Marty, L'eo Boisvert, Megh Thakkar, Quentin Cappart, David Vázquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks? ArXiv, abs/2403.07718, 2024
work page internal anchor Pith review arXiv 2024
-
[20]
Pan, Ruifeng Xu, and Kam-Fai Wong
Yiming Du, Bingbing Wang, Yangfan He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, and Kam-Fai Wong. Memguide: Intent-driven memory selection for goal-oriented multi-session llm agents. 2025
2025
-
[21]
Lomeli, Patrick Lewis, Gautier Izacard, Edouard Grave, Sebastian Riedel, and F
Jane Dwivedi-Yu, Timo Schick, Zhengbao Jiang, M. Lomeli, Patrick Lewis, Gautier Izacard, Edouard Grave, Sebastian Riedel, and F. Petroni. Editeval: An instruction-based benchmark for text improvements. ArXiv, abs/2209.13331, 2022
-
[22]
Openai o1 system card
Ahmed El-Kishky. Openai o1 system card. 2024
2024
-
[23]
Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: An early look at the labor market impact potential of large language models. ArXiv, abs/2303.10130, 2023
-
[24]
Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi Li, Ruibo Liu, Yue Wang, Shuyue Guo, Xingwei Qu, Xiang Yue, Ge Zhang, Wenhu Chen, and Jie Fu. Codeeditorbench: Evaluating code editing capability of large language models. ArXiv, abs/2404.03543, 2024
-
[25]
Troy, Dario Amodei, Jared Kaplan, Jack Clark, and Deep Ganguli
Kunal Handa, Alex Tamkin, Miles McCain, Saffron Huang, Esin Durmus, Sarah Heck, J. Mueller, Jerry Hong, Stuart Ritchie, Tim Belonax, Kevin K. Troy, Dario Amodei, Jared Kaplan, Jack Clark, and Deep Ganguli. Which economic tasks are performed with ai? evidence from millions of claude conversations. ArXiv, abs/2503.04761, 2025
-
[26]
Dual learning for machine translation
Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual learning for machine translation. pp.\ 820--828, 2016
2016
-
[27]
Pentland
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Zeming Chen, Tong Wu, Siru Ouyang, Zihan Wang, Jiaxin Pei, Julian McAuley, Yejin Choi, and A. Pentland. Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks. 2026
2026
-
[28]
Christine Herlihy, Jennifer Neville, Tobias Schnabel, and Adith Swaminathan. On overcoming miscalibrated conversational priors in llm-based chatbots. ArXiv, abs/2406.01633, 2024
-
[29]
Iterative back-translation for neural machine translation
Cong Duy Vu Hoang, Philipp Koehn, Gholamreza Haffari, and Trevor Cohn. Iterative back-translation for neural machine translation. pp.\ 18--24, 2018
2018
-
[30]
Consistencychecker: Tree-based evaluation of llm generalization capabilities
Zhaochen Hong, Haofei Yu, and Jiaxuan You. Consistencychecker: Tree-based evaluation of llm generalization capabilities. pp.\ 33039--33075, 2025
2025
-
[31]
Evermembench: Benchmarking long-term interactive memory in large language models
Chuanrui Hu, Tong Li, Xingze Gao, Hongda Chen, Yi Bai, Dannong Xu, Tianwei Lin, Xinda Zhao, Xiaohong Li, Yunyun Han, Jian Pei, and Yafeng Deng. Evermembench: Benchmarking long-term interactive memory in large language models. 2026
2026
-
[32]
Crmarena: Understanding the capacity of llm agents to perform professional crm tasks in realistic environments
Kung-Hsiang Huang, Akshara Prabhakar, Sidharth Dhawan, Yixin Mao, Huan Wang, Silvio Savarese, Caiming Xiong, Philippe Laban, and Chien-Sheng Wu. Crmarena: Understanding the capacity of llm agents to perform professional crm tasks in realistic environments. pp.\ 3830--3850, 2024
2024
-
[33]
Gpt-4o system card
Aaron Hurst et al. Gpt-4o system card. 2024
2024
-
[34]
Wang, Ying Xiong, Yong Zhang, and Zhenan Fan
Daesik Jang, Morgan Lindsay Heisler, Linzi Xing, Yifei Li, E. Wang, Ying Xiong, Yong Zhang, and Zhenan Fan. Deckbench: Benchmarking multi-agent frameworks for academic slide generation and editing. 2026
2026
-
[35]
Jihyoung Jang, Minseong Boo, and Hyounghun Kim. Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations. ArXiv, abs/2310.13420, 2023
-
[36]
Saurabh Jha, Rohan R. Arora, Yuji Watanabe, Takumi Yanagawa, Yinfang Chen, Jackson Clark, Bhavya Bhavya, Mudit Verma, Harshit Kumar, Hirokuni Kitahara, Noah Zheutlin, Saki Takano, Divya Pathak, Felix George, Xinbo Wu, B. Turkkan, Gerard Vanloo, M. Nidd, Ting Dai, Oishik Chatterjee, Pranjal Gupta, Suranjana Samanta, Pooja Aggarwal, Rong Lee, Pavankumar Mur...
-
[37]
Albert Qiaochu Jiang et al. Mistral 7b. ArXiv, abs/2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [38]
-
[39]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? ArXiv, abs/2310.06770, 2023
work page internal anchor Pith review arXiv 2023
-
[40]
Kapadnis, Lawanya Baghel, Atharva Naik, and C
M. Kapadnis, Lawanya Baghel, Atharva Naik, and C. Ros'e. Charteditbench: Evaluating grounded multi-turn chart editing in multimodal language models. 2026
2026
-
[41]
arXiv preprint arXiv:2602.03429 , url=
Tae Soo Kim, Yoonjoo Lee, Jaesang Yu, John Joon Young Chung, and Juho Kim. Discoverllm: From executing intents to discovering them. arXiv preprint arXiv:2602.03429, 2026
work page internal anchor Pith review arXiv 2026
-
[42]
Joty, Caiming Xiong, and Chien-Sheng Wu
Philippe Laban, Jesse Vig, Wojciech Kryscinski, Shafiq R. Joty, Caiming Xiong, and Chien-Sheng Wu. Swipe: A dataset for document-level simplification of wikipedia pages. pp.\ 10674--10695, 2023
2023
-
[43]
Philippe Laban, A. R. Fabbri, Caiming Xiong, and Chien-Sheng Wu. Summary of a haystack: A challenge to long-context llms and rag systems. pp.\ 9885--9903, 2024
2024
-
[44]
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation. ArXiv, abs/2505.06120, 2025
work page internal anchor Pith review arXiv 2025
-
[45]
FLUX.2: Frontier Visual Intelligence
Black Forest Labs. FLUX.2: Frontier Visual Intelligence . https://bfl.ai/blog/flux-2, 2025
2025
-
[46]
Black Forest Labs, Stephen Batifol, A. Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Muller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image gener...
work page internal anchor Pith review arXiv 2025
-
[47]
M. Lachaux, Baptiste Rozière, Lowik Chanussot, and Guillaume Lample. Unsupervised translation of programming languages. ArXiv, abs/2006.03511, 2020
-
[48]
Unsupervised Machine Translation Using Monolingual Corpora Only
Guillaume Lample, Ludovic Denoyer, and Marc'Aurelio Ranzato. Unsupervised machine translation using monolingual corpora only. ArXiv, abs/1711.00043, 2017
work page Pith review arXiv 2017
-
[49]
Levenshtein
V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady, 10: 0 707--710, 1965
1965
-
[50]
Charte3: A comprehensive benchmark for end-to-end chart editing
Shuo Li, Jiajun Sun, Zhekai Wang, Xiaoran Fan, Hui Li, Di Yang, Zhiheng Xi, Yijun Wang, Zifei Shan, Tao Gui, Qi Zhang, and Xuanjing Huang. Charte3: A comprehensive benchmark for end-to-end chart editing. ArXiv, abs/2601.21694, 2026
-
[51]
Self-alignment with instruction backtranslation
Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, J. Weston, and M. Lewis. Self-alignment with instruction backtranslation. ArXiv, abs/2308.06259, 2023
-
[52]
Toward multi-session personalized conversation: A large-scale dataset and hierarchical tree framework for implicit reasoning
Xintong Li, Jalend Bantupalli, Ria Dharmani, Yuwei Zhang, and Jingbo Shang. Toward multi-session personalized conversation: A large-scale dataset and hierarchical tree framework for implicit reasoning. pp.\ 11493--11506, 2025
2025
-
[53]
Wikitableedit: A benchmark for table editing by natural language instruction
Zheng Li, Xiang Chen, and Xiaojun Wan. Wikitableedit: A benchmark for table editing by natural language instruction. ArXiv, abs/2403.02962, 2024
-
[54]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Annual Meeting of the Association for Computational Linguistics, pp.\ 74--81, 2004
2004
-
[55]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, F
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, F. Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2023
2023
-
[56]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[57]
Evaluating Very Long-Term Conversational Memory of LLM Agents
Adyasha Maharana, Dong-Ho Lee, S. Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. ArXiv, abs/2402.17753, 2024
work page internal anchor Pith review arXiv 2024
-
[58]
Nickil Maveli, Antonio Vergari, and Shay B. Cohen. Can llms compress (and decompress)? evaluating code understanding and execution via invertibility. ArXiv, abs/2601.13398, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[59]
Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, Jaehyuk Lim, Lachlan Carroll, Alice Blair, V. Sivakumar, Sumana Basu, Brad Kenstler, Yuntao Ma, Julian Michael, Xiaoke Li, Oliver Ingebretsen, Aditya Mehta, Jean Mottola, John Teichmann, Kevin Yu...
-
[60]
Multisessioncollab: Learning user preferences with memory to improve long-term collaboration
Shuhaib Mehri, Priyanka Kargupta, Tal August, and Dilek Hakkani-Tur. Multisessioncollab: Learning user preferences with memory to improve long-term collaboration. 2026
2026
-
[61]
Marcus J. Min, Yangruibo Ding, Luca Buratti, Saurabh Pujar, Gail E. Kaiser, Suman Jana, and Baishakhi Ray. Beyond accuracy: Evaluating self-consistency of code large language models with identitychain. ArXiv, abs/2310.14053, 2023
-
[62]
Tarek Naous, Philippe Laban, Wei Xu, and Jennifer Neville. Flipping the dialogue: Training and evaluating user language models. arXiv preprint arXiv:2510.06552, 2025
-
[63]
arXiv preprint arXiv:2201.10005 , year=
Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, N. Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, G. Sastry, Gretchen Krueger, D. Schnurr, F. Such, K. Hsu, Madeleine Thompson, Tabarak Khan, T. Sherbakov, Joanne Jang, Peter Welinder, and Lilian Weng...
-
[64]
Zettlemoyer, and Xian Li
Thao Nguyen, Jeffrey Li, Sewoong Oh, Ludwig Schmidt, Jason Weston, Luke S. Zettlemoyer, and Xian Li. Better alignment with instruction back-and-forth translation. pp.\ 13289--13308, 2024
2024
-
[65]
Svgeditbench: A benchmark dataset for quantitative assessment of llm's svg editing capabilities
Kunato Nishina and Yusuke Matsui. Svgeditbench: A benchmark dataset for quantitative assessment of llm's svg editing capabilities. ArXiv, abs/2404.13710, 2024
-
[66]
Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder. ArXiv, abs/2402.01613, 2024
-
[67]
Pptarena: A benchmark for agentic powerpoint editing
Michael Ofengenden, Yunze Man, Ziqi Pang, and Yu-Xiong Wang. Pptarena: A benchmark for agentic powerpoint editing. ArXiv, abs/2512.03042, 2025
-
[68]
Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, S. Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. Gdpval: Evaluating ai model performance on real-world economic...
-
[69]
Peterson, Michael D
NORMAN G. Peterson, Michael D. Mumford, W. C. Borman, P. Jeanneret, E. Fleishman, Kerry Y. Levin, MICHAEL A. Campion, M. S. Mayfield, F. Morgeson, Kenneth Pearlman, M. Gowing, Anita R. Lancaster, M. Silver, and D. Dye. Understanding work using the occupational information network (o*net): Implications for practice and research. Personnel Psychology, 54: 0...
2001
-
[70]
V. Pimenova, Sarah Fakhoury, Christian Bird, M. Storey, and Madeline Endres. Good vibrations? a qualitative study of co-creation, communication, flow, and trust in vibe coding. ArXiv, abs/2509.12491, 2025
-
[71]
Coedit: Text editing by task-specific instruction tuning
Vipul Raheja, Dhruv Kumar, Ryan Koo, and Dongyeop Kang. Coedit: Text editing by task-specific instruction tuning. ArXiv, abs/2305.09857, 2023
-
[72]
Baptiste Rozière, J Zhang, François Charton, M. Harman, Gabriel Synnaeve, and Guillaume Lample. Leveraging automated unit tests for unsupervised code translation. ArXiv, abs/2110.06773, 2021
-
[73]
arXiv preprint arXiv:1511.06709 , year=
Rico Sennrich, B. Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. ArXiv, abs/1511.06709, 2015
-
[74]
Future of work with AI agents: Auditing automation and augmentation potential across the U.S
Yijia Shao, Humishka Zope, Yucheng Jiang, Jiaxin Pei, D. Nguyen, Erik Brynjolfsson, and Diyi Yang. Future of work with ai agents: Auditing automation and augmentation potential across the u.s. workforce. ArXiv, abs/2506.06576, 2025
-
[75]
Chi, Nathanael Scharli, and Denny Zhou
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Scharli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. pp.\ 31210--31227, 2023
2023
-
[76]
Singh et al
Aaditya K. Singh et al. Openai gpt-5 system card. 2025
2025
-
[77]
Alexa Siu and Raymond Fok. Augmenting expert cognition in the age of generative ai: Insights from document-centric knowledge work. ArXiv, abs/2503.24334, 2025
-
[78]
Defining and characterizing reward hacking.arXiv preprint arXiv:2209.13085, 2022
J. Skalse, Nikolaus H. R. Howe, D. Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. ArXiv, abs/2209.13085, 2022
-
[79]
H. Somers. Round-trip translation: What is it good for? pp.\ 127--133, 2005
2005
-
[80]
Newsedits: A news article revision dataset and a novel document-level reasoning challenge
Alexander Spangher, Xiang Ren, Jonathan May, and Nanyun Peng. Newsedits: A news article revision dataset and a novel document-level reasoning challenge. ArXiv, abs/2206.07106, 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.