Recognition: 1 theorem link
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Pith reviewed 2026-05-12 20:17 UTC · model grok-4.3
The pith
LongBench introduces the first bilingual multi-task benchmark with 21 datasets to evaluate how well LLMs handle texts thousands of words long.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words in English and 13,386 characters in Chinese. Evaluation of eight LLMs shows that GPT-3.5-Turbo-16k outperforms open-source models yet still struggles on longer contexts, scaled position embeddings and longer-sequence fine-tuning bring substantial gains, and retrieval-style compression helps weaker models but does not close the gap to models with strong native long-context ability.
What carries the argument
The LongBench benchmark itself, a collection of 21 standardized datasets spanning six categories (single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, code completion) in English and Chinese.
If this is right
- Commercial LLMs like GPT-3.5-Turbo-16k still underperform on the longest inputs even though they lead overall.
- Techniques that scale position embeddings or fine-tune on longer sequences produce clear gains in long-context tasks.
- Retrieval-based compression improves models that lack strong long-context handling but leaves them behind models that already manage extended inputs natively.
Where Pith is reading between the lines
- Future work could use LongBench scores as a quick filter before testing on real-world long documents.
- The bilingual design may expose language-specific differences in how models handle long contexts that English-only tests miss.
- If new models are trained with LongBench in mind, they might overfit to its particular dataset mix rather than general long-text skills.
Load-bearing premise
The 21 selected datasets and six task categories capture the essential difficulties of long-context understanding without letting models succeed through dataset artifacts or shortcuts instead of genuine long-range reasoning.
What would settle it
A model that scores high on every LongBench task but fails when given full untruncated books, reports, or code repositories in downstream applications would show the benchmark does not measure the intended capability.
read the original abstract
Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LongBench, the first bilingual (English/Chinese), multi-task benchmark for long-context LLM understanding. It comprises 21 datasets across 6 categories (single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, code completion) with average lengths of 6,711 words (English) and 13,386 characters (Chinese). All datasets are standardized to a unified format for automatic evaluation. The authors evaluate 8 LLMs and report that GPT-3.5-Turbo-16k leads but still struggles on long contexts, that scaled position embeddings and long-sequence fine-tuning yield substantial gains, and that retrieval-based compression helps weaker models but does not close the gap to strong long-context models.
Significance. LongBench addresses a genuine gap by supplying a standardized, automatically scorable, bilingual resource for tracking progress on long-context capabilities. If the datasets prove representative and free of exploitable artifacts, the benchmark will support reproducible comparisons and accelerate work on context extension and memory mechanisms. The public release of code and data is a clear strength.
major comments (2)
- [§3] §3 (LongBench Construction): The manuscript provides high-level descriptions of the 21 datasets but does not detail selection criteria, inter-annotator agreement, or explicit checks for annotation artifacts and length-based shortcuts. Without these, it is difficult to confirm that the benchmark isolates long-context understanding rather than other capabilities.
- [§4] §4 (Experiments and Analysis): The three headline findings are stated without per-task breakdowns, confidence intervals, or ablation studies (e.g., retrieval quality vs. context length). This weakens the claim that retrieval “brings improvement” while still lagging strong models, as the contribution of long-context modeling versus other factors cannot be isolated.
minor comments (3)
- The abstract asserts LongBench is “the first bilingual” benchmark; a concise related-work paragraph comparing it to prior long-context suites (e.g., those focused on English only) would strengthen this claim.
- Table 1 (dataset statistics) should include token counts under a common tokenizer and explicit context-length histograms to make the “long context” characterization quantitative.
- The GitHub link is given, but the paper should specify the exact commit or release tag used for the reported results to ensure reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation of minor revision. The comments are constructive and we address each major point below, indicating planned changes to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (LongBench Construction): The manuscript provides high-level descriptions of the 21 datasets but does not detail selection criteria, inter-annotator agreement, or explicit checks for annotation artifacts and length-based shortcuts. Without these, it is difficult to confirm that the benchmark isolates long-context understanding rather than other capabilities.
Authors: We appreciate this observation. Most datasets in LongBench are adapted from existing public resources (with full citations in Section 3 and the appendix), selected specifically because their average lengths exceed standard context windows and require cross-sentence or cross-document reasoning. For the few newly constructed datasets, we followed standard annotation protocols from the source tasks. Inter-annotator agreement is reported for the human-annotated subsets where available. In the revision we will expand §3 with an explicit subsection on selection criteria, a summary of any artifact checks performed (e.g., length-distribution analysis and shortcut probes), and a clearer statement of limitations regarding potential non-contextual cues. revision: partial
-
Referee: [§4] §4 (Experiments and Analysis): The three headline findings are stated without per-task breakdowns, confidence intervals, or ablation studies (e.g., retrieval quality vs. context length). This weakens the claim that retrieval “brings improvement” while still lagging strong models, as the contribution of long-context modeling versus other factors cannot be isolated.
Authors: Thank you for the suggestion. Table 2 already reports per-task scores for all eight models across the 21 datasets. We will add 95% confidence intervals (via bootstrap resampling) to the main results table in the revision. Section 4.3 contains an initial retrieval ablation; we will expand it with a controlled comparison of retrieval quality (top-k accuracy) versus raw context length, plus an additional ablation that varies the compression ratio while holding model architecture fixed. These additions should better isolate the contribution of long-context modeling. revision: partial
Circularity Check
No significant circularity
full rationale
The paper introduces LongBench, a new bilingual benchmark with 21 datasets across 6 task categories for evaluating long-context LLMs. No derivation chain, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. The central contribution is dataset curation, standardization into a unified format, and empirical evaluation of 8 models; validity rests on transparent curation rather than any self-referential logic, self-citation load-bearing premise, or ansatz smuggled via prior work. This is a standard resource paper with no circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Long context understanding is a key current limitation of LLMs that requires dedicated benchmarks for rigorous evaluation.
Forward citations
Cited by 28 Pith papers
-
RULER: What's the Real Context Size of Your Long-Context Language Models?
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
-
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
-
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
-
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
-
Transactional Attention: Semantic Sponsorship for KV-Cache Retention
Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
-
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...
-
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faste...
-
When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems
CAFE detects positive distributional Jensen Gaps across five multi-agent LLM architectures on a banking-risk benchmark, showing that quality drops under semantic stress can coexist with statistically detectable antifr...
-
When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems
CAFE finds positive distributional Jensen Gaps across five multi-agent LLM architectures under semantic stress, showing that quality drops can coexist with detectable stress geometry compatible with antifragile learning.
-
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
-
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...
-
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.
-
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
-
When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis
LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.
-
Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving
In long-context LLM serving, accuracy becomes speed via retry dynamics, and accuracy-aware routing reduces time-to-correct-answer.
-
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
-
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.
-
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.
-
SnapKV: LLM Knows What You are Looking for Before Generation
SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable pe...
-
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
-
Efficient Streaming Language Models with Attention Sinks
StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
Budget-Aware Routing for Long Clinical Text
RCD balances relevance, coverage, and diversity in a knapsack-constrained selection framework, with experiments showing that selector choice and budget level determine optimal unitization strategies on clinical datasets.
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
-
Gated Delta Networks: Improving Mamba2 with Delta Rule
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.
-
Hierarchical vs. Flat Iteration in Shared-Weight Transformers
Hierarchical two-speed shared-weight recurrence in Transformers shows a sharp performance gap compared to independent layer stacking in empirical language modeling tests.
Reference graph
Works this paper leans on
-
[2]
International Conference on Learning Representations , year=
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. International Conference on Learning Representations , year=
-
[5]
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
-
[12]
Transactions on Machine Learning Research , year=
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. Transactions on Machine Learning Research , year=
-
[13]
International Conference on Learning Representations , year=
Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
-
[14]
Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X , author=. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=
-
[18]
Efficient Attentions for Long Document Summarization , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
work page 2021
-
[19]
QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
work page 2021
-
[20]
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
-
[21]
VCSUM : A Versatile C hinese Meeting Summarization Dataset
Wu, Han and Zhan, Mingjie and Tan, Haochen and Hou, Zhaohui and Liang, Ding and Song, Linqi. VCSUM : A Versatile C hinese Meeting Summarization Dataset. Findings of the Association for Computational Linguistics: ACL 2023. 2023
work page 2023
-
[22]
COLING 2002: The 19th International Conference on Computational Linguistics , year=
Learning question classifiers , author=. COLING 2002: The 19th International Conference on Computational Linguistics , year=
work page 2002
-
[23]
SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization , author=. EMNLP-IJCNLP 2019 , pages=
work page 2019
-
[24]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[25]
Transactions of the Association for Computational Linguistics , volume=
The narrativeqa reading comprehension challenge , author=. Transactions of the Association for Computational Linguistics , volume=. 2018 , publisher=
work page 2018
-
[26]
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
work page 2021
-
[28]
Task Definition for Large Scale Text Categorization at NLPCC 2014 , author=
work page 2014
-
[29]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2018
-
[30]
Proceedings of the 28th International Conference on Computational Linguistics , pages=
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , author=. Proceedings of the 28th International Conference on Computational Linguistics , pages=
-
[31]
Transactions of the Association for Computational Linguistics , volume=
♫ MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=
work page 2022
-
[32]
DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications , author=. ACL 2018 , pages=
work page 2018
- [34]
-
[36]
and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , month =
Li, Dacheng and Shao, Rulin and Xie, Anze and Sheng, Ying and Zheng, Lianmin and Gonzalez, Joseph E. and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , month =. How Long Can Open-Source LLMs Truly Promise on Context Length? , url =
-
[37]
Long Sequence Modeling with XGen: A 7B LLM Trained on 8K Input Sequence Length , author=. 2023 , url=
work page 2023
-
[38]
InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities , author=
-
[39]
The Eleventh International Conference on Learning Representations , year=
GLM-130B: An Open Bilingual Pre-trained Model , author=. The Eleventh International Conference on Learning Representations , year=
-
[40]
GLM: General Language Model Pretraining with Autoregressive Blank Infilling , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[41]
Tay, Yi and Dehghani, Mostafa and Bahri, Dara and Metzler, Donald , title =. 2022 , volume =
work page 2022
-
[42]
The Twelfth International Conference on Learning Representations , year=
KoLA: Carefully Benchmarking World Knowledge of Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[43]
Advances in neural information processing systems , volume=
Big bird: Transformers for longer sequences , author=. Advances in neural information processing systems , volume=
-
[45]
-former: Infinite Memory Transformer , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[46]
Transactions of the Association for Computational Linguistics , volume=
Efficient content-based sparse attention with routing transformers , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=
work page 2021
-
[47]
International Conference on Learning Representations , year=
Long Range Arena: A Benchmark for Efficient Transformers , author=. International Conference on Learning Representations , year=
-
[50]
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
SCROLLS: Standardized CompaRison Over Long Language Sequences , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2022
-
[51]
Text summarization branches out , pages=
Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
-
[52]
Intellicode compose: Code generation using transformer , author=. Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=
-
[53]
International Conference on Learning Representations , year=
Generalization through Memorization: Nearest Neighbor Language Models , author=. International Conference on Learning Representations , year=
-
[54]
Transactions on Machine Learning Research , year=
Unsupervised Dense Information Retrieval with Contrastive Learning , author=. Transactions on Machine Learning Research , year=
-
[55]
International conference on machine learning , pages=
Improving language models by retrieving from trillions of tokens , author=. International conference on machine learning , pages=. 2022 , organization=
work page 2022
-
[57]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
Do Long-Range Language Models Actually Use Long-Range Context? , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2021
-
[58]
Advances in Neural Information Processing Systems , volume=
Recurrent memory transformer , author=. Advances in Neural Information Processing Systems , volume=
-
[59]
International Conference on Learning Representations , year=
Memorizing Transformers , author=. International Conference on Learning Representations , year=
-
[62]
International Conference on Learning Representations , year=
Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=
-
[63]
International Conference on Learning Representations , year=
Compressive Transformers for Long-Range Sequence Modelling , author=. International Conference on Learning Representations , year=
-
[65]
The Journal of Machine Learning Research , volume=
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. The Journal of Machine Learning Research , volume=. 2022 , publisher=
work page 2022
-
[66]
The Journal of Machine Learning Research , volume=
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=
work page 2020
- [68]
- [69]
- [70]
-
[71]
Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[72]
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206--2240. PMLR
work page 2022
- [73]
-
[74]
Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. 2022. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079--11091
work page 2022
-
[75]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[76]
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595
work page internal anchor Pith review arXiv 2023
-
[77]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[78]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978--2988
work page 2019
-
[79]
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. 2021. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599--4610
work page 2021
- [80]
-
[81]
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335
work page 2022
-
[82]
Alexander Richard Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074--1084
work page 2019
-
[83]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232--5270
work page 2022
-
[84]
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. 2019. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. EMNLP-IJCNLP 2019, page 70
work page 2019
- [85]
-
[86]
Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, et al. 2018. Dureader: a chinese machine reading comprehension dataset from real-world applications. ACL 2018, page 37
work page 2018
-
[87]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. In International Conference on Learning Representations
work page 2021
-
[88]
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609--6625
work page 2020
-
[89]
Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. 2021. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419--1436
work page 2021
-
[90]
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2022 a . Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research
work page 2022
- [91]
-
[92]
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611
work page 2017
-
[93]
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations
work page 2020
-
[94]
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In International Conference on Learning Representations
work page 2020
-
[95]
Tom \'a s Ko c isk \`y , Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G \'a bor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317--328
work page 2018
-
[96]
Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang
Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023. https://lmsys.org/blog/2023-06-29-longchat How long can open-source llms truly promise on context length?
work page 2023
-
[97]
Xin Li and Dan Roth. 2002. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics
work page 2002
- [98]
-
[99]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74--81
work page 2004
-
[100]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023 a . Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [101]
-
[102]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2023 c . Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[103]
Pedro Henrique Martins, Zita Marinho, and Andr \'e FT Martins. 2022. -former: Infinite memory transformer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5468--5485
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.