arxiv: 2308.14508 · v2 · submitted 2023-08-28 · 💻 cs.CL

Recognition: 1 theorem link

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Aohan Zeng, Hongchang Lyu, Jiajie Zhang, Jiankai Tang, Jie Tang, Juanzi Li, Lei Hou, Xiao Liu, Xin Lv, Yushi Bai, Yuxiao Dong, Zhengxiao Du, Zhidian Huang

Pith reviewed 2026-05-12 20:17 UTC · model grok-4.3

classification 💻 cs.CL

keywords long context understandingbenchmarklarge language modelsbilingual evaluationmulti-task benchmarkLLM evaluationcontext window

0 comments

The pith

LongBench introduces the first bilingual multi-task benchmark with 21 datasets to evaluate how well LLMs handle texts thousands of words long.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates LongBench to fill the gap in testing large language models on long input sequences that exceed their usual few-thousand-token limits. It assembles 21 datasets in English and Chinese across single-document and multi-document question answering, summarization, few-shot learning, synthetic tasks, and code completion, all standardized for automatic scoring. A reader would care because real applications often involve full reports, books, or codebases, and without a shared testbed it is hard to measure whether proposed fixes like longer position embeddings actually work. The authors then run eight models on the benchmark and report that commercial systems lead but still drop on the longest inputs, while scaling embeddings and longer fine-tuning help more than simple retrieval compression.

Core claim

LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words in English and 13,386 characters in Chinese. Evaluation of eight LLMs shows that GPT-3.5-Turbo-16k outperforms open-source models yet still struggles on longer contexts, scaled position embeddings and longer-sequence fine-tuning bring substantial gains, and retrieval-style compression helps weaker models but does not close the gap to models with strong native long-context ability.

What carries the argument

The LongBench benchmark itself, a collection of 21 standardized datasets spanning six categories (single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, code completion) in English and Chinese.

If this is right

Commercial LLMs like GPT-3.5-Turbo-16k still underperform on the longest inputs even though they lead overall.
Techniques that scale position embeddings or fine-tune on longer sequences produce clear gains in long-context tasks.
Retrieval-based compression improves models that lack strong long-context handling but leaves them behind models that already manage extended inputs natively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work could use LongBench scores as a quick filter before testing on real-world long documents.
The bilingual design may expose language-specific differences in how models handle long contexts that English-only tests miss.
If new models are trained with LongBench in mind, they might overfit to its particular dataset mix rather than general long-text skills.

Load-bearing premise

The 21 selected datasets and six task categories capture the essential difficulties of long-context understanding without letting models succeed through dataset artifacts or shortcuts instead of genuine long-range reasoning.

What would settle it

A model that scores high on every LongBench task but fails when given full untruncated books, reports, or code repositories in downstream applications would show the benchmark does not measure the intended capability.

read the original abstract

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LongBench is a useful new bilingual benchmark for long-context LLMs with decent task coverage, but its real value hinges on whether the datasets actually force models to use long context rather than shortcuts.

read the letter

The paper's main contribution is releasing LongBench: 21 datasets split across English and Chinese, grouped into six categories like single- and multi-document QA, summarization, few-shot learning, synthetic tasks, and code completion. Average lengths are 6,711 words for English and 13,386 characters for Chinese. They standardize everything into one format for automatic scoring and run eight models, showing GPT-3.5-Turbo-16k leads but still drops on longer inputs, while position scaling and longer fine-tuning help, and retrieval gives a boost to weaker models but doesn't close the gap to strong long-context ones. The public GitHub release is straightforward and practical. That bilingual scope and the mix of real and synthetic tasks are the parts that feel new compared to prior English-only long-context suites. The evaluation section is clear enough to reproduce the headline numbers. The soft spot is dataset construction. The abstract and results don't show much on how they filtered or validated the data to ensure models can't solve tasks with local cues or surface patterns instead of full-context reasoning. For a benchmark paper, that validation step is load-bearing; without it, adoption will depend on whether follow-up users find the same limitations. Minor issues include the limited model set (mostly older checkpoints) and the lack of human baselines on the harder tasks. This is for groups actively building or testing long-context LLMs, especially anyone needing Chinese coverage or a ready-made suite for retrieval and summarization experiments. Readers who want a standardized starting point for their own ablations will get immediate value. It deserves peer review because the resource is concrete, the claims are modest, and the field needs more multilingual long-context testbeds even if the curation details require tightening.

Referee Report

2 major / 3 minor

Summary. The paper introduces LongBench, the first bilingual (English/Chinese), multi-task benchmark for long-context LLM understanding. It comprises 21 datasets across 6 categories (single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, code completion) with average lengths of 6,711 words (English) and 13,386 characters (Chinese). All datasets are standardized to a unified format for automatic evaluation. The authors evaluate 8 LLMs and report that GPT-3.5-Turbo-16k leads but still struggles on long contexts, that scaled position embeddings and long-sequence fine-tuning yield substantial gains, and that retrieval-based compression helps weaker models but does not close the gap to strong long-context models.

Significance. LongBench addresses a genuine gap by supplying a standardized, automatically scorable, bilingual resource for tracking progress on long-context capabilities. If the datasets prove representative and free of exploitable artifacts, the benchmark will support reproducible comparisons and accelerate work on context extension and memory mechanisms. The public release of code and data is a clear strength.

major comments (2)

[§3] §3 (LongBench Construction): The manuscript provides high-level descriptions of the 21 datasets but does not detail selection criteria, inter-annotator agreement, or explicit checks for annotation artifacts and length-based shortcuts. Without these, it is difficult to confirm that the benchmark isolates long-context understanding rather than other capabilities.
[§4] §4 (Experiments and Analysis): The three headline findings are stated without per-task breakdowns, confidence intervals, or ablation studies (e.g., retrieval quality vs. context length). This weakens the claim that retrieval “brings improvement” while still lagging strong models, as the contribution of long-context modeling versus other factors cannot be isolated.

minor comments (3)

The abstract asserts LongBench is “the first bilingual” benchmark; a concise related-work paragraph comparing it to prior long-context suites (e.g., those focused on English only) would strengthen this claim.
Table 1 (dataset statistics) should include token counts under a common tokenizer and explicit context-length histograms to make the “long context” characterization quantitative.
The GitHub link is given, but the paper should specify the exact commit or release tag used for the reported results to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The comments are constructive and we address each major point below, indicating planned changes to the manuscript.

read point-by-point responses

Referee: [§3] §3 (LongBench Construction): The manuscript provides high-level descriptions of the 21 datasets but does not detail selection criteria, inter-annotator agreement, or explicit checks for annotation artifacts and length-based shortcuts. Without these, it is difficult to confirm that the benchmark isolates long-context understanding rather than other capabilities.

Authors: We appreciate this observation. Most datasets in LongBench are adapted from existing public resources (with full citations in Section 3 and the appendix), selected specifically because their average lengths exceed standard context windows and require cross-sentence or cross-document reasoning. For the few newly constructed datasets, we followed standard annotation protocols from the source tasks. Inter-annotator agreement is reported for the human-annotated subsets where available. In the revision we will expand §3 with an explicit subsection on selection criteria, a summary of any artifact checks performed (e.g., length-distribution analysis and shortcut probes), and a clearer statement of limitations regarding potential non-contextual cues. revision: partial
Referee: [§4] §4 (Experiments and Analysis): The three headline findings are stated without per-task breakdowns, confidence intervals, or ablation studies (e.g., retrieval quality vs. context length). This weakens the claim that retrieval “brings improvement” while still lagging strong models, as the contribution of long-context modeling versus other factors cannot be isolated.

Authors: Thank you for the suggestion. Table 2 already reports per-task scores for all eight models across the 21 datasets. We will add 95% confidence intervals (via bootstrap resampling) to the main results table in the revision. Section 4.3 contains an initial retrieval ablation; we will expand it with a controlled comparison of retrieval quality (top-k accuracy) versus raw context length, plus an additional ablation that varies the compression ratio while holding model architecture fixed. These additions should better isolate the contribution of long-context modeling. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces LongBench, a new bilingual benchmark with 21 datasets across 6 task categories for evaluating long-context LLMs. No derivation chain, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. The central contribution is dataset curation, standardization into a unified format, and empirical evaluation of 8 models; validity rests on transparent curation rather than any self-referential logic, self-citation load-bearing premise, or ansatz smuggled via prior work. This is a standard resource paper with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper aggregates and standardizes existing and new datasets into a benchmark format. It relies on the domain assumption that long-context capability is a measurable and important limitation, without introducing fitted parameters, new mathematical axioms, or postulated entities.

axioms (1)

domain assumption Long context understanding is a key current limitation of LLMs that requires dedicated benchmarks for rigorous evaluation.
Stated as motivation in the opening sentences of the abstract.

pith-pipeline@v0.9.0 · 5638 in / 1316 out tokens · 54792 ms · 2026-05-12T20:17:39.332924+00:00 · methodology

discussion (0)

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
cs.DC 2026-05 conditional novelty 7.0

KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
cs.CL 2026-04 unverdicted novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
cs.LG 2026-04 unverdicted novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
cs.DC 2026-04 unverdicted novelty 7.0

FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
cs.AI 2026-04 unverdicted novelty 7.0

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
Transactional Attention: Semantic Sponsorship for KV-Cache Retention
cs.CL 2026-04 unverdicted novelty 7.0

Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
cs.CL 2026-05 conditional novelty 6.0

ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faste...
When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems
cs.MA 2026-05 unverdicted novelty 6.0

CAFE detects positive distributional Jensen Gaps across five multi-agent LLM architectures on a banking-risk benchmark, showing that quality drops under semantic stress can coexist with statistically detectable antifr...
When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems
cs.MA 2026-05 unverdicted novelty 6.0

CAFE finds positive distributional Jensen Gaps across five multi-agent LLM architectures under semantic stress, showing that quality drops can coexist with detectable stress geometry compatible with antifragile learning.
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
cs.CR 2026-04 unverdicted novelty 6.0

FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
cs.NI 2026-04 unverdicted novelty 6.0

SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.
Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving
cs.DC 2026-04 unverdicted novelty 6.0

In long-context LLM serving, accuracy becomes speed via retry dynamics, and accuracy-aware routing reduces time-to-correct-answer.
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
cs.LG 2026-04 unverdicted novelty 6.0

IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
cs.CL 2024-06 conditional novelty 6.0

PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.
SnapKV: LLM Knows What You are Looking for Before Generation
cs.CL 2024-04 conditional novelty 6.0

SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable pe...
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
cs.CL 2024-02 conditional novelty 6.0

KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
Efficient Streaming Language Models with Attention Sinks
cs.CL 2023-09 accept novelty 6.0

StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Budget-Aware Routing for Long Clinical Text
cs.CL 2026-05 unverdicted novelty 5.0

RCD balances relevance, coverage, and diversity in a knapsack-constrained selection framework, with experiments showing that selector choice and budget level determine optimal unitization strategies on clinical datasets.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
Gated Delta Networks: Improving Mamba2 with Delta Rule
cs.CL 2024-12 unverdicted novelty 5.0

Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.
Hierarchical vs. Flat Iteration in Shared-Weight Transformers
cs.CL 2026-04 unverdicted novelty 4.0

Hierarchical two-speed shared-weight recurrence in Transformers shows a sharp performance gap compared to independent layer stacking in empirical language modeling tests.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · cited by 26 Pith papers · 9 internal anchors

[2]

International Conference on Learning Representations , year=

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. International Conference on Learning Representations , year=

work page
[5]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[12]

Transactions on Machine Learning Research , year=

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. Transactions on Machine Learning Research , year=

work page
[13]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

work page
[14]

Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X , author=. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

work page
[18]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

Efficient Attentions for Long Document Summarization , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2021
[19]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2021
[20]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[21]

VCSUM : A Versatile C hinese Meeting Summarization Dataset

Wu, Han and Zhan, Mingjie and Tan, Haochen and Hou, Zhaohui and Liang, Ding and Song, Linqi. VCSUM : A Versatile C hinese Meeting Summarization Dataset. Findings of the Association for Computational Linguistics: ACL 2023. 2023

work page 2023
[22]

COLING 2002: The 19th International Conference on Computational Linguistics , year=

Learning question classifiers , author=. COLING 2002: The 19th International Conference on Computational Linguistics , year=

work page 2002
[23]

EMNLP-IJCNLP 2019 , pages=

SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization , author=. EMNLP-IJCNLP 2019 , pages=

work page 2019
[24]

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[25]

Transactions of the Association for Computational Linguistics , volume=

The narrativeqa reading comprehension challenge , author=. Transactions of the Association for Computational Linguistics , volume=. 2018 , publisher=

work page 2018
[26]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2021
[28]

Task Definition for Large Scale Text Categorization at NLPCC 2014 , author=

work page 2014
[29]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018
[30]

Proceedings of the 28th International Conference on Computational Linguistics , pages=

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , author=. Proceedings of the 28th International Conference on Computational Linguistics , pages=

work page
[31]

Transactions of the Association for Computational Linguistics , volume=

♫ MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

work page 2022
[32]

ACL 2018 , pages=

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications , author=. ACL 2018 , pages=

work page 2018
[34]

2022 , url =

OpenAI , title =. 2022 , url =

work page 2022
[36]

and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , month =

Li, Dacheng and Shao, Rulin and Xie, Anze and Sheng, Ying and Zheng, Lianmin and Gonzalez, Joseph E. and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , month =. How Long Can Open-Source LLMs Truly Promise on Context Length? , url =

work page
[37]

2023 , url=

Long Sequence Modeling with XGen: A 7B LLM Trained on 8K Input Sequence Length , author=. 2023 , url=

work page 2023
[38]

InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities , author=

work page
[39]

The Eleventh International Conference on Learning Representations , year=

GLM-130B: An Open Bilingual Pre-trained Model , author=. The Eleventh International Conference on Learning Representations , year=

work page
[40]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

GLM: General Language Model Pretraining with Autoregressive Blank Infilling , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[41]

2022 , volume =

Tay, Yi and Dehghani, Mostafa and Bahri, Dara and Metzler, Donald , title =. 2022 , volume =

work page 2022
[42]

The Twelfth International Conference on Learning Representations , year=

KoLA: Carefully Benchmarking World Knowledge of Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[43]

Advances in neural information processing systems , volume=

Big bird: Transformers for longer sequences , author=. Advances in neural information processing systems , volume=

work page
[45]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

-former: Infinite Memory Transformer , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[46]

Transactions of the Association for Computational Linguistics , volume=

Efficient content-based sparse attention with routing transformers , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

work page 2021
[47]

International Conference on Learning Representations , year=

Long Range Arena: A Benchmark for Efficient Transformers , author=. International Conference on Learning Representations , year=

work page
[50]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

SCROLLS: Standardized CompaRison Over Long Language Sequences , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2022
[51]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

work page
[52]

Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=

Intellicode compose: Code generation using transformer , author=. Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=

work page
[53]

International Conference on Learning Representations , year=

Generalization through Memorization: Nearest Neighbor Language Models , author=. International Conference on Learning Representations , year=

work page
[54]

Transactions on Machine Learning Research , year=

Unsupervised Dense Information Retrieval with Contrastive Learning , author=. Transactions on Machine Learning Research , year=

work page
[55]

International conference on machine learning , pages=

Improving language models by retrieving from trillions of tokens , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022
[57]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Do Long-Range Language Models Actually Use Long-Range Context? , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[58]

Advances in Neural Information Processing Systems , volume=

Recurrent memory transformer , author=. Advances in Neural Information Processing Systems , volume=

work page
[59]

International Conference on Learning Representations , year=

Memorizing Transformers , author=. International Conference on Learning Representations , year=

work page
[62]

International Conference on Learning Representations , year=

Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=

work page
[63]

International Conference on Learning Representations , year=

Compressive Transformers for Long-Range Sequence Modelling , author=. International Conference on Learning Representations , year=

work page
[65]

The Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. The Journal of Machine Learning Research , volume=. 2022 , publisher=

work page 2022
[66]

The Journal of Machine Learning Research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=

work page 2020
[68]

Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Onta \ n \'o n, Siddhartha Brahma, Yury Zemlyanskiy, David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, et al. 2023. Colt5: Faster long-range transformers with conditional computation. arXiv preprint arXiv:2303.09752

work page arXiv 2023
[69]

Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. 2023. L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088

work page arXiv 2023
[70]

Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, et al. 2023. Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181

work page arXiv 2023
[71]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2020
[72]

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206--2240. PMLR

work page 2022
[73]

Aydar Bulatov, Yuri Kuratov, and Mikhail S Burtsev. 2023. Scaling transformer to 1m tokens and beyond with rmt. arXiv preprint arXiv:2304.11062

work page arXiv 2023
[74]

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. 2022. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079--11091

work page 2022
[75]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[76]

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595

work page internal anchor Pith review arXiv 2023
[77]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019
[78]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978--2988

work page 2019
[79]

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. 2021. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599--4610

work page 2021
[80]

Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, and Furu Wei. 2023. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486

work page arXiv 2023
[81]

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335

work page 2022
[82]

Alexander Richard Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074--1084

work page 2019
[83]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232--5270

work page 2022
[84]

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. 2019. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. EMNLP-IJCNLP 2019, page 70

work page 2019
[85]

Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. 2023. Longcoder: A long-range pre-trained language model for code completion. arXiv preprint arXiv:2306.14893

work page arXiv 2023
[86]

Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, et al. 2018. Dureader: a chinese machine reading comprehension dataset from real-world applications. ACL 2018, page 37

work page 2018
[87]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. In International Conference on Learning Representations

work page 2021
[88]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609--6625

work page 2020
[89]

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. 2021. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419--1436

work page 2021
[90]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2022 a . Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research

work page 2022
[91]

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022 b . Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299

work page arXiv 2022
[92]

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611

work page 2017
[93]

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations

work page 2020
[94]

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In International Conference on Learning Representations

work page 2020
[95]

Tom \'a s Ko c isk \`y , Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G \'a bor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317--328

work page 2018
[96]

Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang

Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023. https://lmsys.org/blog/2023-06-29-longchat How long can open-source llms truly promise on context length?

work page 2023
[97]

Xin Li and Dan Roth. 2002. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics

work page 2002
[98]

Xinnian Liang, Bing Wang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, and Zhoujun Li. 2023. Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system. arXiv preprint arXiv:2304.13343

work page arXiv 2023
[99]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74--81

work page 2004
[100]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023 a . Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172

work page internal anchor Pith review Pith/arXiv arXiv 2023
[101]

Tianyang Liu, Canwen Xu, and Julian McAuley. 2023 b . Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091

work page arXiv 2023
[102]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2023 c . Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2023
[103]

Pedro Henrique Martins, Zita Marinho, and Andr \'e FT Martins. 2022. -former: Infinite memory transformer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5468--5485

work page 2022

Showing first 80 references.