How to train data-efficient LLMs

Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng · 2024 · arXiv 2402.09668

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

citation-role summary

background 2 dataset 1 method 1

citation-polarity summary

background 2 use dataset 1 use method 1

representative citing papers

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

cs.LG · 2026-05-29 · conditional · novelty 7.0

Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.

Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

HullFT performs test-time finetuning by sparse convex reconstruction of query embeddings followed by gradient caching on repeated examples, yielding better quality-efficiency tradeoffs than prior TTFT methods.

Accelerated Relax-and-Round for Concave Coverage Problems

cs.DS · 2026-05-07 · unverdicted · novelty 6.0

An accelerated relax-and-round algorithm for concave coverage problems achieves Õ(mn ε^{-1}) runtime and a 0.827-approximation ratio for the logarithmic reward function.

KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates

cs.CL · 2026-04-14 · unverdicted · novelty 6.0

KoCo conditions LLM pre-training by prepending three-dimensional semantic coordinates to documents, improving performance on 10 downstream tasks, accelerating convergence by 30%, and helping distinguish facts from noise to reduce hallucinations.

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

cs.CL · 2026-04-09 · conditional · novelty 6.0

Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models

cs.CV · 2026-04-18 · unverdicted · novelty 5.0

Off-the-shelf models assess quality and alignment to select diverse multimodal training data, letting models trained on the filtered subset match or exceed full-dataset results on standard benchmarks.

InternLM2 Technical Report

cs.CL · 2024-03-26 · unverdicted · novelty 5.0

InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.

Reflections and New Directions for Human-Centered Large Language Models

cs.CL · 2026-05-07 · unverdicted · novelty 4.0

Model developers must address human concerns, preferences, values, and goals with rigor at every stage of the LLM pipeline rather than only in post-training.

Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

cs.AI · 2026-05-07 · unverdicted · novelty 4.0 · 2 refs

Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.

An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

cs.SE · 2026-04-09 · unverdicted · novelty 4.0

Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.

Gemma 3 Technical Report

cs.CL · 2025-03-25 · accept · novelty 4.0

Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.

A Survey of Large Language Models

cs.CL · 2023-03-31 · accept · novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence cs.AI · 2026-05-07 · unverdicted · none · ref 74 · 2 links
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.

How to train data-efficient LLMs

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer