How to train data-efficient llms

Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng · 2024 · arXiv 2402.09668

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Accelerated Relax-and-Round for Concave Coverage Problems

cs.DS · 2026-05-07 · unverdicted · novelty 6.0

An accelerated relax-and-round algorithm for concave coverage problems achieves Õ(mn ε^{-1}) runtime and a 0.827-approximation ratio for the logarithmic reward function.

KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates

cs.CL · 2026-04-14 · unverdicted · novelty 6.0

KoCo conditions LLM pre-training by prepending three-dimensional semantic coordinates to documents, improving performance on 10 downstream tasks, accelerating convergence by 30%, and helping distinguish facts from noise to reduce hallucinations.

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

cs.CL · 2026-04-09 · conditional · novelty 6.0

Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.

DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models

cs.CV · 2026-04-18 · unverdicted · novelty 5.0

Off-the-shelf models assess quality and alignment to select diverse multimodal training data, letting models trained on the filtered subset match or exceed full-dataset results on standard benchmarks.

Reflections and New Directions for Human-Centered Large Language Models

cs.CL · 2026-05-07 · unverdicted · novelty 4.0

Model developers must address human concerns, preferences, values, and goals with rigor at every stage of the LLM pipeline rather than only in post-training.

Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

cs.AI · 2026-05-07 · unverdicted · novelty 4.0 · 2 refs

Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.

An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

cs.SE · 2026-04-09 · unverdicted · novelty 4.0

Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.

A Survey of Large Language Models

cs.CL · 2023-03-31 · accept · novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

citing papers explorer

Showing 8 of 8 citing papers.

Accelerated Relax-and-Round for Concave Coverage Problems cs.DS · 2026-05-07 · unverdicted · none · ref 39
An accelerated relax-and-round algorithm for concave coverage problems achieves Õ(mn ε^{-1}) runtime and a 0.827-approximation ratio for the logarithmic reward function.
KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates cs.CL · 2026-04-14 · unverdicted · none · ref 2
KoCo conditions LLM pre-training by prepending three-dimensional semantic coordinates to documents, improving performance on 10 downstream tasks, accelerating convergence by 30%, and helping distinguish facts from noise to reduce hallucinations.
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts cs.CL · 2026-04-09 · conditional · none · ref 74
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.
DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models cs.CV · 2026-04-18 · unverdicted · none · ref 6
Off-the-shelf models assess quality and alignment to select diverse multimodal training data, letting models trained on the filtered subset match or exceed full-dataset results on standard benchmarks.
Reflections and New Directions for Human-Centered Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 27
Model developers must address human concerns, preferences, values, and goals with rigor at every stage of the LLM pipeline rather than only in post-training.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence cs.AI · 2026-05-07 · unverdicted · none · ref 74 · 2 links
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models cs.SE · 2026-04-09 · unverdicted · none · ref 35
Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 237
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

How to train data-efficient llms

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer