hub Canonical reference

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli · 2023 · cs.CL · arXiv 2306.01116

Canonical reference. 79% of citing Pith papers cite this work as background.

49 Pith papers citing it

Background 79% of classified citations

open full Pith review browse 49 citing papers arXiv PDF

abstract

Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 baseline 1 dataset 1

citation-polarity summary

background 11 baseline 1 unclear 1 use dataset 1

representative citing papers

AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

cs.CL · 2026-04-09 · unverdicted · novelty 8.0

AfriVoices-KE is a 3,000-hour multilingual speech dataset for Dholuo, Kikuyu, Kalenjin, Maasai, and Somali with 750 hours scripted and 2,250 hours spontaneous speech from 4,777 speakers.

Learning the Signature of Memorization in Autoregressive Language Models

cs.CL · 2026-04-03 · accept · novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Universal and Transferable Adversarial Attacks on Aligned Language Models

cs.CL · 2023-07-27 · accept · novelty 8.0

Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.

ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

cs.AR · 2026-03-28 · unverdicted · novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Learning Interactive Real-World Simulators

cs.AI · 2023-10-09 · conditional · novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

A Bitter Lesson for Data Filtering

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

With enough compute, large models benefit from training on unfiltered data that includes low-quality and distractor examples instead of requiring high-quality filtered data.

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.

Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

cs.CV · 2026-05-04 · unverdicted · novelty 6.0

A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.

MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model

cs.CE · 2026-04-20 · unverdicted · novelty 6.0

MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.

Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.

Spike-driven Large Language Model

cs.NE · 2026-04-11 · unverdicted · novelty 6.0

SDLLM is a spike-driven LLM that uses gamma-SQP two-step encoding, bidirectional symmetric quantization, and membrane potential clipping to achieve 7x lower energy consumption and 4.2% higher accuracy than prior spike-based language models.

CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

cs.CL · 2025-09-26 · conditional · novelty 6.0

CoSpaDi introduces a training-free sparse dictionary learning framework for post-training LLM compression that optimizes functional reconstruction error via activation-derived orthonormalization and achieves improved accuracy-compression trade-offs over SVD and pruning baselines.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

cs.AI · 2025-07-01 · conditional · novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

cs.LG · 2025-02-17 · unverdicted · novelty 6.0

Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

cs.CL · 2024-02-05 · conditional · novelty 6.0

KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

cs.CV · 2024-01-29 · conditional · novelty 6.0

MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

cs.CV · 2023-11-25 · conditional · novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.

citing papers explorer

Showing 49 of 49 citing papers.

AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages cs.CL · 2026-04-09 · unverdicted · none · ref 16 · internal anchor
AfriVoices-KE is a 3,000-hour multilingual speech dataset for Dholuo, Kikuyu, Kalenjin, Maasai, and Somali with 750 hours scripted and 2,250 hours spontaneous speech from 4,777 speakers.
Learning the Signature of Memorization in Autoregressive Language Models cs.CL · 2026-04-03 · accept · none · ref 15 · internal anchor
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
ORPO: Monolithic Preference Optimization without Reference Model cs.CL · 2024-03-12 · conditional · none · ref 39 · internal anchor
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Universal and Transferable Adversarial Attacks on Aligned Language Models cs.CL · 2023-07-27 · accept · none · ref 19 · internal anchor
Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes cs.CL · 2026-05-07 · unverdicted · none · ref 30 · internal anchor
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs cs.AR · 2026-03-28 · unverdicted · none · ref 47 · internal anchor
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 150 · internal anchor
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 140 · internal anchor
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
A Bitter Lesson for Data Filtering cs.LG · 2026-05-19 · unverdicted · none · ref 15 · internal anchor
With enough compute, large models benefit from training on unfiltered data that includes low-quality and distractor examples instead of requiring high-quality filtered data.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents cs.CL · 2026-05-14 · unverdicted · none · ref 113 · internal anchor
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection cs.AI · 2026-05-13 · unverdicted · none · ref 21 · internal anchor
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 70 · internal anchor
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation cs.CV · 2026-05-04 · unverdicted · none · ref 20 · internal anchor
A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model cs.CE · 2026-04-20 · unverdicted · none · ref 4 · internal anchor
MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition cs.AI · 2026-04-20 · unverdicted · none · ref 54 · internal anchor
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
Spike-driven Large Language Model cs.NE · 2026-04-11 · unverdicted · none · ref 3 · internal anchor
SDLLM is a spike-driven LLM that uses gamma-SQP two-step encoding, bidirectional symmetric quantization, and membrane potential clipping to achieve 7x lower energy consumption and 4.2% higher accuracy than prior spike-based language models.
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning cs.CL · 2025-09-26 · conditional · none · ref 13 · internal anchor
CoSpaDi introduces a training-free sparse dictionary learning framework for post-training LLM compression that optimizes functional reconstruction error via activation-derived orthonormalization and achieves improved accuracy-compression trade-offs over SVD and pruning baselines.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 235 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws cs.LG · 2025-02-17 · unverdicted · none · ref 33 · internal anchor
Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 139 · internal anchor
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache cs.CL · 2024-02-05 · conditional · none · ref 13 · internal anchor
KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models cs.CV · 2024-01-29 · conditional · none · ref 25 · internal anchor
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 116 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets cs.CV · 2023-11-25 · conditional · none · ref 63 · internal anchor
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving cs.CL · 2023-09-29 · conditional · none · ref 33 · internal anchor
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models cs.CL · 2023-09-21 · conditional · none · ref 57 · internal anchor
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning cs.CL · 2023-09-11 · conditional · none · ref 35 · internal anchor
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models cs.LG · 2023-09-01 · conditional · none · ref 43 · internal anchor
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration cs.CL · 2023-06-01 · conditional · none · ref 26 · internal anchor
AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 93 · internal anchor
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech eess.AS · 2026-05-20 · unverdicted · none · ref 22 · internal anchor
Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles cs.CL · 2026-05-12 · unverdicted · none · ref 43 · internal anchor
Re-evaluating controlled text generation systems under standardized conditions reveals that many published performance claims do not hold, highlighting the need for consistent evaluation practices.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods cs.LG · 2026-04-19 · unverdicted · none · ref 33 · internal anchor
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Alignment Imprint: Zero-Shot AI-Generated Text Detection via Provable Preference Discrepancy cs.AI · 2026-04-18 · unverdicted · none · ref 7 · internal anchor
LAPD, derived from the provable preference discrepancy in aligned LLMs, improves zero-shot AI text detection by 45.82% over baselines with claimed statistical dominance over Fast-DetectGPT.
Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining cs.CL · 2025-11-26 · unverdicted · none · ref 16 · internal anchor
Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.
Quantifying Geospatial in the Common Crawl Corpus cs.CL · 2024-06-07 · unverdicted · none · ref 26 · internal anchor
Analysis estimates 18.7% of Common Crawl documents contain geospatial information like coordinates and addresses, with little difference by language.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 133 · internal anchor
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternLM2 Technical Report cs.CL · 2024-03-26 · unverdicted · none · ref 209 · internal anchor
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 115 · internal anchor
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models cs.CV · 2023-11-13 · unverdicted · none · ref 25 · internal anchor
SPHINX improves multi-modal LLMs through joint mixing of weights, tasks, and visual embeddings from varied sources to achieve stronger alignment and multi-purpose capabilities.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions cs.CL · 2023-11-09 · unverdicted · none · ref 254 · internal anchor
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning cs.CV · 2023-10-14 · unverdicted · none · ref 35 · internal anchor
MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.
Reflections and New Directions for Human-Centered Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
Model developers must address human concerns, preferences, values, and goals with rigor at every stage of the LLM pipeline rather than only in post-training.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 152 · internal anchor
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Baichuan 2: Open Large-scale Language Models cs.CL · 2023-09-19 · unverdicted · none · ref 50 · internal anchor
Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.
Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector cs.CL · 2025-09-08 · unverdicted · none · ref 43 · internal anchor
Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.
Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 124 · internal anchor
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 173 · internal anchor
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices cs.DC · 2025-03-11 · unverdicted · none · ref 156 · internal anchor
Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer