hub Mixed citations

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

· 2024 · cs.CL · arXiv 2406.17557

Mixed citation behavior. Most common role is background (67%).

69 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 69 citing papers arXiv PDF

abstract

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 dataset 3 method 1

citation-polarity summary

background 8 use dataset 3 use method 1

representative citing papers

CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention

cs.CL · 2026-06-25 · conditional · novelty 7.0

CARVE introduces key-axis content-aware gating and value-efficient scalar writes in recurrent linear attention, outperforming GDN-2 on perplexity and retrieval tasks while cutting parameters and memory.

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.

Delta Attention Residuals

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Delta Attention Residuals attend over per-sublayer deltas instead of cumulative hidden states, producing higher-contrast attention weights and 1.7-8.2% validation perplexity gains over standard and attention residuals across 220M-7.6B models.

A Causal Language Modeling Detour Improves Encoder Continued Pretraining

cs.CL · 2026-05-12 · conditional · novelty 7.0

A temporary CLM phase followed by MLM decay during encoder continued pretraining outperforms standard MLM on biomedical tasks by 0.3-2.8pp across languages and model sizes.

Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.

BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.

Layer Collapse in Diffusion Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.

Projection-Free Transformers via Gaussian Kernel Attention

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

q-bio.QM · 2026-04-09 · unverdicted · novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

Drift and selection in LLM text ecosystems

cs.CL · 2026-03-15 · unverdicted · novelty 7.0

Recursive LLM text generation drives public corpora toward shallow equilibria via drift unless normative selection for quality sustains deeper structure with a bounded divergence.

Agentic Search in the Wild: Intents and Trajectory Dynamics from 14M+ Real Search Requests

cs.IR · 2026-01-24 · unverdicted · novelty 7.0

Large-scale log study of 14M+ agentic searches finds short sessions, intent-specific repetition patterns, and that 54% of new query terms trace to prior retrieved evidence.

Hierarchical Global Attention (HGA)

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

HGA uses RoPE-aware chunk summaries for two-level hierarchical routing to approximate dense causal attention at 3% sparsity with 0.01-0.02 nats quality gap, as a drop-in replacement requiring no retraining.

Data and Evaluation Closed-Loop for Model Capability Enhancement

cs.AI · 2026-06-26 · unverdicted · novelty 6.0

Proposes capability slices with dual taxonomies and mapping rules to form a closed loop converting benchmark failures into targeted data interventions, validated via two opposing case studies on BBH and math reasoning.

Aurora: A Leverage-Aware Spectral Optimizer

cs.LG · 2026-06-26 · unverdicted · novelty 6.0

Aurora is a leverage-aware spectral optimizer that enforces uniform row norms in matrix updates while preserving Muon's polar geometry, outperforming Muon and achieving SOTA among spectral methods on modded-nanoGPT.

Quantifying the Agreement Between Data-Influence and Data-Similarity to Understand LLM Behavior

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

Data-similarity and data-influence produce significantly overlapping rankings of training documents for LLM outputs, with asymmetry allowing a favorable cost-accuracy trade-off.

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

cs.CL · 2026-06-09 · unverdicted · novelty 6.0 · 2 refs

LC-QAT achieves data-efficient 2-bit weight-only QAT for LLMs by representing quantized weights as a learned affine transform over discrete vectors, supporting end-to-end optimization from a high-quality PTQ start.

Unlocking Latent Value: Taxonomy-Guided Recovery of High-Performing Data from Low-Tier Web Corpora

cs.CL · 2026-06-05 · unverdicted · novelty 6.0

A multi-dimensional taxonomy filtering approach recovers high-performing data from deprioritized web corpora, with filtered low-tier subsets outperforming unfiltered top-tier data on reasoning and coding benchmarks.

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

dMX is a differentiable mixed-precision framework that learns per-layer MXFP bit-width assignments for LLMs and outperforms KL-based heuristics on perplexity and zero-shot accuracy under bit-width budgets.

Harmonic: Hierarchical State Space Models for Efficient Long-Context Language Modeling

cs.CL · 2026-05-30 · unverdicted · novelty 6.0

Hierarchical SSM architecture Harmonic outperforms Transformers and Mamba on long-context language modeling up to 64K tokens and removes RoPE limits at 1B scale while maintaining O(L) compute.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0 · 2 refs

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

Lngram: N-gram Conditional Memory in Latent Space

cs.CL · 2026-05-24 · unverdicted · novelty 6.0

Lngram is a latent N-gram conditional memory module that learns discrete symbols from hidden states for N-gram lookup, outperforming baselines in language modeling and multimodal tasks.

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

Gated DeltaNet-2 decouples channel-wise erase and write gates in linear attention, generalizing prior DeltaNet and KDA models while showing stronger results on language modeling and long-context retrieval at 1.3B scale.

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

LLM-rephrased synthetic clinical notes preserve core information and utility for coarse prediction tasks but lose fine-grained details such as ICD codes, with chunk-wise rephrasing as a partial mitigation that trades off factual accuracy.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings q-bio.QM · 2026-04-09 · unverdicted · none · ref 47 · internal anchor
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer