hub Canonical reference

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9

Tian Chen · 2019 · DOI 10.2139/ssrn.5001098

Canonical reference. 80% of citing Pith papers cite this work as background.

40 Pith papers citing it

Background 80% of classified citations

open at publisher browse 40 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 8 baseline 1 method 1

citation-polarity summary

background 8 baseline 2

representative citing papers

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.

Drifting Objectives for Refining Discrete Diffusion Language Models

cs.CL · 2026-05-19 · unverdicted · novelty 7.0

TokenDrift refines discrete diffusion language models by applying anti-symmetric drifting to soft-token features during training, yielding large reductions in generation perplexity at low NFEs.

CASPIAN: Online Detection and Attribution of Cascade Attacks in LLM Multi-Agent Systems via Cross-Channel Causal Monitoring

cs.MA · 2026-05-19 · unverdicted · novelty 7.0

CASPIAN introduces unified cross-channel causal monitoring via late-interaction conditional transfer entropy to detect cascade onset and attribute origin, bridge, and amplifier agents in LLM multi-agent systems.

Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models

cs.CR · 2026-05-19 · conditional · novelty 7.0

ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.

Dynamic Chunking for Diffusion Language Models

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.

Sampling from Flow Language Models via Marginal-Conditioned Bridges

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Marginal-conditioned bridges enable training-free sampling from Flow Language Models by drawing clean one-hot endpoints from factorized posteriors and using Ornstein-Uhlenbeck bridges, preserving token marginals and reducing denoising error versus conditional-mean bridges.

The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Probabilistic circuits have an output bottleneck with convex probability combinations and a context bottleneck limited to fixed vtree-aligned partitions, making them less expressive than transformers for language data with heterogeneous dependencies, though decomposable PCs are strictly more capable

The Alien Space of Science: Sampling Coherent but Cognitively Unavailable Research Directions

cs.AI · 2026-03-01 · conditional · novelty 7.0

A framework decomposes LLM papers into idea atoms, trains coherence and availability models over the resulting vocabulary, and samples atom combinations that are coherent yet unlikely under existing author communities.

Teaching Language Models Mechanistic Explainability Through MechSMILES

cs.LG · 2025-12-05 · unverdicted · novelty 7.0

MechSMILES lets language models predict complete reaction mechanisms with 93% pathway retrieval on key benchmarks and adapt to new reaction classes from as few as 40 examples.

ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning

cs.LG · 2026-05-20 · conditional · novelty 6.0

ChunkFT enables full-parameter fine-tuning of Llama 3-8B on one 24 GB GPU and Llama 3-70B on two 80 GB GPUs by streaming gradients over dynamically activated sub-tensors.

DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

cs.CL · 2026-05-19 · conditional · novelty 6.0

DEL is a new loss for LLM numerical learning that applies supervised digit entropy optimization and extends to floating-point numbers, showing improved accuracy and distance metrics over prior methods on math benchmarks.

Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

cs.CL · 2026-05-18 · conditional · novelty 6.0

RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.

AOT-POT: Adaptive Operator Transformation for Large-Scale PDE Pre-training

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

AOT-POT adaptively reshapes complex PDE solution operators via input-dependent transformations and parallel stream mixing to enable effective large-scale pre-training, yielding SOTA results on 12 benchmarks with minimal added parameters.

SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

cs.SD · 2026-05-14 · unverdicted · novelty 6.0

SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.

MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

MILM fine-tunes LLMs on XML-encoded multimodal irregular time series via a two-stage process that exploits informative sampling patterns to achieve top performance on EHR classification datasets.

Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.

AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

AdaPreLoRA pairs the Adafactor diagonal Kronecker preconditioner on the full weight matrix with a closed-form factor-space solve that selects the update minimizing an H_t-weighted imbalance, yielding competitive results on GPT-2, Mistral-7B, Qwen2-7B and diffusion personalization tasks.

Three-in-One World Model: Energy-Based Consistency, Prediction, and Counterfactual Inference for Marketing Intervention

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

A DBM-based architecture learns consumer beliefs to enable consistent prediction and counterfactual inference for marketing interventions, outperforming baselines on heterogeneous treatment effects in simulation.

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

Continuous Latent Diffusion Language Model

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model

Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Injecting RTG into states outside the autoregressive sequence yields shorter, more efficient Decision Transformers that outperform the original on offline RL tasks.

citing papers explorer

Showing 40 of 40 citing papers.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 27
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 3
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion cs.LG · 2026-05-22 · unverdicted · none · ref 62
CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
Drifting Objectives for Refining Discrete Diffusion Language Models cs.CL · 2026-05-19 · unverdicted · none · ref 15
TokenDrift refines discrete diffusion language models by applying anti-symmetric drifting to soft-token features during training, yielding large reductions in generation perplexity at low NFEs.
CASPIAN: Online Detection and Attribution of Cascade Attacks in LLM Multi-Agent Systems via Cross-Channel Causal Monitoring cs.MA · 2026-05-19 · unverdicted · none · ref 24
CASPIAN introduces unified cross-channel causal monitoring via late-interaction conditional transfer entropy to detect cascade onset and attribute origin, bridge, and amplifier agents in LLM multi-agent systems.
Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models cs.CR · 2026-05-19 · conditional · none · ref 50
ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
Dynamic Chunking for Diffusion Language Models cs.CL · 2026-05-15 · unverdicted · none · ref 34
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
Sampling from Flow Language Models via Marginal-Conditioned Bridges cs.LG · 2026-05-13 · unverdicted · none · ref 20
Marginal-conditioned bridges enable training-free sampling from Flow Language Models by drawing clean one-hot endpoints from factorized posteriors and using Ornstein-Uhlenbeck bridges, preserving token marginals and reducing denoising error versus conditional-mean bridges.
The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models cs.LG · 2026-05-13 · unverdicted · none · ref 34
Probabilistic circuits have an output bottleneck with convex probability combinations and a context bottleneck limited to fixed vtree-aligned partitions, making them less expressive than transformers for language data with heterogeneous dependencies, though decomposable PCs are strictly more capable
The Alien Space of Science: Sampling Coherent but Cognitively Unavailable Research Directions cs.AI · 2026-03-01 · conditional · none · ref 8
A framework decomposes LLM papers into idea atoms, trains coherence and availability models over the resulting vocabulary, and samples atom combinations that are coherent yet unlikely under existing author communities.
Teaching Language Models Mechanistic Explainability Through MechSMILES cs.LG · 2025-12-05 · unverdicted · none · ref 31
MechSMILES lets language models predict complete reaction mechanisms with 93% pathway retrieval on key benchmarks and adapt to new reaction classes from as few as 40 examples.
ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning cs.LG · 2026-05-20 · conditional · none · ref 3
ChunkFT enables full-parameter fine-tuning of Llama 3-8B on one 24 GB GPU and Llama 3-70B on two 80 GB GPUs by streaming gradients over dynamically activated sub-tensors.
DEL: Digit Entropy Loss for Numerical Learning of Large Language Models cs.CL · 2026-05-19 · conditional · none · ref 39
DEL is a new loss for LLM numerical learning that applies supervised digit entropy optimization and extends to floating-point numbers, showing improved accuracy and distance metrics over prior methods on math benchmarks.
Continuous Diffusion Scales Competitively with Discrete Diffusion for Language cs.CL · 2026-05-18 · conditional · none · ref 51
RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.
AOT-POT: Adaptive Operator Transformation for Large-Scale PDE Pre-training cs.LG · 2026-05-15 · unverdicted · none · ref 30
AOT-POT adaptively reshapes complex PDE solution operators via input-dependent transformations and parallel stream mixing to enable effective large-scale pre-training, yielding SOTA results on 12 benchmarks with minimal added parameters.
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning cs.SD · 2026-05-14 · unverdicted · none · ref 40
SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.
MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling cs.LG · 2026-05-13 · unverdicted · none · ref 76
MILM fine-tunes LLMs on XML-encoded multimodal irregular time series via a two-stage process that exploits informative sampling patterns to achieve top performance on EHR classification datasets.
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs cs.LG · 2026-05-12 · unverdicted · none · ref 53
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
SOMA: Efficient Multi-turn LLM Serving via Small Language Model cs.CL · 2026-05-11 · unverdicted · none · ref 39
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation cs.LG · 2026-05-09 · unverdicted · none · ref 26
AdaPreLoRA pairs the Adafactor diagonal Kronecker preconditioner on the full weight matrix with a closed-form factor-space solve that selects the update minimizing an H_t-weighted imbalance, yielding competitive results on GPT-2, Mistral-7B, Qwen2-7B and diffusion personalization tasks.
Three-in-One World Model: Energy-Based Consistency, Prediction, and Counterfactual Inference for Marketing Intervention cs.AI · 2026-05-08 · unverdicted · none · ref 3
A DBM-based architecture learns consumer beliefs to enable consistent prediction and counterfactual inference for marketing interventions, outperforming baselines on heterogeneous treatment effects in simulation.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 37
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Continuous Latent Diffusion Language Model cs.CL · 2026-05-07 · unverdicted · none · ref 77
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer cs.LG · 2026-05-07 · unverdicted · none · ref 35
Injecting RTG into states outside the autoregressive sequence yields shorter, more efficient Decision Transformers that outperform the original on offline RL tasks.
Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer cs.LG · 2026-05-06 · unverdicted · none · ref 26
Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists cs.AI · 2026-04-30 · unverdicted · none · ref 13
Intern-Atlas constructs a methodological evolution graph with 9.4 million edges from 1.03 million AI papers to capture how methods emerge, adapt, and transition, enabling better idea evaluation and generation for AI-driven research.
UniBCI: Towards a Unified Pretrained Model for Invasive Brain-Computer Interfaces cs.NE · 2026-04-30 · unverdicted · none · ref 41
UniBCI is a unified pretrained model for invasive neural spike data that uses CST tokenization, IAA attention, and self-supervised masked reconstruction to achieve SOTA downstream performance with better generalization and efficiency.
Spectral Condition for $\mu$P under Width-Depth Scaling cs.LG · 2026-02-28 · unverdicted · none · ref 32
A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM cs.CV · 2025-05-23 · unverdicted · none · ref 55
Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.
Superposition Yields Robust Neural Scaling cs.LG · 2025-05-15 · conditional · none · ref 40
Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.
Fast Tensorization of Neural Networks via Slice-wise Feature Distillation cs.LG · 2026-05-19 · unverdicted · none · ref 39
A slice-wise feature distillation framework for independent tensorization of neural network slices to achieve scalable compression with reduced fine-tuning costs.
Prune, Update and Trim: Robust Structured Pruning for Large Language Models cs.LG · 2026-05-18 · unverdicted · none · ref 2
Putri is a structured pruning technique for LLMs that compensates for pruning errors via weight updates and sequential processing while pruning at the attention-head level to reach state-of-the-art results at extreme sparsity.
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay cs.CV · 2026-05-02 · unverdicted · none · ref 32
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks cs.CV · 2026-04-02 · unverdicted · none · ref 44
Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 249
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 49
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion cs.CL · 2025-05-20 · unverdicted · none · ref 66
InfiGFusion introduces graph-on-logits distillation with an O(n log n) Gromov-Wasserstein approximation to fuse LLMs by modeling token co-activations, reporting gains over baselines on 11 benchmarks.
Operator Learning for Reconstructing Flow Fields from Sparse Measurements: a Language Model Approach cs.CE · 2026-05-22 · unverdicted · none · ref 45
A language model-based operator learning method reconstructs flow fields from under 10% sparse measurements on vortex street, US temperature, blood flow, and turbulent jet benchmarks with competitive accuracy.
Prompt-Driven Code Summarization: A Systematic Literature Review cs.SE · 2026-04-16 · unverdicted · none · ref 21
A systematic review that categorizes prompting strategies for LLM-based code summarization, assesses their effectiveness, and identifies gaps in research and evaluation practices.
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding cs.AI · 2026-05-10 · unverdicted · none · ref 2
Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer