LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
citation dossier
Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing
why this work matters in Pith
Pith has found this work in 16 reviewed papers. Its strongest current cluster is cs.CL (5 papers). The largest review-status bucket among citing papers is UNVERDICTED (14 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.
representative citing papers
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
Domain-adapted clinical LLMs provide only marginal and unstable gains over general models on English clinical MCQA benchmarks, while new Spanish Marmoka models perform better.
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
citing papers explorer
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
-
Hypothesis generation and updating in large language models
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
To Adapt or not to Adapt, Rethinking the Value of Medical Knowledge-Aware Large Language Models
Domain-adapted clinical LLMs provide only marginal and unstable gains over general models on English clinical MCQA benchmarks, while new Spanish Marmoka models perform better.
-
SmolVLM: Redefining small and efficient multimodal models
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
- Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion