super hub Mixed citations

s1: Simple test-time scaling

Hannaneh Hajishirzi, Li Fei-Fei, Niklas Muennighoff, Weijia Shi, Xiang Lisa Li, Zitong Yang · 2025 · cs.CL · arXiv 2501.19393

Mixed citation behavior. Most common role is background (67%).

104 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 104 citing papers more from Hannaneh Hajishirzi arXiv PDF

abstract

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 baseline 2 method 1

citation-polarity summary

background 10 baseline 2 unclear 2 use method 1

claims ledger

abstract Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcef

authors

Hannaneh Hajishirzi Li Fei-Fei Niklas Muennighoff Weijia Shi Xiang Lisa Li Zitong Yang

co-cited works

representative citing papers

RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks

cs.CR · 2025-09-25 · conditional · novelty 8.0

RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.

The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.

GIANTS: Generative Insight Anticipation from Scientific Literature

cs.CL · 2026-04-10 · unverdicted · novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

KCSAT-ML benchmark supplies human error rates for math problems and DRG metric exposes that model accuracy collapses on high-human-error items while test-time scaling shows non-monotonic gains and alignment failures.

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.

DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

cs.LG · 2026-04-28 · unverdicted · novelty 7.0 · 2 refs

A single-parameter Tsallis loss continuum unifies SFT and RLVR, derives time-to-escape bounds for cold start, and yields GARL and PAFT estimators that improve performance on QA reasoning tasks.

Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

cs.LG · 2026-04-24 · unverdicted · novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

cs.LG · 2026-01-26 · unverdicted · novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.

Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems

eess.SP · 2025-12-23 · accept · novelty 7.0

An edge-cloud-expert LLM cascade for telecom knowledge systems minimizes processing cost subject to misalignment-risk bounds via multiple hypothesis testing on knowledge and confidence scores.

PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data

cs.CL · 2025-12-11 · conditional · novelty 7.0

PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with modest compute.

User-Assistant Bias in LLMs

cs.CL · 2025-08-16 · unverdicted · novelty 7.0

LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.

Bayesian Social Deduction with Graph-Informed Language Models

cs.AI · 2025-06-21 · unverdicted · novelty 7.0

Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.

Learning a Continue-Thinking Token for Enhanced Test-Time Scaling

cs.CL · 2025-06-12 · unverdicted · novelty 7.0

A learned continue-thinking token, trained via RL on its embedding alone, improves math benchmark accuracy more than fixed-token budget forcing in a frozen language model.

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

cs.CL · 2025-03-06 · unverdicted · novelty 7.0

LCPO trains L1 reasoning models to adhere to prompt-specified CoT lengths, supporting accuracy-compute trade-offs and yielding short reasoning models that outperform larger baselines at matched lengths.

Towards Unconstrained Human-Object Interaction

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.

AI Achieves a Perfect LSAT Score

cs.AI · 2026-04-11 · unverdicted · novelty 7.0

Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

cs.CV · 2025-05-20 · unverdicted · novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

Dynamic Rollout Editing reduces overthinking in RL-trained LLMs by editing post-answer continuations in successful rollouts and preferring the edited versions within GRPO groups.

PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

PathCal calibrates reasoning paths by type-aware soft rebalancing of reflection-marker logits at uncertain states, yielding better efficiency-performance trade-offs on six benchmarks.

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.

Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

LONSREX introduces a metric-based pipeline to identify necessary and sufficient rationales when creating training data for fine-tuning LLMs on explainable misinformation detection, addressing limitations of naive label-based filtering.

citing papers explorer

Showing 50 of 104 citing papers.

RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks cs.CR · 2025-09-25 · conditional · none · ref 25 · internal anchor
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits cs.LG · 2026-05-08 · unverdicted · none · ref 9
Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
GIANTS: Generative Insight Anticipation from Scientific Literature cs.CL · 2026-04-10 · unverdicted · none · ref 12
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty cs.CL · 2026-06-09 · unverdicted · none · ref 11 · internal anchor
KCSAT-ML benchmark supplies human error rates for math problems and DRG metric exposes that model accuracy collapses on high-human-error items while test-time scaling shows non-monotonic gains and alignment failures.
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media cs.CL · 2026-05-16 · unverdicted · none · ref 184 · internal anchor
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation cs.LG · 2026-05-16 · unverdicted · none · ref 30 · internal anchor
Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-08 · unverdicted · none · ref 21 · internal anchor
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference cs.SE · 2026-05-05 · unverdicted · none · ref 10 · internal anchor
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum cs.LG · 2026-04-28 · unverdicted · none · ref 2 · 2 links · internal anchor
A single-parameter Tsallis loss continuum unifies SFT and RLVR, derives time-to-escape bounds for cold start, and yields GARL and PAFT estimators that improve performance on QA reasoning tasks.
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning cs.LG · 2026-04-24 · unverdicted · none · ref 21 · internal anchor
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models cs.LG · 2026-01-26 · unverdicted · none · ref 12 · internal anchor
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.
Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems eess.SP · 2025-12-23 · accept · none · ref 37 · internal anchor
An edge-cloud-expert LLM cascade for telecom knowledge systems minimizes processing cost subject to misalignment-risk bounds via multiple hypothesis testing on knowledge and confidence scores.
PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data cs.CL · 2025-12-11 · conditional · none · ref 8 · internal anchor
PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with modest compute.
User-Assistant Bias in LLMs cs.CL · 2025-08-16 · unverdicted · none · ref 12 · internal anchor
LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.
Bayesian Social Deduction with Graph-Informed Language Models cs.AI · 2025-06-21 · unverdicted · none · ref 33 · internal anchor
Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.
Learning a Continue-Thinking Token for Enhanced Test-Time Scaling cs.CL · 2025-06-12 · unverdicted · none · ref 2 · internal anchor
A learned continue-thinking token, trained via RL on its embedding alone, improves math benchmark accuracy more than fixed-token budget forcing in a frozen language model.
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning cs.CL · 2025-03-06 · unverdicted · none · ref 12 · internal anchor
LCPO trains L1 reasoning models to adhere to prompt-specified CoT lengths, supporting accuracy-compute trade-offs and yielding short reasoning models that outperform larger baselines at matched lengths.
Towards Unconstrained Human-Object Interaction cs.CV · 2026-04-15 · unverdicted · none · ref 49
Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
AI Achieves a Perfect LSAT Score cs.AI · 2026-04-11 · unverdicted · none · ref 20
Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning cs.CV · 2025-05-20 · unverdicted · none · ref 15
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models cs.CL · 2026-06-16 · unverdicted · none · ref 21 · internal anchor
Dynamic Rollout Editing reduces overthinking in RL-trained LLMs by editing post-answer continuations in successful rollouts and preferring the edited versions within GRPO groups.
PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning cs.AI · 2026-05-21 · unverdicted · none · ref 42 · internal anchor
PathCal calibrates reasoning paths by type-aware soft rebalancing of reflection-marker logits at uncertain states, yielding better efficiency-performance trade-offs on six benchmarks.
Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 13 · internal anchor
CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.
Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection cs.CL · 2026-05-19 · unverdicted · none · ref 24 · internal anchor
LONSREX introduces a metric-based pipeline to identify necessary and sufficient rationales when creating training data for fine-tuning LLMs on explainable misinformation detection, addressing limitations of naive label-based filtering.
Self-Supervised On-Policy Distillation for Reasoning Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 46 · internal anchor
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling cs.LG · 2026-05-13 · conditional · none · ref 32 · internal anchor
A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting cs.CL · 2026-05-12 · unverdicted · none · ref 6 · internal anchor
Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.
AIPO: Learning to Reason from Active Interaction cs.CL · 2026-05-08 · unverdicted · none · ref 47 · 2 links · internal anchor
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key cs.AI · 2026-05-07 · unverdicted · none · ref 85 · 3 links · internal anchor
RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
Procedural Knowledge at Scale Improves Reasoning cs.CL · 2026-04-01 · unverdicted · none · ref 23 · internal anchor
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation cs.SE · 2026-03-28 · unverdicted · none · ref 30 · internal anchor
Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning cs.LG · 2026-03-13 · unverdicted · none · ref 8 · internal anchor
Terminator learns to predict optimal early-exit points in chain-of-thought reasoning by training on the first positions where the model emits its final answer, yielding 14-55% shorter outputs with no accuracy loss.
rePIRL: Learn PRM with Inverse RL for LLM Reasoning cs.LG · 2026-02-08 · unverdicted · none · ref 19 · internal anchor
rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.
Diffusion-State Policy Optimization for Masked Diffusion Language Models cs.CL · 2026-02-06 · unverdicted · none · ref 5 · 2 links · internal anchor
DiSPO optimizes intermediate decisions in masked diffusion LMs by branching at selected masked states, resampling tokens, scoring completions, and updating only new tokens using a derived policy-gradient estimator that reuses terminal rollouts.
Training-Trajectory-Aware Token Selection cs.CL · 2026-01-15 · unverdicted · none · ref 15 · internal anchor
Training-Trajectory-Aware Token Selection (T3S) reconstructs the token-level training objective to overcome a performance bottleneck in continual distillation of reasoning capabilities from large to small language models.
Asynchronous Reasoning: Training-Free Interactive Thinking LLMs cs.LG · 2025-12-11 · unverdicted · none · ref 10 · internal anchor
Using properties of positional embeddings, reasoning LLMs can be made to think, listen, and generate outputs asynchronously without any additional training, cutting time to first token to under 5 seconds.
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling cs.CV · 2025-10-23 · unverdicted · none · ref 5 · internal anchor
RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation cs.AI · 2025-10-05 · unverdicted · none · ref 55 · internal anchor
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
The Signal is in the Steps: Local Scoring for Reasoning Data Selection cs.LG · 2025-10-05 · unverdicted · none · ref 15 · internal anchor
LALP scores local reasoning steps rather than full trajectories to improve selection of training data from diverse teacher models for distilling long-form reasoning.
MASH: Modeling Abstention via Selective Help-Seeking cs.CL · 2025-10-01 · unverdicted · none · ref 13 · internal anchor
MASH uses RL with a pay-per-search reward to make LLMs seek external help only when needed, improving multi-hop QA accuracy by 7.6% and enabling competitive abstention without pre-defined knowledge boundaries.
Entropy After </Think> for reasoning model early exiting cs.LG · 2025-09-30 · unverdicted · none · ref 13 · internal anchor
Entropy After </Think> (EAT) enables early exiting in reasoning LLMs by tracking entropy stabilization after a </think> token, cutting token use 12-22% on MATH500 and AIME2025 with no accuracy loss.
HyperAdapt: Simple High-Rank Adaptation cs.LG · 2025-09-23 · unverdicted · none · ref 29 · internal anchor
HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution cs.CL · 2025-09-17 · unverdicted · none · ref 133 · internal anchor
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
GrACE: A Generative Approach to Better Confidence Elicitation and Efficient Test-Time Scaling in Large Language Models cs.CL · 2025-09-11 · unverdicted · none · ref 34 · internal anchor
GrACE is a fine-tuned generative method that uses similarity to a special token embedding for real-time calibrated confidence in LLMs and enables efficient confidence-based test-time scaling.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models cs.CL · 2025-08-21 · unverdicted · none · ref 15 · internal anchor
Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments cs.CL · 2025-08-06 · unverdicted · none · ref 27 · internal anchor
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
Vidar: Embodied Video Diffusion Model for Generalist Manipulation cs.LG · 2025-07-17 · unverdicted · none · ref 19 · internal anchor
Vidar shows that a video diffusion prior continuously pre-trained on 750K multi-view robot trajectories plus a label-free masked inverse dynamics adapter can generalize manipulation to new robot embodiments with 1% of typical demonstration data.
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents cs.CL · 2025-06-18 · unverdicted · none · ref 33 · internal anchor
MEM1 uses end-to-end RL to learn constant-memory agents that update a shared state for memory and reasoning, delivering 3.5x better performance and 3.7x lower memory use than larger baselines on long-horizon QA and shopping tasks.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource cs.CL · 2025-06-13 · conditional · none · ref 27 · internal anchor
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
Grounded Reinforcement Learning for Visual Reasoning cs.CV · 2025-05-29 · unverdicted · none · ref 42 · internal anchor
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

s1: Simple test-time scaling

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer