mega hub Mixed citations

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao · 2024 · cs.CL · arXiv 2402.03300

Mixed citation behavior. Most common role is background (52%).

1348 Pith papers citing it

Background 52% of classified citations

open full Pith review browse 1348 citing papers more from Shao arXiv PDF

abstract

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 187 method 124 baseline 14 other 5 dataset 4

citation-polarity summary

background 175 use method 118 unclear 22 baseline 14 use dataset 4 extend 1

claims ledger

abstract Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieve

authors

Shao

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

cs.AI · 2026-04-15 · conditional · novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

cs.LG · 2026-05-11 · conditional · novelty 8.0

Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and closing much of the gap to expert harnesses.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

cs.CR · 2026-05-01 · unverdicted · novelty 8.0

STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the generation trajectory.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

GIANTS: Generative Insight Anticipation from Scientific Literature

cs.CL · 2026-04-10 · unverdicted · novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

cs.AI · 2026-03-30 · conditional · novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

SEVerA: Verified Synthesis of Self-Evolving Agents

cs.LG · 2026-03-26 · unverdicted · novelty 8.0

SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks

cs.CR · 2025-09-25 · conditional · novelty 8.0

RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

cs.CL · 2025-04-15 · conditional · novelty 8.0

DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.

The Alignment Problem in Constrained Code Generation

cs.SE · 2026-06-19 · unverdicted · novelty 7.0

Incomplete constrainers in constrained decoding push LLMs into low-probability program regions, making unconstrained decoding outperform constrained decoding on functional correctness across seven models and three benchmarks.

Expected Free Energy-based Planning as Variational Inference

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

EFE-based planning is formulated as variational free energy minimization with epistemic priors, decomposing into expected plan costs plus a complexity term.

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.

What Type of Inference is Active Inference?

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.

Cost-Aware Optimization for Agentic Query Execution

cs.DB · 2026-06-02 · unverdicted · novelty 7.0

EnumGRPO is a self-improving optimizer for agentic query execution that reduces LLM-operator costs by ~317x while improving accuracy by 18% over a hybrid baseline across four databases.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

citing papers explorer

Showing 50 of 1348 citing papers.

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot cs.AI · 2026-04-15 · conditional · none · ref 4 · internal anchor
AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration? cs.CV · 2026-05-31 · accept · none · ref 66 · internal anchor
Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.
Continual Harness: Online Adaptation for Self-Improving Foundation Agents cs.LG · 2026-05-11 · conditional · none · ref 16 · internal anchor
Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and closing much of the gap to expert harnesses.
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning cs.LG · 2026-05-09 · conditional · none · ref 24 · internal anchor
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.
STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack cs.CR · 2026-05-01 · unverdicted · none · ref 8 · internal anchor
STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the generation trajectory.
From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation cs.SE · 2026-04-30 · unverdicted · none · ref 50 · internal anchor
MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images cs.CV · 2026-04-23 · unverdicted · none · ref 15 · internal anchor
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV · 2026-04-20 · unverdicted · none · ref 33 · internal anchor
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning cs.CL · 2026-04-19 · unverdicted · none · ref 27 · internal anchor
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees cs.CV · 2026-04-17 · unverdicted · none · ref 43 · internal anchor
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
GIANTS: Generative Insight Anticipation from Scientific Literature cs.CL · 2026-04-10 · unverdicted · none · ref 16 · internal anchor
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology cs.AI · 2026-03-30 · conditional · none · ref 18 · internal anchor
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
SEVerA: Verified Synthesis of Self-Evolving Agents cs.LG · 2026-03-26 · unverdicted · none · ref 37 · internal anchor
SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages cs.LG · 2026-03-13 · unverdicted · none · ref 10 · internal anchor
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks cs.CR · 2025-09-25 · conditional · none · ref 31 · internal anchor
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
Flow-GRPO: Training Flow Matching Models via Online RL cs.CV · 2025-05-08 · unverdicted · none · ref 16 · internal anchor
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning cs.CL · 2025-04-15 · conditional · none · ref 19 · internal anchor
DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.
The Alignment Problem in Constrained Code Generation cs.SE · 2026-06-19 · unverdicted · none · ref 41 · internal anchor
Incomplete constrainers in constrained decoding push LLMs into low-probability program regions, making unconstrained decoding outperform constrained decoding on functional correctness across seven models and three benchmarks.
Expected Free Energy-based Planning as Variational Inference cs.AI · 2026-06-09 · unverdicted · none · ref 275 · internal anchor
EFE-based planning is formulated as variational free energy minimization with epistemic priors, decomposing into expected plan costs plus a complexity term.
A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing cs.CL · 2026-06-03 · unverdicted · none · ref 113 · internal anchor
PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.
What Type of Inference is Active Inference? cs.AI · 2026-06-03 · unverdicted · none · ref 299 · internal anchor
EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.
Cost-Aware Optimization for Agentic Query Execution cs.DB · 2026-06-02 · unverdicted · none · ref 35 · internal anchor
EnumGRPO is a self-improving optimizer for agentic query execution that reduces LLM-operator costs by ~317x while improving accuracy by 18% over a hybrid baseline across four databases.
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification cs.LG · 2026-05-31 · unverdicted · none · ref 26 · internal anchor
OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.
DeepLatent: Think with Images via Parallel Latent Visual Reasoning cs.CV · 2026-05-30 · unverdicted · none · ref 96 · internal anchor
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing cs.LG · 2026-05-30 · unverdicted · none · ref 107 · internal anchor
EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.
Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization cs.LG · 2026-05-29 · unverdicted · none · ref 72 · internal anchor
DEPO formulates detector-evasive paraphrasing as a constrained MDP and solves it via Lagrangian primal-dual RL with GRPO-style updates to achieve evasion while satisfying a semantic-preservation constraint.
LLMs Need Encoders for Semantic IDs Too cs.IR · 2026-05-29 · unverdicted · none · ref 31 · internal anchor
PrefixMem encoder for Semantic IDs improves deepest-level accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative on Pinterest data across LLM families.
The Regularizing Power of Language-Training Deepfake Detectors cs.CV · 2026-05-29 · unverdicted · none · ref 49 · internal anchor
A dual-encoder deepfake detector pairs a frozen specialist with a LoRA-tuned MLLM, trained first via binary alignment then via RL to reward explain-then-classify behavior, yielding improved cross-dataset performance and interpretability.
ElasticMem: Latent Memory as a Learnable Resource for LLM Agents cs.CL · 2026-05-29 · unverdicted · none · ref 35 · internal anchor
ElasticMem enables LLM agents to learn adaptive latent memory retrieval and elastic budget allocation, improving QA accuracy by 24-26% and ALFWorld success by 27-66% over baselines with lower token cost.
PInVerify: An Offline Embodied Benchmark for Active Instance Verification cs.CV · 2026-05-28 · unverdicted · none · ref 36 · internal anchor
PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.
RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood cs.LG · 2026-05-28 · unverdicted · none · ref 9 · internal anchor
RL2ML introduces a parameterized family of surrogate objectives bridging RL and ML with unbiased gradient estimators, group-level update-scale analysis, and metric-dependent optimization for finite-rollout LLM training.
ETCHR: Editing To Clarify and Harness Reasoning cs.CV · 2026-05-22 · unverdicted · none · ref 26 · internal anchor
A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval cs.CV · 2026-05-22 · unverdicted · none · ref 14 · internal anchor
ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents cs.AI · 2026-05-22 · unverdicted · none · ref 23 · internal anchor
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation cs.AI · 2026-05-22 · unverdicted · none · ref 10 · internal anchor
EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion cs.LG · 2026-05-22 · unverdicted · none · ref 71 · internal anchor
CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection cs.CV · 2026-05-22 · unverdicted · none · ref 56 · internal anchor
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
Visual-Advantage On-Policy Distillation for Vision-Language Models cs.CV · 2026-05-21 · unverdicted · none · ref 9 · internal anchor
VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.
Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 121 · internal anchor
Seizure-Semiology-Suite provides a new clinically annotated video dataset and hierarchical benchmark that exposes weaknesses in current MLLMs for seizure semiology and demonstrates gains from fine-tuning and a neuro-symbolic classifier reaching 0.96 F1.
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution cs.CV · 2026-05-20 · conditional · none · ref 54 · internal anchor
RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.
Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning cs.LG · 2026-05-20 · unverdicted · none · ref 12 · internal anchor
FISolver trains a compact LLM on backward-generated (differential equation, first integral) pairs and uses guided reinforcement learning to outperform larger models and Mathematica on first-integral benchmarks at lower cost.
Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models cs.CV · 2026-05-20 · unverdicted · none · ref 6 · internal anchor
Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
Grounding Driving VLA via Inverse Kinematics cs.CV · 2026-05-20 · conditional · none · ref 37 · internal anchor
By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v2 and nuScenes.
Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction cs.CV · 2026-05-20 · unverdicted · none · ref 44 · internal anchor
Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.
Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression cs.LG · 2026-05-20 · unverdicted · none · ref 31 · internal anchor
Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning cs.CV · 2026-05-19 · unverdicted · none · ref 30 · internal anchor
ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models cs.CV · 2026-05-19 · unverdicted · none · ref 66 · internal anchor
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding cs.LG · 2026-05-19 · unverdicted · none · ref 27 · internal anchor
Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 24 · internal anchor
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.
RECIPE: Procedural Planning via Grounding in Instructional Video cs.CV · 2026-05-19 · unverdicted · none · ref 34 · internal anchor
RECIPE improves visual procedural planners by rewarding plans according to their grounding quality in ASR transcripts via GRPO, yielding +7–8 in-domain and up to +16 zero-shot macro-accuracy gains over base models and outperforming supervised fine-tuning on seven benchmarks.