{"total":35,"items":[{"citing_arxiv_id":"2606.29745","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ECHO: Learning Epistemically Adaptive Language Agents with Turn-Level Credit","primary_cat":"cs.MA","submitted_at":"2026-06-29T03:42:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ECHO is a clipped policy-gradient method that uses posterior-sensitive rewards to give turn-level epistemic credit in multi-turn information-seeking tasks, outperforming trajectory-level GRPO on a new Clue Selector Game benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22600","ref_index":9,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On the Position Bias of On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-06-21T17:20:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Position bias in on-policy distillation degrades later-token supervision; IW-OPD weights tokens by accumulated discrepancy, yielding faster convergence and up to 6.9 point gains on AIME-2025.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19818","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Uncertainty-Aware Reward Modeling for Stable RLHF","primary_cat":"cs.LG","submitted_at":"2026-06-18T05:46:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UARM equips reward models with quantile-based conformal prediction uncertainty and reweights GRPO advantages via heteroscedastic variance decomposition to improve calibration and reduce reward hacking in RLHF.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10385","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-06-09T03:51:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AR-OPD disentangles privileged supervision via anchored residual guidance to reduce hindsight leakage in on-policy distillation, reporting gains of 2.3 points over full privileged OPD and 7.9 over SFT on reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09711","ref_index":268,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization","primary_cat":"cs.AI","submitted_at":"2026-06-08T16:32:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09388","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Distilling Safe LLM Systems via Soft Prompts for On Device Settings","primary_cat":"cs.LG","submitted_at":"2026-06-08T12:03:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Soft prompt distillation with total variation and KL divergence transfers safety behaviors from guard models to on-device LLMs and outperforms LoRA adapters, steering vectors, and direct optimization in safety-usefulness trade-offs with minimal inference cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09043","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity","primary_cat":"cs.LG","submitted_at":"2026-06-08T05:24:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DynaCF dynamically downweights shortcut-sensitive samples in reward model training by tracking margin shifts under online counterfactual perturbations within the Bradley-Terry loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03892","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments","primary_cat":"cs.CL","submitted_at":"2026-06-02T16:52:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PROVE trains LLMs on multi-step tool calls using 20 live MCP servers with 343 tools, state-grounded synthesis, and adaptive efficiency rewards, delivering gains of up to 10.2 points on BFCL Multi-Turn and similar on other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03131","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models","primary_cat":"cs.LG","submitted_at":"2026-06-02T04:18:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HARVE removes the component of the reward-head vector aligned with a multi-directional hacking subspace from residual streams using a small set of contrastive examples, improving robustness on RewardHackBench across eight models without fine-tuning while preserving general capability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01281","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-31T15:06:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"POPO uses recency-based prioritized group replay and decoupled off-policy optimization to avoid zero-variance ineffective samples in RLVR, accelerating LLM reasoning finetuning with fewer rollouts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21266","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-20T14:53:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Short GRPO warm-up followed by offline DPO on informative rollouts matches or beats full GRPO on math reasoning benchmarks at substantially lower compute cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17458","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks","primary_cat":"cs.LG","submitted_at":"2026-05-17T14:00:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ClaHF converts instance labels into preference signals via candidate predictions and a reward model, then applies RL optimization to improve text classification accuracy and calibration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11679","ref_index":50,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion","primary_cat":"cs.AI","submitted_at":"2026-05-12T07:38:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"OrthAlignICLR 202678.00↑57.8175.51↑48.6865.28↑22.2572.93↑42.92 88.12↑67.9367.08↑40.2565.34↑22.3173.51↑43.50 MORA 99.23↑79.0481.98↑55.1569.36↑26.3383.52↑53.51 90.00↑69.8182.11↑55.2866.01↑22.9879.37↑49.36 4.1 Experimental Setup Baselines & Model Configurations.We instantiate our study on two widely-adopted foundation models: LLaMA-3-SFT [49] and Mistral-7B-SFT [50]. To rigorously assess MORA, we benchmark it against a diverse suite of competing approaches spanning three methodological families.(i) Optimization with auxiliary constraints-we consider MODPO [ 39], SPO [41], MO-ODPO [40], and OrthAlign [24].(ii) Data selection-we include RSDPO [ 42], which leverages rejection sampling to construct preference pairs."},{"citing_arxiv_id":"2605.09119","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Personalized Alignment Revisited: The Necessity and Sufficiency of User Diversity","primary_cat":"cs.LG","submitted_at":"2026-05-09T19:07:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A user-diversity condition is necessary and sufficient for personalized alignment to achieve O(1) online regret and log(1/epsilon) offline sample complexity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07331","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective","primary_cat":"cs.LG","submitted_at":"2026-05-08T06:35:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06036","ref_index":259,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Optimal Transport for LLM Reward Modeling from Noisy Preference","primary_cat":"cs.LG","submitted_at":"2026-05-07T11:26:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy preference samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01123","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs","primary_cat":"cs.AI","submitted_at":"2026-05-01T21:49:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16918","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-04-18T08:51:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-policy baselines on agentic tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"of assistant turns per episode. \"Seq Len\" is the maximum sequence length (prompt + all turns). \"Config\" refers to the hyperparameter configuration (section E). Environment Type Model GPUs Seq Len Max Actions Config Reward NQ Search LLM Qwen2.5-7B 8 12800 5 Default Binary (EM) AIME LLM Qwen2.5-7B 8 4096 3 A Binary Sokoban Simple LLM Qwen2.5-0.5B 2 2048 10 A[−1,+3] Sokoban Hard LLM Qwen2.5-0.5B 2 2048 10 A[−1,+3] FrozenLake (LLM) LLM Qwen2.5-0.5B 2 2048 10 A Binary CliffWalking LLM Qwen2.5-0.5B 2 2048 200 A[−∞,0] GSM8K LLM Qwen2.5-0.5B 2 4096 3 A Binary FrozenLake (VLM) VLM Qwen2.5-VL-3B 4 4096 10 Default Binary GeoQA VLM Qwen2.5-VL-3B 4 4096 3 Default Binary NQ Search.An agentic retrieval-augmented QA task on Natural Questions [Jin et al."},{"citing_arxiv_id":"2604.13602","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges","primary_cat":"cs.LG","submitted_at":"2026-04-15T08:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In standard RLHF, a reward model rϕ is trained to predict human preferences over response pairs (yw, yl) using a Bradley-Terry model [1, 33]. The policy πθ then maximizes this learned scalar, constrained by a Kullback-Leibler (KL) penalty against a reference model [1, 34]: max πθ Ex∼D, y∼πθ(·|x) [rϕ(x, y)]−β D KL πθ(· |x)∥π ref(· |x) \u0001 .(2) Direct alignment algorithms like DPO optimize a similar preference geometry [ 35]. In RLHF, the proxy gap arises because diverse, context-dependent human values are aggregated into a single, uncalibrated scalar. Optimization exploits this by targeting easy-to-learn heuristic artifacts (e.g., authoritative tone or formatting) that consistently triggered human approval during training. RLAIF: Distilling Evaluator Priors.RLAIF uses the same mathematical structure but replaces human annotators"},{"citing_arxiv_id":"2604.13035","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis","primary_cat":"cs.CV","submitted_at":"2026-04-14T17:59:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Table 3|Models categorized by post-training strategy and model parameters. Model Category Post-Training Details Params Qwen3-14B [39] General RL GRPO-style reinforcement learning 14B Qwen3-235B [39] General RL GRPO-style reinforcement learning 235B UI-Venus-Navi-72B [17] General RL GRPO-based reasoning optimization 72B Gemini-2.5-flash [9] RLAIF + RLHF SFT + Reward Model + RL N/A Qwen2.5-VL-7B-MM-RLHF [13] RLHF PPO-style human feedback alignment 7B Qwen2.5-72B-VL [2] RLHF SFT + DPO (preference optimization) 72B LLaMA4 Maverick [1] RLHF SFT + Online RL + DPO∼17B active (MoE) Qwen3-14B-Intuitor-MATH-1EPOCH [45] RLIF Iterative feedback RL (Intuitor) 14B Qwen3-14B-GRPO-MATH-1EPOCH [45] RLIF GRPO under RLIF objective 14B Qwen2.5-14B-GRPO [31] RLVR GRPO with verifiable reward 14B"},{"citing_arxiv_id":"2604.05341","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation","primary_cat":"cs.IR","submitted_at":"2026-04-07T02:25:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01473","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits","primary_cat":"cs.CR","submitted_at":"2026-04-01T23:29:12+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.12125","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation","primary_cat":"cs.LG","submitted_at":"2026-02-12T16:14:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Generalized on-policy distillation with reward scaling above one (ExOPD) lets student models surpass teacher performance when merging domain experts on math and code tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01970","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-02-02T11:24:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01003","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-02-01T03:56:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ESSAM matches PPO and GRPO accuracy (~78%) on GSM8K math tasks but uses 10-18x less GPU memory and shows stronger generalization across datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.23102","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multiplayer Nash Preference Optimization","primary_cat":"cs.AI","submitted_at":"2025-09-27T04:18:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.03403","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training","primary_cat":"cs.LG","submitted_at":"2025-09-03T15:28:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.03526","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Enhancing Speech Large Language Models through Reinforced Behavior Alignment","primary_cat":"cs.CL","submitted_at":"2025-08-25T07:31:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.04149","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap","primary_cat":"cs.CL","submitted_at":"2025-08-06T07:24:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.17352","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles","primary_cat":"cs.CV","submitted_at":"2025-03-21T17:52:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.09567","ref_index":159,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-03-12T17:35:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To fill this gap, a series of work focus on optimizing the multimodal Long CoT capabilities [554, 1104, 839]. For example, Li et al. [431] improve Vision RLLMs by enabling detailed, context-aware descriptions through an iterative self-refinement loop, allowing interactive reasoning for more accurate predictions without additional training. Dong et al. [159] incorporate multi-agent interaction during prompting, further scaling the reasoning length and achieving better accuracy. Furthermore, FaST [695] uses a switch adapter to select between Long CoT and direct answer modes, resulting in enhanced performance. (2) Multimodal Long CoT Imitation: Recent models such as LLaV A-CoT [900] and Virgo [166] employ data distillation to enable the imitation of"},{"citing_arxiv_id":"2412.14164","ref_index":164,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MetaMorph: Multimodal Understanding and Generation via Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2024-12-18T18:58:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.10442","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization","primary_cat":"cs.CL","submitted_at":"2024-11-15T18:59:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Instructblip: Towards general- purpose vision-language models with instruction tuning. NIPS, 36, 2024. 1 [25] Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quan- quan Gu, James Zou, Kai-Wei Chang, and Wei Wang. En- hancing large vision language models with self-training on image comprehension. arXiv preprint arXiv:2405.19716 , 2024. 3, 6, 7 9 [26] Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward mod- eling to online rlhf.arXiv preprint arXiv:2405.07863, 2024. 3 [27] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al."},{"citing_arxiv_id":"2410.18451","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs","primary_cat":"cs.AI","submitted_at":"2024-10-24T06:06:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.16860","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2024-06-24T17:59:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"supervised models underscores the potential for training superior vision-only models with more data and improved techniques. Additionally, we observe that higher-resolution models particularly enhance performance on chart and vision-centric benchmarks while remaining neutral on general VQA and knowledge-based VQAs. While the majority of the backbones we examine are ViT-based [39], ConvNet-based architectures (such as OpenCLIP ConvNeXt [87]) are inherently well-suited for high-resolution image processing [131] and can produce superior results on OCR & Chart and Vision-Centric benchmarks. In vision-centric benchmarks, the gap between language-supervised and other types of vision models is smaller, with a well-trained self-supervised DINOv2 model even outperforming some language-supervised models."}],"limit":50,"offset":0}