{"total":13,"items":[{"citing_arxiv_id":"2605.30789","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO","primary_cat":"cs.LG","submitted_at":"2026-05-29T03:25:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Smaller models provide temporally correlated policy-level diversity that serves as structured exploration for training larger models in GRPO, yielding accuracy gains such as +8.8% on AIME 24 with reduced compute via the S2L-PO framework.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23271","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-22T06:22:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EvalVerse is a pipeline-aware benchmark that distills expert cinematic judgments into VLMs to assess 'goodness' metrics like aesthetics and multi-shot coherence alongside basic prompt adherence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09533","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications","primary_cat":"cs.CL","submitted_at":"2026-05-10T13:35:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08905","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:57:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06895","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mitigating Cognitive Bias in RLHF by Altering Rationality","primary_cat":"cs.AI","submitted_at":"2026-05-07T19:54:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24536","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generating Place-Based Compromises Between Two Points of View","primary_cat":"cs.CL","submitted_at":"2026-04-27T14:33:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06621","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Theorems of Dr. David Blackwell and Their Contributions to Artificial Intelligence","primary_cat":"cs.GL","submitted_at":"2026-04-08T03:01:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Blackwell's Rao-Blackwell, Approachability, and Informativeness theorems provide frameworks for variance reduction, sequential decisions under uncertainty, and comparing information sources that remain relevant to AI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.14244","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation","primary_cat":"eess.IV","submitted_at":"2025-10-16T02:55:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RL4Seg3D applies reinforcement learning with novel reward functions and fusion to adapt echocardiography segmentation models across domains, improving accuracy, anatomical validity, and temporal consistency on over 30,000 videos without target labels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.10630","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs","primary_cat":"cs.LG","submitted_at":"2025-06-12T12:15:50+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.19134","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Incentivizing High-Quality Human Annotations with Golden Questions","primary_cat":"cs.GT","submitted_at":"2025-05-25T13:11:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.13958","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ToolRL: Reward is All Tool Learning Needs","primary_cat":"cs.LG","submitted_at":"2025-04-16T21:45:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.06387","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators","primary_cat":"cs.LG","submitted_at":"2025-02-10T12:15:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Develops self-consistency monitoring for preference annotators and derives sample-complexity bounds showing linear contracts achieve near-ideal performance faster than binary ones under continuous actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.19256","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HybridFlow: A Flexible and Efficient RLHF Framework","primary_cat":"cs.LG","submitted_at":"2024-09-28T06:20:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"in the set, we explore optimized parallelism strategies for each model in the auto_parallel module, that minimizes model execution latency. The workload 𝑊 includes input and output shapes and computation (training, inference or generation) of each model. In auto_parallel, we utilize a simulator module simu to estimate the latency of different parallel strategies, following previous research [42, 84, 90, 92] (outline in Appendix. C). The d_cost module estimates the end-to-end latency of the RLHF dataflow under given model placement and par- allelism strategies, by iterating through all stages in the dataflow graph and summing up latencies of all stages (Lines 17, 25). For models in the same colocated set and involving com- putation in the same stage (such as actor and critic both"}],"limit":50,"offset":0}