{"total":39,"items":[{"citing_arxiv_id":"2607.01813","ref_index":7,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MMBench-Live: A Continuously Evolving Benchmark for Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-07-02T07:27:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MMBench-Live introduces an automated multi-agent pipeline and distribution-consistent update strategy to create a continuously evolving multimodal benchmark with 5.9K new instances at low cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30175","ref_index":104,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph","primary_cat":"cs.CL","submitted_at":"2026-06-29T11:51:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22578","ref_index":25,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Context-Aware Distillation and Ablation for Text2DSL","primary_cat":"cs.CL","submitted_at":"2026-06-21T16:27:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Context-aware distillation with BNF+API+vocabulary scales PolkitBench to 10,073 pairs at 99.7% runtime pass rate; ablation on GigaChat-10B shows vocabulary adds +0.198 combined score while API/BNF add 22-25pp structural validity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21851","ref_index":68,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"TALAS: Teacher-Anchored Layer Alignment with Adaptive Sharpness-Aware Minimization for Embedding Distillation","primary_cat":"cs.CL","submitted_at":"2026-06-20T03:17:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"TALAS is a knowledge distillation method that selectively aligns upper student layers to teacher sentence embeddings, propagates knowledge top-down via relational constraints in lower layers, and uses ASAM to seek flatter minima.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19988","ref_index":67,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning","primary_cat":"cs.SE","submitted_at":"2026-06-18T09:28:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12984","ref_index":51,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants","primary_cat":"cs.CL","submitted_at":"2026-06-11T07:21:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SkillChain automates skill lifecycle for e-commerce image AI assistants via creator, optimizer, and refiner stages, leading to improved response quality and user engagement in production A/B tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07690","ref_index":9,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"HARP: Efficient Data Selection for Finetuning Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-06-05T06:31:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HARP is a train-based data selector for LLM finetuning that uses hierarchical active region pruning and empirical Bayes posteriors to achieve up to 8.9 point gains with roughly 7 times fewer training examples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06556","ref_index":133,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Robots Need More than VLA and World Models","primary_cat":"cs.RO","submitted_at":"2026-06-04T10:43:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper identifies four missing interfaces (data autolabelling, embodiment retargeting, physics-grounded world models, and video-based reward inference) as the central bottleneck beyond VLA scaling for robot intelligence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02404","ref_index":32,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts","primary_cat":"cs.CL","submitted_at":"2026-06-01T15:50:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"K-BrowseComp is a new Korean web-browsing agent benchmark where frontier LLMs score 30-46% and Korean LLMs score 0-10% on the verified subset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01617","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision","primary_cat":"cs.CL","submitted_at":"2026-06-01T03:10:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EvoPool evolves pools of programmatic annotators that outperform LLM annotation by 0.141 average macro-F1 on 7 of 8 specialized tasks while running thousands of times faster.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00825","ref_index":59,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory","primary_cat":"cs.CV","submitted_at":"2026-05-30T17:53:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SuperMemory-VQA provides 4,853 human-verified QA pairs from 52.9 hours of egocentric AI glasses recordings to benchmark AI systems on realistic long-horizon memory tasks including an unanswerable option.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00686","ref_index":70,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing","primary_cat":"cs.LG","submitted_at":"2026-05-30T11:49:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00544","ref_index":41,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization","primary_cat":"cs.LG","submitted_at":"2026-05-30T05:30:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-response training retains multiple responses per prompt to reduce uncertainty about the conditional output distribution, yielding improved distributional generalization especially in high response-diversity and low prompt-redundancy regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29940","ref_index":29,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Make LLM Learn to Synthesize from Streaming Experiences through Feedback","primary_cat":"cs.AI","submitted_at":"2026-05-28T13:51:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SynLearner lets LLMs improve synthetic data generation on later tasks in a stream by learning reusable patterns and balancing quality with diversity from feedback on earlier tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28577","ref_index":38,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Continual Model Routing in Evolving Model Hubs","primary_cat":"cs.AI","submitted_at":"2026-05-27T15:00:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Formalizes continual model routing (CMR), releases CMRBench with over 2000 models, and presents CARvE which outperforms retrieval, fine-tuning and adapter-merging baselines on model/family/domain accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28556","ref_index":28,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks","primary_cat":"cs.AI","submitted_at":"2026-05-27T14:45:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TASTE automates generation of high-coverage difficult agent benchmarks via adaptive contrastive n-gram sampling of tool sequences, yielding τ^c-Bench where models saturating τ²-Bench drop sharply and unique tool combinations more than double.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21442","ref_index":38,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"torchtune: PyTorch native post-training library","primary_cat":"cs.LG","submitted_at":"2026-05-20T17:32:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17497","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Self-Supervised On-Policy Distillation for Reasoning Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-17T15:14:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15104","ref_index":178,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-14T17:22:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15011","ref_index":43,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale","primary_cat":"cs.CL","submitted_at":"2026-05-14T16:12:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Builds a 2M-contribution graph from 230k papers with 12.5M prerequisite links and reports 0.48 MAP on temporal backtesting for predicting enabling technologies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14071","ref_index":42,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Distribution Corrected Offline Data Distillation for Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-13T19:47:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"org/2025.coling-main.383/. [41] Ling Yang, Zhaochen Yu, Tianjun Zhang, Minkai Xu, Joseph E. Gonzalez, Bin CUI, and Shuicheng YAN. Supercorrect: Advancing small LLM reasoning with thought template distillation and self-correction. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=PyjZO7oSw2. [42] Ning Yang, Hai Lin, Yibo Liu, Baoliang Tian, Guoqing Liu, and Haijun Zhang. Token- importance guided direct preference optimization. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=cMEnMV vMw9. [43] Eunseop Yoon, Hee Suk Yoon, SooHwan Eom, Gunsoo Han, Daniel Nam, Daejin Jo, Kyoung-"},{"citing_arxiv_id":"2605.13149","ref_index":55,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions","primary_cat":"cs.CL","submitted_at":"2026-05-13T08:15:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06856","ref_index":83,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility","primary_cat":"cs.LG","submitted_at":"2026-05-07T18:56:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Kochenderfer, Sanmi Koyejo, Mrinmaya Sachan, Stella Biderman, Zeerak Talat, Avijit Ghosh, and Irene Solaiman. When ai benchmarks plateau: A systematic study of benchmark saturation, 2026. URLhttps://arxiv.org/abs/2602.16763. Lisanne Bainbridge. Ironies of automation.Automatica, 19(6):775-779, 1983. ISSN 0005-1098. doi: https://doi.org/10.1016/0005-1098(83)90046-8. URL https://www.sciencedirect.com/ science/article/pii/0005109883900468. Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Ribeiro, and Dan Weld. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. InProc. ACM Human Factors in Computing Sys- tems (CHI), 2021."},{"citing_arxiv_id":"2605.04221","ref_index":40,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction","primary_cat":"cs.CL","submitted_at":"2026-05-05T19:03:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Self-prompting combined with QLoRA and DPO on small open-weight models yields micro F1 scores up to 0.864 on clinical named entity recognition from 1,200 dental notes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01374","ref_index":67,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-02T10:52:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MTA is a distillation method that aligns teacher-student LLM representations along their transformation trajectories using layer-adaptive granularities and dynamic structural plus hidden representation alignment losses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01205","ref_index":67,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SRA: Span Representation Alignment for Large Language Model Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-02T02:44:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SRA reframes CTKD by aligning attention-weighted span centers of mass in a multi-particle system model with geometric regularization and span logit distillation, claiming consistent outperformance over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17366","ref_index":74,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ArgBench: Benchmarking LLMs on Computational Argumentation Tasks","primary_cat":"cs.CL","submitted_at":"2026-04-19T10:23:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12770","ref_index":46,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-04-14T14:10:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reinforcement learning with a multi-part reward teaches LLMs to output independent, meaning-preserving sentence edits that raise argument appropriateness close to full rewriting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.23206","ref_index":31,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"High-quality generation of dynamic game content via small language models: A proof of concept","primary_cat":"cs.AI","submitted_at":"2026-01-30T17:30:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proof-of-concept shows fine-tuned small language models achieve adequate quality for real-time game content generation in a scoped RPG loop via retry-until-success and LLM-as-judge evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.08592","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-10-04T20:01:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.02657","ref_index":33,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Less LLM, More Documents: Searching for Improved RAG","primary_cat":"cs.IR","submitted_at":"2025-10-03T01:26:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Corpus scaling in RAG frequently matches the accuracy gains from larger LLMs on open-domain QA tasks, with mid-sized models benefiting most due to better passage coverage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.14234","ref_index":25,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision","primary_cat":"cs.LG","submitted_at":"2025-09-17T17:59:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.01943","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"OpenCodeReasoning: Advancing Data Distillation for Competitive Coding","primary_cat":"cs.CL","submitted_at":"2025-04-02T17:50:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A new open SFT dataset for reasoning distillation lets coding models hit state-of-the-art scores on LiveCodeBench and CodeContests with supervised fine-tuning alone, outperforming RL-trained baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.21074","ref_index":128,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation","primary_cat":"cs.CL","submitted_at":"2025-02-28T14:07:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.18449","ref_index":103,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution","primary_cat":"cs.SE","submitted_at":"2025-02-25T18:45:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.01456","ref_index":106,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Process Reinforcement through Implicit Rewards","primary_cat":"cs.LG","submitted_at":"2025-02-03T15:43:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.15594","ref_index":165,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Survey on LLM-as-a-Judge","primary_cat":"cs.CL","submitted_at":"2024-11-23T16:03:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"2Normalizing the output logits.LLM-as-a-Judge in the intermediate steps with Yes/No setting often normalizes the output logits to obtain the evaluation in the form of a continuous decimal between 0 and 1. This is also very common in agent methods and prompt-based optimization methods [43, 165, 225]. For example, the self-consistency and self-reflection scores [165] within one forward pass of MEvaluator, are effectively obtained by constructing a prompt [(𝑥⊕ C ) ,\"Yes\"] and acquire the probability of each token conditioned on the previous tokens 𝑃(𝑡 𝑖 |𝑡<𝑖). The auto- regressive feature is leveraged, thus aggregate the probability of the relevant tokens to compute the self-consistent score 𝜌Self-consistency and self-reflection score 𝜌Self-reflection."},{"citing_arxiv_id":"2408.15549","ref_index":38,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback","primary_cat":"cs.CL","submitted_at":"2024-08-28T05:53:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WildFeedback extracts preference pairs from in-situ user feedback in LLM conversations to fine-tune models for better alignment with real user preferences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.10020","ref_index":119,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Self-Rewarding Language Models","primary_cat":"cs.CL","submitted_at":"2024-01-18T14:43:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}