{"total":18,"items":[{"citing_arxiv_id":"2605.11629","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models","primary_cat":"cs.CL","submitted_at":"2026-05-12T06:54:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OmniThoughtVis curates 1.8M multimodal CoT samples via teacher distillation, difficulty annotation, and tag-based sampling, yielding consistent gains on nine reasoning benchmarks and allowing 4B models to match or beat undistilled 8B baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10999","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillGen: Verified Inference-Time Agent Skill Synthesis","primary_cat":"cs.LG","submitted_at":"2026-05-09T19:24:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08838","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generating Leakage-Free Benchmarks for Robust RAG Evaluation","primary_cat":"cs.CL","submitted_at":"2026-05-09T09:48:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeedRG generates novel, leakage-free RAG benchmark examples from seed data by mapping reasoning structures and swapping entities while applying consistency and leakage checks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08472","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-08T20:46:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06987","ref_index":153,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Response Time Enhances Alignment with Heterogeneous Preferences","primary_cat":"cs.LG","submitted_at":"2026-05-07T22:05:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06638","ref_index":3,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key","primary_cat":"cs.AI","submitted_at":"2026-05-07T17:48:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24819","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora","primary_cat":"cs.SE","submitted_at":"2026-04-27T14:05:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted data repairs, demonstrated across 16 disciplines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19488","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation","primary_cat":"cs.AI","submitted_at":"2026-04-21T14:10:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoDA aligns cross-domain latent reasoning representations in LLMs via CoT distillation and MMD to enable effective knowledge transfer without in-domain demonstrations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18936","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fine-Tuning Small Reasoning Models for Quantum Field Theory","primary_cat":"cs.LG","submitted_at":"2026-04-21T00:21:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The foundational idea of knowledge distillation [36] has been adapted to the LLM setting: Orca [37] showed that training a 13B model on explanation traces from GPT-4 enables it to surpass larger instruction-tuned baselines, and Distilling Step-by-Step [38] demonstrated that a 770M model trained on LLM-generated rationales can outperform the 540B PaLM teacher, while MiniLLM [39] introduced on-policy distillation to reduce the train-test distribution mismatch inherent in standard SFT. DeepSeek-R1 [40] demonstrated that by distilling long reasoning traces from its RL-trained model into smaller architectures via SFT, the distilled Qwen2.5-14B model outperformed QwQ-32B-Preview across reasoning benchmarks. Subsequent analyses have investigated the key factors governing successful CoT distillation, finding that"},{"citing_arxiv_id":"2605.05227","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods","primary_cat":"cs.LG","submitted_at":"2026-04-19T14:23:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04078","ref_index":21,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Validity-Calibrated Reasoning Distillation","primary_cat":"cs.LG","submitted_at":"2026-04-14T12:32:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10720","ref_index":33,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation","primary_cat":"cs.AI","submitted_at":"2026-04-12T16:38:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Training open-weight LLMs on conversational serializations of authentic student programming submissions produces artificial learners that better replicate real debugging behavior than code-only baselines or prompted large models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10079","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-11T07:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07941","ref_index":82,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning","primary_cat":"cs.CL","submitted_at":"2026-04-09T08:00:37+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06833","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization","primary_cat":"cs.CR","submitted_at":"2026-04-08T08:51:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.15202","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-08-21T03:31:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.06196","ref_index":96,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2024-02-09T05:37:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Pythia: In [95], Biderman et al. introduced Pythia, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. Orca: In [96], Mukherjee et al. develop Orca, a 13-billion parameter model that learns to imitate the reasoning process of large foundation models. Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT. StarCoder: In [97], Li et al. introduced StarCoder and"},{"citing_arxiv_id":"2306.11644","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Textbooks Are All You Need","primary_cat":"cs.CL","submitted_at":"2023-06-20T16:14:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}