{"work":{"id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","openalex_id":null,"doi":null,"arxiv_id":"2412.15115","raw_key":null,"title":"Qwen2.5 Technical Report","authors":null,"authors_text":"arXiv preprint arXiv:2412","year":2024,"venue":"cs.CL","abstract":"In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.","external_url":"https://arxiv.org/abs/2412.15115","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-07-03T08:17:45.739200+00:00","pith_arxiv_id":"2412.15115","created_at":"2026-05-09T01:59:34.613693+00:00","updated_at":"2026-07-03T08:17:45.739200+00:00","title_quality_ok":false,"display_title":"Qwen2.5 Technical Report","render_title":"Qwen2.5 Technical Report"},"hub":{"state":{"work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","tier":"mega_hub","tier_reason":"1,000+ Pith inbound or 100,000+ external citations","pith_inbound_count":1020,"external_cited_by_count":null,"distinct_field_count":35,"first_pith_cited_at":"2024-11-29T05:57:37+00:00","last_pith_cited_at":"2026-07-01T15:40:25+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"needed","recognition_status":"needed","updated_at":"2026-07-03T08:33:37.884257+00:00","tier_text":"mega_hub"},"tier":"mega_hub","role_counts":[{"context_role":"background","n":89},{"context_role":"method","n":21},{"context_role":"baseline","n":13},{"context_role":"other","n":8},{"context_role":"dataset","n":7}],"polarity_counts":[{"context_polarity":"background","n":89},{"context_polarity":"use_method","n":20},{"context_polarity":"baseline","n":13},{"context_polarity":"unclear","n":9},{"context_polarity":"use_dataset","n":7}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Qwen2.5 Technical Report","claims":[{"claim_text":"In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5 Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T18:33:50.723899+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"a4c48503-6cf8-4708-8934-1671fb24d7ad","orcid":null,"display_name":"arXiv preprint arXiv:2412"}]},"error":null,"updated_at":"2026-05-13T18:33:50.721178+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-13T18:33:50.602831+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":121},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":87},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":77},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":72},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":64},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":57},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":48},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":48},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":38},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":33},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":30},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":29},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":27},{"title":"Gemma 2: Improving Open Language Models at a Practical Size","work_id":"4dd94e2f-2b27-4cbf-88a0-4910f0772a57","shared_citers":26},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":25},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":25},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":24},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":23},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":20},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":20},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":20},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":19},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":19},{"title":"Instruction-Following Evaluation for Large Language Models","work_id":"3aa06177-125a-4f5a-8f4a-8070c5986c26","shared_citers":18}],"time_series":[{"n":19,"year":2025},{"n":322,"year":2026}]},"error":null,"updated_at":"2026-05-13T17:25:56.326379+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"fixed":1,"items":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-13T18:33:49.949937+00:00"},"reader_index":{"job_type":"reader_index","status":"succeeded","result":{"note":"annotated reader requires full-text/OA fetch; shell is wired for mega hubs","status":"reader queued"},"error":null,"updated_at":"2026-07-03T00:53:38.091970+00:00"},"recognition_alignment":{"job_type":"recognition_alignment","status":"succeeded","result":{"modules":["IndisputableMonolith.Sport.PeakPerformanceFromJCost","IndisputableMonolith.Sports.PeakPerformanceFromPhiLadder","IndisputableMonolith.Education.PedagogyModelsFromConfigDim","IndisputableMonolith.Chemistry.VanDerWaals","IndisputableMonolith.Physics.GrandUnificationFromRS","IndisputableMonolith.Sociology.DunbarFromBandwidth","IndisputableMonolith.Foundation.AlexanderDualityProof","IndisputableMonolith.Mathematics.LanglandsFromRecognitionCost"],"query_chars":1935},"error":null,"updated_at":"2026-07-03T00:53:37.675010+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Qwen2.5 Technical Report","claims":[{"claim_text":"In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5 Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T18:33:50.605546+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Qwen2.5 Technical Report","claims":[{"claim_text":"In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5 Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T17:25:52.725761+00:00"}},"summary":{"title":"Qwen2.5 Technical Report","claims":[{"claim_text":"In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Qwen2.5 Technical Report because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":121},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":87},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":77},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":72},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":64},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":57},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":48},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":48},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":38},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":33},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":30},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":29},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":27},{"title":"Gemma 2: Improving Open Language Models at a Practical Size","work_id":"4dd94e2f-2b27-4cbf-88a0-4910f0772a57","shared_citers":26},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":25},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":25},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":24},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":23},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":20},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":20},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":20},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":19},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":19},{"title":"Instruction-Following Evaluation for Large Language Models","work_id":"3aa06177-125a-4f5a-8f4a-8070c5986c26","shared_citers":18}],"time_series":[{"n":19,"year":2025},{"n":322,"year":2026}]},"authors":[{"id":"a4c48503-6cf8-4708-8934-1671fb24d7ad","orcid":null,"display_name":"arXiv preprint arXiv:2412","source":"manual","import_confidence":0.72}]},"citers":{"total":1020,"items":[{"citing_arxiv_id":"2607.01084","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use","primary_cat":"cs.AI","submitted_at":"2026-07-01T15:40:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Static SFT and RL training for tool-use agents leads to performance drops under open-world distributional shifts across perception, interaction, reasoning and internalization; perturbation-augmented fine-tuning is proposed as mitigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01061","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic generation of verifiable rules for deterministic, self-expanding reaction classification","primary_cat":"cs.AI","submitted_at":"2026-07-01T15:24:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00946","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models","primary_cat":"cs.SD","submitted_at":"2026-07-01T13:46:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SLM modules provide a clean low-dimensional emotion subspace with strong speaker-emotion disentanglement while CFM modules show entanglement and poor generalization for activation steering in hybrid TTS.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00908","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Activation Alignment:The Alignment-Diversity Tradeoff in Task-Aware LLM Quantization","primary_cat":"cs.LG","submitted_at":"2026-07-01T13:12:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TASA improves task-aware mixed-precision LLM quantization by searching calibration data mixtures via gradient-trace alignment and aggregating perplexity plus reasoning sensitivity signals, enabling 3.5-bit models to match or beat 4-bit baselines with over 20-point gains on GSM8K.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00725","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What Survives Into Context: A Diagnostic for Budget-Constrained Multi-Hop RAG and When Submodular Evidence Packing Improves It","primary_cat":"cs.CL","submitted_at":"2026-07-01T10:12:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Answer-in-context diagnostic outperforms recall for predicting RAG F1 under budget constraints and a submodular packer yields up to +5.1 F1 gains on HotpotQA for 3B readers when multi-hop structure, retrieval coverage, and weak-reader conditions align.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00604","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Vehicle Routing Problem Meets Large Language Models: An Overview and Perspectives","primary_cat":"math.OC","submitted_at":"2026-07-01T08:30:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Survey organizing LLM uses for VRP into modeler, designer, and coordinator roles, covering variants, solvers, benchmarks, and two experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00572","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment","primary_cat":"cs.AI","submitted_at":"2026-07-01T07:58:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HARC couples harmfulness and refusal directions across prompt and response positions via subspace fine-tuning, achieving better robustness-capability-usability trade-off than six baselines while transferring across model families.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00465","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2026-07-01T05:34:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StochasT uses stochastic clustering of language tasks into varying turn depths for the same image to improve LVLMs on both single-turn and multi-turn scenarios without discarding data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00422","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"KidnapRAG: A Black-Box Attack for Hijacking Reasoning in Agentic Retrieval-Augmented Generation Systems","primary_cat":"cs.CR","submitted_at":"2026-07-01T04:32:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KidnapRAG is a sequential black-box poisoning attack on Agentic RAG systems using Bait, Chain-Link, and Mal-Ins documents to redirect retrieval and reasoning, outperforming prior baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00274","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SEFORA: Student Essays with Feedback Corpus and LLM Feedback Evaluation Framework","primary_cat":"cs.CL","submitted_at":"2026-06-30T23:48:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Releases SEFORA corpus of instructor feedback on college writing and UniMatch evaluation showing no LLM configuration exceeds 0.4 F1 in matching instructor priorities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00260","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do Multimodal Large Language Models Need Reasoning to Classify Dementia from Speech?","primary_cat":"eess.AS","submitted_at":"2026-06-30T23:12:47+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00208","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SLIM-RL: Risk-Budgeted Random-Masking RL for Diffusion LLMs Without Trajectory Slicing","primary_cat":"cs.CL","submitted_at":"2026-06-30T21:38:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SLIM-RL matches or exceeds TraceRL performance on MATH500, GSM8K, MBPP and HumanEval for diffusion LLMs by risk-budgeted random-masking RL without trajectory slicing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.32017","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-06-30T17:48:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TRIAGE augments GRPO with role-typed segment rewards derived from a judge that detects regression and exploration, yielding higher success rates and fewer turns on ALFWorld, Search-QA, and WebShop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31608","ref_index":174,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning","primary_cat":"cs.CL","submitted_at":"2026-06-30T12:56:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLExEval introduces a human-annotated evaluation framework on 40 rare cases that identifies verbosity bias, hidden knowledge paradox, and 68.6% reasoning-to-output mismatch in LLMs while showing LLM-as-a-Judge overestimates reliability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31599","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-06-30T12:47:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViToS uses dual-stream RL with cross-feedback optimization to prune medical image tokens to 77% length while reporting 108.27% and 104.16% relative performance on two 7B VLMs across seven benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31413","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to Select, Not Relearn: Hard-Routed Mixtures of Reasoning LoRAs","primary_cat":"cs.AI","submitted_at":"2026-06-30T09:40:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hard-Routed MoR-LoRA composes frozen reasoning LoRA experts via hard top-1 routing and a small shared router, preserving expert behavior with fewer trainable parameters than soft-routing mixtures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31307","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When the Database Fails: Prompting LLM Dialogue Agents for Safe Recovery in Task-Oriented Dialogue","primary_cat":"cs.CL","submitted_at":"2026-06-30T08:18:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Guided-Retry prompting cuts hallucination from 30.5% to 15.3% on MultiWOZ and 20.9% to 12.2% on SGD in LLM dialogue agents facing database failures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31247","ref_index":236,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model","primary_cat":"cs.SD","submitted_at":"2026-06-30T07:24:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31168","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Probe Choice Changes Canary-Memorization Verdicts: Three Post-Hoc Disagreement Case Studies in a Text-Dominant LoRA-Tuned Autoregressive Testbed","primary_cat":"cs.CR","submitted_at":"2026-06-30T05:58:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A prefix-window mean-NLL memorization probe disagrees with full-span NLL and exact-recall in three cases on a controlled autoregressive testbed, leading to recommendations for multi-probe reporting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30783","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense","primary_cat":"cs.CR","submitted_at":"2026-06-29T18:11:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30642","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training","primary_cat":"cs.SD","submitted_at":"2026-06-29T17:59:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LeVo 2 presents a hierarchical LLM-Diffusion model with progressive post-training stages to generate full-length songs that balance semantic planning, track-specific acoustics, and musicality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30556","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Poller: Are LLMs Suitable for Evaluating the Poetry Understanding Task?","primary_cat":"cs.CL","submitted_at":"2026-06-29T16:51:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Poller reduces LLM-human disagreement in evaluating Chinese poetry understanding by having LLMs role-play as authors, with reported error reductions of 94.55% and 89.53% on rhetorical techniques and defamiliarization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30429","ref_index":45,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Arko-T: A Foundation Model for Text-to-Structured 3D Generation","primary_cat":"cs.LG","submitted_at":"2026-06-29T15:09:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Arko-T is a 4B text-to-CAD model that outperforms seven frontier LLMs on 8 of 12 metrics by aligning training to design-state preservation at one-tenth the cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30420","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Experience Augmented Policy Optimization for LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-06-29T15:05:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EAPO reuses prior RL policy experience adaptively at decision points in LLM rollouts with adapted importance sampling and reports gains over prior RLVR methods on math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30705","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Why Do Few-Step Text Latents Fail When Image Latents Work? Non-Commitment at Sharp Categorical Readouts","primary_cat":"cs.LG","submitted_at":"2026-06-29T14:20:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Few-step deterministic maps on continuous text latents fail because they cannot resolve discrete branch choices before sharp categorical readouts, with failure governed by decoder sharpness rather than transport accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30339","ref_index":153,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"REAR: Test-time Preference Realignment through Reward Decomposition","primary_cat":"cs.CL","submitted_at":"2026-06-29T14:17:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"REAR decomposes the reward into question and preference components, rescales their balance, and expresses the result as a linear combination of token log-probabilities for efficient integration with best-of-N and tree search.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30175","ref_index":108,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph","primary_cat":"cs.CL","submitted_at":"2026-06-29T11:51:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29863","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration in Agentic Search","primary_cat":"cs.CL","submitted_at":"2026-06-29T06:56:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"KbSD uses a same-size hint-augmented teacher and quadrant-adaptive KL objectives to deliver dense supervision for calibrated behavior across knowledge states in agentic search.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29844","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers","primary_cat":"cs.CL","submitted_at":"2026-06-29T06:33:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MATCH augments sparsified attention with an efficient in-context retrieval system to boost performance on long-range recall tasks in transformers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29815","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-29T05:48:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SrDetection detects data leakage in Code LLMs via contrast between original benchmark samples and their semantic variants, reporting F1 gains of 21.52 (gray-box) and 14.46 (black-box) over baselines in a controlled testbed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29713","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SEVA: Self-Evolving Verification Agent with Process Reward for Fact Attribution","primary_cat":"cs.CL","submitted_at":"2026-06-29T02:37:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SEVA trains a verification agent with a decomposed process reward to produce structured fact attributions, enabling a self-evolution loop that matches GPT-4o-mini F1 on ClearFacts while generating richer output.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29709","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bash-Commenter: Leveraging Syntax-Aware Preference Optimization to Reinforce Large Language Model for Bash Code Comment Generation","primary_cat":"cs.SE","submitted_at":"2026-06-29T02:24:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Bash-Commenter applies CPT, SFT, and Syntax-Aware Preference Optimization (SAPO) via AST atomic operations to LLaMA-3.1-8B, reporting higher BLEU-4/METEOR/ROUGE-L scores than baselines on single-line and multi-line Bash comment generation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29646","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fuzzing Large Language Models to Elicit Hidden Behaviours","primary_cat":"cs.LG","submitted_at":"2026-06-28T23:35:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Fuzzing via Gaussian noise on weights or residual activations elicits hidden backdoor behaviors more often than temperature sampling on four of six models, with proxy-task hyperparameter selection via Thompson sampling improving results over uniform sweeps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29571","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Anisotropy Decides Cosine vs. Rank Metrics for Text Embeddings","primary_cat":"cs.CL","submitted_at":"2026-06-28T19:24:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Anisotropy, quantified by dominant-dimension variance fraction, determines the best parameter-free similarity metric for text embeddings, with rank-based metrics gaining ~20% relative where cosine is weakest.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29502","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UCOB: Learning to Utilize and Evolve Agentic Skills via Credit-Aware On-Policy Bidirectional Self-Distillation","primary_cat":"cs.AI","submitted_at":"2026-06-28T17:02:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UCOB improves agentic RL by using return-to-go comparisons between skill-conditioned and no-skill prompts as local teachers for bidirectional self-distillation and skill memory updates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29476","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-06-28T16:11:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CRAFT is a three-pillar credit assignment scheme that uses counterfactual token importance from GRPO sibling rollouts to provide signed per-token distillation signals in self-distilled agentic RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29308","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MirrorPPR: Exemplar-Based Portrait Photo Retouching","primary_cat":"cs.CV","submitted_at":"2026-06-28T10:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MirrorPPR extracts retouching operations from exemplar pairs via a dedicated extractor and transfers them to query images through a LoRA-adapted Diffusion Transformer, enabled by a new 47-million-pair dataset and self-augmentation for alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29296","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Process Advantage Signal Shaping: A Paradigm-Agnostic Middleware for Process-Supervised RL in LLM Reasoners","primary_cat":"cs.AI","submitted_at":"2026-06-28T09:36:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PASS middleware independently standardizes process/outcome/format streams, derives value-homogeneous chunks, and converts cumulative returns to average value density, yielding consistent pass@1 gains over GRPO baselines in two domains and two signal paradigms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29279","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Manufactured Confidence: How Memory Consolidation Turns Hearsay into Confident Facts","primary_cat":"cs.CR","submitted_at":"2026-06-28T08:56:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM memory consolidation turns casual hedged statements into confident facts that agents obey regardless of source or verification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29139","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How Token Influence Decays with Distance: A Green-Function View of Trained Language Models","primary_cat":"cs.LG","submitted_at":"2026-06-28T01:00:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Empirical Jacobian analysis reveals that token influence in trained language models decays as a power law with distance (exponent ~0.8), a learned property not present in random models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29090","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AB-RAG: Adaptive Budgeted Retrieval-Augmented Generation for Reliable Question Answering","primary_cat":"cs.CL","submitted_at":"2026-06-27T21:08:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AB-RAG adaptively budgets retrieval in RAG by combining three confidence signals to decide when to stop or fetch more evidence, separating correct from incorrect answers at 57.6% vs 0% exact match on a factoid dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29014","ref_index":118,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Customized Generative AI Agent for Transportation Engineering Practice: A Development and Continued Pre-training Guideline","primary_cat":"cs.AI","submitted_at":"2026-06-27T17:22:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A framework is described for adapting six LLMs to transportation engineering via LoRA-based continued pretraining on domain documents, with two models showing strongest results on BLEU-4 and ROUGE metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29013","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mural: Transferring LLM knowledge to image generation via Mixture-of-Transformers","primary_cat":"cs.CV","submitted_at":"2026-06-27T17:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mural transfers knowledge from a frozen LLM to text-to-image synthesis via MoT shared attention, achieving 0.85 GenEval, 86.75 DPG-Bench, and 0.66 WISE while exhibiting emergent behaviors without multimodal or reasoning supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28876","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Memory-Managed Long-Context Attention: A Preliminary Study of Editable Request-Local Memory","primary_cat":"cs.CL","submitted_at":"2026-06-27T11:38:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A hybrid attention mechanism with editable request-local memory slots and sparse fallback achieves high accuracy on synthetic overwrite, version, and anti-pollution tasks where pure fixed-state or sparse methods fail, while identifying open-domain selection as the remaining bottleneck.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28707","ref_index":123,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards","primary_cat":"cs.AI","submitted_at":"2026-06-27T03:25:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"BV-Blend blends prompt-local and semantic-cluster historical reward statistics via SEM-derived weights to stabilize critic-free RL advantage estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28615","ref_index":91,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs","primary_cat":"cs.LG","submitted_at":"2026-06-26T21:14:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes SCSuff metric for evaluating LLM explanation sufficiency via model-generated alternative inputs, showing explanations are typically insufficient and predictable from hidden states.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28565","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"KernelSight-LM: A Kernel-Level LLM Inference Simulator","primary_cat":"cs.PF","submitted_at":"2026-06-26T19:43:38+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28551","ref_index":241,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DataComp-VLM: Improved Open Datasets for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-06-26T19:11:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28548","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution","primary_cat":"cs.CL","submitted_at":"2026-06-26T19:07:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Turn-averaged SAEs reconstruct average activations over conversation turns to represent high-level turn characteristics with a fixed number of features, simplifying long-context interpretability compared to per-token SAEs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28249","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech","primary_cat":"eess.AS","submitted_at":"2026-06-26T16:35:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HPRO uses a differentiable HD-Emo codec to extract separate content and style tokens and progressively aligns frame-, word-, and sentence-level rewards to improve emotional expressiveness in TTS while preserving intelligibility.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28186","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction","primary_cat":"cs.CL","submitted_at":"2026-06-26T15:32:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Epi2Diff extracts cognitive episode sequences from LRM reasoning traces and combines them with semantic features to predict human item difficulty, outperforming baselines on four educational datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27708","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval","primary_cat":"cs.CV","submitted_at":"2026-06-26T04:13:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ZooClaw-FashionSigLIP2 applies distilled full fine-tuning plus WiseFT interpolation to SigLIP2-base and reports outperforming LoRA, larger backbones, and external data on fashion retrieval benchmarks while releasing a new benchmark and bias analysis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27705","ref_index":59,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling","primary_cat":"cs.CL","submitted_at":"2026-06-26T04:07:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LPES uses per-layer scaling factors optimized by a genetic algorithm with Bézier curves to balance attention and improve long-context LLM performance by up to 11.2% on key-value retrieval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27684","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Intuition-Guided Latent Reasoning for LLM-Based Recommendation","primary_cat":"cs.IR","submitted_at":"2026-06-26T03:29:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IntuRec anchors LLM latent reasoning for recommendation by deriving an intuition embedding from top-K candidates via self- and cross-attention to initialize more accurate trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27595","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents","primary_cat":"cs.CL","submitted_at":"2026-06-25T22:51:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Ko-WideSearch is a new Korean breadth-search benchmark spanning 16 categories and three difficulty tiers that evaluates web agents on full set membership plus per-item attributes, showing consistent gaps between set recovery and row completion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27578","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration","primary_cat":"cs.LG","submitted_at":"2026-06-25T22:09:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PEBS applies Morris-James-Stein empirical-Bayes shrinkage to per-rater affine calibrators in RLHF, cutting within-user held-out RMSE by 8.58% on PRISM and 9.66% on PluriHarms versus pooled baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27527","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge","primary_cat":"cs.CV","submitted_at":"2026-06-25T20:19:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LaViD distills LLM conceptual knowledge to vision models via LLM-generated MCQ soft labels, outperforming vision-language distillation baselines on fine-grained benchmarks while improving robustness on spurious correlation datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27483","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning","primary_cat":"cs.AI","submitted_at":"2026-06-25T19:05:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A three-stage training pipeline internalizes world-model simulation and success estimation in LLM agents for improved planning on search and math tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27472","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-06-25T18:50:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The supersession gap in LLM agents—failing to use current facts and discard superseded ones—is a distinct failure not fixed by scale or memory size, but improvable via RL training on a new environment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26530","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DiARC: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-25T02:10:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiARC improves LLM performance on ARC-like benchmarks by constructing and training on preference pairs from three types of negative samples while keeping demonstrations fixed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.25041","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models","primary_cat":"cs.CV","submitted_at":"2026-06-23T18:01:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Wan-Streamer is a unified end-to-end Transformer for low-latency streaming audio-visual interaction using block-causal attention on interleaved multimodal tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24790","ref_index":62,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Grad Detect: Gradient-Based Hallucination Detection in LLMs","primary_cat":"cs.LG","submitted_at":"2026-06-23T16:46:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Grad Detect uses internal gradient patterns from one inference pass to predict LLM hallucinations and abstention, outperforming confidence and sampling baselines on Q&A benchmarks with most signal in the final five layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24253","ref_index":80,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TuringViT: Making SOTA Vision Transformers Accessible to All","primary_cat":"cs.CV","submitted_at":"2026-06-23T07:42:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TuringViT claims a new ViT design with linear attention and curated data that matches SOTA performance using 10% of typical pretraining data while supporting dynamic resolutions and improving VLM integration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22873","ref_index":214,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-22T05:37:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21255","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SCOPE: Sequential Conformal Probing for Reliable OOD Rejection in LLM Services","primary_cat":"cs.CL","submitted_at":"2026-06-19T09:31:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCOPE selects readable hidden layers, constructs conformal gates with IND calibration, and uses supermartingale e-processes to certify persistent service-boundary evidence, improving rejection over final-layer detectors across multiple LLMs and boundary conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19735","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GLARE: A Natural Language Interface for Querying Global Explanations","primary_cat":"cs.AI","submitted_at":"2026-06-18T02:58:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GLARE is an LLM-mediated natural language interface that converts user questions into SQL queries over local explanation data to enable flexible access to aggregated global explanations for black-box vision models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19667","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference","primary_cat":"cs.CL","submitted_at":"2026-06-18T00:38:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"CacheWeaver is a lightweight scheduling layer that orders evidence to exploit prefix caching, reducing median TTFT by 20-33% across vLLM setups while preserving answer quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17660","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TuneAhead: Predicting Fine-tuning Performance Before Full Training Begins","primary_cat":"cs.LG","submitted_at":"2026-06-16T08:21:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TUNEAHEAD predicts fine-tuning performance from meta-features and short probes, reporting RMSE 1.47 and 95.1% of predictions within 3 points on 370 held-out runs of Qwen2.5-7B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17649","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction","primary_cat":"cs.LG","submitted_at":"2026-06-16T08:07:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Formulates pre-hoc fine-tuning prediction as stochastic estimation, proves lower bound on optimization variance decay rate, and introduces a three-regime predictability phase diagram.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28370","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Conversational Query Engine for Mixed-Modality Heterogeneous Enterprise Data Sources","primary_cat":"cs.IR","submitted_at":"2026-06-15T17:02:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"COGNI is a production conversational BI system with indexing, routing, retrieval, and caching layers that reports 88-94% accuracy metrics on internal enterprise benchmarks for mixed structured and unstructured data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.16620","ref_index":9,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Entropy-Gated Latent Recursion","primary_cat":"cs.LG","submitted_at":"2026-06-15T12:14:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"EGLR adds a deterministic layer-recursion axis gated by entropy that is complementary to temperature sampling, raising joint oracle accuracy on MATH-500 from 83.4% to 91.6% for a 3B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.16364","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-06-15T07:58:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Attention analysis shows that LLM tool selection failures occur at the readout/decision stage, not because the model fails to attend to the correct tool definition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.15079","ref_index":151,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale","primary_cat":"cs.CL","submitted_at":"2026-06-13T03:21:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19364","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Closing the Social-Semantic Gap: SPSD for Edge-Based Prompt Compression in Cloud LLM Inference","primary_cat":"cs.LG","submitted_at":"2026-06-10T09:13:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SPSD uses a 4-bit SLM on edge to distill prompts, saving mean 99.9 tokens per call with non-inferior response quality per LLM judge on 248-prompt corpus.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11675","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning","primary_cat":"cs.AI","submitted_at":"2026-06-10T05:39:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces the first structured pulmonary knowledge graph LungKG and uses it to train Lung-R1, which reaches SOTA on EMR-based pulmonary diagnosis tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11033","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AuRA: Internalizing Audio Understanding into LLMs as LoRA","primary_cat":"cs.LG","submitted_at":"2026-06-09T16:05:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AuRA uses LoRA and layer-wise distillation from an ASR teacher to internalize audio encoding into LLMs for improved speech-language performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11023","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generative Archetype-Grounded Item Representations for Sequential Recommendation","primary_cat":"cs.IR","submitted_at":"2026-06-09T15:59:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GenAIR generates LLM-derived archetype embeddings for items and applies behavioral calibration to close the semantic-behavioral gap, yielding performance gains on three real-world datasets when integrated with existing sequential models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11015","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning","primary_cat":"cs.AI","submitted_at":"2026-06-09T15:53:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Open LLMs function as structural priors for MIMO controller tuning by proposing asymmetric structures on coupled plants, reaching better penalized cost with fewer evaluations than pure optimization or classical methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10935","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference","primary_cat":"cs.LG","submitted_at":"2026-06-09T14:45:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLP is a lightweight linear predictor for safe multi-token spans in LLM decoding that delivers 1.14x-1.29x speedup on Qwen2.5 models with zero measured quality degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10931","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO","primary_cat":"cs.CL","submitted_at":"2026-06-09T14:44:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"One-shot GRPO on a single biased example induces generalizing stereotype bias in post-trained LLMs, with susceptibility varying by initial bias likelihood.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10747","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment","primary_cat":"cs.AI","submitted_at":"2026-06-09T11:57:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10722","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs","primary_cat":"cs.CL","submitted_at":"2026-06-09T11:32:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Continual training recipe upcycles dense Qwen2.5-8B LLM to 4x channel-sparse model via predictor-gated bank-wise sparsity in SwiGLU FFN with a single-layer repair for long-context failure on RULER-CWE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10684","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals","primary_cat":"cs.LG","submitted_at":"2026-06-09T10:40:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DAC decomposes agentic search into cooperative searcher and generator agents with cross-agent signals (abstention reward and hard-positive augmentation), achieving strong QA benchmark performance via LoRA on a shared backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10610","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Small Data, Big Noise: Adversarial Training for Robust Parameter-Efficient Fine-Tuning","primary_cat":"cs.CL","submitted_at":"2026-06-09T09:11:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SDBN introduces adversarial training to PEFT via two variants using character-level edits and LLM-generated perturbations, claiming improved robustness and generalization on NLP benchmarks in low-resource noisy settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10507","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning","primary_cat":"cs.AI","submitted_at":"2026-06-09T07:35:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"HIPIF trains LLM agents end-to-end using subgoal-based hierarchical planning and information folding of completed histories, plus hierarchical reflection and process rewards, to handle long-horizon tasks without auxiliary models or expert trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11270","ref_index":12,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation","primary_cat":"cs.LG","submitted_at":"2026-06-09T06:52:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Steering Llama-2-7B-Chat and Qwen2.5-7B-Instruct teachers and distilling students on benign data transfers measurable jailbreak susceptibility, with Llama showing threshold behavior at α = -0.15 and Qwen reaching transfer ratios up to 0.61.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10415","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RATrain: A Resource-Aware Training Runtime for Large Language Models on Bandwidth-Constrained Heterogeneous Supercomputing Platforms","primary_cat":"cs.DC","submitted_at":"2026-06-09T04:42:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RATrain introduces a resource-aware scheduler and MT-3000-specific backend for 1F1B LLM training that achieves 1.35x speedup and 97% scaling efficiency while preserving training correctness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10385","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-06-09T03:51:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AR-OPD disentangles privileged supervision via anchored residual guidance to reduce hindsight leakage in on-policy distillation, reporting gains of 2.3 points over full privileged OPD and 7.9 over SFT on reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10285","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design","primary_cat":"cs.CL","submitted_at":"2026-06-09T01:17:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OpenRTLSet supplies 131k+ Verilog samples with AI-generated descriptions to enable fine-tuning of LLMs for hardware module design.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10184","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning","primary_cat":"cs.LG","submitted_at":"2026-06-08T21:21:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dropout-GRPO uses structured dropout to generate trajectory variance for GRPO in latent-reasoning models like Coconut, raising GSM8K pass@1 from 27.29% to 29.01%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10078","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mult-DPO: Multinomial Direct Preference Optimization for Recommender Systems","primary_cat":"cs.IR","submitted_at":"2026-06-08T18:53:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Mult-DPO provides a tractable multinomial surrogate for set-wise preference optimization in DPO that upper bounds the Plackett-Luce based loss for LLM recommender systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09587","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Seeing the Hivemind: A Consensus-Aware Interaction Technique for Mitigating AI Homogenization","primary_cat":"cs.HC","submitted_at":"2026-06-08T14:59:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces Semantic Repulsion Technique (SRT) that boosts semantic diversity in AI creative outputs by 85-167% and receives higher usefulness and coherence ratings than baselines in a 16-person user study.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09508","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs","primary_cat":"cs.AI","submitted_at":"2026-06-08T14:02:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EntropyInfer adaptively allocates inference compute using per-head attention entropy for rigid/dynamic classification during prefilling and compresses KV cache with generated tokens, achieving up to 2.39x speedup on long contexts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09366","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs","primary_cat":"cs.CL","submitted_at":"2026-06-08T11:38:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"C-Gate represents speech frames as convex combinations of LLM token embeddings to enforce manifold compatibility, delivering up to 48.7% relative WER reduction on LibriSpeech while preserving emotion recognition accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09312","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Toward Compiler World Models: Learning Latent Dynamics for Efficient Tensor Program Search","primary_cat":"cs.LG","submitted_at":"2026-06-08T10:17:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A latent dynamics model for schedule trajectories in TVM AutoScheduler finds programs with 1.37x better GPU latency than Ansor using the same 64 trials and matches 10K-trial Ansor with 10x fewer measurements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09287","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trajectory Geometry of Transformer Representations Across Layers","primary_cat":"cs.LG","submitted_at":"2026-06-08T09:54:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Transformer representations form trajectories showing semantic convergence in middle-to-late layers, higher curvature on reasoning tasks, bifurcation on ambiguous tokens, and a consistent three-phase cosine similarity pattern across GPT-2, TinyLlama, and Qwen2.5.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09178","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis","primary_cat":"cs.CL","submitted_at":"2026-06-08T08:17:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Culturally-adapted red-teaming prompts raise ASR by a mean of 9.3 pp over direct translations across 16 language-model pairs in four Asian languages, with DT scoring mean cultural depth of 0.17 versus up to 2.51 for CA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09165","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges","primary_cat":"cs.AI","submitted_at":"2026-06-08T08:02:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A reliable-to-expressive curriculum with dynamic rubrics trains a 12B safety judge to achieve 94%+ accuracy with only 0.76 cross-rubric variance on three different rubric prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09156","ref_index":98,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OmniGen-AR: AutoRegressive Any-to-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-06-08T07:47:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09092","ref_index":99,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-06-08T06:42:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Thinking-RFT improves Theory of Mind accuracy by 6% over SFT on shortcut-free datasets, with 10% gains on higher-order reasoning and better generalization to new domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":100,"offset":0}}