{"total":18,"items":[{"citing_arxiv_id":"2607.00862","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models","primary_cat":"cs.CL","submitted_at":"2026-07-01T12:27:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CAT uses intrinsic confidence signals in preference optimization to adapt reasoning length in LRMs, outperforming uniform compression baselines on accuracy across benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00482","ref_index":70,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking","primary_cat":"cs.CL","submitted_at":"2026-07-01T06:09:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DASH assigns segment-level credit in reasoning traces using drift toward ground-truth answers, yielding 50.8% accuracy on AIME25 versus 45.4% for GRPO while reducing overthinking behaviors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30852","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-06-29T19:33:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Learned multi-feature stopping improves accuracy-cost tradeoffs on free-form math but scalar rules match or exceed it on multiple-choice and hard problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10445","ref_index":18,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference","primary_cat":"cs.LG","submitted_at":"2026-06-09T05:48:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpenseGPT introduces a hybrid sparse-dense weight format and one-shot pruning that delivers 1.2x end-to-end LLM decoding speedup on B200 GPUs with FP8 while preserving accuracy on Qwen3-32B and Seed-OSS-36B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06915","ref_index":29,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-06-05T05:28:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ThinkBooster supplies a modular library, joint performance-efficiency benchmark, and deployable proxy for test-time compute scaling of LLM reasoning on math and coding tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05613","ref_index":19,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Multilingual Fine-Tuning via Localized Gradient Conflict Resolution","primary_cat":"cs.AI","submitted_at":"2026-06-04T02:36:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Bucket-Level MOO reformulates multilingual fine-tuning as localized multi-objective optimization and proves it enforces a tighter Pareto stationarity condition while improving cross-lingual performance on four LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05122","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data","primary_cat":"cs.CL","submitted_at":"2026-06-03T17:27:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Base LLMs show latent judge calibration that Self-Evaluation Elicitation (SEE) surfaces with 160 examples via RL calibration followed by masked distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05054","ref_index":51,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Boosting Self-Consistency with Ranking","primary_cat":"cs.CL","submitted_at":"2026-06-03T16:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01667","ref_index":39,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ATLAS: Agentic Test-time Learning-to-Allocate Scaling","primary_cat":"cs.LG","submitted_at":"2026-06-01T04:19:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27570","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation","primary_cat":"cs.AI","submitted_at":"2026-05-26T18:43:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LaneRoPE adds an inter-sequence attention mask and extended RoPE to enable collaborative parallel sequence generation in LLMs, yielding accuracy gains on math reasoning under length limits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17626","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Verifier-Guided Code Translation via Meta-Step Decoding","primary_cat":"cs.LG","submitted_at":"2026-05-17T19:47:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decoding Time Verification (DTV) interleaves verifier calls at structural boundaries during autoregressive code generation for C-to-Rust and JavaScript-to-TypeScript translation, raising pass rates while using fewer tokens than post-hoc baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13368","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation","primary_cat":"cs.CL","submitted_at":"2026-05-13T11:27:32+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Document-level machine translation followed by segment-level LLM refinement provides the strongest and most stable improvements in literary translation quality, mainly enhancing fluency and style rather than adequacy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07654","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Reliable Chain-of-Thought via Prefix Consistency","primary_cat":"stat.ML","submitted_at":"2026-05-08T12:28:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prefix consistency weights CoT answers by their regeneration frequency from truncated prefixes and reaches standard self-consistency accuracy at a median 4.6x fewer tokens across five models and four benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23333","ref_index":54,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Process Supervision of Confidence Margin for Calibrated LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-04-25T14:40:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22709","ref_index":28,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought","primary_cat":"cs.CL","submitted_at":"2026-04-24T16:45:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19341","ref_index":84,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Evaluation-driven Scaling for Scientific Discovery","primary_cat":"cs.LG","submitted_at":"2026-04-21T11:24:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"indispensable component of scientific discovery [ 75, 89, 105, 114], forming an evaluation-driven discovery loop that uses external feedback to guide subsequent refinement. Regardless of the diverse designs in these systems, the scaling effect of the evaluation-driven discovery loop itself remains underexplored. Existing methods either focus primarily on scaling generation-side computation, such as reasoning tokens [84, 123, 157] or agent turns [61, 73, 175], or aim to improve results with only limited rounds of discovery loops [ 72, 100]. Yet this loop is precisely the mechanism through which science advances: one round of attempts produces the feedback that shapes the next. This leads to the central question of the paper: how far can scientific discovery be pushed by effectively scaling evaluation-"},{"citing_arxiv_id":"2601.23045","ref_index":8,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?","primary_cat":"cs.AI","submitted_at":"2026-01-30T14:52:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AI model failures on complex tasks become increasingly incoherent with longer reasoning chains, making consistent misalignment less likely than chaotic errors as capabilities scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.25140","ref_index":52,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory","primary_cat":"cs.AI","submitted_at":"2025-09-29T17:51:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReasoningBank distills generalizable reasoning strategies from agent successes and failures to enable self-evolution, with memory-aware test-time scaling amplifying gains over raw-trajectory or success-only memory on web and software benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}