ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.
hub
S1: Simple test-time scaling
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Document-level machine translation followed by segment-level LLM refinement provides the strongest and most stable improvements in literary translation quality, mainly enhancing fluency and style rather than adequacy.
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
ReasoningBank distills generalizable reasoning strategies from agent successes and failures to enable self-evolution, with memory-aware test-time scaling amplifying gains over raw-trajectory or success-only memory on web and software benchmarks.
SpenseGPT introduces a hybrid sparse-dense weight format and one-shot pruning that delivers 1.2x end-to-end LLM decoding speedup on B200 GPUs with FP8 while preserving accuracy on Qwen3-32B and Seed-OSS-36B.
Bucket-Level MOO reformulates multilingual fine-tuning as localized multi-objective optimization and proves it enforces a tighter Pareto stationarity condition while improving cross-lingual performance on four LLMs.
Base LLMs show latent judge calibration that Self-Evaluation Elicitation (SEE) surfaces with 160 examples via RL calibration followed by masked distillation.
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
Decoding Time Verification (DTV) interleaves verifier calls at structural boundaries during autoregressive code generation for C-to-Rust and JavaScript-to-TypeScript translation, raising pass rates while using fewer tokens than post-hoc baselines.
Prefix consistency weights CoT answers by their regeneration frequency from truncated prefixes and reaches standard self-consistency accuracy at a median 4.6x fewer tokens across five models and four benchmarks.
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
AI model failures on complex tasks become increasingly incoherent with longer reasoning chains, making consistent misalignment less likely than chaotic errors as capabilities scale.
ThinkBooster supplies a modular library, joint performance-efficiency benchmark, and deployable proxy for test-time compute scaling of LLM reasoning on math and coding tasks.
citing papers explorer
-
ATLAS: Agentic Test-time Learning-to-Allocate Scaling
ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.
-
What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation
Document-level machine translation followed by segment-level LLM refinement provides the strongest and most stable improvements in literary translation quality, mainly enhancing fluency and style rather than adequacy.
-
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
-
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
ReasoningBank distills generalizable reasoning strategies from agent successes and failures to enable self-evolution, with memory-aware test-time scaling amplifying gains over raw-trajectory or success-only memory on web and software benchmarks.
-
SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference
SpenseGPT introduces a hybrid sparse-dense weight format and one-shot pruning that delivers 1.2x end-to-end LLM decoding speedup on B200 GPUs with FP8 while preserving accuracy on Qwen3-32B and Seed-OSS-36B.
-
Multilingual Fine-Tuning via Localized Gradient Conflict Resolution
Bucket-Level MOO reformulates multilingual fine-tuning as localized multi-objective optimization and proves it enforces a tighter Pareto stationarity condition while improving cross-lingual performance on four LLMs.
-
Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data
Base LLMs show latent judge calibration that Self-Evaluation Elicitation (SEE) surfaces with 160 examples via RL calibration followed by masked distillation.
-
Boosting Self-Consistency with Ranking
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
-
Verifier-Guided Code Translation via Meta-Step Decoding
Decoding Time Verification (DTV) interleaves verifier calls at structural boundaries during autoregressive code generation for C-to-Rust and JavaScript-to-TypeScript translation, raising pass rates while using fewer tokens than post-hoc baselines.
-
Reliable Chain-of-Thought via Prefix Consistency
Prefix consistency weights CoT answers by their regeneration frequency from truncated prefixes and reaches standard self-consistency accuracy at a median 4.6x fewer tokens across five models and four benchmarks.
-
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
-
The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?
AI model failures on complex tasks become increasingly incoherent with longer reasoning chains, making consistent misalignment less likely than chaotic errors as capabilities scale.
-
ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning
ThinkBooster supplies a modular library, joint performance-efficiency benchmark, and deployable proxy for test-time compute scaling of LLM reasoning on math and coding tasks.