Sketch-and-Verify improves small-LLM code generation on HumanEval+ by factorizing search into K algorithmic sketches and M fillings each, outperforming flat sampling by up to 32 percentage points at matched budget while remaining cheaper than upgrading model tier.
International Conference on Learning Representations , year=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
LLMs detect CoT reasoning errors in hidden states with 0.95 AUROC but cannot use this awareness to correct them via steering, patching, or self-correction, indicating the signal is diagnostic not causal.
Execution-based selectors for LLM code candidates outperform textual voting by large margins across configurations, with input generation quality mattering more than the specific aggregation rule.
citing papers explorer
-
Sketch-and-Verify: Structured Inference-Time Scaling via Program Sketching
Sketch-and-Verify improves small-LLM code generation on HumanEval+ by factorizing search into K algorithmic sketches and M fillings each, outperforming flat sampling by up to 32 percentage points at matched budget while remaining cheaper than upgrading model tier.
-
Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal
LLMs detect CoT reasoning errors in hidden states with 0.95 AUROC but cannot use this awareness to correct them via steering, patching, or self-correction, indicating the signal is diagnostic not causal.
-
Semantic Voting: Execution-Grounded Consensus for LLM Code Generation
Execution-based selectors for LLM code candidates outperform textual voting by large margins across configurations, with input generation quality mattering more than the specific aggregation rule.