pith. machine review for the scientific record. sign in

arxiv: 2509.14234 · v3 · submitted 2025-09-17 · 💻 cs.LG

Recognition: unknown

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Authors on Pith no claims yet
classification 💻 cs.LG
keywords computeframeworkdomainspseudo-referencerolloutssupervisioncallinference
0
0 comments X
read the original abstract

Where do learning signals come from when there is no ground truth in post-training? We show that inference compute itself can serve as supervision. By generating parallel rollouts and converting them into reference estimates, models can learn without human labels-critically, even in non-verifiable domains like healthcare guidance where no programmatic checker exists. We call this framework Compute as Teacher (CaT) and it turns inference-time compute from parallel rollouts into supervision for RL training. The framework has two components: (1) reference estimation which aggregates rollouts into a pseudo-reference answer, and (2) reward derivation which converts that pseudo-reference into RL rewards. For (1), we explore a simple method we call synthesis, but the framework admits any aggregator. For (2), we introduce self-proposed rubrics for non-verifiable domains. These are binary, auditable criteria generated from the pseudo-reference and scored by an LLM judge. On HealthBench, models trained with CaT match or exceed inference-time aggregation quality while using 9x less test-time compute. Here, CaT also competes with learning from expert physician annotations, yielding up to +30% relative improvement over the initial policy. The framework extends naturally to verifiable rewards, matching the best existing baselines on MATH-500 in test-time RL and demonstrating 'drop-in' versatility across both types of domains.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

    cs.LG 2026-03 unverdicted novelty 7.0

    SCRL adds selective positive pseudo-labeling and entropy-gated negative pseudo-labeling to test-time RL, reducing noise from weak consensus and improving LLM reasoning on benchmarks.

  2. AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

    cs.LG 2026-04 unverdicted novelty 6.0

    AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.