pith. sign in

arxiv: 2606.07134 · v1 · pith:VCVEC7BCnew · submitted 2026-06-05 · 💻 cs.LG

α-PFN: Fast Entropy Search via In-Context Learning

Pith reviewed 2026-06-27 22:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords Bayesian optimizationentropy searchprior-data fitted networksacquisition functionsinformation gainin-context learningamortized inference
0
0 comments X

The pith

A two-stage PFN learns to approximate entropy search acquisition functions with one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that Prior-data Fitted Networks can amortize the expensive Monte Carlo estimation of expected information gain in entropy search for Bayesian optimization. It does this by training one PFN on optima information and a second α-PFN to directly output the acquisition value from that output. A sympathetic reader would care because this turns a slow, hand-crafted approximation into a fast, learned one that still matches existing methods on benchmarks. The result is that information-theoretic acquisition becomes practical for repeated use without custom implementations or numerical instability.

Core claim

We propose a two-stage amortization strategy that learns to approximate entropy search-based acquisition functions using Prior-data Fitted Networks (PFNs) in a single forward pass. A first PFN is trained to be conditioned on information about the optima; second, the α-PFN is trained to predict the expected information gain by training on information gains measured with the first PFN. The α-PFN offers a flexible learned approximation, which replaces the complex heuristic approximations with a single forward pass per candidate, enabling rapid and extensible acquisition evaluation.

What carries the argument

The α-PFN, a second Prior-data Fitted Network trained to output expected information gain directly from a first PFN's representation of the optimum.

If this is right

  • Entropy search variants become evaluable in a single forward pass instead of Monte Carlo sampling.
  • Acquisition evaluation accelerates by more than 50 times across tested entropy search methods.
  • The same learned approximator remains competitive with state-of-the-art implementations on both synthetic and real-world tasks.
  • Acquisition functions can be extended or swapped without re-deriving new heuristic approximations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage pattern could be applied to amortize other expensive acquisition functions or decision-theoretic quantities.
  • Because the approximator is a forward pass, it becomes feasible to embed entropy search inside inner loops of larger algorithms that previously avoided it for speed reasons.
  • If the PFN training distribution is broadened, the method might transfer to new problem classes without retraining the acquisition model from scratch.

Load-bearing premise

Networks trained only on synthetic or prior data will output sufficiently accurate expected information gain values on the actual optimization tasks.

What would settle it

An experiment on a new real-world benchmark where the α-PFN version of entropy search produces worse final optimization performance than a standard Monte Carlo implementation despite the speed gain.

Figures

Figures reproduced from arXiv: 2606.07134 by Carl Hvarfner, Eytan Bakshy, Frank Hutter, Herilalaina Rakotoarison, Samuel M\"uller, Steven Adriaensen, Tom Viering.

Figure 1
Figure 1. Figure 1: Our base PFN estimates the Posterior Predictive Distribution (PPD) conditioned on different types of information regarding the optimum: none (unconditional case), x ∗ , f ∗ , or both. Note that this example considers a specific optimum. In practice, the optimum, and therefore the information gain, is uncertain and the ES acquisitions compute expected gain. This is typically achieved by MC sampling, where x… view at source ↗
Figure 2
Figure 2. Figure 2: An overview of our pipeline. Left: we illustrate 2 out of 4 cases for the base PFN training. Middle: how to train the α-PFN for PES. Right: how to use the α-PFN at test-time in a BO loop. Note: PFNs were trained only once and reused in all our BO experiments. single model is able to handle all four cases equally. This way we train our base PFN q(y|x, Dtrn, I), where I can contain x ∗ and/or f ∗ . For more … view at source ↗
Figure 3
Figure 3. Figure 3: Bayesian optimization performance comparison between GP-MCMC (NUTS) and α-PFN across different synthetic test functions and real HPO benchmarks. The shaded area indicates one standard error. Evaluation datasets. We evaluate on well-known black￾box optimization benchmarks, including synthetic func￾tions (Branin, Hartmann, Ackley) and real HPO tasks (LCBench (Zimmer et al., 2021), HPO-B (Pineda-Arango et al.… view at source ↗
Figure 4
Figure 4. Figure 4: Noise ablation on Hartmann 4D and 6D, comparing the main setting (σn = 0.316) to a higher-noise OOD setting (σn = 0.5). α-PFN degrades similarly to the corresponding GP baselines. Higgs (LCBench), its performance is worse, often outper￾formed by the GP baseline, in particular on HPO-B. Timing Results [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Trace generation ablation comparing clustered traces from Algorithm 1 with uniformly sampled context and query points. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of GP, Monte Carlo base-PFN, and α-PFN estimates for a 1D setting. For this setting, we use a non-fully Bayesian GP (and we trained a base-PFN and α-PFN specifically for this ablation). This makes the acquisition function of the GP more comparable to the PFN. Top: GP and base-PFN posterior predictive distributions. Bottom: JES estimates, with Monte Carlo curves shown for 10 random se… view at source ↗
Figure 7
Figure 7. Figure 7: Full results across all benchmarks. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Win rate comparison across different benchmarks. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

Information-theoretic acquisition functions such as Entropy Search (ES) offer a principled exploration-exploitation framework for Bayesian optimization (BO). However, their practical implementation relies on complicated and slow approximations, i.e., a Monte Carlo estimation of the information gain. This complexity can introduce numerical errors and requires specialized, hand-crafted implementations. We propose a two-stage amortization strategy that learns to approximate entropy search-based acquisition functions using Prior-data Fitted Networks (PFNs) in a single forward pass. A first PFN is trained to be conditioned on information about the optima; second, the $\alpha$-PFN is trained to predict the expected information gain by training on information gains measured with the first PFN. The $\alpha$-PFN offers a flexible learned approximation, which replaces the complex heuristic approximations with a single forward pass per candidate, enabling rapid and extensible acquisition evaluation. Empirically, our approach is competitive with state-of-the-art entropy search implementations on synthetic and real-world benchmarks, while accelerating the different entropy search variants across all our experiments, with speed ups over 50x. Source code: https://github.com/automl/AlphaPFN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes α-PFN, a two-stage amortization strategy using Prior-data Fitted Networks (PFNs) to approximate entropy search acquisition functions in Bayesian optimization. A first PFN is trained conditioned on information about the optima; a second α-PFN is then trained to predict expected information gain from measurements produced by the first. The resulting model replaces complex Monte Carlo or heuristic approximations with a single forward pass per candidate point. The central empirical claim is that the method is competitive with state-of-the-art entropy search implementations on synthetic and real-world benchmarks while delivering speed-ups exceeding 50×.

Significance. If the performance claims hold under rigorous validation, the work would meaningfully advance practical Bayesian optimization by rendering information-theoretic acquisition functions computationally tractable. The two-stage PFN construction demonstrates a flexible, extensible learned surrogate for expected information gain and could serve as a template for amortizing other expensive BO components.

major comments (2)
  1. [Abstract] Abstract: the claim of empirical competitiveness and >50× speed-ups is load-bearing for the paper’s contribution, yet the abstract (and the provided material) supplies no information on training-data distribution, benchmark tasks, number of runs, error bars, or statistical tests. Without these details it is impossible to determine whether the learned approximation introduces bias or performance degradation relative to existing ES variants.
  2. The central modeling assumption—that a PFN trained on synthetic or prior data will produce sufficiently accurate EIG approximations that generalize to the target optimization tasks without systematic bias—is stated but not subjected to targeted stress tests (e.g., out-of-distribution task transfer or comparison against the exact EIG on low-dimensional problems). This assumption directly underpins the competitiveness claim.
minor comments (1)
  1. The source-code link is provided; this is helpful for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and commit to revisions that improve clarity without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of empirical competitiveness and >50× speed-ups is load-bearing for the paper’s contribution, yet the abstract (and the provided material) supplies no information on training-data distribution, benchmark tasks, number of runs, error bars, or statistical tests. Without these details it is impossible to determine whether the learned approximation introduces bias or performance degradation relative to existing ES variants.

    Authors: We agree the abstract is too terse on experimental details. The manuscript body specifies training on data drawn from standard GP priors, evaluation across synthetic test functions (Branin, Hartmann6, etc.) and real-world hyperparameter optimization tasks, with all results reported as means over 10–20 independent runs together with standard-error bars and direct comparisons to existing ES implementations. We will expand the abstract to include a concise statement of the benchmark classes, run count, and presence of variability measures so that the competitiveness claim can be assessed at a glance. revision: yes

  2. Referee: [—] The central modeling assumption—that a PFN trained on synthetic or prior data will produce sufficiently accurate EIG approximations that generalize to the target optimization tasks without systematic bias—is stated but not subjected to targeted stress tests (e.g., out-of-distribution task transfer or comparison against the exact EIG on low-dimensional problems). This assumption directly underpins the competitiveness claim.

    Authors: The existing experiments already probe generalization by evaluating on synthetic and real-world tasks whose characteristics differ from the training prior; where dimensionality permits, we also report agreement with exact EIG values. Nevertheless, we accept that dedicated out-of-distribution transfer experiments and additional low-dimensional exact-EIG comparisons would make the validation more explicit. We will add a short subsection summarizing these checks and any observed approximation bias. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a two-stage training procedure for PFNs where the first network is trained on synthetic data conditioned on optima information, and the α-PFN is then trained to predict expected information gain using outputs measured from the first PFN. This is the explicit design of the learned approximator, not a reduction of a claimed derivation to its inputs by construction. No mathematical first-principles result, uniqueness theorem, or ansatz is derived or smuggled via self-citation. Central claims concern empirical speedups and competitiveness on benchmarks, which are externally verifiable and not forced by the training process itself. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach rests on the empirical claim that PFN training on synthetic information-gain data produces a usable approximation; the primary free parameters are the network weights learned in each stage. No new mathematical axioms or invented physical entities are introduced.

free parameters (1)
  • PFN network weights (stage 1 and stage 2)
    Weights are fitted during the two training phases on prior data and measured information gains.

pith-pipeline@v0.9.1-grok · 5760 in / 1110 out tokens · 29172 ms · 2026-06-27T22:21:30.544968+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    International Conference on Learning Representations , year=

    Meta-Learning Acquisition Functions for Transfer Learning in Bayesian Optimization , author=. International Conference on Learning Representations , year=

  2. [2]

    Nature , volume=

    Publishing: Credit where credit is due , author=. Nature , volume=. 2014 , publisher=

  3. [3]

    International Conference on Machine Learning , pages=

    Learning to learn without gradient descent by gradient descent , author=. International Conference on Machine Learning , pages=. 2017 , organization=

  4. [4]

    International Conference on Artificial Intelligence and Statistics , pages=

    Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2017 , organization=

  5. [5]

    Towards global optimization , pages=

    The application of Bayesian methods for seeking the extremum , author=. Towards global optimization , pages=

  6. [6]

    International Conference on Machine Learning , year=

    Distribution transformers: Fast approximate Bayesian inference with on-the-fly prior adaptation , author=. International Conference on Machine Learning , year=

  7. [7]

    ICLR , year=

    Meta-Learning Acquisition Functions for Transfer Learning in Bayesian Optimization , author=. ICLR , year=

  8. [8]

    Advances in Neural Information Processing Systems , pages=

    End-to-end meta-bayesian optimisation with transformer neural processes , author=. Advances in Neural Information Processing Systems , pages=

  9. [9]

    Advances in neural information processing systems , pages=

    Local latent space bayesian optimization over structured inputs , author=. Advances in neural information processing systems , pages=

  10. [10]

    Uncertainty in Artificial Intelligence , pages=

    Multi-objective bayesian optimization over high-dimensional search spaces , author=. Uncertainty in Artificial Intelligence , pages=. 2022 , organization=

  11. [11]

    Advances in neural information processing systems , pages=

    Scalable global optimization via local Bayesian optimization , author=. Advances in neural information processing systems , pages=

  12. [12]

    Rosen Ting-Ying Yu and Cyril Picard and Faez Ahmed , booktitle=

  13. [13]

    , author=

    The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. , author=. Journal of Machine Learning Research , pages=

  14. [14]

    International Conference on Machine Learning , pages=

    Vanilla bayesian optimization performs great in high dimensions , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  15. [15]

    International Conference on Artificial Intelligence and Statistics , pages=

    Amortized probabilistic conditioning for optimization, simulation and inference , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2025 , organization=

  16. [16]

    Journal of Machine Learning Research , volume=

    Pre-trained Gaussian processes for Bayesian optimization , author=. Journal of Machine Learning Research , volume=

  17. [17]

    International conference on machine learning , pages=

    Neural contextual bandits with ucb-based exploration , author=. International conference on machine learning , pages=. 2020 , organization=

  18. [18]

    International Conference on Machine Learning , pages=

    Infonet: neural estimation of mutual information without test-time optimization , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  19. [19]

    ACM SIGKDD Explorations Newsletter , volume=

    OpenML: networked science in machine learning , author=. ACM SIGKDD Explorations Newsletter , volume=. 2014 , publisher=

  20. [20]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year =

    Lucas Zimmer and Marius Lindauer and Frank Hutter , title =. IEEE Transactions on Pattern Analysis and Machine Intelligence , year =

  21. [21]

    Advances in neural information processing systems (Datasets and Benchmarks Track) , publisher =

    Sebastian Pineda. Advances in neural information processing systems (Datasets and Benchmarks Track) , publisher =

  22. [22]

    Advances in Neural Information Processing Systems , volume=

    Unexpected improvements to expected improvement for bayesian optimization , author=. Advances in Neural Information Processing Systems , volume=

  23. [23]

    Journal of Machine Learning Research , pages=

    Gibbon: General-purpose information-based bayesian optimisation , author=. Journal of Machine Learning Research , pages=

  24. [24]

    Expert Systems with Applications , volume=

    Parallel predictive entropy search for multi-objective Bayesian optimization with constraints applied to the tuning of machine learning algorithms , author=. Expert Systems with Applications , volume=. 2023 , publisher=

  25. [25]

    Advances in Neural Information Processing Systems , volume=

    Joint entropy search for multi-objective bayesian optimization , author=. Advances in Neural Information Processing Systems , volume=

  26. [26]

    Journal of Global Optimization , volume=

    An informational approach to the global optimization of expensive-to-evaluate functions , author=. Journal of Global Optimization , volume=. 2009 , publisher=

  27. [27]

    2006 , publisher=

    Gaussian processes for machine learning , author=. 2006 , publisher=

  28. [28]

    Advances in Neural Information Processing Systems , pages=

    Towards learning universal hyperparameter optimizers with transformers , author=. Advances in Neural Information Processing Systems , pages=

  29. [29]

    and Daulton, Samuel and Letham, Benjamin and Wilson, Andrew Gordon and Bakshy, Eytan , booktitle =

    Balandat, Maximilian and Karrer, Brian and Jiang, Daniel R. and Daulton, Samuel and Letham, Benjamin and Wilson, Andrew Gordon and Bakshy, Eytan , booktitle =

  30. [30]

    Optimization techniques IFIP technical conference: Novosibirsk , pages=

    On Bayesian methods for seeking the extremum , author=. Optimization techniques IFIP technical conference: Novosibirsk , pages=. 1975 , organization=

  31. [31]

    2010 , booktitle =

    Srinivas, Niranjan and Krause, Andreas and Kakade, Sham and Seeger, Matthias , title =. 2010 , booktitle =

  32. [32]

    arXiv preprint arXiv:2202.13597 , year=

    Rectified max-value entropy search for Bayesian optimization , author=. arXiv preprint arXiv:2202.13597 , year=

  33. [33]

    Thirty-seventh Conference on Neural Information Processing Systems , publisher =

    Self-Correcting Bayesian Optimization through Bayesian Active Learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , publisher =. 2023 , pages=

  34. [34]

    AutoML Conference 2024 (Workshop Track) , year=

    From Epoch to Sample Size: Developing New Data-driven Priors for Learning Curve Prior-Fitted Networks , author=. AutoML Conference 2024 (Workshop Track) , year=

  35. [35]

    Multi-fidelity

    Takeno, Shion and Fukuoka, Hitoshi and Tsukada, Yuhki and Koyama, Toshiyuki and Shiga, Motoki and Takeuchi, Ichiro and Karasuyama, Masayuki , booktitle =. Multi-fidelity. 2020 , publisher =

  36. [36]

    International Conference on Machine Learning , pages =

    Sequential and Parallel Constrained Max-value Entropy Search via Information Lower Bound , author =. International Conference on Machine Learning , pages =. 2022 , publisher =

  37. [37]

    International Conference on Machine Learning , pages=

    A general recipe for likelihood-free Bayesian optimization , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  38. [38]

    Journal of Machine Learning Research , volume=

    Bayesian optimization for likelihood-free inference of simulator-based statistical models , author=. Journal of Machine Learning Research , volume=

  39. [39]

    arXiv preprint arXiv:2307.03565 , year=

    MALIBO: Meta-learning for likelihood-free Bayesian optimization , author=. arXiv preprint arXiv:2307.03565 , year=

  40. [40]

    Nature , pages=

    Accurate predictions on small data with a tabular foundation model , author=. Nature , pages=. 2025 , publisher=

  41. [41]

    Advances in neural information processing systems , pages=

    Random features for large-scale kernel machines , author=. Advances in neural information processing systems , pages=. 2007 , publisher =

  42. [42]

    Proceedings of the IEEE , pages=

    Taking the human out of the loop: A review of Bayesian optimization , author=. Proceedings of the IEEE , pages=. 2015 , publisher=

  43. [43]

    International Conference on Machine Learning , pages=

    Pfns4bo: In-context learning for bayesian optimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  44. [44]

    Advances in Neural Information Processing Systems , pages=

    Joint entropy search for maximally-informed Bayesian optimization , author=. Advances in Neural Information Processing Systems , pages=

  45. [45]

    Advances in neural information processing systems , pages=

    Predictive entropy search for efficient global optimization of black-box functions , author=. Advances in neural information processing systems , pages=. 2014 , publisher =

  46. [46]

    International Conference on Machine Learning , pages=

    Max-value entropy search for efficient Bayesian optimization , author=. International Conference on Machine Learning , pages=. 2017 , organization=

  47. [47]

    and Archambeau, Cedric and Ramos, Fabio , booktitle =

    Tiao, Louis C and Klein, Aaron and Seeger, Matthias W and Bonilla, Edwin V. and Archambeau, Cedric and Ramos, Fabio , booktitle =. 2021 , publisher =

  48. [48]

    International Conference on Learning Representations , year=

    Transformers Can Do Bayesian Inference , author=. International Conference on Learning Representations , year=

  49. [49]

    , author=

    Entropy search for information-efficient global optimization. , author=. Journal of Machine Learning Research , pages =. 2012 , publisher =

  50. [50]

    The Journal of Machine Learning Research , pages=

    GPstuff: Bayesian modeling with Gaussian processes , author=. The Journal of Machine Learning Research , pages=. 2013 , publisher=

  51. [51]

    2001 , school=

    A family of algorithms for approximate Bayesian inference , author=. 2001 , school=

  52. [52]

    TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

    Tabpfn: A transformer that solves small tabular classification problems in a second , author=. arXiv preprint arXiv:2207.01848 , year=

  53. [53]

    Advances in neural information processing systems , pages=

    Practical bayesian optimization of machine learning algorithms , author=. Advances in neural information processing systems , pages=. 2012 , publisher =

  54. [54]

    Hyperparameter Optimization

    Feurer, Matthias and Hutter, Frank. Hyperparameter Optimization. Automated Machine Learning: Methods, Systems, Challenges. 2019

  55. [55]

    Summer school on machine learning , pages=

    Gaussian processes in machine learning , author=. Summer school on machine learning , pages=. 2003 , publisher=

  56. [56]

    International conference on machine learning , pages=

    Conditional neural processes , author=. International conference on machine learning , pages=. 2018 , organization=

  57. [57]

    and Casella, George , title =

    Robert, Christian P. and Casella, George , title =. 2005 , publisher =

  58. [58]

    International Conference on Machine Learning , pages=

    In-Context Freeze-Thaw Bayesian Optimization for Hyperparameter Optimization , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  59. [59]

    Advances in Neural Information Processing Systems , pages=

    Forecastpfn: Synthetically-trained zero-shot forecasting , author=. Advances in Neural Information Processing Systems , pages=. 2023 , publisher =

  60. [60]

    Advances in Neural Information Processing Systems , pages=

    Efficient bayesian learning curve extrapolation using prior-data fitted networks , author=. Advances in Neural Information Processing Systems , pages=

  61. [61]

    Advances in Neural Information Processing Systems , pages=

    Language models are few-shot learners , author=. Advances in Neural Information Processing Systems , pages=

  62. [62]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Efficient Bayesian Experiment Design with Equivariant Networks , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  63. [63]

    Advances in Neural Information Processing Systems , year=

    Informed Initialization for Bayesian Optimization and Active Learning , author=. Advances in Neural Information Processing Systems , year=

  64. [64]

    Huang, Daolang and Wen, Xinyi and Bharti, Ayush and Kaski, Samuel and Acerbi, Luigi , booktitle=

  65. [65]

    International Conference on Artificial Intelligence and Statistics , year=

    Amortized Safe Active Learning for Real-Time Data Acquisition: Pretrained Neural Policies from Simulated Nonparametric Functions , author=. International Conference on Artificial Intelligence and Statistics , year=