pith. machine review for the scientific record. sign in

arxiv: 2605.12944 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:35 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords supervised fine-tuningdata selectionrecipe searchinstruction tuningoperator sequencesmodel training efficiency
0
0 comments X

The pith

Recipe search over fixed instruction pools finds better supervised fine-tuning data than instance ranking or full-data training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes supervised fine-tuning data selection as a search over executable sequences of operators such as filtering, mixing, and deduplication applied to a fixed raw pool, rather than scoring and keeping top-k individual examples. This matters because high-performing training subsets are typically produced by ordered curation steps that jointly shape the data distribution, yet evaluating every possible recipe with full training runs is prohibitively expensive. AutoSelection solves the problem with a two-layer solver that first materializes candidate subsets from cached task, data, and model signals, then uses warmup probes, local edits, and Gaussian-process ranking to refine recipes under a tight budget of full evaluations. On a 90K instruction pool the method delivers the highest in-distribution reasoning average across three base models while also showing gains on out-of-distribution graph reasoning and stable transfer from 1.5B to 7B scales.

Core claim

AutoSelection, a two-layer solver, decouples fixed-pool materialization based on cached signals from expensive full evaluation by using warmup probes, realized subset states, local recipe edits, Gaussian-process-assisted ranking, and stagnation-triggered reseeding, and thereby discovers recipes whose resulting subsets yield stronger in-distribution reasoning performance than full-data training, random recipe search, random top-k, or single-operator selectors.

What carries the argument

Two-layer solver that materializes candidate subsets from cached task-data-model signals for rapid ranking and then refines executable operator sequences with local edits and Gaussian-process assistance under a limited budget of full SFT runs.

If this is right

  • Recipe structure itself matters for final performance beyond the choice of any single operator, as shown by structural ablations.
  • The discovered recipes transfer across model scales from 1.5B to 7B parameters while preserving the performance ordering.
  • The same fixed-pool recipe approach produces measurable gains on out-of-distribution graph-reasoning tasks in addition to the in-distribution average.
  • High-quality subsets can be obtained without generating, rewriting, or augmenting any new training examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The cached-signal layer could be reused across multiple downstream tasks or base models without retraining the ranking model from scratch.
  • Extending the operator library with domain-specific filters would allow the same search machinery to target specialized data distributions such as code or math instruction sets.

Load-bearing premise

Cached task, data, and model signals plus warmup probes can reliably predict which recipes will perform well when the full supervised fine-tuning run is actually executed.

What would settle it

Running full SFT evaluations on the top recipes returned by AutoSelection and finding that their actual performance is no better than the performance of recipes found by random search under the same evaluation budget.

Figures

Figures reproduced from arXiv: 2605.12944 by Haodong Wu, Jiahao Zhang, Lijie Hu, Yongqi Zhang.

Figure 1
Figure 1. Figure 1: Conceptual contrast between instance-level selection and fixed-pool data recipe search [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: AutoSelection as a solver for fixed-pool data recipe search. All candidate recipes operate [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Raw-score and best-so-far curves for three 1.5B AutoSelection runs under the same 15 full [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Trajectory diagnostics for the search-side ablations on the 1.5B setting. Curves show post [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Extended-budget Random Select curve on the 1.5B setting. Shaded steps denote warmup [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Rolling GP fit over the three analyzed runs. Each panel compares saved surrogate predictions [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Run-level search diagnostics. Left: retained-example scale versus validation score. Right: [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative metric-distribution comparison for nine anonymous evaluated recipes. Each [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
read the original abstract

Supervised fine-tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top-$k$ subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed-pool data recipe search: given a raw instruction pool and a library of grounded operators, the goal is to discover an executable recipe that constructs a high-quality selected subset under a limited budget of full SFT evaluations, without generating, rewriting, or augmenting training samples. We introduce AutoSelection, a two-layer solver that decouples fixed-pool materialization based on cached task-, data-, and model-side signals from expensive full evaluation, using warmup probes, realized subset states, local recipe edits, Gaussian-process-assisted ranking, and stagnation-triggered reseeding. Experiments on a 90K instruction pool show that AutoSelection achieves the strongest in-distribution reasoning average across three base models, outperforming full-data training, random recipe search, random top-$k$, and single-operator selectors. Additional Out-of-distribution graph-reasoning results, search-stability analyses, structural ablations, and 1.5B-to-7B transfer checks further show that recipe structure matters beyond individual selection operators. Code is available at https://github.com/w253/AutoSelection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reframes SFT data selection as fixed-pool recipe search over a library of grounded operators (filtering, mixing, deduplication) applied to a 90K instruction pool. AutoSelection is a two-layer solver that materializes subsets via cached task/data/model signals and limited warmup probes, then uses Gaussian-process ranking, local recipe edits, and stagnation reseeding to locate high-quality recipes under a tight budget of full SFT runs. Experiments report that AutoSelection yields the highest in-distribution reasoning average across three base models, outperforming full-data training, random recipe search, random top-k, and single-operator baselines; additional OOD graph-reasoning results, search-stability analyses, structural ablations, and 1.5B-to-7B transfer checks are presented, with code released.

Significance. If the proxy signals and warmup probes are shown to preserve ranking order with full SFT performance, the work meaningfully shifts the paradigm from instance-level ranking to structured recipe search, offering a practical route to better data curation with far fewer full evaluations. The provision of code and the empirical breadth across models and tasks are clear strengths that would support adoption if the proxy-to-full correlation is established.

major comments (2)
  1. [§5 and search-stability analyses] §5 (Results) and search-stability analyses: no direct quantification is given of the correlation between warmup-probe rankings and full SFT metrics across the three base models or reasoning tasks. Because the two-layer solver substitutes these proxies for exhaustive evaluation, this correlation is load-bearing for the claim that AutoSelection locates genuinely superior recipes rather than artifacts of the proxy objective.
  2. [Table 1 / in-distribution results] Table 1 / in-distribution reasoning results: the reported averages lack statistical significance tests, run-to-run variance, or explicit controls for post-hoc selection of the final recipe; without these, the claim of consistent outperformance over random recipe search and single-operator baselines cannot be fully assessed.
minor comments (1)
  1. [Method] Method section: the distinction between cached signals, realized subset states, and the exact form of the Gaussian-process surrogate could be illustrated with a short pseudocode block or diagram to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§5 and search-stability analyses] §5 (Results) and search-stability analyses: no direct quantification is given of the correlation between warmup-probe rankings and full SFT metrics across the three base models or reasoning tasks. Because the two-layer solver substitutes these proxies for exhaustive evaluation, this correlation is load-bearing for the claim that AutoSelection locates genuinely superior recipes rather than artifacts of the proxy objective.

    Authors: We acknowledge that a direct quantification of the correlation between warmup-probe rankings and full SFT performance is not provided in the current manuscript. Our search-stability analyses show that the discovered recipes lead to strong performance, but to directly address this concern, we will add in the revised version explicit correlation metrics (such as Spearman rank correlation) computed across the three base models and tasks, using the available probe and full evaluation data from our experiments. This will help confirm that the proxy objective aligns with the true performance. revision: yes

  2. Referee: [Table 1 / in-distribution results] Table 1 / in-distribution reasoning results: the reported averages lack statistical significance tests, run-to-run variance, or explicit controls for post-hoc selection of the final recipe; without these, the claim of consistent outperformance over random recipe search and single-operator baselines cannot be fully assessed.

    Authors: We agree that including statistical significance and variance would strengthen the results. Due to the high computational cost of full SFT runs, our experiments used single evaluations per recipe within the budget. In the revision, we will report run-to-run variance from additional repeated runs on the top recipes where possible, include p-values from statistical tests comparing AutoSelection to baselines, and add a description of the recipe selection procedure to clarify that it follows the fixed budget and automated process without post-hoc cherry-picking. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical search method with independent experimental validation

full rationale

The paper presents AutoSelection as a practical two-layer search algorithm that uses cached signals and limited warmup probes to rank recipes, followed by full SFT evaluations on discovered subsets. All central claims rest on direct empirical comparisons (in-distribution reasoning averages, OOD graph-reasoning, stability analyses, and transfer checks) against explicit baselines such as full-data training, random recipe search, and single-operator selectors. No equations, predictions, or uniqueness claims reduce by construction to fitted parameters or self-citations; the method is algorithmic rather than derivational, and code is released for external reproduction. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that a library of grounded operators can construct high-quality subsets and that limited full evaluations suffice when guided by cached signals.

axioms (1)
  • domain assumption A library of grounded operators (filtering, mixing, deduplication) is sufficient to shape high-quality SFT data distributions
    Invoked in the problem formulation as the basis for recipe search.

pith-pipeline@v0.9.0 · 5557 in / 1172 out tokens · 38910 ms · 2026-05-14T20:35:22.139565+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 18 canonical work pages · 6 internal anchors

  1. [1]

    Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023

  2. [2]

    What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

    Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. InThe Twelfth International Conference on Learning Representations

  3. [3]

    Data-juicer: A one-stop data processing system for large language models

    Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, et al. Data-juicer: A one-stop data processing system for large language models. InCompanion of the 2024 International Conference on Management of Data, pages 120–134, 2024

  4. [4]

    LESS: Selecting influential data for targeted instruction tuning

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: Selecting influential data for targeted instruction tuning. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=PG5fV50maR

  5. [5]

    Task-specific data selection for instruction tuning via monosemantic neuronal activations

    Da Ma, Gonghu Shang, Zhi Chen, Libo Qin, Yijie LUO, Hongshen Xu, Lei Pan, Shuai Fan, Kai Yu, and Lu Chen. Task-specific data selection for instruction tuning via monosemantic neuronal activations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  6. [6]

    Lead: iterative data selection for efficient llm instruction tuning.Proceedings of the VLDB Endowment, 19(3):426–439, 2025

    Xiaotian Lin, Yanlin Qi, Yizhang Zhu, Themis Palpanas, Chengliang Chai, Nan Tang, and Yuyu Luo. Lead: iterative data selection for efficient llm instruction tuning.Proceedings of the VLDB Endowment, 19(3):426–439, 2025

  7. [7]

    Datachef: Cooking up optimal data recipes for llm adaptation via reinforcement learning.arXiv preprint arXiv:2602.11089, 2026

    Yicheng Chen, Zerun Ma, Xinchen Xie, Yining Li, and Kai Chen. Datachef: Cooking up optimal data recipes for llm adaptation via reinforcement learning.arXiv preprint arXiv:2602.11089, 2026

  8. [8]

    LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

    Wei Huang, Anda Cheng, Yinggui Wang, Lei Wang, and Tao Wei. Llm-autodp: Automatic data processing via LLM agents for model fine-tuning.arXiv preprint arXiv:2601.20375, 2026

  9. [9]

    Evaluating data influence in meta learning.arXiv preprint arXiv:2501.15963, 2025

    Chenyang Ren, Huanyi Xie, Shu Yang, Meng Ding, Lijie Hu, and Di Wang. Evaluating data influence in meta learning.arXiv preprint arXiv:2501.15963, 2025

  10. [10]

    From quantity to quality: Boosting LLM performance with self-guided data selection for instruction tuning

    Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. From quantity to quality: Boosting LLM performance with self-guided data selection for instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...

  11. [11]

    Deduplicating training data makes language mod- els better

    Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language mod- els better. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, 2022. 10

  12. [12]

    Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S. Morcos. Semd- edup: Data-efficient learning at web-scale through semantic deduplication.arXiv preprint arXiv:2303.09540, 2023

  13. [13]

    Data diversity matters for robust instruction tuning

    Alexander Bukharin, Shiyang Li, Zhengyang Wang, Jingfeng Yang, Bing Yin, Xian Li, Chao Zhang, Tuo Zhao, and Haoming Jiang. Data diversity matters for robust instruction tuning. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3411–3425, 2024

  14. [14]

    Automatic configuration of llm post-training pipelines

    Channe Chwa, Xinle Wu, and Yao Lu. Automatic configuration of llm post-training pipelines. arXiv preprint arXiv:2603.18773, 2026

  15. [15]

    Random search for hyper-parameter optimization.Journal of machine learning research, 13(2), 2012

    James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization.Journal of machine learning research, 13(2), 2012

  16. [16]

    Practical bayesian optimization of machine learning algorithms.Advances in neural information processing systems, 25, 2012

    Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms.Advances in neural information processing systems, 25, 2012

  17. [17]

    Hy- perband: A novel bandit-based approach to hyperparameter optimization.Journal of Machine Learning Research, 18(185):1–52, 2018

    Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hy- perband: A novel bandit-based approach to hyperparameter optimization.Journal of Machine Learning Research, 18(185):1–52, 2018. URL http://jmlr.org/papers/v18/16-558. html

  18. [18]

    BOHB: Robust and efficient hyperparameter optimization at scale

    Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and efficient hyperparameter optimization at scale. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1437–1446. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr. press/v80/...

  19. [19]

    Efficient and robust automated machine learning

    Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Sys- tems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/ p...

  20. [20]

    Limit: Less is more for instruction tuning across evaluation paradigms

    Aditi Jha, Sam Havens, Jeremy Dohmann, Alexander Trott, and Jacob Portes. Limit: Less is more for instruction tuning across evaluation paradigms. InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following

  21. [22]

    Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023

    Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023. URLhttps://huggingface.co/datasets/teknium/OpenHermes-2.5

  22. [23]

    Self-instruct: Aligning language models with self-generated instruc- tions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023

  23. [24]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98

  24. [25]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  25. [26]

    Brown, et al

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj. Featured Certifica- tion. 11

  26. [27]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=d7KBjmI3GmQ

  27. [28]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, et al. Qwen2.5 technical report, 2025. URLhttps://arxiv.org/abs/2412.15115

  28. [29]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, et al. The llama 3 herd of models, 2024. URL https: //arxiv.org/abs/2407.21783

  29. [30]

    Graphwiz: An instruction-following language model for graph computational problems

    Nuo Chen, Yuhan Li, Jianheng Tang, and Jia Li. Graphwiz: An instruction-following language model for graph computational problems. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 353–364, 2024

  30. [31]

    Can language models solve graph problems in natural language?Advances in Neural Informa- tion Processing Systems, 36:30840–30861, 2023

    Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, and Yulia Tsvetkov. Can language models solve graph problems in natural language?Advances in Neural Informa- tion Processing Systems, 36:30840–30861, 2023

  31. [32]

    Optimal lossless data compression: Non-asymptotics and asymptotics.IEEE Transactions on Information Theory, 60(2):777–795, 2014

    Ioannis Kontoyiannis and Sergio Verdú. Optimal lossless data compression: Non-asymptotics and asymptotics.IEEE Transactions on Information Theory, 60(2):777–795, 2014. doi: 10.1109/TIT.2013.2291007

  32. [33]

    A new generalized varentropy and its properties

    Saeid Maadani, Gholam Reza Mohtashami Borzadaran, and Abdolhamid Rezaei Roknabadi. A new generalized varentropy and its properties. 2020. URL https://api.semanticscholar. org/CorpusID:225604868

  33. [34]

    Exploring iterative controllable summarization with large language models

    Xianzhi Li, Ethan Callanan, Abdellah Ghassel, and Xiaodan Zhu. Entropy-gated branching for efficient test-time reasoning. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors, Proceedings of the 19th Conference of the European Chapter of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 5054–5069, Rabat, Morocco, March 20...

  34. [35]

    Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

    Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, and Yongqi Zhang. Rethinking token-level credit assignment in rlvr: A polarity-entropy analysis.arXiv preprint arXiv:2604.11056, 2026

  35. [36]

    Claude E. Shannon. A mathematical theory of communication.Bell Syst. Tech. J., 27:623–656,

  36. [37]

    URLhttps://api.semanticscholar.org/CorpusID:55379485

  37. [38]

    The best of both worlds: Bridging quality and diversity in data selection with bipartite graph

    Minghao Wu, Thuy-Trang Vu, Lizhen Qu, and Gholamreza Haffari. The best of both worlds: Bridging quality and diversity in data selection with bipartite graph. InForty-second Inter- national Conference on Machine Learning, 2025. URLhttps://openreview.net/forum? id=nCoaJYNCcg

  38. [39]

    A preliminary study of the intrinsic relationship between complexity and alignment

    Yingxiu Zhao, Bowen Yu, Binyuan Hui, Haiyang Yu, Fei Huang, Yongbin Li, and Nevin L Zhang. A preliminary study of the intrinsic relationship between complexity and alignment. arXiv preprint arXiv:2308.05696, 2023

  39. [40]

    Chasing random: Instruction selection strategies fail to generalize

    Harshita Diddee and Daphne Ippolito. Chasing random: Instruction selection strategies fail to generalize. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1943–1957, 2025

  40. [41]

    Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807, 2025

    Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, and Pradeep Dasigi. Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807, 2025

  41. [42]

    A critical look at targeted instruction selection: Disentangling what matters (and what doesn’t)

    Nihal V Nayak, Paula Rodriguez-Diaz, Neha Hulkund, Sara Beery, and David Alvarez-Melis. A critical look at targeted instruction selection: Disentangling what matters (and what doesn’t). arXiv preprint arXiv:2602.14696, 2026. 12

  42. [43]

    Smaller language models are capable of selecting instruction-tuning training data for larger language models

    Dheeraj Mekala, Alex Nguyen, and Jingbo Shang. Smaller language models are capable of selecting instruction-tuning training data for larger language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 10456–10470, Bangkok, Thailand, August 2024. Association for Comput...

  43. [44]

    Smalltolarge (s2l): Scalable data selection for fine-tuning large language models by summarizing train- ing trajectories of small models

    Yu Yang, Siddhartha Mishra, Jeffrey Chiang, and Baharan Mirzasoleiman. Smalltolarge (s2l): Scalable data selection for fine-tuning large language models by summarizing train- ing trajectories of small models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Pro- cessing Systems, v...

  44. [45]

    Doremi: Optimiz- ing data mixtures speeds up language model pretraining

    Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimiz- ing data mixtures speeds up language model pretraining. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 69798–6...

  45. [46]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  46. [47]

    Sparse autoencoders find highly interpretable features in language models

    Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=F76bwRSLeK. 13 A Experimental Protocol and Operator Details A.1 Operator Library Table 4 s...

  47. [48]

    Operators

    We group the fields into task, data, and model components, z(S) = ztask(S);z data(S);z model(S) . Table 8 lists the fields used in our search. SNAR denotes sparse-neuron activation rate. Table 8: State-vector fields used to summarize an executed candidate subset. Group Field Computation Meaning Task score_meanmean x,b sb(x)overEAverage MONA task relevance...

  48. [49]

    A specific observation (not vague)

  49. [50]

    Backed by data from the table

  50. [51]

    Be direct and quantitative

    Actionable (suggests what to try or avoid) Format each finding as a numbered line. Be direct and quantitative. These are HYPOTHESES based on limited data, not proven facts. Example format:

  51. [52]

    Avoid aggressive filtering

    More data consistently helps: recipe_A (12K samples, 22.2%) > recipe_C (3K samples, 18.5%). Avoid aggressive filtering

  52. [53]

    Try higher rates or skip it

    operator_X at rate 0.3 hurts benchmark_Y: recipe_B dropped from 15% to 0.9%. Try higher rates or skip it. Listing 2: Proposer prompt template. You are an expert Data-Centric AI Search Controller optimizing a data recipe. YOUR GOAL: Propose {n_candidates} DISTINCT mutated recipe configurations that resolve current risks and explore different valid subspace...

  53. [54]

    Analyze the current recipe, state vector, and search history

  54. [55]

    Select operators and hyperparameters ONLY from the OPERATOR CATALOG

  55. [56]

    Do NOT include markdown blocks (‘ ‘‘‘json ‘), just raw JSON

    Your output MUST be a valid JSON array of objects representing the {n_candidates} recipes. Do NOT include markdown blocks (‘ ‘‘‘json ‘), just raw JSON

  56. [57]

    steps": [ {

    Format: [ { "steps": [ { "operator": "operator_name", "params": {"param1": "value", "param2": 123} 29 } ] }, ... (up to {n_candidates} distinct configurations) ] Listing 3: Ranker prompt template. You are a strategic advisor for an automated data selection search system. Your task is to select the SINGLE most promising candidate recipe for real evaluation...

  57. [58]

    - A candidate whose per-task MONA scores improve across multiple benchmarks is a strong positive signal, even if retain_ratio drops

    Per-Task MONA Scores (PRIMARY SIGNAL): - score_per_task shows how relevant the selected subset is to each benchmark. - A candidate whose per-task MONA scores improve across multiple benchmarks is a strong positive signal, even if retain_ratio drops. - Compare each candidate’s score_per_task against the parent’s and look for improvements on weak benchmarks...

  58. [59]

    - Late search: prefer high mu candidates to refine the best

    Exploration vs Exploitation Trade-off: - Early search: prefer high sigma candidates to gather information. - Late search: prefer high mu candidates to refine the best. - Current phase: {phase}

  59. [60]

    - Historical evidence shows extreme filtering often fails catastrophically

    Data Quantity Risk: - Recipes that aggressively filter data risk producing too few samples. - Historical evidence shows extreme filtering often fails catastrophically. - Union operators can recover data volume and are safer exploration choices. - Refer to the per-benchmark history to see how sample count correlates with each benchmark

  60. [61]

    - Operators from the same family are often redundant

    Operator Synergies and Redundancy: - Multiple filtering operators in sequence compound data loss multiplicatively. - Operators from the same family are often redundant. - Complementary operators tend to work well together

  61. [62]

    Feedback Alignment: - Does this candidate address the patterns identified in evaluation insights? - Does it avoid strategies that have been shown to fail?

  62. [63]

    - High distribution_drift indicates risky distributional shift

    State Vector Patterns: - High retain_ratio with good score_mean tends to perform well. - High distribution_drift indicates risky distributional shift. - The parent’s state vector shows the data profile that candidates will modify

  63. [64]

    ranking": [<best_idx>, <2nd_idx>, ..., <worst_idx>],

    GP Model Limitations: - The GP has only {n_iterations} training points, so predictions carry uncertainty. - Do not blindly trust GP rankings, especially when scores are close. - Qualitative reasoning about operator interactions can add value beyond the GP. ## OUTPUT FORMAT After thorough reasoning, output a full ranking of all presented candidates as a JS...

  64. [65]

    Select only operators from the allowed catalog above

  65. [66]

    Prefer operators and combinations supported by the evidence

  66. [67]

    Keep parameters within catalog bounds

  67. [68]

    operator

    Return raw JSON only in this exact format: [ {"operator": "mona_filter", "params": {"fraction": 0.5}}, {"operator": "ngram_entropy", "params": {"fraction": 0.4}} ] Pool size reference: [POOL_SIZE] E.2 Benchmark evaluation prompts The validation suite uses GPQA, GSM8K, BBH, and MMLU. Listing 5: GPQA evaluation prompt scaffold. System: You are an expert ass...

  68. [71]

    Respond with ONLY the letter of the correct answer (A, B, C, or D) on the last line

  69. [72]

    Example format: [Your reasoning] Answer: B Few-shot turns: User: Question: Which of the following is NOT a function of the cell membrane? A

    Format: put your final answer after "Answer:" on the last line. Example format: [Your reasoning] Answer: B Few-shot turns: User: Question: Which of the following is NOT a function of the cell membrane? A. Selective permeability B. Protein synthesis C. Cell signaling D. Cell adhesion Assistant: The cell membrane has multiple functions including selective p...

  70. [73]

    Think through the problem carefully

  71. [74]

    Example format: [Your reasoning] Answer: (B) Few-shot turns: User: not ( True ) and ( True ) is Assistant: not ( True ) evaluates to False

    On the last line, write your final answer after "Answer:" exactly matching the expected format. Example format: [Your reasoning] Answer: (B) Few-shot turns: User: not ( True ) and ( True ) is Assistant: not ( True ) evaluates to False. False and ( True ) evaluates to False. Answer: False User: In the following sentences, explain the antecedent of the pron...

  72. [75]

    Read the question carefully

  73. [76]

    Consider each option

  74. [77]

    Example format: [Your reasoning] Answer: B Few-shot turns: User: Question: What is the capital of France? A

    Respond with your reasoning, then provide the letter of the correct answer after "Answer:" on the last line. Example format: [Your reasoning] Answer: B Few-shot turns: User: Question: What is the capital of France? A. London B. Berlin C. Paris D. Madrid Assistant: Paris is the capital and largest city of France. Answer: C User: Question: Which planet is k...