BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training

Chao Liu; Deping Xiang; Guoqiang Gong; Jiaxing Wang; Jin Xu; Jun Fang; Ke Zhang; Pengzhang Liu; Qixia Jiang; Tongxuan Liu

arxiv: 2606.18650 · v1 · pith:YGVXBHAGnew · submitted 2026-06-17 · 💻 cs.LG

BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training

Jiaxing Wang , Deping Xiang , Jin Xu , Zirui Liu , Zicheng Zhang , Guoqiang Gong , Jun Fang , Chao Liu

show 4 more authors

Pengzhang Liu Tongxuan Liu Ke Zhang Qixia Jiang

This is my paper

Pith reviewed 2026-06-26 21:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords data selectionLLM trainingbi-level optimizationLagrange multipliersexcess lossHessian-free methodsFrank-Wolfe algorithmadaptive selection

0 comments

The pith

BLADE reformulates bi-level data selection for LLMs as a penalized single-level objective that uses a dynamic reference model instead of a static one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the core tension in scaling data selection for trillion-token LLM training: influence-based methods rest on sound bi-level objectives but demand intractable inverse-Hessian steps, while excess-loss methods run cheaply yet rely on a fixed reference model that drifts out of alignment with the model being trained. BLADE applies Lagrange multipliers to convert the bi-level problem into an equivalent penalized single-level objective. This move removes the Hessian requirement, recovers an excess-loss expression, and substitutes a dynamic reference that evolves in lockstep with training. The authors prove the penalized form still guarantees first-order convergence and realize the selection step as a memoryless randomized block-coordinate Frank-Wolfe procedure suitable for online batches. If the reformulation holds, training pipelines can perform principled, adaptive data filtering at scale without the usual computational roadblocks.

Core claim

BLADE reformulates the bi-level optimization problem underlying influence-based methods as a penalized single-level objective via Lagrange multipliers, avoiding inverse-Hessian computation while revealing a principled connection to excess-loss based data selection. The resulting objective recovers an excess-loss form but replaces the static reference model with a dynamic one that stays synchronized with training. Theoretically, the penalized formulation guarantees first-order convergence. For efficient online batch selection the method is instantiated as a memoryless randomized block-coordinate Frank-Wolfe algorithm.

What carries the argument

The Lagrange-multiplier penalized single-level reformulation of the bi-level influence objective, which produces a dynamic reference model synchronized to the training process.

If this is right

The approach eliminates the need for inverse-Hessian computations while retaining a connection to excess-loss selection.
The dynamic reference model remains aligned with the evolving proxy model throughout training.
First-order convergence is guaranteed for the reformulated objective.
The selection procedure can be executed online via a memoryless randomized block-coordinate Frank-Wolfe algorithm.
Extensive experiments show consistent gains over prior data-selection baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dynamic reference may allow selection decisions to track shifts in what the model currently finds hard, opening the possibility of more responsive training trajectories than static filters permit.
The same penalized reformulation could be tried on other bi-level problems in machine learning that currently stall on Hessian terms.
Because the algorithm is memoryless, the method may integrate into existing large-scale training loops with little extra state management.
If the dynamic reference improves selection quality, one testable extension is to measure whether downstream model performance on held-out benchmarks rises more steeply than with static excess-loss baselines.

Load-bearing premise

The penalized single-level formulation preserves enough of the original bi-level semantics to guarantee first-order convergence and keeps the dynamic reference model properly synchronized without introducing instabilities during online training.

What would settle it

A controlled small-scale run in which the data batches chosen by the penalized objective measurably diverge from those chosen by the exact bi-level solution (computed directly on a toy model where both are tractable) would falsify the claim that the reformulation preserves the intended semantics.

Figures

Figures reproduced from arXiv: 2606.18650 by Chao Liu, Deping Xiang, Guoqiang Gong, Jiaxing Wang, Jin Xu, Jun Fang, Ke Zhang, Pengzhang Liu, Qixia Jiang, Tongxuan Liu, Zicheng Zhang, Zirui Liu.

**Figure 2.** Figure 2: (a) The mean Llama2-70B oracle loss on the selected tokens. (b) The averaged accuracy [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Token-level score (∆) distributions at steps 1K, 2K, and 3K. While RHO-1’s scores collapse toward non-positive values due to its static reference, BLADE’s dynamic update maintains a pronounced right tail (∆ > 0). This preserves BLADE’s discriminative power to consistently select tokens with positive marginal utility. 0 1000 2000 3000 4000 5000 4 6 8 (a) Dist(u, w) with synchronization. 0 2000 4000 3 4 5 6 … view at source ↗

**Figure 4.** Figure 4: Trajectory of Dist(u, w). BLADE maintains the structural constraint of the bi-level objective (Equation (2)) through periodic synchronization, whereas RHO-1 exhibits inevitable divergence. The Effect of Synchronizing u and w One obvious characteristic of BLADE is the periodic synchronization of the proxy model u and the reference model w (w (e) 0 = u (eτ) ). We investigate this by tracking the parameter di… view at source ↗

read the original abstract

As Large Language Model (LLM) datasets scale to trillions of tokens, data selection has emerged as a critical frontier to filter out uninformative noise and construct adaptive learning trajectories. Beyond static heuristic filtering, advanced data selection methods for LLM training largely follow two paradigms, each with fundamental limitations. Influence-based methods provide principled bi-level objectives but require intractable inverse-Hessian computations, while excess-loss methods are computationally efficient but rely on a static reference model that becomes misaligned with the evolving proxy model during training. We propose BLADE (Bi-Level Adaptive Data sElection), a Hessian-free framework for data selection. BLADE reformulates the bi-level optimization problem underlying influence-based methods as a penalized single-level objective via Lagrange multipliers, avoiding inverse-Hessian computation while revealing a principled connection to excess-loss based data selection. The resulting objective recovers an excess-loss form but replaces the static reference model with a dynamic one that stays synchronized with training. Theoretically, we prove that this penalized formulation guarantees first-order convergence. For efficient online batch selection, we instantiate BLADE as a memoryless randomized block-coordinate Frank-Wolfe algorithm. Extensive experiments show that BLADE consistently outperforms state-of-the-art data selection baselines, providing a practical recipe for LLM training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BLADE's Lagrange reformulation of bi-level influence selection into a dynamic excess-loss objective is the main new move, but the claimed first-order convergence needs the full derivation to confirm it holds without extra assumptions on the reference model.

read the letter

The core claim is that BLADE converts the bi-level influence problem into a single-level penalized objective with Lagrange multipliers. This produces an excess-loss style selector but with a reference model that updates alongside training, avoiding the Hessian while keeping the reference aligned.

The reformulation itself is the clearest new element. It directly addresses the static-reference misalignment in excess-loss methods and the compute cost in influence methods. The paper also supplies a convergence proof for the penalized form and turns the method into a memoryless randomized block-coordinate Frank-Wolfe routine for online batch selection. Experiments reportedly beat existing baselines on LLM training tasks.

The approach is practical for anyone scaling data selection to trillions of tokens. The dynamic reference is a reasonable fix for the alignment problem, and the Frank-Wolfe instantiation looks lightweight enough for real runs.

The open question is whether the penalty term fully recovers the original bi-level optimality condition and first-order convergence once the reference model is updated online. The abstract states the proof exists, but the precise multiplier construction and any synchronization assumptions matter. If those steps introduce lag or require extra conditions not stated up front, the guarantee weakens. Minor implementation details around the memoryless property would also be worth checking.

This is worth sending to a serious referee. The target problem is real, the proposed bridge between paradigms is explicit, and the work includes both theory and experiments. Readers working on efficient LLM training or bi-level optimization would get direct value from the reformulation even if later revisions are needed.

Referee Report

2 major / 2 minor

Summary. The paper claims that BLADE reformulates the bi-level optimization underlying influence-based data selection as a penalized single-level objective via Lagrange multipliers. This avoids inverse-Hessian computation, recovers an excess-loss form with a dynamic (training-synchronized) reference model instead of a static one, guarantees first-order convergence, and is instantiated as a memoryless randomized block-coordinate Frank-Wolfe algorithm for online batch selection. Experiments are reported to show consistent outperformance over state-of-the-art baselines on LLM training.

Significance. If the penalized reformulation exactly recovers bi-level semantics and the convergence analysis holds without additional instabilities from the dynamic reference, the work would usefully bridge influence-based and excess-loss paradigms while providing a scalable, Hessian-free method for data selection at trillion-token scale. The explicit connection to excess-loss forms and the memoryless Frank-Wolfe instantiation are notable strengths if substantiated.

major comments (2)

[§3] The abstract and §3 assert that the Lagrange-penalized single-level objective recovers the original bi-level solution and excess-loss form; the precise multiplier construction, the conditions under which equivalence holds (especially with the dynamic reference), and any hidden assumptions on synchronization must be stated explicitly to support the first-order convergence claim.
[§4] §4 states a first-order convergence guarantee for the memoryless randomized block-coordinate Frank-Wolfe instantiation; the proof must address whether online updates to the dynamic reference model can introduce lag or violate the assumptions needed for the guarantee to hold.

minor comments (2)

[§5] The experimental section should include more detail on baseline implementations, hyperparameter choices, and statistical significance of the reported gains to allow direct replication.
Notation for the dynamic reference model and its update rule should be introduced earlier and kept consistent across the method and theory sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive suggestions. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the equivalence conditions and convergence analysis.

read point-by-point responses

Referee: [§3] The abstract and §3 assert that the Lagrange-penalized single-level objective recovers the original bi-level solution and excess-loss form; the precise multiplier construction, the conditions under which equivalence holds (especially with the dynamic reference), and any hidden assumptions on synchronization must be stated explicitly to support the first-order convergence claim.

Authors: We agree that greater explicitness is warranted. The Lagrange multiplier is the dual variable associated with the bi-level constraint; equivalence to the excess-loss form holds precisely when the reference model is updated synchronously with the proxy model at each iteration. We will insert a dedicated paragraph in §3 stating the multiplier construction, the exact equivalence conditions, and the synchronization assumption required for the first-order convergence result. revision: yes
Referee: [§4] §4 states a first-order convergence guarantee for the memoryless randomized block-coordinate Frank-Wolfe instantiation; the proof must address whether online updates to the dynamic reference model can introduce lag or violate the assumptions needed for the guarantee to hold.

Authors: The memoryless block-coordinate updates are constructed so that the reference model remains synchronized by design, preventing lag from accumulating. Nevertheless, to directly address the concern, we will augment the proof in §4 with a short lemma bounding any residual lag and confirming that the first-order rate is preserved under the online schedule. revision: yes

Circularity Check

0 steps flagged

No circularity: standard Lagrange reformulation with independent convergence claim

full rationale

The provided abstract and context describe a reformulation of bi-level optimization into a penalized single-level objective using Lagrange multipliers, followed by a claimed proof of first-order convergence and an online Frank-Wolfe instantiation. This is a standard mathematical technique for constrained optimization and does not reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The dynamic reference model is presented as synchronized by construction of the online algorithm, but no equations or steps in the given text exhibit a reduction where the output is forced by the inputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5786 in / 1120 out tokens · 18839 ms · 2026-06-26T21:50:15.545887+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 15 linked inside Pith

[1]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark abd Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint, arXiv:1905.10044, 2019

Pith/arXiv arXiv 1905
[2]

Mathqa: Towards interpretable math word problem solving with operation-based formalisms.arXiv preprint, arXiv:1905.13319, 2019

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms.arXiv preprint, arXiv:1905.13319, 2019

Pith/arXiv arXiv 1905
[3]

Jiang, Jia Deng, Stella Biderman, and Sean Welleck

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics.arXiv preprint, arXiv:2310.10631, 2023

Pith/arXiv arXiv 2023
[4]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InAAAI conference on artificial intelligence, 2020

2020
[5]

Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems

Tianyi Chen, Yuejiao Sun, and Wotao Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. InAdvances in Neural Information Processing Systems, 2021

2021
[6]

Making scalable meta learning practical

Sang Keun Choe, Sanket Vaibhav Mehta, Willie Neiswanger Hwijeen Ahn, Pengtao Xie, Emma Strubell, and Eric Xing. Making scalable meta learning practical. InAdvances in Neural Information Processing Systems, 2024

2024
[7]

Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint, arXiv:1803.05457, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint, arXiv:1803.05457, 2018

Pith/arXiv arXiv 2018
[8]

Training verifiers to solve math word problems.arXiv preprint, arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint, arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021
[9]

Dsdm: Model-aware dataset selection with datamodels

Logan Engstrom, Axel Feldmann, and Aleksander Madry. Dsdm: Model-aware dataset selection with datamodels. InInternational Conference on Machine Learning, pages 12491–12526. PMLR, 2024

2024
[10]

The language model evaluation harness, 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

2024
[11]

Bilevel optimization with a lower- level contraction: Optimal sample complexity without warm-start

Riccardo Grazzi, Massimiliano Pontil, and Saverio Salzo. Bilevel optimization with a lower- level contraction: Optimal sample complexity without warm-start. InJournal of Machine Learning Research, 2023

2023
[12]

Bliss: A lightweight bilevel influence scoring method for data selection in language model pretraining.arXiv preprint, arXiv:2510.06048, 2025

Jie Hao, Rui Yu, Wei Zhang, Huixia Wang, Jie Xu, and Mingrui Liu. Bliss: A lightweight bilevel influence scoring method for data selection in language model pretraining.arXiv preprint, arXiv:2510.06048, 2025

Pith/arXiv arXiv 2025
[13]

Projection-free online learning

Elad Hazan and Satyen Kale. Projection-free online learning. InInternational Conference on Machine Learning, 2012

2012
[14]

Measuring massive multitask language understanding.arXiv preprint, arXiv:2009.03300, 2020

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint, arXiv:2009.03300, 2020

Pith/arXiv arXiv 2009
[15]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Neural Information Processing Systems, 2021. 10

2021
[16]

A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor- critic.SIAM J

Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor- critic.SIAM J. Optim., 2023

2023
[17]

Mawps: A math word problem repository

Rik Koncel Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. Mawps: A math word problem repository. InConference of the North American Chapter of the Association for Computational Linguistics, 2016

2016
[18]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational Conference on Machine Learning, 2017

2017
[19]

A fully first-order method for stochastic bilevel optimization

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert Nowa. A fully first-order method for stochastic bilevel optimization. InInternational Conference on Machine Learning, 2023

2023
[20]

On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation. InInternational Conference on Learning Representations, 2024

2024
[21]

Convergence rate of frank-wolfe for non-convex objectives.arXiv preprint, arXiv:1607.00345, 2016

Simon Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives.arXiv preprint, arXiv:1607.00345, 2016

Pith/arXiv arXiv 2016
[22]

Block-coordinate Frank-Wolfe optimization for structural SVMs

Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate Frank-Wolfe optimization for structural SVMs. InInternational Conference on Machine Learning, 2013

2013
[23]

Rho-1: Not all tokens are what you need

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. Rho-1: Not all tokens are what you need. In Neural Information Processing Systems, 2024

2024
[24]

Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning

Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. InInternational Conference on Learning Representations, 2023

2023
[25]

A framework for bilevel optimization that enables stochastic and global variance reduction algorithms

Dagréou Mathieu, Pierre Ablin, Samuel Vaiter, and Thomas Moreau. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. InAdvances in Neural Information Processing Systems, 2022

2022
[26]

A diverse corpus for evaluating and de- veloping english math word problem solvers

Shenyun Miao, Chaochun Liang, and Keh-Yih Su. A diverse corpus for evaluating and de- veloping english math word problem solvers. InAssociation for Computational Linguistics, 2020

2020
[27]

Can a suit of armor conduct electricity? a new dataset for open book question aanswering.arXiv preprint, arXiv:1809.02789, 2018

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question aanswering.arXiv preprint, arXiv:1809.02789, 2018

Pith/arXiv arXiv 2018
[28]

Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal

Sören Mindermann, Jan Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, 2022

2022
[29]

In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint, arXiv:1412.6614, 2014

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint, arXiv:1412.6614, 2014

Pith/arXiv arXiv 2014
[30]

G- dig: Towards gradient-based diverse and high-quality instruction data selection for machine translation

Xingyuan Pan, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, and Shanbo Cheng. G- dig: Towards gradient-based diverse and high-quality instruction data selection for machine translation. InAssociation for Computational Linguistics, 2024

2024
[31]

Openwebmath: An open dataset of high-quality mathematical web text

Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text. InInternational Conference on Learning Representations, 2024

2024
[32]

Are nlp models really able to solve simple math word problems? InConference of the North American Chapter of the Association for Computational Linguistics, 2021

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InConference of the North American Chapter of the Association for Computational Linguistics, 2021. 11

2021
[33]

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, and Susannah Young et. al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint, arXiv:2112.11446, 2021

Pith/arXiv arXiv 2021
[34]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. InJournel of Machine Learning Research, 2020

2020
[35]

Reddi, Suvrit Sra, Barnabas Poczos, and Alex Smola

Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic frank-wolfe methods for nonconvex optimization. InAnnual Allerton Conference on Communication, Control, and Computing, 2016

2016
[36]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. InAAAI conference on artificial intelligence, 2020

2020
[37]

Truncated back- propagation for bilevel optimization

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back- propagation for bilevel optimization. InInternational Conference on Artificial Intelligence and Statistics, 2019

2019
[38]

On penalty-based bilevel gradient descent method

Han Shen and Tianyi Chen. On penalty-based bilevel gradient descent method. InInternational Conference on Machine Learning, 2023

2023
[39]

Slimpajama: A 627b token, cleaned and deduplicated version of redpajam.https://www.cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated- version-of-redpajama, 2023

Daria Soboleva, Faisal Al-Khateeb, Joel Hestness, and Jacob Robert Steeves Nolan Dey Open- tensor: Robert Myers. Slimpajama: A 627b token, cleaned and deduplicated version of redpajam.https://www.cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated- version-of-redpajama, 2023

2023
[40]

The implicit bias of gradient descent on separable data

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. InJournal of Machine Learning Research, 2018

2018
[41]

Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants

Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants. 2023

2023
[42]

D4: Improving llm pretraining via document de-duplication and diversification

Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari Morcos. D4: Improving llm pretraining via document de-duplication and diversification. InNeural Information Processing Systems, 2023

2023
[43]

Llama 2: Open foundation and fine-tuned chat models.arXiv preprint, arXiv:2307.09288, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, and Nikolay Bashlykov et.al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint, arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023
[44]

Wang, Tong Wu, Dawn Song, Prateek Mittal, and Ruoxi Jia

Jiachen T. Wang, Tong Wu, Dawn Song, Prateek Mittal, and Ruoxi Jia. Greats: Online selection of high-quality data for llm training in every iteration. InNeural Information Processing Systems, 2024

2024
[45]

Tandem: Bi-level data mixture optimization with twin networks

Jiaxing Wang, Deping Xiang, Jin Xu, Mingyang Yi, Guoqiang Gong, Zicheng Zhang, Haoran Li, Pengzhang Liu, Zhen Chen, Ke Zhang, Ju Fan, and Qixiang Jiang. Tandem: Bi-level data mixture optimization with twin networks. InNeural Information Processing Systems, 2025

2025
[46]

Less: Selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning, 2024

2024
[47]

Data selection for language models via importance resampling

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. InNeural Information Processing Systems, 2023

2023
[48]

Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. InInternational Conference on Learning Representations, 2024. 12

2024
[49]

Mates: Model-aware data selection for efficient pretraining with data influence models

Zichun Yu, Spandan Das, and Chenyan Xiong. Mates: Model-aware data selection for efficient pretraining with data influence models. InNeural Information Processing Systems, 2024

2024
[50]

Mammoth: Building math generalist models through hybrid instruction tuning

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, , and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. In International Conference on Learning Representations, 2024

2024
[51]

Hellaswag: Can a machine really finish your sentence?arXiv preprint, arXiv:1905.07830, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint, arXiv:1905.07830, 2019

Pith/arXiv arXiv 1905
[52]

Tinyllama: An open-source small language model.arXiv preprint, arXiv:2401.02385, 2024

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint, arXiv:2401.02385, 2024

Pith/arXiv arXiv 2024
[53]

Lima: Less is more for alignment.arXiv preprint, arXiv:2305.11206, 2023

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment.arXiv preprint, arXiv:2305.11206, 2023

Pith/arXiv arXiv 2023
[54]

Probabilistic bilevel coreset selection

Xiao Zhou, Renjie Pi, Weizhong Zhang, Yong Lin, and Tong Zhang. Probabilistic bilevel coreset selection. InInternational Conference on Machine Learning, 2022. 13 A Convergence of Bi-level Data Selection Here we present a theoretical analysis of the proposed BLADE. The original bi-level optimization problem P: min α∈A Lval (w∗(α))s.t.w ∗(α)∈S ∗(α) := arg m...

2022
[55]

Both∇ wLtrain(α,w)and∇ wLval(w)are Lipschitz continuous towon coefficientL
[56]

hard but learnable

BothL train(α,w)andL val(w)are Lipschitz continuous towwith coefficientB. Assumption 3(Bounded Hessian).For any α∈ A , w∈S ∗(α), there exists positive constants κ, ρ, satisfying Hessian matrices∇ 2 wwLtrain(α,w)⪰κ 8 and∇ 2 αwLtrain(α,w)⪯ρ. Assumption 4(Lipschitz Hessian).For any α∈ A , Ltrain(α,w) is twice continuous differentiable, and the Hessian matric...

2048

[1] [1]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark abd Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint, arXiv:1905.10044, 2019

Pith/arXiv arXiv 1905

[2] [2]

Mathqa: Towards interpretable math word problem solving with operation-based formalisms.arXiv preprint, arXiv:1905.13319, 2019

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms.arXiv preprint, arXiv:1905.13319, 2019

Pith/arXiv arXiv 1905

[3] [3]

Jiang, Jia Deng, Stella Biderman, and Sean Welleck

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics.arXiv preprint, arXiv:2310.10631, 2023

Pith/arXiv arXiv 2023

[4] [4]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InAAAI conference on artificial intelligence, 2020

2020

[5] [5]

Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems

Tianyi Chen, Yuejiao Sun, and Wotao Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. InAdvances in Neural Information Processing Systems, 2021

2021

[6] [6]

Making scalable meta learning practical

Sang Keun Choe, Sanket Vaibhav Mehta, Willie Neiswanger Hwijeen Ahn, Pengtao Xie, Emma Strubell, and Eric Xing. Making scalable meta learning practical. InAdvances in Neural Information Processing Systems, 2024

2024

[7] [7]

Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint, arXiv:1803.05457, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint, arXiv:1803.05457, 2018

Pith/arXiv arXiv 2018

[8] [8]

Training verifiers to solve math word problems.arXiv preprint, arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint, arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021

[9] [9]

Dsdm: Model-aware dataset selection with datamodels

Logan Engstrom, Axel Feldmann, and Aleksander Madry. Dsdm: Model-aware dataset selection with datamodels. InInternational Conference on Machine Learning, pages 12491–12526. PMLR, 2024

2024

[10] [10]

The language model evaluation harness, 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

2024

[11] [11]

Bilevel optimization with a lower- level contraction: Optimal sample complexity without warm-start

Riccardo Grazzi, Massimiliano Pontil, and Saverio Salzo. Bilevel optimization with a lower- level contraction: Optimal sample complexity without warm-start. InJournal of Machine Learning Research, 2023

2023

[12] [12]

Bliss: A lightweight bilevel influence scoring method for data selection in language model pretraining.arXiv preprint, arXiv:2510.06048, 2025

Jie Hao, Rui Yu, Wei Zhang, Huixia Wang, Jie Xu, and Mingrui Liu. Bliss: A lightweight bilevel influence scoring method for data selection in language model pretraining.arXiv preprint, arXiv:2510.06048, 2025

Pith/arXiv arXiv 2025

[13] [13]

Projection-free online learning

Elad Hazan and Satyen Kale. Projection-free online learning. InInternational Conference on Machine Learning, 2012

2012

[14] [14]

Measuring massive multitask language understanding.arXiv preprint, arXiv:2009.03300, 2020

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint, arXiv:2009.03300, 2020

Pith/arXiv arXiv 2009

[15] [15]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Neural Information Processing Systems, 2021. 10

2021

[16] [16]

A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor- critic.SIAM J

Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor- critic.SIAM J. Optim., 2023

2023

[17] [17]

Mawps: A math word problem repository

Rik Koncel Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. Mawps: A math word problem repository. InConference of the North American Chapter of the Association for Computational Linguistics, 2016

2016

[18] [18]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational Conference on Machine Learning, 2017

2017

[19] [19]

A fully first-order method for stochastic bilevel optimization

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert Nowa. A fully first-order method for stochastic bilevel optimization. InInternational Conference on Machine Learning, 2023

2023

[20] [20]

On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation. InInternational Conference on Learning Representations, 2024

2024

[21] [21]

Convergence rate of frank-wolfe for non-convex objectives.arXiv preprint, arXiv:1607.00345, 2016

Simon Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives.arXiv preprint, arXiv:1607.00345, 2016

Pith/arXiv arXiv 2016

[22] [22]

Block-coordinate Frank-Wolfe optimization for structural SVMs

Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate Frank-Wolfe optimization for structural SVMs. InInternational Conference on Machine Learning, 2013

2013

[23] [23]

Rho-1: Not all tokens are what you need

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. Rho-1: Not all tokens are what you need. In Neural Information Processing Systems, 2024

2024

[24] [24]

Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning

Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. InInternational Conference on Learning Representations, 2023

2023

[25] [25]

A framework for bilevel optimization that enables stochastic and global variance reduction algorithms

Dagréou Mathieu, Pierre Ablin, Samuel Vaiter, and Thomas Moreau. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. InAdvances in Neural Information Processing Systems, 2022

2022

[26] [26]

A diverse corpus for evaluating and de- veloping english math word problem solvers

Shenyun Miao, Chaochun Liang, and Keh-Yih Su. A diverse corpus for evaluating and de- veloping english math word problem solvers. InAssociation for Computational Linguistics, 2020

2020

[27] [27]

Can a suit of armor conduct electricity? a new dataset for open book question aanswering.arXiv preprint, arXiv:1809.02789, 2018

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question aanswering.arXiv preprint, arXiv:1809.02789, 2018

Pith/arXiv arXiv 2018

[28] [28]

Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal

Sören Mindermann, Jan Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, 2022

2022

[29] [29]

In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint, arXiv:1412.6614, 2014

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint, arXiv:1412.6614, 2014

Pith/arXiv arXiv 2014

[30] [30]

G- dig: Towards gradient-based diverse and high-quality instruction data selection for machine translation

Xingyuan Pan, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, and Shanbo Cheng. G- dig: Towards gradient-based diverse and high-quality instruction data selection for machine translation. InAssociation for Computational Linguistics, 2024

2024

[31] [31]

Openwebmath: An open dataset of high-quality mathematical web text

Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text. InInternational Conference on Learning Representations, 2024

2024

[32] [32]

Are nlp models really able to solve simple math word problems? InConference of the North American Chapter of the Association for Computational Linguistics, 2021

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InConference of the North American Chapter of the Association for Computational Linguistics, 2021. 11

2021

[33] [33]

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, and Susannah Young et. al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint, arXiv:2112.11446, 2021

Pith/arXiv arXiv 2021

[34] [34]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. InJournel of Machine Learning Research, 2020

2020

[35] [35]

Reddi, Suvrit Sra, Barnabas Poczos, and Alex Smola

Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic frank-wolfe methods for nonconvex optimization. InAnnual Allerton Conference on Communication, Control, and Computing, 2016

2016

[36] [36]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. InAAAI conference on artificial intelligence, 2020

2020

[37] [37]

Truncated back- propagation for bilevel optimization

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back- propagation for bilevel optimization. InInternational Conference on Artificial Intelligence and Statistics, 2019

2019

[38] [38]

On penalty-based bilevel gradient descent method

Han Shen and Tianyi Chen. On penalty-based bilevel gradient descent method. InInternational Conference on Machine Learning, 2023

2023

[39] [39]

Slimpajama: A 627b token, cleaned and deduplicated version of redpajam.https://www.cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated- version-of-redpajama, 2023

Daria Soboleva, Faisal Al-Khateeb, Joel Hestness, and Jacob Robert Steeves Nolan Dey Open- tensor: Robert Myers. Slimpajama: A 627b token, cleaned and deduplicated version of redpajam.https://www.cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated- version-of-redpajama, 2023

2023

[40] [40]

The implicit bias of gradient descent on separable data

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. InJournal of Machine Learning Research, 2018

2018

[41] [41]

Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants

Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants. 2023

2023

[42] [42]

D4: Improving llm pretraining via document de-duplication and diversification

Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari Morcos. D4: Improving llm pretraining via document de-duplication and diversification. InNeural Information Processing Systems, 2023

2023

[43] [43]

Llama 2: Open foundation and fine-tuned chat models.arXiv preprint, arXiv:2307.09288, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, and Nikolay Bashlykov et.al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint, arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023

[44] [44]

Wang, Tong Wu, Dawn Song, Prateek Mittal, and Ruoxi Jia

Jiachen T. Wang, Tong Wu, Dawn Song, Prateek Mittal, and Ruoxi Jia. Greats: Online selection of high-quality data for llm training in every iteration. InNeural Information Processing Systems, 2024

2024

[45] [45]

Tandem: Bi-level data mixture optimization with twin networks

Jiaxing Wang, Deping Xiang, Jin Xu, Mingyang Yi, Guoqiang Gong, Zicheng Zhang, Haoran Li, Pengzhang Liu, Zhen Chen, Ke Zhang, Ju Fan, and Qixiang Jiang. Tandem: Bi-level data mixture optimization with twin networks. InNeural Information Processing Systems, 2025

2025

[46] [46]

Less: Selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning, 2024

2024

[47] [47]

Data selection for language models via importance resampling

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. InNeural Information Processing Systems, 2023

2023

[48] [48]

Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. InInternational Conference on Learning Representations, 2024. 12

2024

[49] [49]

Mates: Model-aware data selection for efficient pretraining with data influence models

Zichun Yu, Spandan Das, and Chenyan Xiong. Mates: Model-aware data selection for efficient pretraining with data influence models. InNeural Information Processing Systems, 2024

2024

[50] [50]

Mammoth: Building math generalist models through hybrid instruction tuning

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, , and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. In International Conference on Learning Representations, 2024

2024

[51] [51]

Hellaswag: Can a machine really finish your sentence?arXiv preprint, arXiv:1905.07830, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint, arXiv:1905.07830, 2019

Pith/arXiv arXiv 1905

[52] [52]

Tinyllama: An open-source small language model.arXiv preprint, arXiv:2401.02385, 2024

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint, arXiv:2401.02385, 2024

Pith/arXiv arXiv 2024

[53] [53]

Lima: Less is more for alignment.arXiv preprint, arXiv:2305.11206, 2023

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment.arXiv preprint, arXiv:2305.11206, 2023

Pith/arXiv arXiv 2023

[54] [54]

Probabilistic bilevel coreset selection

Xiao Zhou, Renjie Pi, Weizhong Zhang, Yong Lin, and Tong Zhang. Probabilistic bilevel coreset selection. InInternational Conference on Machine Learning, 2022. 13 A Convergence of Bi-level Data Selection Here we present a theoretical analysis of the proposed BLADE. The original bi-level optimization problem P: min α∈A Lval (w∗(α))s.t.w ∗(α)∈S ∗(α) := arg m...

2022

[55] [55]

Both∇ wLtrain(α,w)and∇ wLval(w)are Lipschitz continuous towon coefficientL

[56] [56]

hard but learnable

BothL train(α,w)andL val(w)are Lipschitz continuous towwith coefficientB. Assumption 3(Bounded Hessian).For any α∈ A , w∈S ∗(α), there exists positive constants κ, ρ, satisfying Hessian matrices∇ 2 wwLtrain(α,w)⪰κ 8 and∇ 2 αwLtrain(α,w)⪯ρ. Assumption 4(Lipschitz Hessian).For any α∈ A , Ltrain(α,w) is twice continuous differentiable, and the Hessian matric...

2048