pith. sign in

arxiv: 2605.30337 · v1 · pith:E24UWKTDnew · submitted 2026-05-28 · 💻 cs.LG

Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

Pith reviewed 2026-06-29 08:56 UTC · model grok-4.3

classification 💻 cs.LG
keywords test-time finetuningconvex optimizationFrank-Wolfegradient cachingLLM adaptationefficiencyintegerizationretrieval
0
0 comments X

The pith

HullFT represents each query embedding as a sparse convex combination of training sequences, converts the weights to integer multiplicities, and reuses gradients on repeats to speed up test-time finetuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HullFT to solve the speed and quality bottlenecks in test-time finetuning of language models. It expresses a query embedding as a sparse convex combination of a few training sequences via Frank-Wolfe optimization, yielding a relevant and diverse support set. Fractional weights are then turned into an exact integer multiset through geometric integerization, which creates repeated examples. These repeats are exploited with gradient reuse to amortize forward-backward passes during finetuning. Experiments show this produces lower bits-per-byte at substantially lower total runtime than prior methods.

Core claim

HullFT addresses both selection and update costs in test-time finetuning by first representing the query embedding as a sparse convex combination of training sequences using efficient projection-free Frank-Wolfe optimization, then converting the fractional convex weights into an exact integer multiset for finetuning through a geometric integerization procedure, and finally exploiting the resulting repeated examples with gradient reuse to amortize forward-backward computation across repeated finetuning steps.

What carries the argument

Frank-Wolfe optimization to obtain a sparse convex support set for the query embedding, followed by geometric integerization that turns fractional weights into exact integer multiplicities for gradient caching.

If this is right

  • The method achieves lower bits-per-byte than current state-of-the-art TTFT approaches.
  • Total runtime per query drops substantially while maintaining or improving adaptation quality.
  • The support set obtained from convex reconstruction is inherently relevant and diverse without extra diversity heuristics.
  • Gradient reuse on repeated examples amortizes forward-backward cost across multiple finetuning steps.
  • The overall pipeline removes the need to trade speed for quality in per-query adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If integerization reliably preserves the convex set properties, similar reconstruction steps could replace heuristic retrieval in other per-example adaptation tasks.
  • The approach opens the possibility of running test-time updates on resource-constrained devices where full per-query retrieval and training were previously infeasible.
  • Because the support set size is controlled by the Frank-Wolfe sparsity, the method may scale to much larger training corpora without quadratic retrieval costs.

Load-bearing premise

The geometric integerization procedure converts fractional convex weights into an exact integer multiset without materially degrading the relevance or diversity properties that the Frank-Wolfe support set was chosen to provide.

What would settle it

A head-to-head benchmark run showing that HullFT produces higher bits-per-byte or higher total runtime than the strongest prior TTFT baseline on the same model and dataset would falsify the claimed quality-efficiency improvement.

Figures

Figures reproduced from arXiv: 2605.30337 by Alaa Khamis, Alaa Maalouf.

Figure 1
Figure 1. Figure 1: Our test-time finetuning pipeline. 1. Frank-Wolfe Support: Given a prompt q, we retrieve a candidate pool from the corpus, then approximate q as a sparse convex combination to select a support set. 2. Integerization: Fractional weights are converted to integer counts forming an exact N-point multiset. 3. Finetuning & Inference: The base LLM is finetuned on this multiset before evaluating q via gradient reu… view at source ↗
Figure 2
Figure 2. Figure 2: Left: BPB% vs total runtime (selection + finetuning) as we sweep N ∈ [1, 50] for each method. Our method (green) is Pareto-dominant for every budget T ≲4s. Vertical lines at T=1.75s and T=2.0s mark the BPB% gap to the best baseline: our method is 3.83% and 3.44% lower than the best baseline at those budgets, respectively. Right: Quality gap (best-baseline BPB% − ours) in % as a function of the total-runtim… view at source ↗
Figure 3
Figure 3. Figure 3: Per-subset breakdown for all 12 Pile subsets (2 subsets per row). For each subset we show, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: BPB% vs. total runtime (selection + finetuning) for HullFT at gradient-reuse depths r ∈ {1, 2, 3}. Points correspond to N ∈ {10, 20, 30, 40, 50}. At any given run￾time budget, r=2 matches or beats r ∈ {1, 3}, saves wall time at a modest BPB% cost. Gradient reuse ablation [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of SIFT gradient-reuse variants [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: CPU-only comparison at N=20 over 6 Pile subsets. Left: mean selection time on CPU. Middle-left: mean finetuning time. Middle-right: total runtime (selection + finetuning). Right: BPB% relative to the non-finetuned baseline. HullFT is 25.8× faster than SIFT at selection and saves about 89s end-to-end, at a modest BPB% cost relative to SIFT. kNN Pool Selected (20) Query SIFT HullFT [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 7
Figure 7. Figure 7: 3D t-SNE projection of the candidate pool (gray) and the [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: BPB% averaged over 12 Pile subsets as a function of selection budget [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: BPB% for the three FW integerization strategies. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: BPB% for Carathéodory selection (Alg. 5) with three integerization strategies, with HullFT [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: BPB% vs. FW tolerance ε at N=20 (left) and N=50 (right). The x-axis is inverted so that the plot reads left-to-right as loose-to-exact; ε=0 (exact solution) is placed one decade beyond 10−8 for visibility. C.4 FW variant: forced-unique selection without ε early-stop We ask whether the ε early-stop and the allowance for point revisits in standard HullFT are important. The fw_no_epsilon variant disables the… view at source ↗
Figure 12
Figure 12. Figure 12: Effect of the kNN pre-selection pool size [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Per-subset breakdown against SIFT for all 12 Pile subsets (2 subsets per row). For each subset we show, left: BPB% vs. wall-clock time, and right: the quality gap (SIFT − HullFT) as a function of the total-runtime budget T. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Per-subset breakdown against kNN for all 12 Pile subsets (2 subsets per row). For each subset we show, left: BPB% vs. wall-clock time, and right: the quality gap (kNN − HullFT) as a function of the total-runtime budget T. C.7 Fidelity of geometric integerization to Frank–Wolfe weights Frank–Wolfe produces fractional weights on a sparse support; geometric integerization converts these into an exact N-tuple… view at source ↗
Figure 15
Figure 15. Figure 15: Fidelity of geometric integerization to Frank–Wolfe weights. Mean [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: 3D t-SNE projections of the candidate pool (gray) and the [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
read the original abstract

Test-time finetuning (TTFT) is a rapidly evolving paradigm that adapts a language model to each prompt by retrieving related sequences, updating the model on them, and then evaluating the prompt. However, TTFT is only practical if it is fast: selection and finetuning both happen per query, making each a direct bottleneck. Existing methods trade speed for quality: fast retrieval is often redundant, while stronger diversity-aware selection adds prohibitive per-query cost. We introduce HullFT, a geometric approach to TTFT that addresses both bottlenecks. Given a query, HullFT first represents the query embedding as a sparse convex combination of few training sequences, using efficient projection-free Frank-Wolfe optimization. This yields a support set that is inherently relevant and diverse. We then convert the fractional convex weights into an exact integer multiset for finetuning through a geometric integerization procedure. The resulting multiplicities naturally create repeated examples, which we exploit with Gradient Reuse to amortize forward-backward computation across repeated finetuning steps. Our experiments show that HullFT improves the quality-efficiency tradeoff over current state-of-the-art TTFT methods, achieving lower bits-per-byte at substantially lower total runtime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces HullFT for test-time finetuning (TTFT) of LLMs. It represents each query embedding as a sparse convex combination of training sequences via projection-free Frank-Wolfe optimization to obtain a relevant and diverse support set, converts the resulting fractional weights to an exact integer multiset via a geometric integerization procedure, and exploits the resulting repeated examples with Gradient Reuse to amortize forward-backward passes. The central claim is that this pipeline improves the quality-efficiency tradeoff over prior TTFT methods, yielding lower bits-per-byte at substantially lower total runtime.

Significance. If the empirical claims hold and the integerization step preserves the intended properties of the support set, the geometric framing could supply a more principled and efficient alternative to heuristic retrieval-plus-finetuning pipelines, with potential impact on practical per-query adaptation of large models.

major comments (2)
  1. [Abstract] Abstract: the claim that experiments demonstrate improved quality-efficiency tradeoffs is presented without any quantitative tables, ablation details, error bars, or baseline numbers, rendering the headline result impossible to evaluate from the provided text.
  2. [Method] Method (geometric integerization step): the pipeline relies on converting Frank-Wolfe fractional weights to integer multiplicities without materially altering the effective training distribution, yet no L1 or total-variation bound, convergence guarantee, or ablation isolating the integerization effect on bits-per-byte (while holding total FLOPs fixed) is supplied; this is load-bearing for the quality claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We address each of the major comments below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that experiments demonstrate improved quality-efficiency tradeoffs is presented without any quantitative tables, ablation details, error bars, or baseline numbers, rendering the headline result impossible to evaluate from the provided text.

    Authors: We agree with the observation that the abstract presents the experimental claims at a high level without specific numbers. To address this, we will revise the abstract to incorporate key quantitative results from our experiments, including comparisons to baselines in terms of bits-per-byte and runtime, while maintaining its concise nature. This will make the headline result more directly evaluable. revision: yes

  2. Referee: [Method] Method (geometric integerization step): the pipeline relies on converting Frank-Wolfe fractional weights to integer multiplicities without materially altering the effective training distribution, yet no L1 or total-variation bound, convergence guarantee, or ablation isolating the integerization effect on bits-per-byte (while holding total FLOPs fixed) is supplied; this is load-bearing for the quality claim.

    Authors: The referee correctly identifies that the manuscript does not provide an explicit analysis or ablation for the integerization step. We will add to the revised manuscript a bound on the total variation distance between the fractional and integer distributions, along with an ablation that measures the impact of integerization on bits-per-byte under fixed computational budget. This will substantiate the claim that the procedure does not materially alter the effective training distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an algorithmic pipeline: Frank-Wolfe optimization produces a sparse convex combination yielding a support set, a separate geometric integerization converts fractional weights to integer multiplicities, and repeated examples enable gradient reuse. Performance claims (lower bits-per-byte at lower runtime) are presented as empirical outcomes of this procedure rather than tautological re-expressions of input parameters or fitted quantities. No equations reduce by construction to their own inputs, no load-bearing self-citation chains are invoked for uniqueness or ansatz, and no known result is merely renamed. The integerization step's preservation of relevance/diversity is an unproven assumption (a correctness concern), not a definitional or fitted-input circularity. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities; the method implicitly assumes that the convex hull of training embeddings contains useful query reconstructions and that repeated examples can be processed without quality loss.

pith-pipeline@v0.9.1-grok · 5741 in / 1089 out tokens · 17114 ms · 2026-06-29T08:56:58.912885+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 10 canonical work pages · 5 internal anchors

  1. [1]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

  2. [2]

    The Refined- Web dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The Refined- Web dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only. InProceedings of the 37th International Conference on Neural Information Processing Sy...

  3. [3]

    Test-time training with self-supervision for generalization under distribution shifts

    Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InProceedings of the 37th International Conference on Machine Learning, pages 9229–9248, 2020

  4. [4]

    Test-time training with masked autoencoders

    Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A Efros. Test-time training with masked autoencoders. InAdvances in Neural Information Processing Systems, 2022

  5. [5]

    Test-time training on nearest neighbors for large language models

    Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. InThe Twelfth International Conference on Learning Representations, 2024

  6. [6]

    Learning to (learn at test time): RNNs with expressive hidden states

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states. InForty-second International Conference on Machine Learning, 2025

  7. [7]

    The surprising effectiveness of test-time training for abstract reasoning

    Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. The surprising effectiveness of test-time training for abstract reasoning. arXiv preprint arXiv:2411.07279, 2024

  8. [8]

    Efficiently learning at test- time: Active fine-tuning of LLMs

    Jonas Hübotter, Sascha Bongni, Ido Hakimi, and Andreas Krause. Efficiently learning at test- time: Active fine-tuning of LLMs. InInternational Conference on Learning Representations, 2025

  9. [9]

    Billion-scale similarity search with GPUs

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2021

  10. [10]

    CCNet: Extracting high quality monolingual datasets from web crawl data

    Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. InProceedings of the 12th Language Resources and Evaluation Conference, pages 4003–4012, 2020

  11. [11]

    SemDeDup: Data-efficient learning at web-scale through semantic deduplication

    Amro Abbas, Kushal Tirumala, Daniel Simig, Surya Ganguli, and Ari S Morcos. SemD- eDup: Data-efficient learning at web-scale through semantic deduplication.arXiv preprint arXiv:2303.09540, 2023

  12. [12]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

  13. [13]

    The use of mmr, diversity-based reranking for reordering documents and producing summaries

    Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 335–336, 1998

  14. [14]

    Determinantal point processes for machine learning.Foundations and Trends® in Machine Learning, 5(2–3):123–286, 2012

    Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning.Foundations and Trends® in Machine Learning, 5(2–3):123–286, 2012

  15. [15]

    Active learning for convolutional neural networks: A core-set approach

    Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. InInternational Conference on Learning Representations, 2018. 10

  16. [16]

    Coresets for data-efficient training of machine learning models

    Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. InProceedings of the 37th International Conference on Machine Learning, pages 6950–6960, 2020

  17. [17]

    Approximating Nash equilibria and dense bipartite subgraphs via an approximate version of Carathéodory’s theorem

    Siddharth Barman. Approximating Nash equilibria and dense bipartite subgraphs via an approximate version of Carathéodory’s theorem. InProceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pages 361–369, 2015

  18. [18]

    Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen.Mathematische Annalen, 64(1):95–115, 1907

    Constantin Carathéodory. Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen.Mathematische Annalen, 64(1):95–115, 1907

  19. [19]

    Combettes and Sebastian Pokutta

    Cyrille W. Combettes and Sebastian Pokutta. Revisiting the approximate carathéodory problem via the frank–wolfe algorithm.Mathematical Programming, 197(1):191–214, 2023

  20. [20]

    Transductive inference for text classification using support vector machines

    Thorsten Joachims. Transductive inference for text classification using support vector machines. InProceedings of the Sixteenth International Conference on Machine Learning, pages 200–209, 1999

  21. [21]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InPro- ceedings of the 34th International Conference on Neural Information Processing Systems, 2020

  22. [22]

    Improving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, ...

  23. [23]

    The influence curve and its role in robust estimation.Journal of the American Statistical Association, 69:383–393, 1974

    Frank R Hampel. The influence curve and its role in robust estimation.Journal of the American Statistical Association, 69:383–393, 1974

  24. [24]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InProceedings of the 34th International Conference on Machine Learning, pages 1885–1894, 2017

  25. [25]

    Estimating training data influence by tracing gradient descent

    Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. InProceedings of the 34th International Conference on Neural Information Processing Systems, volume 33, pages 19920–19930, 2020

  26. [26]

    DataInf: Efficiently estimating data influence in LoRA-tuned LLMs and diffusion models

    Yongchan Kwon, Eric Wu, Kevin Wu, and James Zou. DataInf: Efficiently estimating data influence in LoRA-tuned LLMs and diffusion models. InThe Twelfth International Conference on Learning Representations, 2024

  27. [27]

    LESS: Selecting influential data for targeted instruction tuning

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: Selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning, 2024

  28. [28]

    Universal language model fine-tuning for text classi- fication

    Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classi- fication. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, 2018

  29. [29]

    Suchin Gururangan, Ana Marasovi´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, 2020

  30. [30]

    BioBERT: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. BioBERT: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020. 11

  31. [31]

    Publicly available clinical BERT embeddings

    Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. Publicly available clinical BERT embeddings. InProceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, 2019

  32. [32]

    ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

    Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. ClinicalBERT: Modeling clinical notes and predicting hospital readmission.arXiv preprint arXiv:1904.05342, 2019

  33. [33]

    SciBERT: A pretrained language model for sci- entific text

    Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for sci- entific text. InProceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, 2019

  34. [34]

    LEGAL-BERT: The muppets straight out of law school

    Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion An- droutsopoulos. LEGAL-BERT: The muppets straight out of law school. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, 2020

  35. [35]

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Senevi- ratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrad...

  36. [36]

    LIMA: Less is more for alignment

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. LIMA: Less is more for alignment. InProceedings of the 37th International Conference on Neural Information Processing Systems, 2023

  37. [37]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  38. [38]

    DataS3: Dataset subset selection for specialization.arXiv preprint arXiv:2504.16277, 2025

    Neha Hulkund, Alaa Maalouf, Levi Cai, Daniel Yang, Tsun-Hsuan Wang, Abigail O’Neil, Timm Haucke, Sandeep Mukherjee, Vikram Ramaswamy, Judy Hansen Shen, Gabriel Tseng, Mike Walmsley, Daniela Rus, Ken Goldberg, Hannah Kerner, Irene Chen, Yogesh Girdhar, and Sara Beery. DataS3: Dataset subset selection for specialization.arXiv preprint arXiv:2504.16277, 2025

  39. [39]

    Compress to impress: Efficient LLM adaptation using a single gradient step on 100 samples

    Shiva Sreeram, Alaa Maalouf, Pratyusha Sharma, and Daniela Rus. Compress to impress: Efficient LLM adaptation using a single gradient step on 100 samples. InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

  40. [40]

    Robustness is a function, not a number: A factorized compre- hensive study of OOD robustness in vision-based driving.arXiv preprint arXiv:2602.09018, 2026

    Amir Mallak and Alaa Maalouf. Robustness is a function, not a number: A factorized compre- hensive study of OOD robustness in vision-based driving.arXiv preprint arXiv:2602.09018, 2026

  41. [41]

    Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models

    Tsun-Hsuan Wang, Alaa Maalouf, Wei Xiao, Yutong Ban, Alexander Amini, Guy Rosman, Sertac Karaman, and Daniela Rus. Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6687–6694. IEEE, 2024

  42. [42]

    How to train data-efficient LLMs

    Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. How to train data-efficient LLMs. arXiv preprint arXiv:2402.09668, 2024

  43. [43]

    QuRating: Selecting high- quality data for training language models

    Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. QuRating: Selecting high- quality data for training language models. InInternational Conference on Machine Learning, 2024

  44. [44]

    Fast and accurate least-mean-squares solvers

    Alaa Maalouf, Ibrahim Jubran, and Dan Feldman. Fast and accurate least-mean-squares solvers. InProceedings of the 33rd International Conference on Neural Information Processing Systems, volume 32, 2019. 12

  45. [45]

    B. Maurey. Théorèmes de factorisation pour les opérateurs linéaires à valeurs dans un espace lp(ω, µ),0< p≤+∞.Séminaire Maurey-Schwartz, pages 1–8, 1972-1973

  46. [46]

    Tight bounds for approximate Carathéodory and beyond

    Vahab Mirrokni, Renato Paes Leme, Adrian Vladu, and Sam Chiu-wai Wong. Tight bounds for approximate Carathéodory and beyond. InProceedings of the 34th International Conference on Machine Learning, pages 2440–2448, 2017

  47. [47]

    An algorithm for quadratic programming.Naval Research Logistics Quarterly, 3(1–2):95–110, 1956

    Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming.Naval Research Logistics Quarterly, 3(1–2):95–110, 1956

  48. [48]

    Revisiting Frank–Wolfe: Projection-free sparse convex optimization

    Martin Jaggi. Revisiting Frank–Wolfe: Projection-free sparse convex optimization. InProceed- ings of the 30th International Conference on Machine Learning, pages 427–435, 2013

  49. [49]

    Some comments on Wolfe’s ‘away step’.Mathematical Programming, 35:110–119, 1986

    Jacques Guélat and Patrice Marcotte. Some comments on Wolfe’s ‘away step’.Mathematical Programming, 35:110–119, 1986

  50. [50]

    Convergence Rate of Frank-Wolfe for Non-Convex Objectives

    Simon Lacoste-Julien. Convergence rate of Frank–Wolfe for non-convex objectives.arXiv preprint arXiv:1607.00345, 2016

  51. [51]

    Coresets, sparse greedy approximation, and the Frank–Wolfe algorithm

    Kenneth L Clarkson. Coresets, sparse greedy approximation, and the Frank–Wolfe algorithm. ACM Transactions on Algorithms, 6(4):63:1–63:30, 2010

  52. [52]

    Geometric approximation via coresets

    Pankaj K Agarwal, Sariel Har-Peled, and Kasturi R Varadarajan. Geometric approximation via coresets. In Jacob E Goodman, János Pach, and Emo Welzl, editors,Combinatorial and Computational Geometry. Cambridge University Press, 2005

  53. [53]

    American Mathematical Society, 2011

    Sariel Har-Peled.Geometric Approximation Algorithms, volume 173 ofMathematical Surveys and Monographs. American Mathematical Society, 2011

  54. [54]

    Coresets and sketches

    Jeff M Phillips. Coresets and sketches. InHandbook of Discrete and Computational Geometry. CRC Press, 2017

  55. [55]

    Coresets for the average case error for finite query sets.Sensors, 21(19), 2021

    Alaa Maalouf, Ibrahim Jubran, Murad Tukan, and Dan Feldman. Coresets for the average case error for finite query sets.Sensors, 21(19), 2021

  56. [56]

    Provable data subset selection for efficient neural networks training

    Murad Tukan, Samson Zhou, Alaa Maalouf, Daniela Rus, Vladimir Braverman, and Dan Feldman. Provable data subset selection for efficient neural networks training. InProceedings of the 40th International Conference on Machine Learning, pages 34533–34555, 2023

  57. [57]

    GLISTER: Generalization based data subset selection for efficient and robust learning

    Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer. GLISTER: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8110–8118, 2021

  58. [58]

    GRAD-MATCH: Gradient matching based data subset selection for efficient deep model training

    Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. GRAD-MATCH: Gradient matching based data subset selection for efficient deep model training. InProceedings of the 38th International Conference on Machine Learning, pages 5464–5474, 2021

  59. [59]

    Selection via proxy: Efficient data selection for deep learning

    Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data selection for deep learning. InInternational Conference on Learning Representations, 2020

  60. [60]

    Pruning neural networks via coresets and convex geometry: Towards no assumptions

    Murad Tukan, Loay Mualem, and Alaa Maalouf. Pruning neural networks via coresets and convex geometry: Towards no assumptions. InAdvances in Neural Information Processing Systems, 2022

  61. [61]

    Provable filter pruning for efficient neural networks

    Lucas Liebenwein, Cenk Baykal, Harry Lang, Dan Feldman, and Daniela Rus. Provable filter pruning for efficient neural networks. InInternational Conference on Learning Representations, 2020

  62. [62]

    Data- independent structured pruning of neural networks via coresets.IEEE Transactions on Neural Networks and Learning Systems, 33(12):7829–7841, 2022

    Ben Mussay, Dan Feldman, Samson Zhou, Vladimir Braverman, and Margarita Osadchy. Data- independent structured pruning of neural networks via coresets.IEEE Transactions on Neural Networks and Learning Systems, 33(12):7829–7841, 2022. 13

  63. [63]

    Sensitivity- informed provable pruning of neural networks.SIAM Journal on Mathematics of Data Science, 4(1):26–45, 2022

    Cenk Baykal, Lucas Liebenwein, Igor Gilitschenski, Dan Feldman, and Daniela Rus. Sensitivity- informed provable pruning of neural networks.SIAM Journal on Mathematics of Data Science, 4(1):26–45, 2022

  64. [64]

    AutoCoreset: An au- tomatic practical coreset construction framework

    Alaa Maalouf, Murad Tukan, Vladimir Braverman, and Daniela Rus. AutoCoreset: An au- tomatic practical coreset construction framework. InProceedings of the 40th International Conference on Machine Learning, pages 23451–23466, 2023

  65. [65]

    A unified approach to coreset learning.IEEE Transactions on Neural Networks and Learning Systems, 35 (5):6893–6905, 2024

    Alaa Maalouf, Gilad Eini, Ben Mussay, Dan Feldman, and Margarita Osadchy. A unified approach to coreset learning.IEEE Transactions on Neural Networks and Learning Systems, 35 (5):6893–6905, 2024

  66. [66]

    A unified framework for approximating and clustering data

    Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. InProceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, pages 569–578, 2011

  67. [67]

    New frameworks for offline and streaming coreset constructions.arXiv preprint arXiv:1612.00889, 2016

    Vladimir Braverman, Dan Feldman, Harry Lang, Adiel Statman, and Samson Zhou. New frameworks for offline and streaming coreset constructions.arXiv preprint arXiv:1612.00889, 2016

  68. [68]

    Practical Coreset Constructions for Machine Learning

    Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning.arXiv preprint arXiv:1703.06476, 2017

  69. [69]

    Coresets-methods and history: A theoreticians design pattern for approximation and streaming algorithms.KI - Künstliche Intelligenz, 32(1): 37–53, 2018

    Alexander Munteanu and Chris Schwiegelshohn. Coresets-methods and history: A theoreticians design pattern for approximation and streaming algorithms.KI - Künstliche Intelligenz, 32(1): 37–53, 2018

  70. [70]

    What is the effect of importance weighting in deep learning? InProceedings of the 36th International Conference on Machine Learning, pages 872–881, 2019

    Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning? InProceedings of the 36th International Conference on Machine Learning, pages 872–881, 2019

  71. [71]

    /" + b; }; var dbg = Debugger(g); var hits = 0; dbg.onDebuggerStatement = function (frame) { var f = frame.eval(

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Technical Report, 2019. 14 A Experimental protocol details The main experiments use 12 Pile subsets and 150 test queries per subset. This protocol is a compute-conscious version of the evaluation style used by TT...