pith. machine review for the scientific record. sign in

arxiv: 2605.07063 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM post-trainingdata regularizationdata selectionSFTRLHFRLVRbias-variance tradeofffeasible set
0
0 comments X

The pith

General training data serves as a regularizer by constraining updates from scarce target data in LLM post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes the role of abundant but imperfect general training data in LLM post-training. Rather than selecting subsets from it to mix with scarce high-fidelity target data, the approach treats the general data as inducing a regularizer at each training step. A feasible set of allowable update directions is built from the general data, and the update direction dictated by the target data is projected onto that set. This unifies prior training and selection techniques as points on a bias-variance spectrum and supplies a broader family of methods with tunable regularization strength. Experiments across supervised fine-tuning, RLHF, and RLVR show consistent gains over data-selection baselines together with low-overhead implementation.

Core claim

At each training step a feasible set of model-update directions is constructed from the general training data, and the direction specified by the scarce target data is projected onto that set. The resulting data-induced regularizer prevents overfitting to the target objective. Standard training and existing selection methods emerge as special cases that differ only in the strength of this regularizer and therefore occupy different positions on a bias-variance spectrum. A richer family of methods is obtained by varying the regularizer, and system optimizations make the approach practical at LLM scale.

What carries the argument

Projection of the target-data update direction onto the feasible set of directions induced by general training data at each step.

If this is right

  • Existing data-selection and standard-training procedures become special cases obtained by particular choices of the data-induced regularizer.
  • A continuous spectrum of bias-variance trade-offs becomes available by adjusting regularization strength.
  • System-level optimizations allow the projection step to run with minimal added cost at LLM scale.
  • Performance improvements hold across supervised fine-tuning, RLHF, and RLVR relative to selection baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regularization perspective could be applied in other domains where high-quality labeled data is scarce and general data is plentiful.
  • The strength of the projection could be adapted automatically during training based on observed loss or gradient statistics.
  • Combining the feasible-set projection with other regularizers such as weight decay might produce additive gains.

Load-bearing premise

The feasible set built from general data meaningfully limits target-driven updates so that overfitting is reduced without discarding useful target signal.

What would settle it

An experiment on an SFT or RLHF task in which the projected updates produce lower held-out performance than either unregularized target training or standard data-selection baselines.

Figures

Figures reproduced from arXiv: 2605.07063 by Jiaqi W. Ma, Pingbang Hu, Xueshen Liu, Z. Morley Mao.

Figure 1
Figure 1. Figure 1: The dual view of data selection and data regularization. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data regularization spectrum. Different methods correspond to feasible sets of increasing [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bias–variance de￾composition for a chosen Ut [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Top. Forward pass and backward pass (activation and weight gradient) computation graph: both −→ and 99K denote computational dependency; 99K indicates that the dependency runs across the boundary between forward and backward. Bottom. Memory footprint: a tensor can be released once no remaining operation depends on it. Forward pass. Each layer l computes the output (pre-activation) of the layer e (l) i,τ = … view at source ↗
Figure 5
Figure 5. Figure 5: Memory footprint comparison. Top: Standard Training Update (Algorithm 4.1). Middle: Global Subset Update (Algorithm 4.2). Bottom: Layer-Wise Subset Update (Algorithm 4.3). All three methods achieve comparable peak memory per layer, but Global Subset Update must retain both a (l) and ∂ℓ/∂e(l) for all l throughout the backward pass until the global subset is determined, while standard training and Layer-Wise… view at source ↗
Figure 6
Figure 6. Figure 6: shows the training dynamics of alpaca/samsum with three fine-tuning methods, and [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SFT training dynamics on the three QA settings: evaluation perplexity throughout training. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-layer scores on alpaca/samsum, averaged across training steps (lines) with 25th–75th interquartile bands. Score magnitudes vary by orders of magnitude across layer types, so Global Subset Update’s global ranking is dominated by down_proj while layers like Q and K (ρ ≲ 0.2) receive effectively uncurated data. The magnitudes ( [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: RLHF (self-reference target). Top. Train￾ing reward. Bottom. Evaluation toxicity. We follow a standard RLHF pipeline using TRL [von Werra et al., 2020] with PPO [Schul￾man et al., 2017] for detoxification [Hugging Face, 2023], using GPT-NEO-2.7B [Black et al., 2021] with LoRA. The policy generates continuations scored by a toxicity-based reward model (LFTW R4 Target [Vidgen et al., 2021]), and is evaluated… view at source ↗
Figure 10
Figure 10. Figure 10: RLHF (held-out target). Top. Training reward. Bottom. Evaluation toxicity. All curation methods use negative score filtering, retaining only samples with non-negative gra￾dient alignment. We further ablate two design choices: (i) target signal source: self-referenc￾ing (the same rollout batch) versus a new roll￾out on a held-out target set, and (ii) target loss: reward-weighted log-probability versus the … view at source ↗
Figure 11
Figure 11. Figure 11: RLVR on MATH with QWEN3- 1.7B. Evaluation accuracy over training [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Illustration of MSE for each method as a function of target sample size [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Per-layer scores on the three QA-only settings (full-parameter SFT trace), analo [PITH_FULL_IMAGE:figures/full_fig_p048_13.png] view at source ↗
read the original abstract

Data selection methods address a critical challenge in LLM post-training: effectively leveraging scarce, high-fidelity target data alongside abundant but imperfectly aligned general training data. In this work, we move beyond the data-selection framing and introduce Dr. Post-Training (Data-Regularized Post-Training), a novel framework that reconceptualizes general training data as a data-induced regularizer that prevents overfitting to the scarce target objective, rather than serving as a pool for selection. Specifically, our framework proposes that at each training step, construct a feasible set of model update directions using the general training data, and project the model update direction specified by the scarce target data onto that feasible set. Standard training and existing data selection methods arise as special cases with different choices of the data-induced regularizer, and these methods correspond to different points on a bias--variance spectrum with different regularization strength. Building on this view, we propose a family of methods offering a richer design space and more flexible bias--variance tradeoffs. For practical LLM-scale use, we introduce careful system optimizations that realize these methods with minimal overhead. Extensive experiments across SFT, RLHF, and RLVR show that our methods consistently outperform state-of-the-art data selection baselines, and system benchmarks confirm their efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes Dr. Post-Training, a framework that reconceptualizes abundant general training data as a data-induced regularizer for LLM post-training. At each step, a feasible set of update directions is constructed from general data gradients, and the update direction from scarce target data is projected onto this set. Standard training and data selection methods are presented as special cases corresponding to different regularization strengths on a bias-variance spectrum. A family of methods is introduced with system optimizations for LLM-scale efficiency, and experiments across SFT, RLHF, and RLVR are claimed to show consistent outperformance over state-of-the-art data selection baselines.

Significance. If the projection mechanism can be rigorously shown to constrain overfitting directions while preserving task-relevant target signal (via gradient alignment or subspace overlap), the framework could provide a principled unification of data selection and regularization, enabling more flexible bias-variance control in post-training. The practical system optimizations and empirical claims, if substantiated with ablations and error analysis, would strengthen its utility for leveraging general data in LLM pipelines.

major comments (3)
  1. [Abstract] Abstract: No equations or formal definitions are given for the feasible set construction from general data or the projection operator applied to the target update. Without these, it is impossible to verify whether the framework yields non-tautological benefits or simply reparameterizes existing regularization, directly undermining assessment of the central regularization claim.
  2. [Method] Method (implied in abstract description): The paper provides no analysis, bounds, or conditions (e.g., cosine similarity thresholds or explained variance in gradient subspaces) under which the general-data feasible set overlaps sufficiently with target directions to avoid nullifying useful signal. This is load-bearing for the claim that the approach yields superior bias-variance tradeoffs, especially when general and target distributions differ substantially.
  3. [Experiments] Experiments: The claim of consistent outperformance across SFT, RLHF, and RLVR lacks any reported details on baselines, metrics, error bars, ablation studies, or statistical significance, making it impossible to evaluate whether the results support the superiority over data selection methods or are robust to implementation choices.
minor comments (1)
  1. [Abstract] Abstract: The phrasing is dense; separating the conceptual unification from the proposed family of methods and system optimizations would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We address each of the major comments point by point below, providing clarifications and indicating revisions to the manuscript where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: No equations or formal definitions are given for the feasible set construction from general data or the projection operator applied to the target update. Without these, it is impossible to verify whether the framework yields non-tautological benefits or simply reparameterizes existing regularization, directly undermining assessment of the central regularization claim.

    Authors: We agree that the abstract, due to space constraints, does not include equations. However, the main text in Section 3 formally defines the feasible set as the set of convex combinations of gradients computed on general data batches, and the projection operator as the solution to a quadratic program minimizing the distance to the target gradient subject to the feasible set constraint. We have revised the abstract to include a high-level mathematical description of these components to make the central claim verifiable from the abstract alone. revision: yes

  2. Referee: [Method] Method (implied in abstract description): The paper provides no analysis, bounds, or conditions (e.g., cosine similarity thresholds or explained variance in gradient subspaces) under which the general-data feasible set overlaps sufficiently with target directions to avoid nullifying useful signal. This is load-bearing for the claim that the approach yields superior bias-variance tradeoffs, especially when general and target distributions differ substantially.

    Authors: The manuscript does include empirical analysis of gradient alignment in Section 4, with reported cosine similarities between general and target gradients. We acknowledge the lack of theoretical bounds and have added a new paragraph in the method section providing a sufficient condition based on the principal angle between the gradient subspaces, along with a simple bound on the signal preservation using the minimum overlap. For cases where distributions differ substantially, we discuss how increasing the regularization strength (smaller feasible set) can still be beneficial if some overlap exists, supported by additional experiments. revision: partial

  3. Referee: [Experiments] Experiments: The claim of consistent outperformance across SFT, RLHF, and RLVR lacks any reported details on baselines, metrics, error bars, ablation studies, or statistical significance, making it impossible to evaluate whether the results support the superiority over data selection methods or are robust to implementation choices.

    Authors: We regret that these details were not sufficiently highlighted in the main text. The full paper reports: baselines including random selection, perplexity-based, and gradient-based methods; metrics such as downstream task performance and human preference scores; error bars as standard deviations over multiple runs; ablations on regularization strength and feasible set size in the appendix; and statistical significance via t-tests with p-values reported. We have added a dedicated paragraph in the experiments section summarizing these and referencing the relevant tables and figures for clarity. revision: yes

Circularity Check

1 steps flagged

Framework defines prior methods as special cases by construction of the regularizer

specific steps
  1. self definitional [Abstract]
    "Standard training and existing data selection methods arise as special cases with different choices of the data-induced regularizer, and these methods correspond to different points on a bias--variance spectrum with different regularization strength."

    The feasible-set construction is defined so that varying the regularizer (i.e., the choice of feasible set or projection) directly recovers prior methods by construction; the claim that they 'arise as special cases' therefore reduces to a restatement of the framework's own parameterization rather than an independent insight or prediction.

full rationale

The paper introduces a projection-based regularization view and explicitly states that standard training and data selection emerge as special cases under different choices of the data-induced regularizer. This inclusion is definitional rather than derived from independent principles or equations that could falsify the equivalence. Experimental outperformance claims remain independent of this framing, so the circularity is partial and limited to the organizational claim rather than the core results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full mathematical definitions, parameter choices, and experimental protocols are unavailable, so the ledger is necessarily incomplete.

free parameters (1)
  • regularization strength
    The abstract states that different choices of the data-induced regularizer correspond to different points on a bias-variance spectrum with different regularization strength.
axioms (1)
  • domain assumption General training data can be used to construct a feasible set of model update directions that regularizes updates from scarce target data
    This is the core premise of the Dr. Post-Training framework as stated in the abstract.

pith-pipeline@v0.9.0 · 5533 in / 1295 out tokens · 44036 ms · 2026-05-11T01:49:59.334544+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

187 extracted references · 187 canonical work pages · 14 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , pages =

    Albalak, Alon and Raffel, Colin A and Wang, William Yang , title =. Advances in Neural Information Processing Systems , pages =

  2. [2]

    Transactions on Machine Learning Research , note =

    Alon Albalak and Yanai Elazar and Sang Michael Xie and Shayne Longpre and Nathan Lambert and Xinyi Wang and Niklas Muennighoff and Bairu Hou and Liangming Pan and Haewon Jeong and Colin Raffel and Shiyu Chang and Tatsunori Hashimoto and William Yang Wang , title =. Transactions on Machine Learning Research , note =

  3. [3]

    SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model , journal =

    Allal, Loubna Ben and Lozhkov, Anton and Bakouch, Elie and Bl. SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model , journal =

  4. [4]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , publisher =

  5. [7]

    Journal of machine learning research , number =

    Baydin, Atilim Gunes and Pearlmutter, Barak A and Radul, Alexey Andreyevich and Siskind, Jeffrey Mark , title =. Journal of machine learning research , number =

  6. [8]

    Machine learning , number =

    Ben-David, Shai and Blitzer, John and Crammer, Koby and Kulesza, Alex and Pereira, Fernando and Vaughan, Jennifer Wortman , title =. Machine learning , number =

  7. [9]

    OPT 2024: Optimization for Machine Learning , year =

    Bernstein, Jeremy and Newhouse, Laker , title =. OPT 2024: Optimization for Machine Learning , year =

  8. [10]

    Optimization methods for large-scale machine learning , journal =

    Bottou, L. Optimization methods for large-scale machine learning , journal =

  9. [12]

    Open problems and fundamental limitations of reinforcement learning from human feedback , journal =

    Casper, Stephen and Davies, Xander and Shi, Claudia and Gilbert, Thomas Krendl and Scheurer, J. Open problems and fundamental limitations of reinforcement learning from human feedback , journal =

  10. [14]

    The Twelfth International Conference on Learning Representations , year =

    Chen, Lichang and Li, Shiyang and Yan, Jun and Wang, Hai and Gunaratna, Kalpa and Yadav, Vikas and Tang, Zheng and Srinivasan, Vijay and Zhou, Tianyi and Huang, Heng and others , title =. The Twelfth International Conference on Learning Representations , year =

  11. [15]

    Advances in Neural Information Processing Systems , year =

    Choe, Sang Keun and Ahn, Hwijeen and Bae, Juhan and Zhao, Kewen and Kang, Minsoo and Chung, Youngseog and Pratapa, Adithya and Neiswanger, Willie and Strubell, Emma and Mitamura, Teruko and others , title =. Advances in Neural Information Processing Systems , year =

  12. [16]

    Advances in neural information processing systems , volume =

    Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , title =. Advances in neural information processing systems , volume =

  13. [17]

    Transactions of the Association for Computational Linguistics , pages =

    Clark, Jonathan H and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria , title =. Transactions of the Association for Computational Linguistics , pages =

  14. [18]

    Frustratingly easy domain adaptation , booktitle =

    Daum. Frustratingly easy domain adaptation , booktitle =

  15. [20]

    Advances in neural information processing systems , pages =

    Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , title =. Advances in neural information processing systems , pages =

  16. [21]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , title =. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages =

  17. [22]

    International Conference on Machine Learning , organization =

    Ethayarajh, Kawin and Choi, Yejin and Swayamdipta, Swabha , title =. International Conference on Machine Learning , organization =

  18. [23]

    Naval research logistics quarterly , number =

    Frank, Marguerite and Wolfe, Philip , title =. Naval research logistics quarterly , number =

  19. [24]

    Domain-adversarial training of neural networks , journal =

    Ganin, Yaroslav and Ustinova, Evgeniya and Ajakan, Hana and Germain, Pascal and Larochelle, Hugo and Laviolette, Fran. Domain-adversarial training of neural networks , journal =

  20. [26]

    International conference on machine learning , organization =

    Ghorbani, Amirata and Zou, James , title =. International conference on machine learning , organization =

  21. [27]

    EMNLP-IJCNLP 2019 , pages =

    Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander , title =. EMNLP-IJCNLP 2019 , pages =

  22. [28]

    Information and Software Technology , pages =

    Gong, Youdi and Liu, Guangzhen and Xue, Yunzhi and Li, Rui and Meng, Lingzhong , title =. Information and Software Technology , pages =

  23. [31]

    Textbooks are all you need , journal =

    Gunasekar, Suriya and Zhang, Yi and Aneja, Jyoti and Mendes, Caio C. Textbooks are all you need , journal =

  24. [32]

    Proceedings of the 2018 conference on empirical methods in natural language processing , pages =

    Guo, Jiang and Shah, Darsh and Barzilay, Regina , title =. Proceedings of the 2018 conference on empirical methods in natural language processing , pages =

  25. [34]

    Gradient Descent Happens in a Tiny Subspace

    Gur-Ari, Guy and Roberts, Daniel A and Dyer, Ethan , title =. arXiv preprint arXiv:1812.04754 , year =

  26. [35]

    Don’t stop pretraining: Adapt language models to domains and tasks , booktitle =

    Gururangan, Suchin and Marasovi. Don’t stop pretraining: Adapt language models to domains and tasks , booktitle =

  27. [36]

    Findings of the Association for Computational Linguistics: EMNLP 2021 , doi =

    Han, Xiaochuang and Tsvetkov, Yulia , title =. Findings of the Association for Computational Linguistics: EMNLP 2021 , doi =

  28. [37]

    Transactions on Machine Learning Research , url =

    Zeyu Han and Chao Gao and Jinyang Liu and Jeff Zhang and Sai Qian Zhang , title =. Transactions on Machine Learning Research , url =

  29. [38]

    Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome H and Friedman, Jerome H , title =

  30. [39]

    Foundations and Trends in Optimization , number =

    Hazan, Elad , title =. Foundations and Trends in Optimization , number =

  31. [40]

    First Conference on Language Modeling , url =

    Luxi He and Mengzhou Xia and Peter Henderson , title =. First Conference on Language Modeling , url =

  32. [41]

    Forty-second International Conference on Machine Learning , year =

    He, Yutong and Li, Pengrui and Hu, Yipeng and Chen, Chuyan and Yuan, Kun , title =. Forty-second International Conference on Machine Learning , year =

  33. [42]

    International Conference on Learning Representations , year =

    Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. International Conference on Learning Representations , year =

  34. [43]

    NeurIPS , year =

    Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , title =. NeurIPS , year =

  35. [44]

    ICLR , number =

    Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu and others , title =. ICLR , number =

  36. [45]

    Advances in Neural Information Processing Systems , editor =

    Hu, Yuzheng and Hu, Pingbang and Zhao, Han and Ma, Jiaqi , title =. Advances in Neural Information Processing Systems , editor =

  37. [46]

    Ma and Han Zhao , title =

    Yuzheng Hu and Fan Wu and Haotian Ye and David Forsyth and James Zou and Nan Jiang and Jiaqi W. Ma and Han Zhao , title =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , url =

  38. [47]

    Ma , title =

    Pingbang Hu and Joseph Melkonian and Weijing Tang and Han Zhao and Jiaqi W. Ma , title =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , url =

  39. [48]

    Ma and Han Zhao , title =

    Pingbang Hu and Yuzheng Hu and Jiaqi W. Ma and Han Zhao , title =. ICLR 2026 Workshop on Navigating and Addressing Data Problems for Foundation Models , year =

  40. [49]

    Detoxifying a Language Model using PPO , howpublished =

  41. [50]

    Communications in Statistics-Simulation and Computation , number =

    Hutchinson, Michael F , title =. Communications in Statistics-Simulation and Computation , number =

  42. [51]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages =

    Ivison, Hamish and Smith, Noah A and Hajishirzi, Hannaneh and Dasigi, Pradeep , title =. Findings of the Association for Computational Linguistics: ACL 2023 , pages =

  43. [53]

    Findings of the Association for Computational Linguistics: NAACL 2025 , doi =

    Jiao, Cathy and Gao, Weizhen and Raghunathan, Aditi and Xiong, Chenyan , title =. Findings of the Association for Computational Linguistics: NAACL 2025 , doi =

  44. [54]

    Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Joshi, Mandar and Choi, Eunsol and Weld, Daniel S and Zettlemoyer, Luke , title =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

  45. [55]

    arXiv preprint arXiv:2209.13569 , year =

    Kamalakara, Siddhartha Rao and Locatelli, Acyr and Venkitesh, Bharat and Ba, Jimmy and Gal, Yarin and Gomez, Aidan N , title =. arXiv preprint arXiv:2209.13569 , year =

  46. [56]

    Not all samples are created equal: Deep learning with importance sampling , booktitle =

    Katharopoulos, Angelos and Fleuret, Fran. Not all samples are created equal: Deep learning with importance sampling , booktitle =

  47. [57]

    International conference on machine learning , organization =

    Koh, Pang Wei and Liang, Percy , title =. International conference on machine learning , organization =

  48. [58]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =

    Kung, Po-Nien and Yin, Fan and Wu, Di and Chang, Kai-Wei and Peng, Nanyun , title =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =

  49. [59]

    Transactions of the Association for Computational Linguistics , pages =

    Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and others , title =. Transactions of the Association for Computational Linguistics , pages =

  50. [61]

    Lan, Guanghui , title =

  51. [62]

    Journal of computational and graphical statistics , number =

    Lange, Kenneth and Hunter, David R and Yang, Ilsoon , title =. Journal of computational and graphical statistics , number =

  52. [63]

    Lange, Kenneth , title =

  53. [64]

    The Twelfth International Conference on Learning Representations , url =

    Vladislav Lialin and Sherin Muckatira and Namrata Shivagunde and Anna Rumshisky , title =. The Twelfth International Conference on Learning Representations , url =

  54. [65]

    Text summarization branches out , pages =

    Lin, Chin-Yew , title =. Text summarization branches out , pages =

  55. [66]

    ACM Computing Surveys , number =

    Ling, Chen and Zhao, Xujiang and Lu, Jiaying and Deng, Chengyuan and Zheng, Can and Wang, Junxiang and Chowdhury, Tanmoy and Li, Yun and Cui, Hejie and Zhang, Xuchao and others , title =. ACM Computing Surveys , number =

  56. [67]

    International conference on machine learning , organization =

    Liu, Zijian and Nguyen, Ta Duy and Nguyen, Thien Hang and Ene, Alina and Nguyen, Huy , title =. International conference on machine learning , organization =

  57. [68]

    arXiv preprint arXiv:1511.06343 , year =

    Loshchilov, Ilya and Hutter, Frank , title =. arXiv preprint arXiv:1511.06343 , year =

  58. [69]

    International Conference on Learning Representations , url =

    Ilya Loshchilov and Frank Hutter , title =. International Conference on Learning Representations , url =

  59. [70]

    The Twelfth International Conference on Learning Representations , url =

    Keming Lu and Hongyi Yuan and Zheng Yuan and Runji Lin and Junyang Lin and Chuanqi Tan and Chang Zhou and Jingren Zhou , title =. The Twelfth International Conference on Learning Representations , url =

  60. [71]

    International conference on machine learning , organization =

    Mairal, Julien , title =. International conference on machine learning , organization =

  61. [72]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Mishra, Swaroop and Khashabi, Daniel and Baral, Chitta and Hajishirzi, Hannaneh , title =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

  62. [73]

    The Thirteenth International Conference on Learning Representations , year =

    Muennighoff, Niklas and Hongjin, SU and Wang, Liang and Yang, Nan and Wei, Furu and Yu, Tao and Singh, Amanpreet and Kiela, Douwe , title =. The Thirteenth International Conference on Learning Representations , year =

  63. [74]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

    Muhamed, Aashiq and Li, Oscar and Woodruff, David and Diab, Mona and Smith, Virginia , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

  64. [75]

    Nesterov, Yurii , title =

  65. [76]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , url =

    Mahdi Nikdan and Vincent Cohen-Addad and Dan Alistarh and Vahab Mirrokni , title =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , url =

  66. [77]

    Advances in neural information processing systems , pages =

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others , title =. Advances in neural information processing systems , pages =

  67. [78]

    IEEE Transactions on knowledge and data engineering , number =

    Pan, Sinno Jialin and Yang, Qiang , title =. IEEE Transactions on knowledge and data engineering , number =

  68. [79]

    SIDDA: SInkhorn Dynamic Domain Adaptation for image classification with equivariant neural networks , journal =

    Pandya, Sneh and Patel, Purvik and Nord, Brian D and Walmsley, Mike and. SIDDA: SInkhorn Dynamic Domain Adaptation for image classification with equivariant neural networks , journal =

  69. [80]

    Advances in neural information processing systems , pages =

    Perez, Ethan and Kiela, Douwe and Cho, Kyunghyun , title =. Advances in neural information processing systems , pages =

  70. [81]

    Advances in Neural Information Processing Systems , pages =

    Pruthi, Garima and Liu, Frederick and Kale, Satyen and Sundararajan, Mukund , title =. Advances in Neural Information Processing Systems , pages =

  71. [83]

    Proceedings of the 2016 conference on empirical methods in natural language processing , pages =

    Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , title =. Proceedings of the 2016 conference on empirical methods in natural language processing , pages =

  72. [84]

    Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages =

    Rasley, Jeff and Rajbhandari, Samyam and Ruwase, Olatunji and He, Yuxiong , title =. Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages =

  73. [85]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

    Renduchintala, Adithya and Konuk, Tugrul and Kuchaiev, Oleksii , title =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

  74. [86]

    Psychiatry research , pages =

    Sarzynska-Wawer, Justyna and Wawer, Aleksander and Pawlak, Aleksandra and Szymanowska, Julia and Stefaniak, Izabela and Jarkiewicz, Michal and Okruszek, Lukasz , title =. Psychiatry research , pages =

  75. [89]

    Shalev-Shwartz, Shai and Ben-David, Shai , title =

  76. [93]

    Covariate shift adaptation by importance weighted cross validation

    Sugiyama, Masashi and Krauledat, Matthias and M. Covariate shift adaptation by importance weighted cross validation. , journal =

  77. [94]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , doi =

    Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , title =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , doi =

  78. [95]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, pages =

    Thakkar, Megh and Bolukbasi, Tolga and Ganapathy, Sriram and Vashishth, Shikhar and Chandar, Sarath and Talukdar, Partha , title =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, pages =

  79. [96]

    Vershynin, Roman , title =

  80. [97]

    ACL , year =

    Bertie Vidgen and Tristan Thrush and Zeerak Waseem and Douwe Kiela , title =. ACL , year =

Showing first 80 references.