arxiv: 2605.07063 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

Pingbang Hu , Xueshen Liu , Z. Morley Mao , Jiaqi W. Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM post-trainingdata regularizationdata selectionSFTRLHFRLVRbias-variance tradeofffeasible set

0 comments

The pith

General training data serves as a regularizer by constraining updates from scarce target data in LLM post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes the role of abundant but imperfect general training data in LLM post-training. Rather than selecting subsets from it to mix with scarce high-fidelity target data, the approach treats the general data as inducing a regularizer at each training step. A feasible set of allowable update directions is built from the general data, and the update direction dictated by the target data is projected onto that set. This unifies prior training and selection techniques as points on a bias-variance spectrum and supplies a broader family of methods with tunable regularization strength. Experiments across supervised fine-tuning, RLHF, and RLVR show consistent gains over data-selection baselines together with low-overhead implementation.

Core claim

At each training step a feasible set of model-update directions is constructed from the general training data, and the direction specified by the scarce target data is projected onto that set. The resulting data-induced regularizer prevents overfitting to the target objective. Standard training and existing selection methods emerge as special cases that differ only in the strength of this regularizer and therefore occupy different positions on a bias-variance spectrum. A richer family of methods is obtained by varying the regularizer, and system optimizations make the approach practical at LLM scale.

What carries the argument

Projection of the target-data update direction onto the feasible set of directions induced by general training data at each step.

If this is right

Existing data-selection and standard-training procedures become special cases obtained by particular choices of the data-induced regularizer.
A continuous spectrum of bias-variance trade-offs becomes available by adjusting regularization strength.
System-level optimizations allow the projection step to run with minimal added cost at LLM scale.
Performance improvements hold across supervised fine-tuning, RLHF, and RLVR relative to selection baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularization perspective could be applied in other domains where high-quality labeled data is scarce and general data is plentiful.
The strength of the projection could be adapted automatically during training based on observed loss or gradient statistics.
Combining the feasible-set projection with other regularizers such as weight decay might produce additive gains.

Load-bearing premise

The feasible set built from general data meaningfully limits target-driven updates so that overfitting is reduced without discarding useful target signal.

What would settle it

An experiment on an SFT or RLHF task in which the projected updates produce lower held-out performance than either unregularized target training or standard data-selection baselines.

Figures

Figures reproduced from arXiv: 2605.07063 by Jiaqi W. Ma, Pingbang Hu, Xueshen Liu, Z. Morley Mao.

**Figure 2.** Figure 2: Data regularization spectrum. Different methods correspond to feasible sets of increasing [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Bias–variance decomposition for a chosen Ut [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Top. Forward pass and backward pass (activation and weight gradient) computation graph: both −→ and 99K denote computational dependency; 99K indicates that the dependency runs across the boundary between forward and backward. Bottom. Memory footprint: a tensor can be released once no remaining operation depends on it. Forward pass. Each layer l computes the output (pre-activation) of the layer e (l) i,τ = … view at source ↗

**Figure 5.** Figure 5: Memory footprint comparison. Top: Standard Training Update (Algorithm 4.1). Middle: Global Subset Update (Algorithm 4.2). Bottom: Layer-Wise Subset Update (Algorithm 4.3). All three methods achieve comparable peak memory per layer, but Global Subset Update must retain both a (l) and ∂ℓ/∂e(l) for all l throughout the backward pass until the global subset is determined, while standard training and Layer-Wise… view at source ↗

**Figure 6.** Figure 6: shows the training dynamics of alpaca/samsum with three fine-tuning methods, and [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: SFT training dynamics on the three QA settings: evaluation perplexity throughout training. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Per-layer scores on alpaca/samsum, averaged across training steps (lines) with 25th–75th interquartile bands. Score magnitudes vary by orders of magnitude across layer types, so Global Subset Update’s global ranking is dominated by down_proj while layers like Q and K (ρ ≲ 0.2) receive effectively uncurated data. The magnitudes ( [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: RLHF (self-reference target). Top. Training reward. Bottom. Evaluation toxicity. We follow a standard RLHF pipeline using TRL [von Werra et al., 2020] with PPO [Schulman et al., 2017] for detoxification [Hugging Face, 2023], using GPT-NEO-2.7B [Black et al., 2021] with LoRA. The policy generates continuations scored by a toxicity-based reward model (LFTW R4 Target [Vidgen et al., 2021]), and is evaluated… view at source ↗

**Figure 10.** Figure 10: RLHF (held-out target). Top. Training reward. Bottom. Evaluation toxicity. All curation methods use negative score filtering, retaining only samples with non-negative gradient alignment. We further ablate two design choices: (i) target signal source: self-referencing (the same rollout batch) versus a new rollout on a held-out target set, and (ii) target loss: reward-weighted log-probability versus the … view at source ↗

**Figure 11.** Figure 11: RLVR on MATH with QWEN3- 1.7B. Evaluation accuracy over training [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Illustration of MSE for each method as a function of target sample size [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗

**Figure 13.** Figure 13: Per-layer scores on the three QA-only settings (full-parameter SFT trace), analo [PITH_FULL_IMAGE:figures/full_fig_p048_13.png] view at source ↗

read the original abstract

Data selection methods address a critical challenge in LLM post-training: effectively leveraging scarce, high-fidelity target data alongside abundant but imperfectly aligned general training data. In this work, we move beyond the data-selection framing and introduce Dr. Post-Training (Data-Regularized Post-Training), a novel framework that reconceptualizes general training data as a data-induced regularizer that prevents overfitting to the scarce target objective, rather than serving as a pool for selection. Specifically, our framework proposes that at each training step, construct a feasible set of model update directions using the general training data, and project the model update direction specified by the scarce target data onto that feasible set. Standard training and existing data selection methods arise as special cases with different choices of the data-induced regularizer, and these methods correspond to different points on a bias--variance spectrum with different regularization strength. Building on this view, we propose a family of methods offering a richer design space and more flexible bias--variance tradeoffs. For practical LLM-scale use, we introduce careful system optimizations that realize these methods with minimal overhead. Extensive experiments across SFT, RLHF, and RLVR show that our methods consistently outperform state-of-the-art data selection baselines, and system benchmarks confirm their efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The projection reframing is a clean conceptual move that unifies some prior tricks, but the paper still needs to show when the feasible set actually keeps useful target signal instead of just damping it.

read the letter

The core idea here is to treat general data not as a selection pool but as the source of a feasible set of update directions, then project the target-driven update onto that set at each step. That turns standard training and various selection heuristics into different choices of regularizer strength on a bias-variance line. The paper spells out a family of such regularizers and adds system-level tricks to keep the overhead low at LLM scale. Experiments on SFT, RLHF, and RLVR report consistent gains over current data-selection baselines, which is the main empirical claim. Those results are the part worth looking at if you care about data efficiency in post-training. The soft spot is exactly the one the stress-test flagged: the projection only helps if the general-data span overlaps the directions that matter for the target task while cutting the overfitting ones. The abstract and the reported experiments do not include subspace overlap numbers, cosine-similarity checks, or failure cases where the distributions diverge. Without that, it is hard to tell whether the gains come from the principled regularizer or from something closer to early stopping or implicit noise. The math is presented as general, but the practical methods still have a free regularization-strength parameter that needs tuning. This is the kind of paper that belongs in a reading group focused on data-centric alignment or efficient fine-tuning. Readers who already run large-scale post-training experiments will get the most out of the system optimizations and the empirical comparisons. It is worth sending to referees because the perspective is new enough and the experiments are broad enough to merit a careful look, even if the authors will probably have to add the missing alignment analysis before acceptance.

Referee Report

3 major / 1 minor

Summary. The paper proposes Dr. Post-Training, a framework that reconceptualizes abundant general training data as a data-induced regularizer for LLM post-training. At each step, a feasible set of update directions is constructed from general data gradients, and the update direction from scarce target data is projected onto this set. Standard training and data selection methods are presented as special cases corresponding to different regularization strengths on a bias-variance spectrum. A family of methods is introduced with system optimizations for LLM-scale efficiency, and experiments across SFT, RLHF, and RLVR are claimed to show consistent outperformance over state-of-the-art data selection baselines.

Significance. If the projection mechanism can be rigorously shown to constrain overfitting directions while preserving task-relevant target signal (via gradient alignment or subspace overlap), the framework could provide a principled unification of data selection and regularization, enabling more flexible bias-variance control in post-training. The practical system optimizations and empirical claims, if substantiated with ablations and error analysis, would strengthen its utility for leveraging general data in LLM pipelines.

major comments (3)

[Abstract] Abstract: No equations or formal definitions are given for the feasible set construction from general data or the projection operator applied to the target update. Without these, it is impossible to verify whether the framework yields non-tautological benefits or simply reparameterizes existing regularization, directly undermining assessment of the central regularization claim.
[Method] Method (implied in abstract description): The paper provides no analysis, bounds, or conditions (e.g., cosine similarity thresholds or explained variance in gradient subspaces) under which the general-data feasible set overlaps sufficiently with target directions to avoid nullifying useful signal. This is load-bearing for the claim that the approach yields superior bias-variance tradeoffs, especially when general and target distributions differ substantially.
[Experiments] Experiments: The claim of consistent outperformance across SFT, RLHF, and RLVR lacks any reported details on baselines, metrics, error bars, ablation studies, or statistical significance, making it impossible to evaluate whether the results support the superiority over data selection methods or are robust to implementation choices.

minor comments (1)

[Abstract] Abstract: The phrasing is dense; separating the conceptual unification from the proposed family of methods and system optimizations would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We address each of the major comments point by point below, providing clarifications and indicating revisions to the manuscript where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: No equations or formal definitions are given for the feasible set construction from general data or the projection operator applied to the target update. Without these, it is impossible to verify whether the framework yields non-tautological benefits or simply reparameterizes existing regularization, directly undermining assessment of the central regularization claim.

Authors: We agree that the abstract, due to space constraints, does not include equations. However, the main text in Section 3 formally defines the feasible set as the set of convex combinations of gradients computed on general data batches, and the projection operator as the solution to a quadratic program minimizing the distance to the target gradient subject to the feasible set constraint. We have revised the abstract to include a high-level mathematical description of these components to make the central claim verifiable from the abstract alone. revision: yes
Referee: [Method] Method (implied in abstract description): The paper provides no analysis, bounds, or conditions (e.g., cosine similarity thresholds or explained variance in gradient subspaces) under which the general-data feasible set overlaps sufficiently with target directions to avoid nullifying useful signal. This is load-bearing for the claim that the approach yields superior bias-variance tradeoffs, especially when general and target distributions differ substantially.

Authors: The manuscript does include empirical analysis of gradient alignment in Section 4, with reported cosine similarities between general and target gradients. We acknowledge the lack of theoretical bounds and have added a new paragraph in the method section providing a sufficient condition based on the principal angle between the gradient subspaces, along with a simple bound on the signal preservation using the minimum overlap. For cases where distributions differ substantially, we discuss how increasing the regularization strength (smaller feasible set) can still be beneficial if some overlap exists, supported by additional experiments. revision: partial
Referee: [Experiments] Experiments: The claim of consistent outperformance across SFT, RLHF, and RLVR lacks any reported details on baselines, metrics, error bars, ablation studies, or statistical significance, making it impossible to evaluate whether the results support the superiority over data selection methods or are robust to implementation choices.

Authors: We regret that these details were not sufficiently highlighted in the main text. The full paper reports: baselines including random selection, perplexity-based, and gradient-based methods; metrics such as downstream task performance and human preference scores; error bars as standard deviations over multiple runs; ablations on regularization strength and feasible set size in the appendix; and statistical significance via t-tests with p-values reported. We have added a dedicated paragraph in the experiments section summarizing these and referencing the relevant tables and figures for clarity. revision: yes

Circularity Check

1 steps flagged

Framework defines prior methods as special cases by construction of the regularizer

specific steps

self definitional [Abstract]
"Standard training and existing data selection methods arise as special cases with different choices of the data-induced regularizer, and these methods correspond to different points on a bias--variance spectrum with different regularization strength."

The feasible-set construction is defined so that varying the regularizer (i.e., the choice of feasible set or projection) directly recovers prior methods by construction; the claim that they 'arise as special cases' therefore reduces to a restatement of the framework's own parameterization rather than an independent insight or prediction.

full rationale

The paper introduces a projection-based regularization view and explicitly states that standard training and data selection emerge as special cases under different choices of the data-induced regularizer. This inclusion is definitional rather than derived from independent principles or equations that could falsify the equivalence. Experimental outperformance claims remain independent of this framing, so the circularity is partial and limited to the organizational claim rather than the core results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full mathematical definitions, parameter choices, and experimental protocols are unavailable, so the ledger is necessarily incomplete.

free parameters (1)

regularization strength
The abstract states that different choices of the data-induced regularizer correspond to different points on a bias-variance spectrum with different regularization strength.

axioms (1)

domain assumption General training data can be used to construct a feasible set of model update directions that regularizes updates from scarce target data
This is the core premise of the Dr. Post-Training framework as stated in the abstract.

pith-pipeline@v0.9.0 · 5533 in / 1295 out tokens · 44036 ms · 2026-05-11T01:49:59.334544+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

at each training step, construct a feasible set of model update directions using the general training data, and project the model update direction specified by the scarce target data onto that feasible set
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

these methods correspond to different points on a bias–variance spectrum with different regularization strength

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

187 extracted references · 187 canonical work pages · 14 internal anchors

[1]

Advances in Neural Information Processing Systems , pages =

Albalak, Alon and Raffel, Colin A and Wang, William Yang , title =. Advances in Neural Information Processing Systems , pages =

work page
[2]

Transactions on Machine Learning Research , note =

Alon Albalak and Yanai Elazar and Sang Michael Xie and Shayne Longpre and Nathan Lambert and Xinyi Wang and Niklas Muennighoff and Bairu Hou and Liangming Pan and Haewon Jeong and Colin Raffel and Shiyu Chang and Tatsunori Hashimoto and William Yang Wang , title =. Transactions on Machine Learning Research , note =

work page
[3]

SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model , journal =

Allal, Loubna Ben and Lozhkov, Anton and Bakouch, Elie and Bl. SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model , journal =

work page
[4]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , publisher =

work page
[7]

Journal of machine learning research , number =

Baydin, Atilim Gunes and Pearlmutter, Barak A and Radul, Alexey Andreyevich and Siskind, Jeffrey Mark , title =. Journal of machine learning research , number =

work page
[8]

Machine learning , number =

Ben-David, Shai and Blitzer, John and Crammer, Koby and Kulesza, Alex and Pereira, Fernando and Vaughan, Jennifer Wortman , title =. Machine learning , number =

work page
[9]

OPT 2024: Optimization for Machine Learning , year =

Bernstein, Jeremy and Newhouse, Laker , title =. OPT 2024: Optimization for Machine Learning , year =

work page 2024
[10]

Optimization methods for large-scale machine learning , journal =

Bottou, L. Optimization methods for large-scale machine learning , journal =

work page
[12]

Open problems and fundamental limitations of reinforcement learning from human feedback , journal =

Casper, Stephen and Davies, Xander and Shi, Claudia and Gilbert, Thomas Krendl and Scheurer, J. Open problems and fundamental limitations of reinforcement learning from human feedback , journal =

work page
[14]

The Twelfth International Conference on Learning Representations , year =

Chen, Lichang and Li, Shiyang and Yan, Jun and Wang, Hai and Gunaratna, Kalpa and Yadav, Vikas and Tang, Zheng and Srinivasan, Vijay and Zhou, Tianyi and Huang, Heng and others , title =. The Twelfth International Conference on Learning Representations , year =

work page
[15]

Advances in Neural Information Processing Systems , year =

Choe, Sang Keun and Ahn, Hwijeen and Bae, Juhan and Zhao, Kewen and Kang, Minsoo and Chung, Youngseog and Pratapa, Adithya and Neiswanger, Willie and Strubell, Emma and Mitamura, Teruko and others , title =. Advances in Neural Information Processing Systems , year =

work page
[16]

Advances in neural information processing systems , volume =

Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , title =. Advances in neural information processing systems , volume =

work page
[17]

Transactions of the Association for Computational Linguistics , pages =

Clark, Jonathan H and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria , title =. Transactions of the Association for Computational Linguistics , pages =

work page
[18]

Frustratingly easy domain adaptation , booktitle =

Daum. Frustratingly easy domain adaptation , booktitle =

work page
[20]

Advances in neural information processing systems , pages =

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , title =. Advances in neural information processing systems , pages =

work page
[21]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , title =. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages =

work page 2019
[22]

International Conference on Machine Learning , organization =

Ethayarajh, Kawin and Choi, Yejin and Swayamdipta, Swabha , title =. International Conference on Machine Learning , organization =

work page
[23]

Naval research logistics quarterly , number =

Frank, Marguerite and Wolfe, Philip , title =. Naval research logistics quarterly , number =

work page
[24]

Domain-adversarial training of neural networks , journal =

Ganin, Yaroslav and Ustinova, Evgeniya and Ajakan, Hana and Germain, Pascal and Larochelle, Hugo and Laviolette, Fran. Domain-adversarial training of neural networks , journal =

work page
[26]

International conference on machine learning , organization =

Ghorbani, Amirata and Zou, James , title =. International conference on machine learning , organization =

work page
[27]

EMNLP-IJCNLP 2019 , pages =

Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander , title =. EMNLP-IJCNLP 2019 , pages =

work page 2019
[28]

Information and Software Technology , pages =

Gong, Youdi and Liu, Guangzhen and Xue, Yunzhi and Li, Rui and Meng, Lingzhong , title =. Information and Software Technology , pages =

work page
[31]

Textbooks are all you need , journal =

Gunasekar, Suriya and Zhang, Yi and Aneja, Jyoti and Mendes, Caio C. Textbooks are all you need , journal =

work page
[32]

Proceedings of the 2018 conference on empirical methods in natural language processing , pages =

Guo, Jiang and Shah, Darsh and Barzilay, Regina , title =. Proceedings of the 2018 conference on empirical methods in natural language processing , pages =

work page 2018
[34]

Gradient Descent Happens in a Tiny Subspace

Gur-Ari, Guy and Roberts, Daniel A and Dyer, Ethan , title =. arXiv preprint arXiv:1812.04754 , year =

work page Pith review arXiv
[35]

Don’t stop pretraining: Adapt language models to domains and tasks , booktitle =

Gururangan, Suchin and Marasovi. Don’t stop pretraining: Adapt language models to domains and tasks , booktitle =

work page
[36]

Findings of the Association for Computational Linguistics: EMNLP 2021 , doi =

Han, Xiaochuang and Tsvetkov, Yulia , title =. Findings of the Association for Computational Linguistics: EMNLP 2021 , doi =

work page 2021
[37]

Transactions on Machine Learning Research , url =

Zeyu Han and Chao Gao and Jinyang Liu and Jeff Zhang and Sai Qian Zhang , title =. Transactions on Machine Learning Research , url =

work page
[38]

Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome H and Friedman, Jerome H , title =

work page
[39]

Foundations and Trends in Optimization , number =

Hazan, Elad , title =. Foundations and Trends in Optimization , number =

work page
[40]

First Conference on Language Modeling , url =

Luxi He and Mengzhou Xia and Peter Henderson , title =. First Conference on Language Modeling , url =

work page
[41]

Forty-second International Conference on Machine Learning , year =

He, Yutong and Li, Pengrui and Hu, Yipeng and Chen, Chuyan and Yuan, Kun , title =. Forty-second International Conference on Machine Learning , year =

work page
[42]

International Conference on Learning Representations , year =

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. International Conference on Learning Representations , year =

work page
[43]

NeurIPS , year =

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , title =. NeurIPS , year =

work page
[44]

ICLR , number =

Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu and others , title =. ICLR , number =

work page
[45]

Advances in Neural Information Processing Systems , editor =

Hu, Yuzheng and Hu, Pingbang and Zhao, Han and Ma, Jiaqi , title =. Advances in Neural Information Processing Systems , editor =

work page
[46]

Ma and Han Zhao , title =

Yuzheng Hu and Fan Wu and Haotian Ye and David Forsyth and James Zou and Nan Jiang and Jiaqi W. Ma and Han Zhao , title =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , url =

work page
[47]

Ma , title =

Pingbang Hu and Joseph Melkonian and Weijing Tang and Han Zhao and Jiaqi W. Ma , title =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , url =

work page
[48]

Ma and Han Zhao , title =

Pingbang Hu and Yuzheng Hu and Jiaqi W. Ma and Han Zhao , title =. ICLR 2026 Workshop on Navigating and Addressing Data Problems for Foundation Models , year =

work page 2026
[49]

Detoxifying a Language Model using PPO , howpublished =

work page
[50]

Communications in Statistics-Simulation and Computation , number =

Hutchinson, Michael F , title =. Communications in Statistics-Simulation and Computation , number =

work page
[51]

Findings of the Association for Computational Linguistics: ACL 2023 , pages =

Ivison, Hamish and Smith, Noah A and Hajishirzi, Hannaneh and Dasigi, Pradeep , title =. Findings of the Association for Computational Linguistics: ACL 2023 , pages =

work page 2023
[53]

Findings of the Association for Computational Linguistics: NAACL 2025 , doi =

Jiao, Cathy and Gao, Weizhen and Raghunathan, Aditi and Xiong, Chenyan , title =. Findings of the Association for Computational Linguistics: NAACL 2025 , doi =

work page 2025
[54]

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Joshi, Mandar and Choi, Eunsol and Weld, Daniel S and Zettlemoyer, Luke , title =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

work page
[55]

arXiv preprint arXiv:2209.13569 , year =

Kamalakara, Siddhartha Rao and Locatelli, Acyr and Venkitesh, Bharat and Ba, Jimmy and Gal, Yarin and Gomez, Aidan N , title =. arXiv preprint arXiv:2209.13569 , year =

work page arXiv
[56]

Not all samples are created equal: Deep learning with importance sampling , booktitle =

Katharopoulos, Angelos and Fleuret, Fran. Not all samples are created equal: Deep learning with importance sampling , booktitle =

work page
[57]

International conference on machine learning , organization =

Koh, Pang Wei and Liang, Percy , title =. International conference on machine learning , organization =

work page
[58]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =

Kung, Po-Nien and Yin, Fan and Wu, Di and Chang, Kai-Wei and Peng, Nanyun , title =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2023
[59]

Transactions of the Association for Computational Linguistics , pages =

Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and others , title =. Transactions of the Association for Computational Linguistics , pages =

work page
[61]

Lan, Guanghui , title =

work page
[62]

Journal of computational and graphical statistics , number =

Lange, Kenneth and Hunter, David R and Yang, Ilsoon , title =. Journal of computational and graphical statistics , number =

work page
[63]

Lange, Kenneth , title =

work page
[64]

The Twelfth International Conference on Learning Representations , url =

Vladislav Lialin and Sherin Muckatira and Namrata Shivagunde and Anna Rumshisky , title =. The Twelfth International Conference on Learning Representations , url =

work page
[65]

Text summarization branches out , pages =

Lin, Chin-Yew , title =. Text summarization branches out , pages =

work page
[66]

ACM Computing Surveys , number =

Ling, Chen and Zhao, Xujiang and Lu, Jiaying and Deng, Chengyuan and Zheng, Can and Wang, Junxiang and Chowdhury, Tanmoy and Li, Yun and Cui, Hejie and Zhang, Xuchao and others , title =. ACM Computing Surveys , number =

work page
[67]

International conference on machine learning , organization =

Liu, Zijian and Nguyen, Ta Duy and Nguyen, Thien Hang and Ene, Alina and Nguyen, Huy , title =. International conference on machine learning , organization =

work page
[68]

arXiv preprint arXiv:1511.06343 , year =

Loshchilov, Ilya and Hutter, Frank , title =. arXiv preprint arXiv:1511.06343 , year =

work page arXiv
[69]

International Conference on Learning Representations , url =

Ilya Loshchilov and Frank Hutter , title =. International Conference on Learning Representations , url =

work page
[70]

The Twelfth International Conference on Learning Representations , url =

Keming Lu and Hongyi Yuan and Zheng Yuan and Runji Lin and Junyang Lin and Chuanqi Tan and Chang Zhou and Jingren Zhou , title =. The Twelfth International Conference on Learning Representations , url =

work page
[71]

International conference on machine learning , organization =

Mairal, Julien , title =. International conference on machine learning , organization =

work page
[72]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Mishra, Swaroop and Khashabi, Daniel and Baral, Chitta and Hajishirzi, Hannaneh , title =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

work page
[73]

The Thirteenth International Conference on Learning Representations , year =

Muennighoff, Niklas and Hongjin, SU and Wang, Liang and Yang, Nan and Wei, Furu and Yu, Tao and Singh, Amanpreet and Kiela, Douwe , title =. The Thirteenth International Conference on Learning Representations , year =

work page
[74]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

Muhamed, Aashiq and Li, Oscar and Woodruff, David and Diab, Mona and Smith, Virginia , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2024
[75]

Nesterov, Yurii , title =

work page
[76]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , url =

Mahdi Nikdan and Vincent Cohen-Addad and Dan Alistarh and Vahab Mirrokni , title =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , url =

work page
[77]

Advances in neural information processing systems , pages =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others , title =. Advances in neural information processing systems , pages =

work page
[78]

IEEE Transactions on knowledge and data engineering , number =

Pan, Sinno Jialin and Yang, Qiang , title =. IEEE Transactions on knowledge and data engineering , number =

work page
[79]

SIDDA: SInkhorn Dynamic Domain Adaptation for image classification with equivariant neural networks , journal =

Pandya, Sneh and Patel, Purvik and Nord, Brian D and Walmsley, Mike and. SIDDA: SInkhorn Dynamic Domain Adaptation for image classification with equivariant neural networks , journal =

work page
[80]

Advances in neural information processing systems , pages =

Perez, Ethan and Kiela, Douwe and Cho, Kyunghyun , title =. Advances in neural information processing systems , pages =

work page
[81]

Advances in Neural Information Processing Systems , pages =

Pruthi, Garima and Liu, Frederick and Kale, Satyen and Sundararajan, Mukund , title =. Advances in Neural Information Processing Systems , pages =

work page
[83]

Proceedings of the 2016 conference on empirical methods in natural language processing , pages =

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , title =. Proceedings of the 2016 conference on empirical methods in natural language processing , pages =

work page 2016
[84]

Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages =

Rasley, Jeff and Rajbhandari, Samyam and Ruwase, Olatunji and He, Yuxiong , title =. Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages =

work page
[85]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

Renduchintala, Adithya and Konuk, Tugrul and Kuchaiev, Oleksii , title =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

work page 2024
[86]

Psychiatry research , pages =

Sarzynska-Wawer, Justyna and Wawer, Aleksander and Pawlak, Aleksandra and Szymanowska, Julia and Stefaniak, Izabela and Jarkiewicz, Michal and Okruszek, Lukasz , title =. Psychiatry research , pages =

work page
[89]

Shalev-Shwartz, Shai and Ben-David, Shai , title =

work page
[93]

Covariate shift adaptation by importance weighted cross validation

Sugiyama, Masashi and Krauledat, Matthias and M. Covariate shift adaptation by importance weighted cross validation. , journal =

work page
[94]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , doi =

Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , title =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , doi =

work page
[95]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, pages =

Thakkar, Megh and Bolukbasi, Tolga and Ganapathy, Sriram and Vashishth, Shikhar and Chandar, Sarath and Talukdar, Partha , title =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, pages =

work page 2023
[96]

Vershynin, Roman , title =

work page
[97]

ACL , year =

Bertie Vidgen and Tristan Thrush and Zeerak Waseem and Douwe Kiela , title =. ACL , year =

work page

Showing first 80 references.