SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

Marzyeh Ghassemi; Sanqiang Zhao; Shujian Zhang; Wenxuan Zhou; Yuxin Xiao

arxiv: 2410.05248 · v4 · submitted 2024-10-07 · 💻 cs.CL · cs.AI· cs.LG

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

Yuxin Xiao , Shujian Zhang , Wenxuan Zhou , Marzyeh Ghassemi , Sanqiang Zhao This is my paper

Pith reviewed 2026-05-23 19:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords SFTMixmixupinstruction tuninglarge language modelstraining dynamicsconfidence levelssupervised fine-tuningregularization

0 comments

The pith

Mixup on examples of varying confidence improves LLM instruction tuning without curated datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SFTMix, a mixup recipe that improves instruction tuning by identifying examples with different confidence levels through training dynamics and interpolating them. It claims that confident examples tend to overfit while unconfident ones generalize poorly, so bridging this gap via interpolation plus regularization yields better results. A sympathetic reader would care because the approach avoids the cost of proprietary LLM filtering or human annotation for high-quality data. The method shows gains on both general instruction-following and healthcare tasks across model families and dataset sizes.

Core claim

SFTMix leverages training dynamics to identify examples with varying confidence levels across the semantic representation space. Confident data is prone to overfitting while unconfident data is harder to generalize, so the method interpolates them to bridge the gap and applies mixup-based regularization on the resulting examples to support learning without relying on well-curated SFT datasets.

What carries the argument

SFTMix, a mixup recipe that classifies examples by confidence from training dynamics then interpolates them for regularization.

If this is right

Performance gains appear on both instruction-following and healthcare-specific tasks.
Improvements hold across LLM families and across SFT datasets of varying sizes and qualities.
The recipe remains compatible with existing data selection techniques.
It adapts to compute-constrained training scenarios.
The approach scales to broader applications beyond the tested tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same interpolation step could be tested on other fine-tuning objectives such as preference tuning or continued pretraining.
If the confidence metric generalizes, it might reduce reliance on external models for data quality assessment in new domains.
Applying SFTMix only to the lowest-confidence subset could serve as a lightweight variant for resource-limited settings.
The semantic space unevenness observation suggests potential for confidence-aware sampling in active learning loops.

Load-bearing premise

Examples with different confidence levels should play distinct roles in instruction tuning because confident data overfits and unconfident data generalizes poorly.

What would settle it

If standard SFT on the same datasets and models yields equal or better results than SFTMix across multiple benchmarks, or if confidence scores from training dynamics show no correlation with overfitting or generalization gaps, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2410.05248 by Marzyeh Ghassemi, Sanqiang Zhao, Shujian Zhang, Wenxuan Zhou, Yuxin Xiao.

**Figure 1.** Figure 1: Embeddings of 2,500 most and 2,500 least confident examples in Alpaca-52K by Llama-3.1-8B trained using NTP. The clear separation between these embeddings suggests that the LLM exhibits varying confidence levels across different semantic regions. et al., 2020) of this LLM by computing its confidence in generating each pair (Xi , Yi) ∈ D. Specifically, let Perpc (Yi | Xi) denote the LLM’s perplexity for … view at source ↗

**Figure 2.** Figure 2: The overall pipeline of the three-stage SFTMix recipe for LLM instruction tuning. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Confidence distributions from instruction [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

To acquire instruction-following capabilities, large language models (LLMs) undergo instruction tuning, where they are trained on instruction-response pairs using next-token prediction (NTP). Efforts to improve instruction tuning often focus on higher-quality supervised fine-tuning (SFT) datasets, typically requiring data filtering with proprietary LLMs or human annotation. In this paper, we take a different approach by proposing SFTMix, a novel Mixup-based recipe that elevates LLM instruction tuning without relying on well-curated datasets. We observe that LLMs exhibit uneven confidence across the semantic representation space. We argue that examples with different confidence levels should play distinct roles in instruction tuning: Confident data is prone to overfitting, while unconfident data is harder to generalize. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels. We then interpolate them to bridge the confidence gap and apply a Mixup-based regularization to support learning on these additional, interpolated examples. We demonstrate the effectiveness of SFTMix in both instruction-following and healthcare-specific SFT tasks, with consistent improvements across LLM families and SFT datasets of varying sizes and qualities. Extensive analyses across six directions highlight SFTMix's compatibility with data selection, adaptability to compute-constrained scenarios, and scalability to broader applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SFTMix applies training-dynamics confidence to pick and interpolate SFT examples then runs mixup regularization, with claimed gains on instruction and healthcare tasks.

read the letter

The main takeaway is that SFTMix identifies high- and low-confidence examples via training dynamics, creates interpolated versions between them, and adds mixup regularization on top during instruction tuning. This is meant to improve results without first curating higher-quality data. The paper tests the recipe on multiple LLM families and on both general and healthcare-specific SFT datasets of different sizes and qualities, plus some follow-up analyses on compatibility with data selection and low-compute settings. That spread of experiments is the part that actually adds value here. The specific combination of confidence stratification plus interpolation for this use case looks like the new piece, even if mixup itself is established elsewhere. The work is presented as a practical, data-agnostic recipe rather than a theoretical advance. The motivating claim that confident examples overfit while unconfident ones generalize poorly is stated up front and used to justify the interpolation step. The paper treats this mostly as an empirical observation and focuses on end-task gains rather than running targeted tests that isolate why the interpolation helps that distinction. Without the actual numbers, error bars, or full ablation tables it is hard to judge how large the improvements are or how they compare to stronger recent baselines. The healthcare results are a reasonable addition for applied interest, but they do not change the fact that the core evidence is still performance deltas on standard setups. Readers working on efficient or domain-adapted fine-tuning would be the ones who might try this out or cite the recipe. The method is simple enough to reproduce and the experiments cover enough ground that the paper clears the bar for a serious referee. I would send it to review.

Referee Report

2 major / 3 minor

Summary. The paper proposes SFTMix, a Mixup-based recipe for elevating LLM instruction tuning without requiring curated high-quality SFT datasets. It identifies examples with varying confidence levels via training dynamics, interpolates between confident (overfitting-prone) and unconfident (generalization-hard) examples to bridge gaps, and applies Mixup regularization on the interpolated data. The method is evaluated on instruction-following and healthcare-specific tasks, reporting consistent gains across LLM families and datasets of different sizes/qualities, with additional analyses on compatibility, compute constraints, and scalability.

Significance. If the empirical results hold, SFTMix offers a practical, data-agnostic enhancement to standard next-token prediction SFT that avoids reliance on proprietary LLMs or human annotation for data filtering. The six-direction analyses provide evidence of robustness and complementarity with existing data selection methods, which could make effective instruction tuning more accessible in resource-constrained or domain-specific settings.

major comments (2)

[Abstract, §3] Abstract and §3: The motivating claim that 'confident data is prone to overfitting, while unconfident data is harder to generalize' is presented as an observation from training dynamics, but the manuscript does not report a direct ablation isolating the distinct roles (e.g., training only on confident vs. only on unconfident subsets with and without interpolation) to confirm this drives the gains rather than the Mixup regularization alone.
[§4] §4 (Experiments): While consistent improvements are asserted across models and datasets, the results lack reported error bars, statistical significance tests, or multiple random seeds for the main tables; this weakens the cross-family and cross-dataset claims given the known variance in LLM fine-tuning.

minor comments (3)

[§3.1] Notation for confidence scoring via training dynamics (e.g., loss trajectories or logit margins) should be formalized with an equation in §3.1 for reproducibility.
[Figure 2] Figure 2 or equivalent visualization of interpolated examples would benefit from clearer labeling of the interpolation parameter λ and its sampling distribution.
[§4.2] The healthcare-specific task description in §4.2 should include the exact dataset size and domain adaptation details to allow direct comparison with general instruction-following results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation of minor revision. We address each major comment below.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3: The motivating claim that 'confident data is prone to overfitting, while unconfident data is harder to generalize' is presented as an observation from training dynamics, but the manuscript does not report a direct ablation isolating the distinct roles (e.g., training only on confident vs. only on unconfident subsets with and without interpolation) to confirm this drives the gains rather than the Mixup regularization alone.

Authors: We acknowledge that the manuscript presents the motivation based on training dynamics observations without a dedicated ablation that isolates the contribution of confidence-stratified interpolation from Mixup regularization alone. The six-direction analyses demonstrate overall gains and complementarity, but a targeted ablation would provide stronger causal evidence for the claimed roles of confident and unconfident examples. We will add this ablation in the revised manuscript. revision: yes
Referee: [§4] §4 (Experiments): While consistent improvements are asserted across models and datasets, the results lack reported error bars, statistical significance tests, or multiple random seeds for the main tables; this weakens the cross-family and cross-dataset claims given the known variance in LLM fine-tuning.

Authors: We agree that the absence of error bars and multi-seed statistics limits the strength of the cross-model and cross-dataset claims, given the known variance in LLM fine-tuning. Our reported results used single runs owing to compute limits across the evaluated model families and dataset scales. We will rerun key experiments with multiple random seeds, add error bars, and include statistical significance tests in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical recipe with independent experimental validation

full rationale

The paper proposes SFTMix as a data-agnostic Mixup recipe motivated by an explicit observation (uneven LLM confidence across semantic space) and an argument about distinct roles for confident vs. unconfident examples. It then describes a procedure using training dynamics for identification, interpolation, and regularization, followed by empirical demonstrations across models, datasets, and tasks. No equations, fitted parameters called predictions, self-definitional steps, or load-bearing self-citations appear in the abstract or described method. The central claims rest on reported performance gains rather than any derivation that reduces to the method's own inputs by construction. This is a standard empirical contribution with self-contained experimental support.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the central claim rests on the untested premise that confidence-stratified mixup improves generalization.

pith-pipeline@v0.9.0 · 5778 in / 1029 out tokens · 29526 ms · 2026-05-23T19:33:40.621294+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...
A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 1 Pith paper

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint

work page 2023
[4]

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al. 2024. A survey on data selection for language models. arXiv preprint

work page 2024
[5]

Yoshua Bengio, J \'e r \^o me Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In ICML

work page 2009
[6]

Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel

David Berthelot, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. 2020. Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring. In ICLR

work page 2020
[7]

David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. 2019. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS

work page 2019
[8]

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeffrey Wu. 2024. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In ICML

work page 2024
[9]

Luigi Carratino, Moustapha Ciss \'e , Rodolphe Jenatton, and Jean-Philippe Vert. 2022. On mixup regularization. JMLR

work page 2022
[10]

Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. 2009. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks

work page 2009
[11]

Jiaao Chen, Zichao Yang, and Diyi Yang. 2020. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In ACL

work page 2020
[12]

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. 2024. Alpagasus: Training a better alpaca with fewer data. In ICLR

work page 2024
[13]

Zeming Chen, Alejandro Hern \'a ndez Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas K \"o pf, Amirkeivan Mohtashami, et al. 2023. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint

work page 2023
[14]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. https://lmsys.org/blog/2023-03-30-vicuna/ Vicuna: An open-source chatbot impressing gpt-4 with 90\

work page 2023
[15]

Muthu Chidambaram, Xiang Wang, Yuzheng Hu, Chenwei Wu, and Rong Ge. 2022. Towards understanding the data dependency of mixup-style training. In ICLR

work page 2022
[16]

Everlyn Chimoto, Jay Gala, Orevaoghene Ahia, Julia Kreutzer, Bruce Bassett, and Sara Hooker. 2024. Critical learning periods: Leveraging early training dynamics for efficient data pruning. In ACL Findings

work page 2024
[17]

Hyeong Kyu Choi, Joonmyung Choi, and Hyunwoo J. Kim. 2022. Tokenmixup: Efficient attention-guided token-level data augmentation for transformers. In NeurIPS

work page 2022
[18]

Fenia Christopoulou, Gerasimos Lampouras, and Ignacio Iacobacci. 2022. Training dynamics for curriculum learning: A study on monolingual and cross-lingual nlu. In EMNLP

work page 2022
[19]

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat language models by scaling high-quality instructional conversations. In EMNLP

work page 2023
[20]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint

work page 2024
[21]

Yann Dubois, Percy Liang, and Tatsunori Hashimoto. 2024. Length-controlled alpacaeval: A simple debiasing of automatic evaluators. In COLM

work page 2024
[22]

Gamaleldin F Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. 2018. Large margin deep networks for classification. In NeurIPS, pages 850--860

work page 2018
[23]

Demi Guo, Yoon Kim, and Alexander M Rush. 2020. Sequence-level mixed sample data augmentation. In EMNLP

work page 2020
[24]

Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander L \"o ser, Daniel Truhn, and Keno K Bressem. 2023. Medalpaca--an open-source collection of medical conversational ai models and training data. arXiv preprint

work page 2023
[25]

Zongbo Han, Yifeng Yang, Changqing Zhang, Linjun Zhang, Joey Tianyi Zhou, and Qinghua Hu. 2024. Selective learning: Towards robust calibration with dynamic regularization. arXiv preprint

work page 2024
[26]

Muyang He, Shuo Yang, Tiejun Huang, and Bo Zhao. 2024. Large-scale dataset pruning with dynamic uncertainty. In CVPR

work page 2024
[27]

Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. 2020. Augmix: A simple method to improve robustness and uncertainty under data shift. In ICLR

work page 2020
[28]

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lo RA : Low-rank adaptation of large language models. In ICLR

work page 2022
[29]

Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, et al. 2024. Neftune: Noisy embeddings improve instruction finetuning. In ICLR

work page 2024
[30]

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint

work page 2023
[31]

Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. 2018. Predicting the generalization gap in deep networks with margin distributions. In ICLR

work page 2018
[32]

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences

work page 2021
[33]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. In EMNLP

work page 2019
[34]

Po-Nien Kung, Fan Yin, Di Wu, Kai-Wei Chang, and Nanyun Peng. 2023. Active instruction tuning: Improving cross-task generalization by training on prompt sensitive tasks. In EMNLP

work page 2023
[35]

Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. 2024. B io M istral: A collection of open-source pretrained large language models for medical domains. In ACL Findings

work page 2024
[36]

Changchun Li, Ximing Li, Lei Feng, and Jihong Ouyang. 2022. Who is your right mixup partner in positive and unlabeled learning. In ICLR

work page 2022
[37]

Junnan Li, Richard Socher, and Steven C.H. Hoi. 2020. Dividemix: Learning with noisy labels as semi-supervised learning. In ICLR

work page 2020
[38]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2023. Holistic evaluation of language models. TMLR

work page 2023
[39]

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, yelong shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. 2024. Not all tokens are what you need for pretraining. In NeurIPS

work page 2024
[40]

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. 2023. Trustworthy llms: A survey and guideline for evaluating large language models' alignment. arXiv preprint

work page 2023
[41]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint

work page 2019
[42]

Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. 2023. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint

work page 2023
[43]

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large language models: A survey. arXiv preprint

work page 2024
[44]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. In NeurIPS

work page 2022
[45]

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In CHIL

work page 2022
[46]

Chanwoo Park, Sangdoo Yun, and Sanghyuk Chun. 2022. A unified analysis of mixed sample data augmentation: A loss function perspective. In NeurIPS

work page 2022
[47]

Seo Yeon Park and Cornelia Caragea. 2022. A data cartography based mixup for pre-trained language models. In NAACL

work page 2022
[48]

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4. arXiv preprint

work page 2023
[49]

Francesco Pinto, Harry Yang, Ser Nam Lim, Philip Torr, and Puneet Dokania. 2022. Using mixup as a regularizer can surprisingly improve accuracy & out-of-distribution robustness. NeurIPS

work page 2022
[50]

Eduard Poesina, Cornelia Caragea, and Radu Ionescu. 2024. A novel cartography-based curriculum learning method applied on R o NLI : The first R omanian natural language inference corpus. In ACL

work page 2024
[51]

Jun Rao, Xuebo Liu, Lian Lian, Shengjun Cheng, Yunjie Liao, and Min Zhang. 2024. Commonit: Commonality-aware instruction tuning for large language models via data partitions. In EMNLP

work page 2024
[52]

Stephanie Schoch, Ritwick Mishra, and Yangfeng Ji. 2023. Data selection for fine-tuning large language models using transferred shapley values. In ACL Workshop

work page 2023
[53]

Nabeel Seedat, Nicolas Huynh, Boris van Breugel, and Mihaela van der Schaar. 2024. Curated LLM : Synergy of LLM s and data curation for tabular augmentation in low-data regimes. In ICML

work page 2024
[54]

Yang, Bin Wu, Laurence Aitchison, Emine Yilmaz, and Aldo Lipani

Zhengyan Shi, Adam X. Yang, Bin Wu, Laurence Aitchison, Emine Yilmaz, and Aldo Lipani. 2024. Instruction tuning with loss over instructions. In NeurIPS

work page 2024
[55]

Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. 2020. Fixmatch: simplifying semi-supervised learning with consistency and confidence. In NeurIPS

work page 2020
[56]

Lichao Sun, Congying Xia, Wenpeng Yin, Tingting Liang, S Yu Philip, and Lifang He. 2020. Mixup-transformer: Dynamic data augmentation for nlp tasks. In COLING

work page 2020
[57]

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith, and Yejin Choi. 2020. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In EMNLP

work page 2020
[58]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. https://github.com/tatsu-lab/stanford_alpaca Stanford alpaca: An instruction-following llama model

work page 2023
[59]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \'e , et al. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint

work page 2024
[60]

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Cl \'e mentine Fourrier, Nathan Habib, et al. 2023. Zephyr: Direct distillation of lm alignment. arXiv preprint

work page 2023
[61]

Sirazam Monira, Wheemyung Shin, TaeChoong Chung, and Sung-Ho Bae

A F M Shahab Uddin, Mst. Sirazam Monira, Wheemyung Shin, TaeChoong Chung, and Sung-Ho Bae. 2021. Saliencymix: A saliency guided data augmentation strategy for better regularization. In ICLR

work page 2021
[62]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. JMLR

work page 2008
[63]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS

work page 2017
[64]

Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. 2019. Manifold mixup: Better representations by interpolating hidden states. In ICML

work page 2019
[65]

Jiahao Wang, Bolin Zhang, Qianlong Du, Jiajun Zhang, and Dianhui Chu. 2024. A survey on data selection for llm instruction tuning. arXiv preprint

work page 2024
[66]

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-instruct: Aligning language models with self-generated instructions. In ACL

work page 2023
[67]

Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. 2024. Pmc-llama: toward building open-source language models for medicine. Journal of the American Medical Informatics Association

work page 2024
[68]

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. 2024. LESS : Selecting influential data for targeted instruction tuning. In ICML

work page 2024
[69]

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. 2023. Data selection for language models via importance resampling. In NeurIPS

work page 2023
[70]

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. 2024. Wizardlm: Empowering large pre-trained language models to follow complex instructions. In ICLR

work page 2024
[71]

Huiyun Yang, Huadong Chen, Hao Zhou, and Lei Li. 2022. Enhancing cross-lingual transfer by manifold mixup. In ICLR

work page 2022
[72]

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond empirical risk minimization. In ICLR

work page 2018
[73]

Linjun Zhang, Zhun Deng, Kenji Kawaguchi, Amirata Ghorbani, and James Zou. 2021. How does mixup help with robustness and generalization? In ICLR

work page 2021
[74]

Mike Zhang and Barbara Plank. 2021. Cartography active learning. In EMNLP Findings

work page 2021
[75]

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. 2023. Instruction tuning for large language models: A survey. arXiv preprint

work page 2023
[76]

Shujian Zhang, Chengyue Gong, Xingchao Liu, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. 2022. Allsh: Active learning guided by local sensitivity and hardness. In NAACL Findings

work page 2022
[77]

Wancong Zhang and Ieshan Vaidya. 2021. Mixup training leads to reduced overfitting and improved calibration for the transformer architecture. arXiv preprint

work page 2021
[78]

Hao Zhao, Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2024. Long is more for alignment: A simple but tough-to-beat baseline for instruction fine-tuning. In ICML

work page 2024
[79]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint

work page 2023
[80]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. In NeurIPS

work page 2023

Showing first 80 references.

[1] [1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint

work page 2023

[4] [4]

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al. 2024. A survey on data selection for language models. arXiv preprint

work page 2024

[5] [5]

Yoshua Bengio, J \'e r \^o me Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In ICML

work page 2009

[6] [6]

Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel

David Berthelot, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. 2020. Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring. In ICLR

work page 2020

[7] [7]

David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. 2019. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS

work page 2019

[8] [8]

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeffrey Wu. 2024. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In ICML

work page 2024

[9] [9]

Luigi Carratino, Moustapha Ciss \'e , Rodolphe Jenatton, and Jean-Philippe Vert. 2022. On mixup regularization. JMLR

work page 2022

[10] [10]

Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. 2009. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks

work page 2009

[11] [11]

Jiaao Chen, Zichao Yang, and Diyi Yang. 2020. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In ACL

work page 2020

[12] [12]

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. 2024. Alpagasus: Training a better alpaca with fewer data. In ICLR

work page 2024

[13] [13]

Zeming Chen, Alejandro Hern \'a ndez Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas K \"o pf, Amirkeivan Mohtashami, et al. 2023. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint

work page 2023

[14] [14]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. https://lmsys.org/blog/2023-03-30-vicuna/ Vicuna: An open-source chatbot impressing gpt-4 with 90\

work page 2023

[15] [15]

Muthu Chidambaram, Xiang Wang, Yuzheng Hu, Chenwei Wu, and Rong Ge. 2022. Towards understanding the data dependency of mixup-style training. In ICLR

work page 2022

[16] [16]

Everlyn Chimoto, Jay Gala, Orevaoghene Ahia, Julia Kreutzer, Bruce Bassett, and Sara Hooker. 2024. Critical learning periods: Leveraging early training dynamics for efficient data pruning. In ACL Findings

work page 2024

[17] [17]

Hyeong Kyu Choi, Joonmyung Choi, and Hyunwoo J. Kim. 2022. Tokenmixup: Efficient attention-guided token-level data augmentation for transformers. In NeurIPS

work page 2022

[18] [18]

Fenia Christopoulou, Gerasimos Lampouras, and Ignacio Iacobacci. 2022. Training dynamics for curriculum learning: A study on monolingual and cross-lingual nlu. In EMNLP

work page 2022

[19] [19]

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat language models by scaling high-quality instructional conversations. In EMNLP

work page 2023

[20] [20]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint

work page 2024

[21] [21]

Yann Dubois, Percy Liang, and Tatsunori Hashimoto. 2024. Length-controlled alpacaeval: A simple debiasing of automatic evaluators. In COLM

work page 2024

[22] [22]

Gamaleldin F Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. 2018. Large margin deep networks for classification. In NeurIPS, pages 850--860

work page 2018

[23] [23]

Demi Guo, Yoon Kim, and Alexander M Rush. 2020. Sequence-level mixed sample data augmentation. In EMNLP

work page 2020

[24] [24]

Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander L \"o ser, Daniel Truhn, and Keno K Bressem. 2023. Medalpaca--an open-source collection of medical conversational ai models and training data. arXiv preprint

work page 2023

[25] [25]

Zongbo Han, Yifeng Yang, Changqing Zhang, Linjun Zhang, Joey Tianyi Zhou, and Qinghua Hu. 2024. Selective learning: Towards robust calibration with dynamic regularization. arXiv preprint

work page 2024

[26] [26]

Muyang He, Shuo Yang, Tiejun Huang, and Bo Zhao. 2024. Large-scale dataset pruning with dynamic uncertainty. In CVPR

work page 2024

[27] [27]

Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. 2020. Augmix: A simple method to improve robustness and uncertainty under data shift. In ICLR

work page 2020

[28] [28]

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lo RA : Low-rank adaptation of large language models. In ICLR

work page 2022

[29] [29]

Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, et al. 2024. Neftune: Noisy embeddings improve instruction finetuning. In ICLR

work page 2024

[30] [30]

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint

work page 2023

[31] [31]

Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. 2018. Predicting the generalization gap in deep networks with margin distributions. In ICLR

work page 2018

[32] [32]

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences

work page 2021

[33] [33]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. In EMNLP

work page 2019

[34] [34]

Po-Nien Kung, Fan Yin, Di Wu, Kai-Wei Chang, and Nanyun Peng. 2023. Active instruction tuning: Improving cross-task generalization by training on prompt sensitive tasks. In EMNLP

work page 2023

[35] [35]

Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. 2024. B io M istral: A collection of open-source pretrained large language models for medical domains. In ACL Findings

work page 2024

[36] [36]

Changchun Li, Ximing Li, Lei Feng, and Jihong Ouyang. 2022. Who is your right mixup partner in positive and unlabeled learning. In ICLR

work page 2022

[37] [37]

Junnan Li, Richard Socher, and Steven C.H. Hoi. 2020. Dividemix: Learning with noisy labels as semi-supervised learning. In ICLR

work page 2020

[38] [38]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2023. Holistic evaluation of language models. TMLR

work page 2023

[39] [39]

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, yelong shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. 2024. Not all tokens are what you need for pretraining. In NeurIPS

work page 2024

[40] [40]

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. 2023. Trustworthy llms: A survey and guideline for evaluating large language models' alignment. arXiv preprint

work page 2023

[41] [41]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint

work page 2019

[42] [42]

Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. 2023. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint

work page 2023

[43] [43]

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large language models: A survey. arXiv preprint

work page 2024

[44] [44]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. In NeurIPS

work page 2022

[45] [45]

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In CHIL

work page 2022

[46] [46]

Chanwoo Park, Sangdoo Yun, and Sanghyuk Chun. 2022. A unified analysis of mixed sample data augmentation: A loss function perspective. In NeurIPS

work page 2022

[47] [47]

Seo Yeon Park and Cornelia Caragea. 2022. A data cartography based mixup for pre-trained language models. In NAACL

work page 2022

[48] [48]

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4. arXiv preprint

work page 2023

[49] [49]

Francesco Pinto, Harry Yang, Ser Nam Lim, Philip Torr, and Puneet Dokania. 2022. Using mixup as a regularizer can surprisingly improve accuracy & out-of-distribution robustness. NeurIPS

work page 2022

[50] [50]

Eduard Poesina, Cornelia Caragea, and Radu Ionescu. 2024. A novel cartography-based curriculum learning method applied on R o NLI : The first R omanian natural language inference corpus. In ACL

work page 2024

[51] [51]

Jun Rao, Xuebo Liu, Lian Lian, Shengjun Cheng, Yunjie Liao, and Min Zhang. 2024. Commonit: Commonality-aware instruction tuning for large language models via data partitions. In EMNLP

work page 2024

[52] [52]

Stephanie Schoch, Ritwick Mishra, and Yangfeng Ji. 2023. Data selection for fine-tuning large language models using transferred shapley values. In ACL Workshop

work page 2023

[53] [53]

Nabeel Seedat, Nicolas Huynh, Boris van Breugel, and Mihaela van der Schaar. 2024. Curated LLM : Synergy of LLM s and data curation for tabular augmentation in low-data regimes. In ICML

work page 2024

[54] [54]

Yang, Bin Wu, Laurence Aitchison, Emine Yilmaz, and Aldo Lipani

Zhengyan Shi, Adam X. Yang, Bin Wu, Laurence Aitchison, Emine Yilmaz, and Aldo Lipani. 2024. Instruction tuning with loss over instructions. In NeurIPS

work page 2024

[55] [55]

Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. 2020. Fixmatch: simplifying semi-supervised learning with consistency and confidence. In NeurIPS

work page 2020

[56] [56]

Lichao Sun, Congying Xia, Wenpeng Yin, Tingting Liang, S Yu Philip, and Lifang He. 2020. Mixup-transformer: Dynamic data augmentation for nlp tasks. In COLING

work page 2020

[57] [57]

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith, and Yejin Choi. 2020. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In EMNLP

work page 2020

[58] [58]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. https://github.com/tatsu-lab/stanford_alpaca Stanford alpaca: An instruction-following llama model

work page 2023

[59] [59]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \'e , et al. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint

work page 2024

[60] [60]

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Cl \'e mentine Fourrier, Nathan Habib, et al. 2023. Zephyr: Direct distillation of lm alignment. arXiv preprint

work page 2023

[61] [61]

Sirazam Monira, Wheemyung Shin, TaeChoong Chung, and Sung-Ho Bae

A F M Shahab Uddin, Mst. Sirazam Monira, Wheemyung Shin, TaeChoong Chung, and Sung-Ho Bae. 2021. Saliencymix: A saliency guided data augmentation strategy for better regularization. In ICLR

work page 2021

[62] [62]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. JMLR

work page 2008

[63] [63]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS

work page 2017

[64] [64]

Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. 2019. Manifold mixup: Better representations by interpolating hidden states. In ICML

work page 2019

[65] [65]

Jiahao Wang, Bolin Zhang, Qianlong Du, Jiajun Zhang, and Dianhui Chu. 2024. A survey on data selection for llm instruction tuning. arXiv preprint

work page 2024

[66] [66]

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-instruct: Aligning language models with self-generated instructions. In ACL

work page 2023

[67] [67]

Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. 2024. Pmc-llama: toward building open-source language models for medicine. Journal of the American Medical Informatics Association

work page 2024

[68] [68]

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. 2024. LESS : Selecting influential data for targeted instruction tuning. In ICML

work page 2024

[69] [69]

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. 2023. Data selection for language models via importance resampling. In NeurIPS

work page 2023

[70] [70]

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. 2024. Wizardlm: Empowering large pre-trained language models to follow complex instructions. In ICLR

work page 2024

[71] [71]

Huiyun Yang, Huadong Chen, Hao Zhou, and Lei Li. 2022. Enhancing cross-lingual transfer by manifold mixup. In ICLR

work page 2022

[72] [72]

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond empirical risk minimization. In ICLR

work page 2018

[73] [73]

Linjun Zhang, Zhun Deng, Kenji Kawaguchi, Amirata Ghorbani, and James Zou. 2021. How does mixup help with robustness and generalization? In ICLR

work page 2021

[74] [74]

Mike Zhang and Barbara Plank. 2021. Cartography active learning. In EMNLP Findings

work page 2021

[75] [75]

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. 2023. Instruction tuning for large language models: A survey. arXiv preprint

work page 2023

[76] [76]

Shujian Zhang, Chengyue Gong, Xingchao Liu, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. 2022. Allsh: Active learning guided by local sensitivity and hardness. In NAACL Findings

work page 2022

[77] [77]

Wancong Zhang and Ieshan Vaidya. 2021. Mixup training leads to reduced overfitting and improved calibration for the transformer architecture. arXiv preprint

work page 2021

[78] [78]

Hao Zhao, Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2024. Long is more for alignment: A simple but tough-to-beat baseline for instruction fine-tuning. In ICML

work page 2024

[79] [79]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint

work page 2023

[80] [80]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. In NeurIPS

work page 2023