pith. machine review for the scientific record. sign in

arxiv: 2512.02764 · v3 · submitted 2025-12-02 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:33 UTC · model grok-4.3

classification 💻 cs.CL
keywords PEFTparameter-efficient fine-tuninglarge language modelsbenchmarkingreplicabilityunified frameworkLLM fine-tuningmodular design
0
0 comments X

The pith

PEFT-Factory supplies one controlled environment that bundles 19 PEFT methods with 27 datasets for reproducible LLM fine-tuning comparisons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

New parameter-efficient fine-tuning methods for large language models often prove difficult to replicate or compare fairly because each arrives with its own code and setup. The paper presents PEFT-Factory as a single downstream framework that incorporates both off-the-shelf and custom PEFT approaches along with classification and text-generation datasets plus standard and method-specific metrics. Its modular structure is meant to let users add new methods while keeping experiments in a stable, ready-to-run state. A reader would care if the claim holds because it could reduce duplicated effort and make performance numbers across different techniques more trustworthy.

Core claim

PEFT-Factory is introduced as a unified framework originating from LLaMA-Factory that natively implements a representative set of 19 PEFT methods, supplies 27 datasets spanning 12 tasks, and includes both standard and PEFT-specific evaluation metrics, thereby creating a ready-to-use, controlled, and stable environment that improves replicability and benchmarking of PEFT methods.

What carries the argument

The modular design of PEFT-Factory that supports extensibility while delivering native implementations of 19 PEFT methods together with fixed datasets and metrics.

If this is right

  • Newly proposed PEFT methods can be added and tested against the existing set without rebuilding the surrounding pipeline.
  • Comparisons of method performance become possible under identical data splits, evaluation protocols, and hardware conditions.
  • Researchers gain immediate access to both classification and text-generation benchmarks when evaluating a new technique.
  • Custom PEFT variants can be inserted into the same controlled environment used for the built-in methods.
  • Results reported from the framework carry consistent metrics that combine general and PEFT-specific measures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could serve as a shared reference point that later papers adopt to report results, reducing the current scatter of incompatible experimental setups.
  • Extending the same modular structure to newer model families or additional task types would follow naturally from the design choices already made.
  • If adoption grows, the collection of 27 datasets might evolve into a de-facto standard testbed for efficient fine-tuning research.
  • Teams building production systems could use the same codebase to prototype and then deploy a chosen PEFT method with less translation effort.

Load-bearing premise

The native implementations of the 19 PEFT methods produce stable and comparable results across the 27 datasets without hidden code differences or extra tuning steps that would make fair comparison impossible.

What would settle it

Running the same PEFT method on the same dataset inside and outside PEFT-Factory and finding materially different performance numbers that trace to unaccounted implementation choices.

Figures

Figures reproduced from arXiv: 2512.02764 by Ivan Srba, Maria Bielikova, Robert Belanec.

Figure 1
Figure 1. Figure 1: Diagram representing the components of PEFT-FACTORY. The four main overarching components of PEFT-FACTORY are PEFT Methods, Datasets, Models, and Metrics, which are further defined by their subcompo￾nents. Components represented by green color are implemented in PEFT-FACTORY, components in blue color are native to LLaMA-Factory (Zheng et al., 2024a). Additionally, the Adapters library requires a different … view at source ↗
Figure 5
Figure 5. Figure 5: Example directory structure of custom PEFT [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 2
Figure 2. Figure 2: Selection of PEFT methods from Finetuning method dropdown menu. All 19 PEFT methods included in PEFT-FACTORY are available to choose. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Configuration options for the Prompt Tuning method. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Classification and PSCP results for prediction after training with Prompt Tuning. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at https://github.com/kinit-sk/PEFT-Factory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces PEFT-Factory, a unified framework for parameter-efficient fine-tuning of autoregressive large language models. It natively implements 19 PEFT methods, supports 27 classification and text generation datasets across 12 tasks, and includes both standard and PEFT-specific evaluation metrics. The modular design enables extensibility for custom methods, and the framework is presented as a downstream extension of LLaMA-Factory that is publicly released to improve replicability and benchmarking.

Significance. If the native implementations faithfully reproduce published PEFT performance and the framework is actively maintained, it could provide a valuable standardized environment for the community, reducing the effort required for fair comparisons and replication studies in the fast-moving PEFT literature. The public GitHub release and stated support for both off-the-shelf and custom methods are concrete strengths that directly support the replicability goal.

major comments (2)
  1. [§4] The central claim that PEFT-Factory supplies a 'controlled and stable environment' for benchmarking rests on the correctness of the 19 native implementations, yet the manuscript contains no reproduction experiments, ablation studies, or direct comparisons against original published results for any of the bundled methods. This verification is load-bearing for the replicability assertion.
  2. [§3.2] The description of the modular architecture does not address how the framework ensures that custom or off-the-shelf PEFT methods produce results free of hidden implementation discrepancies when run across the 27 datasets; without such safeguards or tests, the benchmarking utility remains unproven.
minor comments (3)
  1. [Abstract] The abstract lists 'both standard and PEFT-specific evaluation metrics' but does not enumerate or define the PEFT-specific metrics; a short table or paragraph in §4 would improve clarity.
  2. A summary table listing the 19 PEFT methods, their core hyperparameters, and the tasks they support would help readers quickly assess coverage.
  3. [§2] The relationship between PEFT-Factory and the parent LLaMA-Factory codebase is mentioned but not detailed; explicit notes on which components were modified or extended would aid users who are already familiar with LLaMA-Factory.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and describe the revisions we will make to strengthen the replicability aspects of the manuscript.

read point-by-point responses
  1. Referee: [§4] The central claim that PEFT-Factory supplies a 'controlled and stable environment' for benchmarking rests on the correctness of the 19 native implementations, yet the manuscript contains no reproduction experiments, ablation studies, or direct comparisons against original published results for any of the bundled methods. This verification is load-bearing for the replicability assertion.

    Authors: We agree that direct verification of the native implementations would better support the claim of a controlled environment. The manuscript emphasizes the unified framework and its extensibility rather than exhaustive benchmarking, which we viewed as outside the primary scope. In the revision we will add a dedicated subsection (or appendix) presenting reproduction results for a representative subset of the 19 methods on standard datasets, comparing against numbers reported in the original PEFT papers. This will provide concrete evidence of implementation fidelity. revision: yes

  2. Referee: [§3.2] The description of the modular architecture does not address how the framework ensures that custom or off-the-shelf PEFT methods produce results free of hidden implementation discrepancies when run across the 27 datasets; without such safeguards or tests, the benchmarking utility remains unproven.

    Authors: We acknowledge that §3.2 could more explicitly describe the consistency mechanisms. All methods share the same data loaders, training loop, and evaluation harness; custom methods are required to implement a narrow interface that returns only the adapted parameters or logits. The repository already contains unit tests and example configuration files that exercise this interface across multiple datasets. We will revise the architecture description to highlight these safeguards and note that the evaluation pipeline is deliberately method-agnostic. revision: yes

Circularity Check

0 steps flagged

No significant circularity in engineering framework paper

full rationale

The manuscript introduces PEFT-Factory as a software framework providing native implementations of 19 PEFT methods and 27 datasets for improved replicability. No mathematical derivations, equations, predictions, or fitted parameters exist in the paper. The central claim rests on the public GitHub release and modular design, which are externally verifiable artifacts independent of any self-referential logic or self-citation chains. This is a standard engineering contribution whose correctness does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the engineering assumption that a single modular codebase can faithfully reproduce 19 distinct PEFT methods without introducing systematic bias; no free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5466 in / 1248 out tokens · 37453 ms · 2026-05-17T02:33:14.545638+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 15 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Abubakar Abid, Ali Abdalla, Ali Abid, Dawood Khan, Abdulrahman Alfozan, and James Zou. 2019. Gradio: Hassle-free sharing and testing of ml models in the wild. arXiv preprint arXiv:1906.02569

  4. [4]

    Lightning AI. 2023. Litgpt. https://github.com/Lightning-AI/litgpt

  5. [5]

    Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. https://doi.org/10.18653/v1/N19-1245 M ath QA : Towards interpretable math word problem solving with operation-based formalisms . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics:...

  6. [6]

    Yuvanesh Anand, Zach Nussbaum, Brandon Duderstadt, Benjamin Schmidt, and Andriy Mulyar. 2023. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all

  7. [7]

    Akari Asai, Mohammadreza Salehi, Matthew Peters, and Hannaneh Hajishirzi. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.446 ATTEMPT : Parameter-efficient multi-task tuning via attentional mixtures of soft prompts . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6655--6672, Abu Dhabi, United Arab Emirat...

  8. [8]

    Axolotl maintainers and contributors . 2023. https://github.com/axolotl-ai-cloud/axolotl Axolotl: Open source llm post-training

  9. [9]

    Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second PASCAL recognising textual entailment challenge

  10. [10]

    Robert Belanec, Branislav Pecher, Ivan Srba, and Maria Bielikova. 2025. https://arxiv.org/abs/2511.21285 Peft-bench: A parameter-efficient fine-tuning methods benchmark . arXiv preprint

  11. [11]

    Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. 2022. https://doi.org/10.18653/v1/2022.acl-short.1 B it F it: Simple parameter-efficient fine-tuning for transformer-based masked language-models . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1--9, Dublin, Ireland. Association...

  12. [12]

    Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge

  13. [13]

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, and 1 others. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432--7439

  14. [14]

    Arthur Cayley. 1846. Sur quelques propri \'e t \'e s des d \'e terminants gauches. Journal für die reine und angewandte Mathematik

  15. [15]

    Daniel Cer, Mona Diab, Eneko Agirre, I \ n igo Lopez-Gazpio, and Lucia Specia. 2017. https://doi.org/10.18653/v1/S17-2001 S em E val-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation . In Proceedings of the 11th International Workshop on Semantic Evaluation ( S em E val-2017) , pages 1--14, Vancouver, Canada. ACL

  16. [16]

    Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca

  17. [17]

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1300 B ool Q : Exploring the surprising difficulty of natural yes/no questions . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...

  18. [18]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  19. [19]

    Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. https://doi.org/10.1007/11736790_9 The pascal recognising textual entailment challenge . In Proceedings of the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment, MLCW'05, page 177–190, Berlin...

  20. [20]

    Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. 2019. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107--124

  21. [21]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: efficient finetuning of quantized llms. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA. Curran Associates Inc

  22. [22]

    Shizhe Diao, Rui Pan, Hanze Dong, Ka Shun Shum, Jipeng Zhang, Wei Xiong, and Tong Zhang. 2023. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. arXiv preprint arXiv:2306.12420

  23. [23]

    Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, and 1 others. 2023. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220--235

  24. [24]

    William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing

  25. [25]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  26. [26]

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. 2023. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010

  27. [27]

    Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1--9. Association for Computational Linguistics

  28. [28]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

  29. [29]

    Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. 2024. https://openreview.net/forum?id=lIsCS8b6zj Parameter-efficient fine-tuning for large models: A comprehensive survey . Transactions on Machine Learning Research

  30. [30]

    Soufiane Hayou, Nikhil Ghosh, and Bin Yu. 2024. Lora+: efficient low rank adaptation of large models. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org

  31. [31]

    Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2022. https://openreview.net/forum?id=0RDcd5Axok Towards a unified view of parameter-efficient transfer learning . In International Conference on Learning Representations

  32. [32]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ Measuring massive multitask language understanding . In International Conference on Learning Representations

  33. [33]

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790--2799. PMLR

  34. [34]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3

  35. [35]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

  36. [36]

    Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), page...

  37. [37]

    Tushar Khot, Ashish Sabharwal, and Peter Clark. 2019. https://doi.org/10.18653/v1/D19-1281 What ' s missing: A knowledge gap guided approach for multi-hop question answering . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), p...

  38. [38]

    Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.243 The power of scale for parameter-efficient prompt tuning . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045--3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

  39. [39]

    Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2011. The W inograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning , volume 46, page 47

  40. [40]

    Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. 2023. https://doi.org/10.1145/3605573.3605613 Colossal-ai: A unified deep learning system for large-scale parallel training . In Proceedings of the 52nd International Conference on Parallel Processing, ICPP '23, page 766–775, New York, NY, USA. Ass...

  41. [41]

    Xiang Lisa Li and Percy Liang. 2021. https://doi.org/10.18653/v1/2021.acl-long.353 Prefix-tuning: Optimizing continuous prompts for generation . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582--4597, Onl...

  42. [42]

    Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. 2023. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647

  43. [43]

    Chin-Yew Lin. 2004. https://aclanthology.org/W04-1013/ ROUGE : A package for automatic evaluation of summaries . In Text Summarization Branches Out, pages 74--81, Barcelona, Spain. Association for Computational Linguistics

  44. [44]

    Vijay Lingam, Atula Tejaswi Neerkaje, Aditya Vavre, Aneesh Shetty, Gautham Krishna Gudur, Joydeep Ghosh, Eunsol Choi, Alex Dimakis, Aleksandar Bojchevski, and sujay sanghavi. 2024. https://openreview.net/forum?id=DOUskwCqg5 SVFT : Parameter-efficient fine-tuning with singular vectors . In 2nd Workshop on Advancing Neural Network Training: Computational Ef...

  45. [45]

    Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022 a . Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950--1965

  46. [46]

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. 2024. Dora: weight-decomposed low-rank adaptation. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org

  47. [47]

    Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2022 b . https://doi.org/10.18653/v1/2022.acl-short.8 P -tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61--68, Dub...

  48. [48]

    Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2023. Gpt understands, too. AI Open

  49. [49]

    Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. 2022. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft

  50. [50]

    Fanxu Meng, Zhaohui Wang, and Muhan Zhang. 2024. Pissa: principal singular values and singular vectors adaptation of large language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24, Red Hook, NY, USA. Curran Associates Inc

  51. [51]

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large language models: A survey. arXiv preprint arXiv:2402.06196

  52. [52]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics

  53. [53]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K\" o pf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, and 2 others. 2019. Pytorch: an imperative style, high-performance deep l...

  54. [54]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://doi.org/10.18653/v1/2021.naacl-main.168 Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094, Online. Association for ...

  55. [55]

    Jonas Pfeiffer, Ivan Vuli \'c , Iryna Gurevych, and Sebastian Ruder. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.617 MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654--7673, Online. Association for Comput...

  56. [56]

    Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. https://doi.org/10.18653/v1/N19-1128 W i C : the word-in-context dataset for evaluating context-sensitive meaning representations . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and S...

  57. [57]

    Clifton Poth, Hannah Sterz, Indraneil Paul, Sukannya Purkayastha, Leon Engl \"a nder, Timo Imhof, Ivan Vuli \'c , Sebastian Ruder, Iryna Gurevych, and Jonas Pfeiffer. 2023. https://aclanthology.org/2023.emnlp-demo.13 Adapters: A unified library for parameter-efficient and modular transfer learning . In Proceedings of the 2023 Conference on Empirical Metho...

  58. [58]

    Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Sch \"o lkopf. 2023. Controlling text-to-image diffusion by orthogonal finetuning. Advances in Neural Information Processing Systems, 36:79320--79362

  59. [59]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and 1 others. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9

  60. [60]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1--67

  61. [61]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQ u AD : 100,000+ questions for machine comprehension of text. In Proceedings of EMNLP, pages 2383--2392. Association for Computational Linguistics

  62. [62]

    Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, pages 90--95

  63. [63]

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99--106

  64. [64]

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. https://doi.org/10.18653/v1/D19-1454 Social IQ a: Commonsense reasoning about social interactions . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),...

  65. [65]

    Zhengxiang Shi and Aldo Lipani. 2024. https://openreview.net/forum?id=KjegfPGRde De PT : Decomposed prompt tuning for parameter-efficient fine-tuning . In The Twelfth International Conference on Learning Representations

  66. [66]

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on EMNLP, pages 1631--1642

  67. [67]

    Pengwei Tang, Xiaolin Hu, and Yong Liu. 2025. https://openreview.net/forum?id=fswihJIYbd AD e PT : Adaptive decomposed prompt tuning for parameter-efficient fine-tuning . In The Thirteenth International Conference on Learning Representations

  68. [68]

    A Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems

  69. [69]

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. 2020. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl

  70. [70]

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32

  71. [71]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461

  72. [72]

    Smith, Iz Beltagy, and Hannaneh Hajishirzi

    Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. 2023 a . https://arxiv.org/abs/2306.04751 How far can camels go? exploring the state of instruction tuning on open resources . Preprint, arXiv:2306.04751

  73. [73]

    Zhen Wang, Rameswar Panda, Leonid Karlinsky, Rogerio Feris, Huan Sun, and Yoon Kim. 2023 b . https://openreview.net/forum?id=Nk2pDtuhTq Multitask prompt tuning enables parameter-efficient transfer learning . In The Eleventh International Conference on Learning Representations

  74. [74]

    Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. https://doi.org/10.1162/tacl_a_00290 Neural network acceptability judgments . Transactions of the ACL, 7:625--641

  75. [75]

    Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North A merican Chapter of the ACL: Human Language Technologies, Volume 1 (Long Papers) , pages 1112--1122, New Orleans, Louisiana. ACL

  76. [76]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://www.aclweb.org/anthology/2020.emnlp-demos.6 Transformers...

  77. [77]

    Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. 2023. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv preprint arXiv:2312.12148

  78. [78]

    Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, and 25 others. 2024. https://api.semanticscholar.org/CorpusID:274859421 Qwen2.5 technical report . ArXiv, abs/2412.15115

  79. [79]

    Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th international conference on mining software repositories, pages 476--486

  80. [80]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. https://doi.org/10.18653/v1/P19-1472 H ella S wag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy. Association for Computational Linguistics

Showing first 80 references.