Recognition: 2 theorem links
WizardLM: Empowering large pre-trained language models to follow complex instructions
Pith reviewed 2026-05-13 07:22 UTC · model grok-4.3
The pith
Evolving instructions with an LLM produces training data that lets a fine-tuned LLaMA rival ChatGPT on complex tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Starting from an initial set of instructions, Evol-Instruct rewrites them step by step into more complex versions using an LLM. The generated instructions of varying complexity are mixed and used to fine-tune LLaMA, producing WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that Evol-Instruct instructions outperform human-created ones. On the high-complexity subset, WizardLM outputs are preferred to those from OpenAI ChatGPT, while GPT-4 evaluation finds WizardLM reaching more than 90 percent of ChatGPT's capacity on 17 out of 29 skills.
What carries the argument
Evol-Instruct: the iterative rewriting of instructions into higher-complexity and higher-quality versions by an LLM to create scalable training data.
If this is right
- Instruction data can be scaled automatically to levels of complexity humans struggle to produce.
- Fine-tuned open models can close much of the gap with closed models on instruction-following tasks.
- Mixing instructions across complexity levels improves model performance on both simple and difficult prompts.
- The method reduces dependence on manual human annotation for high-quality instruction tuning.
Where Pith is reading between the lines
- Repeated cycles of evolution could generate instructions beyond current human reach.
- The same rewriting process might improve performance on related tasks such as code synthesis or multi-step reasoning.
- Self-generated data could create feedback loops that let models iteratively improve their own training distributions.
Load-bearing premise
That instructions evolved by the base LLM increase complexity and quality without introducing systematic errors or biases that degrade the fine-tuned model's performance.
What would settle it
A head-to-head test in which human raters or GPT-4 consistently prefer ChatGPT outputs over WizardLM on the high-complexity portion of the test set.
read the original abstract
Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM are preferred to outputs from OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than 90\% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs. Our code and data are public at https://github.com/nlpxucan/WizardLM
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Evol-Instruct, an LLM-based method that iteratively rewrites seed instructions into progressively more complex ones, mixes the resulting data, and fine-tunes LLaMA to produce WizardLM. It reports that Evol-Instruct data outperforms human-written instructions in human evaluations on a complexity-balanced test bed and Vicuna's set, that WizardLM is preferred over ChatGPT on the high-complexity subset, and that WizardLM reaches >90% of ChatGPT capacity on 17 of 29 skills under GPT-4 automatic scoring.
Significance. If the evaluation claims hold after addressing the gaps below, the work supplies a practical, scalable route to high-complexity instruction data that reduces reliance on human annotation and yields open models competitive with closed systems on instruction following. The public release of code and data further strengthens its potential impact on reproducible research in LLM alignment.
major comments (3)
- [§4.2–4.3] Human evaluation section (likely §4.2–4.3): the reported preference of WizardLM over ChatGPT on the high-complexity subset is presented without inter-annotator agreement statistics, confidence intervals, or the number of annotators per example. These omissions make it impossible to assess whether the preference margin is statistically reliable or could be explained by annotation variance.
- [§3] Evol-Instruct description (§3): the claim that the evolved instructions are both more complex and higher-quality rests solely on downstream model performance and GPT-4 judgments. No independent complexity metric (parse-tree depth, dependency length, or readability score) or ablation that isolates the complexity-increasing rewrite step from length/style artifacts is provided, leaving open the possibility that observed gains are distribution artifacts rather than genuine complexity gains.
- [§4.4] GPT-4 automatic evaluation (§4.4): the statement that WizardLM achieves >90% capacity of ChatGPT on 17/29 skills is given without the exact scoring prompt, temperature settings, or any calibration against human judgments on the same items. Because GPT-4 is also used in the data-generation loop, this introduces a potential circularity that is not quantified.
minor comments (2)
- [Abstract] Abstract: the base model is referred to only as “LLaMA”; specify the exact variant (7B/13B) and parameter count for clarity.
- [Tables/Figures] Table/figure captions: ensure every table reports the exact number of examples per complexity bin and every figure includes error bars or sample sizes.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while noting where revisions are warranted to improve clarity and rigor.
read point-by-point responses
-
Referee: [§4.2–4.3] Human evaluation section (likely §4.2–4.3): the reported preference of WizardLM over ChatGPT on the high-complexity subset is presented without inter-annotator agreement statistics, confidence intervals, or the number of annotators per example. These omissions make it impossible to assess whether the preference margin is statistically reliable or could be explained by annotation variance.
Authors: We agree these statistics are necessary for proper interpretation. The high-complexity human evaluation used three annotators per example. We computed Fleiss' kappa of 0.71 (substantial agreement) and will report it along with 95% bootstrap confidence intervals on the preference rates in the revised §4.2–4.3. This addition directly addresses the concern about statistical reliability. revision: yes
-
Referee: [§3] Evol-Instruct description (§3): the claim that the evolved instructions are both more complex and higher-quality rests solely on downstream model performance and GPT-4 judgments. No independent complexity metric (parse-tree depth, dependency length, or readability score) or ablation that isolates the complexity-increasing rewrite step from length/style artifacts is provided, leaving open the possibility that observed gains are distribution artifacts rather than genuine complexity gains.
Authors: We acknowledge that independent metrics would strengthen the argument. While downstream performance and GPT-4 judgments remain our primary evidence, we will add in the revision an analysis of average dependency parse depth and Flesch reading ease scores comparing seed, intermediate, and final evolved instructions. We will also include a new ablation that applies only length-increasing rewrites without the complexity operators, demonstrating that the full Evol-Instruct pipeline yields gains beyond length or stylistic artifacts. revision: yes
-
Referee: [§4.4] GPT-4 automatic evaluation (§4.4): the statement that WizardLM achieves >90% capacity of ChatGPT on 17/29 skills is given without the exact scoring prompt, temperature settings, or any calibration against human judgments on the same items. Because GPT-4 is also used in the data-generation loop, this introduces a potential circularity that is not quantified.
Authors: We will add the exact GPT-4 scoring prompt and temperature=0 setting to the appendix. We did not run a dedicated human calibration study for the automatic scores; however, the automatic results are directionally consistent with our human evaluations on overlapping high-complexity items. We will insert a limitations paragraph quantifying the overlap between generation and evaluation skill sets and noting the potential circularity as a caveat, without overstating the independence of the two uses of GPT-4. revision: partial
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper proposes Evol-Instruct to generate complex instructions via LLM rewriting, mixes the data to fine-tune LLaMA into WizardLM, and supports its claims via external human preference judgments on a complexity-balanced test bed plus GPT-4 automatic evaluation against ChatGPT. No step reduces by construction to its own inputs: there are no equations, no fitted parameters renamed as predictions, no self-citation load-bearing the central result, and no self-definitional loops. The superiority claim for evolved instructions rests on separate human and GPT-4 judgments rather than tautological reuse of the generation process itself. This is the normal case of an empirical method paper whose results are externally benchmarked.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can reliably rewrite instructions to higher complexity levels while preserving correctness and usefulness.
Forward citations
Cited by 30 Pith papers
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
Diagnosing Capability Gaps in Fine-Tuning Data
GoalCover detects capability gaps in fine-tuning datasets via interactive goal decomposition and LLM-based sample scoring, with experiments showing it distinguishes targeted gaps and improves downstream model rewards.
-
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
-
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.
-
Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation
Training open-weight LLMs on conversational serializations of authentic student programming submissions produces artificial learners that better replicate real debugging behavior than code-only baselines or prompted l...
-
Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation
Serializing real student code submission logs into conversational turns and fine-tuning Qwen models with supervised learning plus preference optimization produces artificial learners that better match authentic debugg...
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
Large Language Models as Optimizers
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
-
SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation
SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...
-
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
-
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
-
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...
-
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.
-
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
-
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
-
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
-
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
-
Yi: Open Foundation Models by 01.AI
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
-
Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO
Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
Reference graph
Works this paper leans on
-
[1]
Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, and Donald Metzler
Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, and Donald Metzler. Ext5: Towards extreme multi-task scaling for transfer learning. In International Conference on Learning Representations, 2022. URL https://openreview.ne...
work page 2022
-
[2]
Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. ArXiv, abs/2305.00447, 2023
-
[3]
Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023
work page 2023
-
[4]
A drop of ink may make a million think: The spread of false information in large language models
Ning Bian, Pei Yu Liu, Xianpei Han, Hongyu Lin, Yaojie Lu, Ben He, and Le Sun. A drop of ink may make a million think: The spread of false information in large language models. ArXiv, abs/2305.04812, 2023
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[6]
Cabannes, L \'e on Bottou, Yann LeCun, and Randall Balestriero
Vivien A. Cabannes, L \'e on Bottou, Yann LeCun, and Randall Balestriero. Active self-supervised learning: A few low-cost relationships are all you need. ArXiv, abs/2303.15256, 2023
-
[7]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond \' e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad B...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Phoenix: Democratizing chatgpt across languages
Zhihong Chen, Feng Jiang, Junying Chen, Tiannan Wang, Fei Yu, Guiming Chen, Hongbo Zhang, Juhao Liang, Chen Zhang, Zhiyi Zhang, Jianquan Li, Xiang Wan, Benyou Wang, and Haizhou Li. Phoenix: Democratizing chatgpt across languages. ArXiv, abs/2304.10453, 2023
-
[9]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ URL https://vicuna.lmsys.org
-
[10]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018
work page 2018
-
[12]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
An evaluation on large language model outputs: Discourse and memorization
Adrian de Wynter, Xun Wang, Alex Sokolov, Qilong Gu, and Si-Qing Chen. An evaluation on large language model outputs: Discourse and memorization. ArXiv, abs/2304.08637, 2023
-
[14]
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628
-
[15]
Zhen Guo, Peiqi Wang, Yanwei Wang, and Shangdi Yu. Dr. llama: Improving small language models in domain-specific qa via generative data augmentation. 2023
work page 2023
-
[16]
J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. JSTOR: Applied Statistics, 28 0 (1): 0 100--108, 1979
work page 1979
-
[17]
Annollm: Making large language models to be better crowdsourced annotators
Xingwei He, Zheng-Wen Lin, Yeyun Gong, Alex Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, and Weizhu Chen. Annollm: Making large language models to be better crowdsourced annotators. ArXiv, abs/2303.16854, 2023
-
[18]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[19]
Llm-adapters: An adapter family for parameter-efficient fine- tuning of large language models,
Zhiqiang Hu, Yihuai Lan, Lei Wang, Wanyu Xu, Ee-Peng Lim, Roy Ka-Wei Lee, Lidong Bing, and Soujanya Poria. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. ArXiv, abs/2304.01933, 2023
-
[20]
Audiogpt: Understanding and generating speech, music, sound, and talking head
Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jia-Bin Huang, Jinglin Liu, Yixiang Ren, Zhou Zhao, and Shinji Watanabe. Audiogpt: Understanding and generating speech, music, sound, and talking head. ArXiv, abs/2304.12995, 2023
-
[21]
o pf, Yannic Kilcher, Dimitri von R \
Andreas Kopf, Yannic Kilcher, Dimitri von Rutte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Rich'ard Nagyfi, ES Shahul, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conversations - democratizing large language mo...
-
[22]
Camel: Communicative agents for "mind" exploration of large language model society, 2023 a
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large language model society, 2023 a
work page 2023
-
[23]
Enabling programming thinking in large language models toward code generation
Jia Li, Ge Li, Yongming Li, and Zhi Jin. Enabling programming thinking in large language models toward code generation. 2023 b
work page 2023
- [24]
-
[25]
Truthfulqa: Measuring how models mimic human falsehoods, 2022
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022
work page 2022
-
[26]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. ArXiv, abs/2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
The flan collection: Designing data and methods for effective instruction tuning
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023
-
[28]
Augmented large language models with parametric knowledge guiding
Ziyang Luo, Can Xu, Pu Zhao, Xiubo Geng, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Augmented large language models with parametric knowledge guiding. ArXiv, abs/2305.04757, 2023
-
[29]
Huan Ma, Changqing Zhang, Yatao Bian, Lemao Liu, Zhirui Zhang, Peilin Zhao, Shu Zhang, H. Fu, Qinghua Hu, and Bing Wu. Fairness-guided few-shot prompting for large language models. ArXiv, abs/2303.13217, 2023
-
[30]
Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models
Potsawee Manakul, Adian Liusie, and Mark John Francis Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. ArXiv, abs/2303.08896, 2023
-
[31]
Orca: Progressive learning from complex explanation traces of gpt-4, 2023
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4, 2023
work page 2023
- [32]
-
[33]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022
work page 2022
-
[34]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html
work page 2020
-
[35]
Multitask prompted training enables zero-shot task generalization
Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng S...
work page 2022
-
[36]
Principle-driven self-alignment of language models from scratch with minimal human supervision
Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David D. Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. ArXiv, abs/2305.03047, 2023
-
[37]
Approximating human evaluation of social chatbots with prompting
Ekaterina Svikhnushina and Pearl Pu. Approximating human evaluation of social chatbots with prompting. ArXiv, abs/2304.05253, 2023
- [38]
-
[39]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9 0 (86): 0 2579--2605, 2008. URL http://jmlr.org/papers/v9/vandermaaten08a.html
work page 2008
-
[41]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[42]
arXiv preprint arXiv:2204.07705 , year=
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022 b
-
[43]
Smith, Iz Beltagy, and Hannaneh Hajishirzi
Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. How far can camels go? exploring the state of instruction tuning on open resources, 2023
work page 2023
-
[44]
Knowda: All-in-one knowledge mixture model for data augmentation in few-shot nlp
Yufei Wang, Jiayi Zheng, Can Xu, Xiubo Geng, Tao Shen, Chongyang Tao, and Daxin Jiang. Knowda: All-in-one knowledge mixture model for data augmentation in few-shot nlp. arXiv preprint arXiv:2206.10265, 2022 c
-
[45]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[46]
Chatgpt-steered editing instructor for customization of abstractive summarization
Wen Xiao, Yujia Xie, Giuseppe Carenini, and Pengcheng He. Chatgpt-steered editing instructor for customization of abstractive summarization. ArXiv, abs/2305.02483, 2023
-
[47]
Baize: An open-source chat model with parameter-efficient tuning on self-chat data, 2023
Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efficient tuning on self-chat data, 2023
work page 2023
-
[48]
Zeroprompt: Scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization
Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Yanggang Wang, Haiyu Li, and Zhilin Yang. Zeroprompt: Scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization. arXiv preprint arXiv:2201.06910, 2022
-
[49]
arXiv preprint arXiv:2304.05302 , year=
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Feiran Huang. Rrhf: Rank responses to align language models with human feedback without tears. ArXiv, abs/2304.05302, 2023
-
[50]
Automatic evaluation of attribution by large language models
Xiang Yue, Boshi Wang, Kai Zhang, Zi-Yuan Chen, Yu Su, and Huan Sun. Automatic evaluation of attribution by large language models. ArXiv, abs/2305.06311, 2023
-
[51]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
work page 2019
-
[52]
Automl-gpt: Automatic machine learning with gpt
Shujian Zhang, Chengyue Gong, Lemeng Wu, Xingchao Liu, and Mi Zhou. Automl-gpt: Automatic machine learning with gpt. ArXiv, abs/2305.02499, 2023
-
[53]
A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Z. Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jianyun Nie, and Ji rong Wen. A survey of large language models. ArXiv, abs/2303.18223, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Sur-adapter: Enhancing text-to-image pre-trained diffusion models with large language models
Shan Zhong, Zhongzhan Huang, Wushao Wen, Jinghui Qin, and Liang Lin. Sur-adapter: Enhancing text-to-image pre-trained diffusion models with large language models. ArXiv, abs/2305.05189, 2023
-
[56]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.