arxiv: 2304.12244 · v3 · submitted 2023-04-24 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Chongyang Tao, Daxin Jiang, Jiazhan Feng, Kai Zheng, Pu Zhao, Qingfeng Sun, Qingwei Lin, Xiubo Geng

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords instruction tuninglarge language modelsevol-instructfine-tuningcomplex instructionsLLaMAinstruction following

0 comments

The pith

Evolving instructions with an LLM produces training data that lets a fine-tuned LLaMA rival ChatGPT on complex tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs can generate large amounts of high-complexity instruction data by iteratively rewriting simpler starting instructions. This Evol-Instruct process replaces slow, limited human creation and yields instructions that human evaluators rate higher than human-written ones. Fine-tuning LLaMA on the mixed dataset creates WizardLM, whose outputs humans prefer over ChatGPT's in high-complexity cases. GPT-4 automatic scoring places WizardLM at more than 90 percent of ChatGPT's capacity on 17 of 29 skills. The results indicate that AI-driven evolution of instructions offers a scalable route to stronger open instruction-following models.

Core claim

Starting from an initial set of instructions, Evol-Instruct rewrites them step by step into more complex versions using an LLM. The generated instructions of varying complexity are mixed and used to fine-tune LLaMA, producing WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that Evol-Instruct instructions outperform human-created ones. On the high-complexity subset, WizardLM outputs are preferred to those from OpenAI ChatGPT, while GPT-4 evaluation finds WizardLM reaching more than 90 percent of ChatGPT's capacity on 17 out of 29 skills.

What carries the argument

Evol-Instruct: the iterative rewriting of instructions into higher-complexity and higher-quality versions by an LLM to create scalable training data.

If this is right

Instruction data can be scaled automatically to levels of complexity humans struggle to produce.
Fine-tuned open models can close much of the gap with closed models on instruction-following tasks.
Mixing instructions across complexity levels improves model performance on both simple and difficult prompts.
The method reduces dependence on manual human annotation for high-quality instruction tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeated cycles of evolution could generate instructions beyond current human reach.
The same rewriting process might improve performance on related tasks such as code synthesis or multi-step reasoning.
Self-generated data could create feedback loops that let models iteratively improve their own training distributions.

Load-bearing premise

That instructions evolved by the base LLM increase complexity and quality without introducing systematic errors or biases that degrade the fine-tuned model's performance.

What would settle it

A head-to-head test in which human raters or GPT-4 consistently prefer ChatGPT outputs over WizardLM on the high-complexity portion of the test set.

read the original abstract

Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM are preferred to outputs from OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than 90\% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs. Our code and data are public at https://github.com/nlpxucan/WizardLM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Evol-Instruct gives a workable LLM-driven way to scale up complex instruction data, and WizardLM ends up competitive with ChatGPT on the high-complexity slice in their tests, though the evals leave some room for distribution artifacts.

read the letter

The main point is that they start with seed instructions and use an iterative rewrite process (Evol-Instruct) to make them progressively harder, then fine-tune LLaMA on the resulting mix to produce WizardLM. Human raters on their complexity-balanced test set prefer WizardLM over ChatGPT on the hardest items, and GPT-4 scoring puts it above 90% of ChatGPT capacity on 17 of 29 skills. They also show the evolved instructions beat human-written ones in the same setup, and they release the code and data.

Referee Report

3 major / 2 minor

Summary. The paper proposes Evol-Instruct, an LLM-based method that iteratively rewrites seed instructions into progressively more complex ones, mixes the resulting data, and fine-tunes LLaMA to produce WizardLM. It reports that Evol-Instruct data outperforms human-written instructions in human evaluations on a complexity-balanced test bed and Vicuna's set, that WizardLM is preferred over ChatGPT on the high-complexity subset, and that WizardLM reaches >90% of ChatGPT capacity on 17 of 29 skills under GPT-4 automatic scoring.

Significance. If the evaluation claims hold after addressing the gaps below, the work supplies a practical, scalable route to high-complexity instruction data that reduces reliance on human annotation and yields open models competitive with closed systems on instruction following. The public release of code and data further strengthens its potential impact on reproducible research in LLM alignment.

major comments (3)

[§4.2–4.3] Human evaluation section (likely §4.2–4.3): the reported preference of WizardLM over ChatGPT on the high-complexity subset is presented without inter-annotator agreement statistics, confidence intervals, or the number of annotators per example. These omissions make it impossible to assess whether the preference margin is statistically reliable or could be explained by annotation variance.
[§3] Evol-Instruct description (§3): the claim that the evolved instructions are both more complex and higher-quality rests solely on downstream model performance and GPT-4 judgments. No independent complexity metric (parse-tree depth, dependency length, or readability score) or ablation that isolates the complexity-increasing rewrite step from length/style artifacts is provided, leaving open the possibility that observed gains are distribution artifacts rather than genuine complexity gains.
[§4.4] GPT-4 automatic evaluation (§4.4): the statement that WizardLM achieves >90% capacity of ChatGPT on 17/29 skills is given without the exact scoring prompt, temperature settings, or any calibration against human judgments on the same items. Because GPT-4 is also used in the data-generation loop, this introduces a potential circularity that is not quantified.

minor comments (2)

[Abstract] Abstract: the base model is referred to only as “LLaMA”; specify the exact variant (7B/13B) and parameter count for clarity.
[Tables/Figures] Table/figure captions: ensure every table reports the exact number of examples per complexity bin and every figure includes error bars or sample sizes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while noting where revisions are warranted to improve clarity and rigor.

read point-by-point responses

Referee: [§4.2–4.3] Human evaluation section (likely §4.2–4.3): the reported preference of WizardLM over ChatGPT on the high-complexity subset is presented without inter-annotator agreement statistics, confidence intervals, or the number of annotators per example. These omissions make it impossible to assess whether the preference margin is statistically reliable or could be explained by annotation variance.

Authors: We agree these statistics are necessary for proper interpretation. The high-complexity human evaluation used three annotators per example. We computed Fleiss' kappa of 0.71 (substantial agreement) and will report it along with 95% bootstrap confidence intervals on the preference rates in the revised §4.2–4.3. This addition directly addresses the concern about statistical reliability. revision: yes
Referee: [§3] Evol-Instruct description (§3): the claim that the evolved instructions are both more complex and higher-quality rests solely on downstream model performance and GPT-4 judgments. No independent complexity metric (parse-tree depth, dependency length, or readability score) or ablation that isolates the complexity-increasing rewrite step from length/style artifacts is provided, leaving open the possibility that observed gains are distribution artifacts rather than genuine complexity gains.

Authors: We acknowledge that independent metrics would strengthen the argument. While downstream performance and GPT-4 judgments remain our primary evidence, we will add in the revision an analysis of average dependency parse depth and Flesch reading ease scores comparing seed, intermediate, and final evolved instructions. We will also include a new ablation that applies only length-increasing rewrites without the complexity operators, demonstrating that the full Evol-Instruct pipeline yields gains beyond length or stylistic artifacts. revision: yes
Referee: [§4.4] GPT-4 automatic evaluation (§4.4): the statement that WizardLM achieves >90% capacity of ChatGPT on 17/29 skills is given without the exact scoring prompt, temperature settings, or any calibration against human judgments on the same items. Because GPT-4 is also used in the data-generation loop, this introduces a potential circularity that is not quantified.

Authors: We will add the exact GPT-4 scoring prompt and temperature=0 setting to the appendix. We did not run a dedicated human calibration study for the automatic scores; however, the automatic results are directionally consistent with our human evaluations on overlapping high-complexity items. We will insert a limitations paragraph quantifying the overlap between generation and evaluation skill sets and noting the potential circularity as a caveat, without overstating the independence of the two uses of GPT-4. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper proposes Evol-Instruct to generate complex instructions via LLM rewriting, mixes the data to fine-tune LLaMA into WizardLM, and supports its claims via external human preference judgments on a complexity-balanced test bed plus GPT-4 automatic evaluation against ChatGPT. No step reduces by construction to its own inputs: there are no equations, no fitted parameters renamed as predictions, no self-citation load-bearing the central result, and no self-definitional loops. The superiority claim for evolved instructions rests on separate human and GPT-4 judgments rather than tautological reuse of the generation process itself. This is the normal case of an empirical method paper whose results are externally benchmarked.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that LLM-based iterative rewriting produces instruction data whose complexity and quality track human judgment.

axioms (1)

domain assumption LLMs can reliably rewrite instructions to higher complexity levels while preserving correctness and usefulness.
Evol-Instruct invokes this capability at each rewriting step; if false, the generated training data would not improve downstream performance.

pith-pipeline@v0.9.0 · 5553 in / 1129 out tokens · 52901 ms · 2026-05-13T07:22:59.139509+00:00 · methodology

discussion (0)

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Diagnosing Capability Gaps in Fine-Tuning Data
cs.LG 2026-04 unverdicted novelty 7.0

GoalCover detects capability gaps in fine-tuning datasets via interactive goal decomposition and LLM-based sample scoring, with experiments showing it distinguishes targeted gaps and improves downstream model rewards.
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
cs.CL 2026-04 unverdicted novelty 7.0

Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.
Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation
cs.AI 2026-04 conditional novelty 7.0

Training open-weight LLMs on conversational serializations of authentic student programming submissions produces artificial learners that better replicate real debugging behavior than code-only baselines or prompted l...
Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation
cs.AI 2026-04 unverdicted novelty 7.0

Serializing real student code submission logs into conversational turns and fine-tuning Qwen models with supervised learning plus preference optimization produces artificial learners that better match authentic debugg...
Self-Rewarding Language Models
cs.CL 2024-01 conditional novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
Large Language Models as Optimizers
cs.LG 2023-09 unverdicted novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation
cs.CL 2026-05 unverdicted novelty 6.0

SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
cs.CR 2026-04 unverdicted novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
cs.SE 2026-04 unverdicted novelty 6.0

Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
cs.AI 2026-04 unverdicted novelty 6.0

Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
cs.CL 2026-04 unverdicted novelty 6.0

AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
cs.NI 2026-04 unverdicted novelty 6.0

Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks
cs.CL 2026-04 unverdicted novelty 6.0

RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
cs.CL 2024-04 conditional novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
cs.CL 2023-06 accept novelty 6.0

GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
cs.LG 2026-04 unverdicted novelty 5.0

ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
cs.CL 2026-04 unverdicted novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
cs.CV 2024-06 unverdicted novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
Yi: Open Foundation Models by 01.AI
cs.CL 2024-03 unverdicted novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO
cs.CL 2026-04 unverdicted novelty 3.0

Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 28 Pith papers · 11 internal anchors

[1]

Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, and Donald Metzler

Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, and Donald Metzler. Ext5: Towards extreme multi-task scaling for transfer learning. In International Conference on Learning Representations, 2022. URL https://openreview.ne...

work page 2022
[2]

Tallrec: An effective and efficient tuning framework to align large language model with recommendation

Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. ArXiv, abs/2305.00447, 2023

work page arXiv 2023
[3]

Open llm leaderboard

Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023

work page 2023
[4]

A drop of ink may make a million think: The spread of false information in large language models

Ning Bian, Pei Yu Liu, Xianpei Han, Hongyu Lin, Yaojie Lu, Ben He, and Le Sun. A drop of ink may make a million think: The spread of false information in large language models. ArXiv, abs/2305.04812, 2023

work page arXiv 2023
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[6]

Cabannes, L \'e on Bottou, Yann LeCun, and Randall Balestriero

Vivien A. Cabannes, L \'e on Bottou, Yann LeCun, and Randall Balestriero. Active self-supervised learning: A few low-cost relationships are all you need. ArXiv, abs/2303.15256, 2023

work page arXiv 2023
[7]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond \' e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad B...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Phoenix: Democratizing chatgpt across languages

Zhihong Chen, Feng Jiang, Junying Chen, Tiannan Wang, Fei Yu, Guiming Chen, Hongbo Zhang, Juhao Liang, Chen Zhang, Zhiyi Zhang, Jianquan Li, Xiang Wan, Benyou Wang, and Haizhou Li. Phoenix: Democratizing chatgpt across languages. ArXiv, abs/2304.10453, 2023

work page arXiv 2023
[9]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ URL https://vicuna.lmsys.org

work page
[10]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

work page 2018
[12]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

An evaluation on large language model outputs: Discourse and memorization

Adrian de Wynter, Xun Wang, Alex Sokolov, Qilong Gu, and Si-Qing Chen. An evaluation on large language model outputs: Discourse and memorization. ArXiv, abs/2304.08637, 2023

work page arXiv 2023
[14]

12 Published as a conference paper at ICLR 2022 Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628

work page doi:10.5281/zenodo.5371628 2021
[15]

Zhen Guo, Peiqi Wang, Yanwei Wang, and Shangdi Yu. Dr. llama: Improving small language models in domain-specific qa via generative data augmentation. 2023

work page 2023
[16]

J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. JSTOR: Applied Statistics, 28 0 (1): 0 100--108, 1979

work page 1979
[17]

Annollm: Making large language models to be better crowdsourced annotators

Xingwei He, Zheng-Wen Lin, Yeyun Gong, Alex Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, and Weizhu Chen. Annollm: Making large language models to be better crowdsourced annotators. ArXiv, abs/2303.16854, 2023

work page arXiv 2023
[18]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[19]

Llm-adapters: An adapter family for parameter-efficient fine- tuning of large language models,

Zhiqiang Hu, Yihuai Lan, Lei Wang, Wanyu Xu, Ee-Peng Lim, Roy Ka-Wei Lee, Lidong Bing, and Soujanya Poria. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. ArXiv, abs/2304.01933, 2023

work page arXiv 2023
[20]

Audiogpt: Understanding and generating speech, music, sound, and talking head

Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jia-Bin Huang, Jinglin Liu, Yixiang Ren, Zhou Zhao, and Shinji Watanabe. Audiogpt: Understanding and generating speech, music, sound, and talking head. ArXiv, abs/2304.12995, 2023

work page arXiv 2023
[21]

o pf, Yannic Kilcher, Dimitri von R \

Andreas Kopf, Yannic Kilcher, Dimitri von Rutte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Rich'ard Nagyfi, ES Shahul, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conversations - democratizing large language mo...

work page arXiv 2023
[22]

Camel: Communicative agents for "mind" exploration of large language model society, 2023 a

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large language model society, 2023 a

work page 2023
[23]

Enabling programming thinking in large language models toward code generation

Jia Li, Ge Li, Yongming Li, and Zhi Jin. Enabling programming thinking in large language models toward code generation. 2023 b

work page 2023
[24]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023 c

work page 2023
[25]

Truthfulqa: Measuring how models mimic human falsehoods, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022

work page 2022
[26]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. ArXiv, abs/2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

The flan collection: Designing data and methods for effective instruction tuning

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023

work page arXiv 2023
[28]

Augmented large language models with parametric knowledge guiding

Ziyang Luo, Can Xu, Pu Zhao, Xiubo Geng, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Augmented large language models with parametric knowledge guiding. ArXiv, abs/2305.04757, 2023

work page arXiv 2023
[29]

Fu, Qinghua Hu, and Bing Wu

Huan Ma, Changqing Zhang, Yatao Bian, Lemao Liu, Zhirui Zhang, Peilin Zhao, Shu Zhang, H. Fu, Qinghua Hu, and Bing Wu. Fairness-guided few-shot prompting for large language models. ArXiv, abs/2303.13217, 2023

work page arXiv 2023
[30]

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark John Francis Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. ArXiv, abs/2303.08896, 2023

work page arXiv 2023
[31]

Orca: Progressive learning from complex explanation traces of gpt-4, 2023

Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4, 2023

work page 2023
[32]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[33]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022
[34]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html

work page 2020
[35]

Multitask prompted training enables zero-shot task generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng S...

work page 2022
[36]

Principle-driven self-alignment of language models from scratch with minimal human supervision

Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David D. Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. ArXiv, abs/2305.03047, 2023

work page arXiv 2023
[37]

Approximating human evaluation of social chatbots with prompting

Ekaterina Svikhnushina and Pearl Pu. Approximating human evaluation of social chatbots with prompting. ArXiv, abs/2304.05253, 2023

work page arXiv 2023
[38]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[39]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Visualizing data using t-sne

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9 0 (86): 0 2579--2605, 2008. URL http://jmlr.org/papers/v9/vandermaaten08a.html

work page 2008
[41]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022 a

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

arXiv preprint arXiv:2204.07705 , year=

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022 b

work page arXiv 2022
[43]

Smith, Iz Beltagy, and Hannaneh Hajishirzi

Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. How far can camels go? exploring the state of instruction tuning on open resources, 2023

work page 2023
[44]

Knowda: All-in-one knowledge mixture model for data augmentation in few-shot nlp

Yufei Wang, Jiayi Zheng, Can Xu, Xiubo Geng, Tao Shen, Chongyang Tao, and Daxin Jiang. Knowda: All-in-one knowledge mixture model for data augmentation in few-shot nlp. arXiv preprint arXiv:2206.10265, 2022 c

work page arXiv 2022
[45]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

Chatgpt-steered editing instructor for customization of abstractive summarization

Wen Xiao, Yujia Xie, Giuseppe Carenini, and Pengcheng He. Chatgpt-steered editing instructor for customization of abstractive summarization. ArXiv, abs/2305.02483, 2023

work page arXiv 2023
[47]

Baize: An open-source chat model with parameter-efficient tuning on self-chat data, 2023

Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efficient tuning on self-chat data, 2023

work page 2023
[48]

Zeroprompt: Scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization

Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Yanggang Wang, Haiyu Li, and Zhilin Yang. Zeroprompt: Scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization. arXiv preprint arXiv:2201.06910, 2022

work page arXiv 2022
[49]

arXiv preprint arXiv:2304.05302 , year=

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Feiran Huang. Rrhf: Rank responses to align language models with human feedback without tears. ArXiv, abs/2304.05302, 2023

work page arXiv 2023
[50]

Automatic evaluation of attribution by large language models

Xiang Yue, Boshi Wang, Kai Zhang, Zi-Yuan Chen, Yu Su, and Huan Sun. Automatic evaluation of attribution by large language models. ArXiv, abs/2305.06311, 2023

work page arXiv 2023
[51]

Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[52]

Automl-gpt: Automatic machine learning with gpt

Shujian Zhang, Chengyue Gong, Lemeng Wu, Xingchao Liu, and Mi Zhou. Automl-gpt: Automatic machine learning with gpt. ArXiv, abs/2305.02499, 2023

work page arXiv 2023
[53]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Z. Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jianyun Nie, and Ji rong Wen. A survey of large language models. ArXiv, abs/2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Sur-adapter: Enhancing text-to-image pre-trained diffusion models with large language models

Shan Zhong, Zhongzhan Huang, Wushao Wen, Jinghui Qin, and Liang Lin. Sur-adapter: Enhancing text-to-image pre-trained diffusion models with large language models. ArXiv, abs/2305.05189, 2023

work page arXiv 2023
[56]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023