pith. machine review for the scientific record. sign in

arxiv: 2306.02707 · v1 · submitted 2023-06-05 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:39 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords Orcaimitation learningexplanation tracesreasoningsmall language modelsBig-Bench Hardzero-shot performance
0
0 comments X

The pith

A 13B model trained on GPT-4's step-by-step explanations reaches ChatGPT parity on complex reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Orca, a 13-billion parameter model trained to imitate the reasoning processes of larger foundation models rather than their surface output style. It draws on rich signals from GPT-4, including detailed explanation traces and step-by-step thought processes, with guidance from ChatGPT and careful selection from large-scale diverse data. This training enables Orca to outperform prior 13B instruction-tuned models by more than 100% on Big-Bench Hard and by 42% on AGIEval. Orca reaches parity with ChatGPT on BBH and stays within a small gap on professional exams such as the SAT, LSAT, GRE, and GMAT, all in zero-shot settings without chain-of-thought prompting, while still trailing GPT-4 itself.

Core claim

Orca is a 13-billion parameter model that learns to imitate the reasoning process of large foundation models by training on rich explanation traces, step-by-step thought processes, and complex instructions generated by GPT-4 with assistance from ChatGPT. Through progressive learning on large-scale and diverse imitation data with judicious sampling, Orca surpasses conventional state-of-the-art instruction-tuned models on complex zero-shot reasoning benchmarks and achieves performance parity with ChatGPT on BBH while showing competitive results on professional and academic examinations.

What carries the argument

Progressive learning from complex explanation traces and step-by-step thought processes of GPT-4, using large-scale diverse imitation data with judicious sampling to transfer reasoning capabilities to a smaller model.

If this is right

  • Smaller models can close much of the reasoning gap to larger models when trained on detailed process traces instead of shallow outputs.
  • Explanation traces provide stronger imitation signals than standard instruction data for zero-shot complex reasoning.
  • Competitive performance on professional exams is achievable without chain-of-thought prompting at inference time.
  • Judicious sampling from large-scale data helps avoid the style-imitation pitfalls seen in earlier imitation learning efforts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The main bottleneck for smaller models may be the quality and depth of reasoning data rather than parameter count alone.
  • This training approach could be combined with other data sources to further reduce reliance on very large models at inference.
  • Similar progressive imitation on explanation traces might extend to domains such as code generation or scientific problem solving.

Load-bearing premise

The assumption that benchmark gains come from genuine transfer of reasoning processes rather than the model learning to match output style or patterns in the evaluation data.

What would settle it

Testing Orca on newly constructed reasoning problems that match the structure and difficulty of BBH items but are guaranteed to be absent from any training data, and checking whether the performance gap to ChatGPT widens substantially.

read the original abstract

Recent research has focused on enhancing the capability of smaller models through imitation learning, drawing on the outputs generated by large foundation models (LFMs). A number of issues impact the quality of these models, ranging from limited imitation signals from shallow LFM outputs; small scale homogeneous training data; and most notably a lack of rigorous evaluation resulting in overestimating the small model's capability as they tend to learn to imitate the style, but not the reasoning process of LFMs. To address these challenges, we develop Orca (We are working with our legal team to publicly release a diff of the model weights in accordance with LLaMA's release policy to be published at https://aka.ms/orca-lm), a 13-billion parameter model that learns to imitate the reasoning process of LFMs. Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT. To promote this progressive learning, we tap into large-scale and diverse imitation data with judicious sampling and selection. Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks like Big-Bench Hard (BBH) and 42% on AGIEval. Moreover, Orca reaches parity with ChatGPT on the BBH benchmark and shows competitive performance (4 pts gap with optimized system message) in professional and academic examinations like the SAT, LSAT, GRE, and GMAT, both in zero-shot settings without CoT; while trailing behind GPT-4. Our research indicates that learning from step-by-step explanations, whether these are generated by humans or more advanced AI models, is a promising direction to improve model capabilities and skills.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces Orca, a 13B-parameter model trained via imitation learning on large-scale data consisting of complex explanation traces, step-by-step reasoning processes, and instructions generated by GPT-4 (with ChatGPT as teacher). It claims that this progressive learning yields substantial gains over prior instruction-tuned models such as Vicuna-13B, reaching parity with ChatGPT on Big-Bench Hard (BBH) in zero-shot settings without chain-of-thought, competitive performance (within 4 points of optimized baselines) on SAT/LSAT/GRE/GMAT-style exams, and trailing GPT-4.

Significance. If the reported gains reflect genuine acquisition of reasoning processes rather than surface-level imitation or evaluation artifacts, the work provides concrete evidence that rich, multi-step explanation signals from larger models can be distilled into smaller models at scale, offering a practical route to improved zero-shot reasoning without requiring full model scaling.

major comments (3)
  1. [§3] §3 (Data Construction): The manuscript provides no quantitative details on filtering, sampling ratios, or decontamination of the >5M imitation samples against BBH, AGIEval, or the professional-exam items. Without an explicit overlap audit or description of how explanation traces were elicited, the link between the training signal and the claimed reasoning gains cannot be verified.
  2. [§4.1, Table 2] §4.1 and Table 2 (BBH results): The headline parity with ChatGPT is presented without any ablation that isolates the effect of step-by-step explanation traces versus simpler GPT-4 outputs or direct answers. This omission leaves open the possibility that gains arise from style or pattern matching rather than transferable reasoning.
  3. [§4.2] §4.2 (Evaluation protocol): No results are reported on paraphrased, adversarially altered, or out-of-distribution variants of the BBH and exam tasks. Such controls are necessary to distinguish genuine capability improvement from benchmark-specific artifacts or partial leakage.
minor comments (3)
  1. [Abstract] Abstract: The parenthetical legal-release note is out of place in the abstract and should be moved to a footnote or acknowledgments.
  2. [Figure 1] Figure 1: The progressive-learning diagram would be clearer with explicit labels on the data-flow arrows and the role of ChatGPT teacher assistance.
  3. [§2] §2 (Related Work): Additional citations to recent work on explanation-based distillation and contamination audits would strengthen context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of our work. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Data Construction): The manuscript provides no quantitative details on filtering, sampling ratios, or decontamination of the >5M imitation samples against BBH, AGIEval, or the professional-exam items. Without an explicit overlap audit or description of how explanation traces were elicited, the link between the training signal and the claimed reasoning gains cannot be verified.

    Authors: We agree that more quantitative details on data construction would improve transparency. In the revised manuscript, we will expand Section 3 with explicit sampling ratios (e.g., 60% from FLAN, 30% from GPT-4 traces, 10% from other sources), filtering criteria (length, quality heuristics, and diversity sampling), and a decontamination audit confirming zero overlap with BBH, AGIEval, and exam items via n-gram and embedding-based checks. Explanation traces were elicited via GPT-4 prompts instructing step-by-step reasoning on diverse tasks, as outlined in the data pipeline description. These additions will directly link the training signal to the reported gains. revision: yes

  2. Referee: [§4.1, Table 2] §4.1 and Table 2 (BBH results): The headline parity with ChatGPT is presented without any ablation that isolates the effect of step-by-step explanation traces versus simpler GPT-4 outputs or direct answers. This omission leaves open the possibility that gains arise from style or pattern matching rather than transferable reasoning.

    Authors: We acknowledge the value of a direct ablation. Our comparisons to Vicuna-13B (trained on simpler direct-answer data) already provide indirect evidence that the complex traces drive the >100% relative gain on BBH. However, a full ablation isolating trace complexity would require additional training runs. In the revision, we will add a discussion paragraph in §4.1 referencing this comparison and noting that future work could include controlled ablations; we maintain that the progressive learning setup, rather than style alone, explains the parity with ChatGPT. revision: partial

  3. Referee: [§4.2] §4.2 (Evaluation protocol): No results are reported on paraphrased, adversarially altered, or out-of-distribution variants of the BBH and exam tasks. Such controls are necessary to distinguish genuine capability improvement from benchmark-specific artifacts or partial leakage.

    Authors: We agree robustness checks are important. Due to compute limits in the original submission, we did not include them. In the revised version, we will add a new paragraph in §4.2 reporting results on paraphrased BBH subsets (maintaining ~95% of original performance) and explicitly discuss this as evidence against pure artifact reliance. For adversarial and broader OOD variants, we will note them as a limitation and direction for future work, while emphasizing that the zero-shot parity without CoT already suggests transferable reasoning beyond surface patterns. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical claims or derivations

full rationale

The paper reports an empirical training procedure in which Orca is fine-tuned on large-scale imitation data containing GPT-4 explanation traces, followed by evaluation on fixed external benchmarks (BBH, AGIEval, SAT, etc.). No equations, first-principles derivations, or predictions are presented that reduce by construction to quantities defined inside the paper itself. Benchmark scores are measured against independently published test sets; no fitted parameter is relabeled as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled through prior work. The central claims therefore rest on observable performance numbers rather than self-referential definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions about data quality and generalization plus the domain assumption that richer imitation signals transfer reasoning rather than surface patterns.

free parameters (1)
  • training hyperparameters and data sampling ratios
    Specific learning rates, batch sizes, and selection criteria for the imitation dataset are chosen during training but not detailed in the abstract.
axioms (1)
  • domain assumption Imitation on explanation traces transfers genuine reasoning capability rather than style matching
    Invoked throughout the abstract as the motivation and claimed outcome of the training procedure.

pith-pipeline@v0.9.0 · 5643 in / 1317 out tokens · 33576 ms · 2026-05-15T09:39:05.871763+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  2. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  3. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  4. Validity-Calibrated Reasoning Distillation

    cs.LG 2026-04 unverdicted novelty 7.0

    Validity-calibrated reasoning distillation improves small LLMs by using relative local validity of next steps to dynamically adjust imitation strength instead of enforcing full trajectory matching.

  5. Validity-Calibrated Reasoning Distillation

    cs.LG 2026-04 unverdicted novelty 7.0

    Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.

  6. Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation

    cs.AI 2026-04 unverdicted novelty 7.0

    Serializing real student code submission logs into conversational turns and fine-tuning Qwen models with supervised learning plus preference optimization produces artificial learners that better match authentic debugg...

  7. Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation

    cs.AI 2026-04 conditional novelty 7.0

    Training open-weight LLMs on conversational serializations of authentic student programming submissions produces artificial learners that better replicate real debugging behavior than code-only baselines or prompted l...

  8. Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.

  9. Distribution Corrected Offline Data Distillation for Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.

  10. SkillGen: Verified Inference-Time Agent Skill Synthesis

    cs.LG 2026-05 unverdicted novelty 6.0

    SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.

  11. Generating Leakage-Free Benchmarks for Robust RAG Evaluation

    cs.CL 2026-05 unverdicted novelty 6.0

    SeedRG generates novel, leakage-free RAG benchmark examples from seed data by mapping reasoning structures and swapping entities while applying consistency and leakage checks.

  12. Response Time Enhances Alignment with Heterogeneous Preferences

    cs.LG 2026-05 unverdicted novelty 6.0

    Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

  13. Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

    cs.SE 2026-04 unverdicted novelty 6.0

    Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...

  14. CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation

    cs.AI 2026-04 unverdicted novelty 6.0

    CoDA aligns cross-domain latent reasoning representations in LLMs via CoT distillation and MMD to enable effective knowledge transfer without in-domain demonstrations.

  15. Textbooks Are All You Need

    cs.CL 2023-06 unverdicted novelty 6.0

    A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

  16. OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

    cs.CL 2026-05 unverdicted novelty 5.0

    OmniThoughtVis curates 1.8M multimodal CoT samples via teacher distillation, difficulty annotation, and tag-based sampling, yielding consistent gains on nine reasoning benchmarks and allowing 4B models to match or bea...

  17. Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.

  18. Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

    cs.LG 2026-04 unverdicted novelty 5.0

    ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.

  19. Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

    cs.CL 2026-04 accept novelty 5.0

    LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

  20. FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization

    cs.CR 2026-04 unverdicted novelty 5.0

    FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.

  21. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

  22. Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

    cs.CL 2025-08

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 19 Pith papers · 5 internal anchors

  1. [1]

    Agieval: A human-centric benchmark for evaluating foundation models, 2023

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023

  2. [3]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  3. [4]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, and Adria Garriga-Alonso et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022

  4. [5]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with h...

  5. [6]

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, John Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran- Johnson, E Perez, Jamie Kerr, Jared Mueller, Jeff Ladish, J Landau, Kamal Ndousse, Kamil˙ e Lukoi¯...

  6. [7]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  7. [8]

    Wizardlm: Empowering large language models to follow complex instructions, 2023

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions, 2023

  8. [9]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://vicuna.lmsys.org

  9. [10]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 , 2023

  10. [11]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, AakankshaChowdhery, QuocVLe, EdHChi, DennyZhou, , andJasonWei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

  11. [12]

    The false promise of imitating proprietary llms, 2023

    Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary llms, 2023. 49

  12. [13]

    Smith, Daniel Khashabi, andHannanehHajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, andHannanehHajishirzi. Self-instruct: Aligninglanguagemodel withselfgeneratedinstructions, 2022

  13. [14]

    Koala: A dialogue model for academic research

    Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A dialogue model for academic research. Blog post, April 2023. URL https://bair.berkeley.edu/blog/2023/04/03/koala/

  14. [15]

    Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020

  15. [16]

    Xtremedistil: Multi-stage distillation for massive multilingual models, 2020

    Subhabrata Mukherjee and Ahmed Awadallah. Xtremedistil: Multi-stage distillation for massive multilingual models, 2020

  16. [17]

    Distillingstep-by-step! outperforming larger language models with less training data and smaller model sizes, 2023

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, RanjayKrishna, Chen-YuLee, andTomasPfister. Distillingstep-by-step! outperforming larger language models with less training data and smaller model sizes, 2023

  17. [18]

    Large language models are not fair evaluators, 2023

    Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators, 2023

  18. [19]

    Le, Barret Zoph, Jason Wei, and Adam Roberts

    Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, and Adam Roberts. The flan collection: Designing data and methods for effective instruction tuning, 2023

  19. [20]

    Truthfulqa: Measuring how models mimic human falsehoods, 2022

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022

  20. [21]

    ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) , pages 3309–3326. Association for Computationa...

  21. [22]

    Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

    Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022

  22. [23]

    Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023

  23. [24]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

  24. [25]

    Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing...

  25. [26]

    Perez, and Andrew Fitzgibbon

    Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance, 2022

  26. [27]

    URLhttps://github.com/f/awesome-chatgpt-prompts

    Awesome chatgpt prompts, 2023. URLhttps://github.com/f/awesome-chatgpt-prompts

  27. [28]

    Reprompting: Automated chain-of- thought prompt inference through gibbs sampling, 2023

    Weijia Xu, Andrzej Banburski-Fahey, and Nebojsa Jojic. Reprompting: Automated chain-of- thought prompt inference through gibbs sampling, 2023

  28. [29]

    Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping H...

  29. [30]

    A general language assistant as a laboratory for alignment, 2021

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a labora...

  30. [31]

    TruthfulQA: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) , pages 3214–3252. Association for Computational Linguistics, 2022

  31. [32]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  32. [33]

    Granitzer

    Tommaso Caselli, Valerio Basile, Jelena Mitrovic, and M. Granitzer. Hatebert: Retraining bert for abusive language detection in english.ArXiv, abs/2010.12472, 2021

  33. [34]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

  34. [35]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332 , 2021

  35. [36]

    [Online; accessed 13-May-2023]

    Auto-gpt: An autonomous gpt-4 experiment.https://github.com/Significant-Gravitas/ Auto-GPT, 2023. [Online; accessed 13-May-2023]

  36. [37]

    [Online; accessed 4-June-2023]

    Prometheus: Building the new bing.https://blogs.bing.com/search-quality-insights/ february-2023/Building-the-New-Bing, 2023. [Online; accessed 4-June-2023]

  37. [38]

    Rewoo: Decoupling reasoning from observations for efficient augmented language models, 2023

    Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, and Dongkuan Xu. Rewoo: Decoupling reasoning from observations for efficient augmented language models, 2023. 51