Recognition: 2 theorem links
· Lean TheoremSelf-Refine: Iterative Refinement with Self-Feedback
Pith reviewed 2026-05-10 20:43 UTC · model grok-4.3
The pith
Large language models can improve their own outputs by iteratively generating feedback and refinements without any training or extra models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Self-Refine demonstrates that the same LLM can generate an initial response, produce specific feedback on its shortcomings, and then produce an improved response based on that feedback, repeating the cycle as needed. When applied across dialog response generation, mathematical reasoning, and other tasks, this iterative self-correction yields outputs that both humans and metrics rate higher than the model's direct one-shot answers, with average task performance rising by about 20 percent absolute.
What carries the argument
Self-Refine, the three-step loop in which one LLM generates an output, writes feedback on it, and then rewrites the output to address the feedback, all without external supervision.
If this is right
- Task performance rises by roughly 20 percent on average over direct generation across dialog, reasoning, and similar problems.
- Human evaluators consistently prefer the outputs after self-refinement to the initial one-step versions.
- The gains hold for current top models such as GPT-4 without requiring any new training data or reinforcement learning.
- The method applies uniformly to the seven tested tasks without task-specific engineering.
Where Pith is reading between the lines
- Test-time iteration of this kind could serve as a lightweight substitute for additional pretraining or fine-tuning on some tasks.
- The approach may reduce certain error types such as factual inconsistencies if the feedback step reliably catches them.
- Combining the loop with existing prompting styles like chain-of-thought could produce further additive gains.
Load-bearing premise
The LLM must be able to produce accurate and actionable feedback on its own outputs that genuinely leads to better results rather than neutral changes or new mistakes.
What would settle it
A controlled test on any of the evaluated tasks in which multiple rounds of Self-Refine produce outputs that score no higher, or lower, than the model's standard single-pass generation on the same human or automatic metrics.
read the original abstract
Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides feedback for its output and uses it to refine itself, iteratively. Self-Refine does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner, and feedback provider. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Across all evaluated tasks, outputs generated with Self-Refine are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by ~20% absolute on average in task performance. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Self-Refine, a training-free iterative method in which a single LLM first generates an initial output, then uses the same model to produce self-feedback on that output, and finally refines the output based on the feedback; the process can be repeated. The approach is evaluated on seven diverse tasks (dialogue, reasoning, code generation, etc.) with GPT-3.5, ChatGPT, and GPT-4, claiming that Self-Refine outputs are preferred by both human judges and automatic metrics over standard one-step generation, with an average absolute improvement of approximately 20%.
Significance. If the reported gains are shown to arise from genuine self-refinement rather than confounds, the result would be significant: it would demonstrate that current frontier LLMs can be improved at inference time through simple, standalone self-interaction without any additional training data, RL, or external models, providing a broadly applicable technique across NLP tasks.
major comments (4)
- [Evaluation / Results] The central empirical claim rests on the unverified assumption that the LLM produces accurate and actionable self-feedback. The manuscript provides no quantitative breakdown (e.g., human or automatic annotation of feedback correctness, error-identification rate, or adherence rate in the subsequent refinement step) in the evaluation or results sections; without this, it remains possible that the ~20% average lift arises from repeated sampling, longer context, or extra inference steps rather than iterative self-correction.
- [Experiments] No controls are reported for output length or total token usage. Iterative refinement typically produces longer responses; the paper does not compare against length-matched baselines or report token counts, leaving open the possibility that metric improvements (especially on tasks where verbosity correlates with quality) are partly driven by this confound rather than the refinement mechanism itself.
- [Results] The ~20% average improvement is presented without per-task variances, statistical significance tests, confidence intervals, or the exact number of iterations used per task and model. These details are necessary to establish that the gains are robust and not driven by a subset of tasks or unstable runs.
- [Method] Prompt templates for the initial generation, feedback, and refinement stages are not provided in sufficient detail (or in an appendix), which prevents exact reproduction and makes it impossible to determine whether the self-feedback prompts were carefully engineered or whether the method generalizes beyond the specific prompts used.
minor comments (2)
- [Abstract] The abstract states an average ~20% absolute improvement but does not specify which automatic metrics were used for each task; adding this information would improve clarity.
- [Related Work] Related work on self-consistency, chain-of-thought, and other test-time scaling methods is mentioned but could be expanded with more precise comparisons of computational cost and performance deltas.
Simulated Author's Rebuttal
Thank you for your thorough and constructive review. We appreciate the feedback and will address each major comment below, proposing specific revisions to strengthen the manuscript where appropriate.
read point-by-point responses
-
Referee: [Evaluation / Results] The central empirical claim rests on the unverified assumption that the LLM produces accurate and actionable self-feedback. The manuscript provides no quantitative breakdown (e.g., human or automatic annotation of feedback correctness, error-identification rate, or adherence rate in the subsequent refinement step) in the evaluation or results sections; without this, it remains possible that the ~20% average lift arises from repeated sampling, longer context, or extra inference steps rather than iterative self-correction.
Authors: We agree that a direct quantitative analysis of self-feedback quality would provide stronger support for the mechanism. Although human preference judgments and automatic metric gains indicate effective refinements, we will add a new analysis subsection reporting human-annotated feedback correctness, error identification rates, and adherence in the refinement step on sampled instances from multiple tasks. To address confounds such as repeated sampling or extra steps, we will also include comparisons against best-of-n sampling baselines with matched inference budgets. revision: yes
-
Referee: [Experiments] No controls are reported for output length or total token usage. Iterative refinement typically produces longer responses; the paper does not compare against length-matched baselines or report token counts, leaving open the possibility that metric improvements (especially on tasks where verbosity correlates with quality) are partly driven by this confound rather than the refinement mechanism itself.
Authors: We acknowledge the importance of controlling for length and token usage. In the revision, we will report average token counts and output lengths for baseline and Self-Refine outputs across all tasks and models. We will further add length-matched baseline comparisons, for example by constraining generation length in the one-step baseline or by length-normalized evaluation. revision: yes
-
Referee: [Results] The ~20% average improvement is presented without per-task variances, statistical significance tests, confidence intervals, or the exact number of iterations used per task and model. These details are necessary to establish that the gains are robust and not driven by a subset of tasks or unstable runs.
Authors: We will revise the results section to include per-task scores with standard deviations, 95% confidence intervals, and statistical significance tests (paired t-tests or equivalent) between Self-Refine and baselines. We will also explicitly state the iteration counts used per task and model (typically 2–3 iterations or until convergence). revision: yes
-
Referee: [Method] Prompt templates for the initial generation, feedback, and refinement stages are not provided in sufficient detail (or in an appendix), which prevents exact reproduction and makes it impossible to determine whether the self-feedback prompts were carefully engineered or whether the method generalizes beyond the specific prompts used.
Authors: We apologize for the omission. The revised manuscript will include all prompt templates in full detail in a dedicated appendix, covering the exact wording for initial generation, feedback, and refinement stages for each task and model. revision: yes
Circularity Check
No significant circularity; purely empirical evaluation
full rationale
The paper introduces Self-Refine as an empirical prompting technique that uses the same LLM for generation, feedback, and refinement, then evaluates it on seven tasks against one-step baselines. No equations, derivations, fitted parameters, or first-principles predictions appear in the provided text. Claims rest on human and automatic metric comparisons showing ~20% average gains, not on any reduction of outputs to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes that would collapse the result. The core assumption about feedback quality is an unverified empirical hypothesis tested only via downstream task metrics, which is a validity concern rather than circularity under the defined patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A single LLM can generate useful, actionable feedback on its own outputs that leads to measurable improvement when used for refinement
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.RealityFromDistinctionreality_from_one_distinction unclearthe same LLMs provides feedback for its output and uses it to refine itself, iteratively
Forward citations
Cited by 60 Pith papers
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue
Bot-Mod uses multi-turn dialogue guided by Gibbs sampling over intent hypotheses to identify malicious agent behavior in communities, showing reliable detection with low false positives on a Moltbook-derived dataset.
-
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.
-
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
-
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...
-
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
-
GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs
GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...
-
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
-
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes ac...
-
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
-
Constraint-Aware Corrective Memory for Language-Based Drug Discovery Agents
CACM improves language-based drug discovery agents by 36.4% via protocol auditing, a grounded diagnostician, and compressed static/dynamic/corrective memory channels that localize failures and bias corrections.
-
An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor
ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
Large Language Models as Optimizers
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
-
Reflexion: Language Agents with Verbal Reinforcement Learning
Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
-
RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning
RTLC prompting lifts Claude 3.7 Sonnet pairwise accuracy on 350 hard JudgeBench items from 64.6% to 78.6% via a Research-Teach-Critique scaffold that beats self-consistency.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement
PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...
-
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
-
LoopTrap: Termination Poisoning Attacks on LLM Agents
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
-
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
-
When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems
CAFE detects positive distributional Jensen Gaps across five multi-agent LLM architectures on a banking-risk benchmark, showing that quality drops under semantic stress can coexist with statistically detectable antifr...
-
When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems
CAFE finds positive distributional Jensen Gaps across five multi-agent LLM architectures under semantic stress, showing that quality drops can coexist with detectable stress geometry compatible with antifragile learning.
-
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
-
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
-
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
-
Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA
Rabtriever distills a generative reranker into an efficient independent encoder using JEPA and auxiliary reverse KL loss to achieve linear complexity and strong performance on rationale-based retrieval tasks.
-
Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA
Rabtriever distills a generative reranker into an efficient bi-encoder using on-policy JEPA to achieve near-reranker accuracy with linear complexity on rationale-based retrieval.
-
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
-
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
-
Preregistered Belief Revision Contracts
PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
-
ReflectCAP: Detailed Image Captioning with Reflective Memory
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-cov...
-
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
-
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code
Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
-
QoS-QoE Translation with Large Language Model
A new QoS-QoE Translation dataset is constructed from multimedia literature and fine-tuned LLMs demonstrate strong performance on bidirectional continuous and discrete QoS-QoE predictions.
-
SeLaR: Selective Latent Reasoning in Large Language Models
SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
-
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation
Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
-
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.
-
Teaching Large Language Models to Self-Debug
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
-
ALGOGEN: Tool-Generated Verifiable Traces for Reliable Algorithm Visualization
ALGOGEN improves LLM-generated algorithm visualizations by splitting simulation into traceable JSON outputs via Visualization Trace Algebra and using Rendering Style Language for reliable rendering, raising success ra...
-
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
-
PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling
Multi-agent debate and mixture-of-agents outperform self-consistency by 1.3 and 2.7 percentage points respectively at equal compute budgets on MMLU-Pro and BBH, with advantages that continue at higher scales while sel...
-
State Representation and Termination for Recursive Reasoning Systems
Recursive reasoning systems can represent their state via an epistemic state graph and terminate when the linearized order-gap is non-degenerate near the fixed point, providing a local condition for when the stopping ...
-
Bolzano: Case Studies in LLM-Assisted Mathematical Research
A multi-agent LLM system autonomously produced publishable results on five out of eight mathematical and theoretical computer science problems.
-
HYPERHEURIST: A Simulated Annealing-Based Control Framework for LLM-Driven Code Generation in Optimized Hardware Design
HYPERHEURIST uses simulated annealing to refine functionally validated LLM-generated RTL designs, producing more stable PPA optimization than single-pass LLM generation across eight benchmarks.
-
Your Model Diversity, Not Method, Determines Reasoning Strategy
The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.
-
Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis
Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.
-
STaR-DRO: Stateful Tsallis Reweighting for Group-Robust Structured Prediction
STaR-DRO applies momentum-smoothed Tsallis reweighting to focus learning on hard groups in structured prediction, yielding F1 gains on clinical label extraction.
Reference graph
Works this paper leans on
-
[1]
Teresa M. Amabile. 1983. https://doi.org/10.1007/978-1-4612-5533-8_4 A Theoretical Framework . In The Social Psychology of Creativity , pages 65--96. Springer New York, New York, NY
-
[2]
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022 a . https://arxiv.org/abs/2204.05862 Training a helpful and harmless assistant with reinforcement learning from human feedback . ArXiv:2204.05862
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022 b . Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [4]
-
[5]
Lawrence D Brown, T Tony Cai, and Anirban DasGupta. 2001. Interval estimation for a binomial proportion. Statistical science, 16(2):101--133
work page 2001
-
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
work page 2020
-
[7]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality
work page 2023
-
[9]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Sanjoy Dasgupta, Daniel Hsu, Stefanos Poulis, and Xiaojin Zhu. 2019. http://proceedings.mlr.press/v97/dasgupta19a.html Teaching a black-box learner . In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , volume 97 of Proceedings of Machine Learning Research, pages 1547--1555. PMLR
work page 2019
-
[11]
Wanyu Du, Zae Myung Kim, Vipul Raheja, Dhruv Kumar, and Dongyeop Kang. 2022. https://aclanthology.org/2022.in2writing-1.14 Read, revise, repeat: A system demonstration for human-in-the-loop iterative text revision . In Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022), pages 96--108, Dublin, Ireland. Ass...
work page 2022
-
[12]
Ahmed Elgohary, Christopher Meek, Matthew Richardson, Adam Fourney, Gonzalo Ramos, and Ahmed Hassan Awadallah. 2021. https://doi.org/10.18653/v1/2021.naacl-main.444 NL - EDIT : Correcting semantic parse errors through natural language interaction . In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Lin...
-
[13]
Linda Flower and John R Hayes. 1981. A cognitive process theory of writing. College composition and communication, 32(4):365--387
work page 1981
- [14]
- [15]
-
[16]
Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. 2023. https://bair.berkeley.edu/blog/2023/04/03/koala/ Koala: A dialogue model for academic research . Blog post
work page 2023
- [18]
-
[19]
Juncen Li, Robin Jia, He He, and Percy Liang. 2018. https://doi.org/10.18653/v1/N18-1169 Delete, retrieve, generate: a simple approach to sentiment and style transfer . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 1865--187...
-
[20]
Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.165 C ommon G en: A constrained text generation challenge for generative commonsense reasoning . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823--1840, Online. As...
-
[21]
Jiacheng Liu, Skyler Hallinan, Ximing Lu, Pengfei He, Sean Welleck, Hannaneh Hajishirzi, and Yejin Choi. 2022. Rainier: Reinforced knowledge introspector for commonsense question answering. In Conference on Empirical Methods in Natural Language Processing
work page 2022
- [22]
- [23]
-
[24]
Aman Madaan, Niket Tandon, Dheeraj Rajagopal, Peter Clark, Yiming Yang, and Eduard Hovy. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.508 Think about it! improving defeasible reasoning by first modeling the question scenario. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6291--6310, Online and Punta ...
-
[25]
Shikib Mehri and Maxine Eskenazi. 2020. https://aclanthology.org/2020.sigdial-1.28 Unsupervised evaluation of interactive dialog with D ialo GPT . In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 225--235, 1st virtual meeting. Association for Computational Linguistics
work page 2020
-
[26]
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. https://arxiv.org/abs/2203.13474 Codegen: An open large language model for code with multi-turn program synthesis . ArXiv preprint, abs/2203.13474
work page internal anchor Pith review arXiv 2022
-
[27]
OpenAI . Model index for researchers. https://platform.openai.com/docs/model-index-for-researchers. Accessed: May 14, 2023
work page 2023
-
[28]
OpenAI. 2022. https://beta.openai.com/docs/model-index-for-researchers Model index for researchers . Blogpost
work page 2022
-
[29]
OpenAI. 2023. http://arxiv.org/abs/2303.08774 Gpt-4 technical report
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. 2022. https://arxiv.org/abs/2203.02155 Training langua...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. 2023. http://arxiv.org/abs/2302.12813 Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback
-
[32]
Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W Black. 2018. https://doi.org/10.18653/v1/P18-1080 Style transfer through back-translation . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 866--876, Melbourne, Australia. Association for Computational Linguistics
- [33]
-
[34]
Ruchir Puri, David Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, and Frederick Reiss. 2021. https://arxiv.org/abs/2105.12655 Codenet: A large-scale ai for code dataset for learning a diver...
- [35]
- [37]
- [38]
-
[39]
Timo Schick, Jane Dwivedi-Yu, Zhengbao Jiang, Fabio Petroni, Patrick Lewis, Gautier Izacard, Qingfei You, Christoforos Nalmpantis, Edouard Grave, and Sebastian Riedel. 2022 a . https://doi.org/10.48550/ARXIV.2208.11663 Peer: A collaborative language model
- [40]
-
[41]
Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. http://arxiv.org/abs/2303.11366 Reflexion: an autonomous agent with dynamic memory and self-reflection
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Herbert A. Simon. 1962. http://www.jstor.org/stable/985254 The architecture of complexity . Proceedings of the American Philosophical Society, 106(6):467--482
work page 1962
-
[43]
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. https://proceedings.neurips.cc/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf Learning to summarize with human feedback . In Advances in Neural Information Processing Systems, volume 33, pages 3008--3021. C...
work page 2020
- [44]
- [45]
-
[46]
Niket Tandon, Aman Madaan, Peter Clark, and Yiming Yang. 2022. Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 339--352
work page 2022
-
[47]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/abs/2201.11903. arXiv preprint arXiv:2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [49]
-
[50]
Kevin Yang, Nanyun Peng, Yuandong Tian, and Dan Klein. 2022. Re3: Generating longer stories with recursive reprompting and revision. In Conference on Empirical Methods in Natural Language Processing
work page 2022
- [51]
-
[52]
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.