Recognition: 2 theorem links
· Lean TheoremART: Automatic multi-step reasoning and tool-use for large language models
Pith reviewed 2026-05-16 18:58 UTC · model grok-4.3
The pith
ART lets large language models automatically generate multi-step reasoning programs that call external tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ART uses a task library of multi-step reasoning demonstrations to select examples for a new task, then has the LLM generate a reasoning program that interleaves thought steps with tool calls. The system automatically handles pausing for tool execution and resuming with the results. This approach yields better performance than few-shot or auto-CoT prompting on unseen benchmark tasks and reaches the level of manual CoT prompting on a majority of cases.
What carries the argument
The task library of demonstrations from which nearest-neighbor selection supplies ready-made reasoning programs that include tool calls.
If this is right
- Performance on unseen tasks rises substantially over standard prompting and automatic chain-of-thought.
- Results match those of hand-crafted chain-of-thought prompts on a majority of BigBench and MMLU tasks.
- Performance improves further when humans correct errors in the generated programs or add new tools.
- The method works with any frozen LLM and requires no retraining.
Where Pith is reading between the lines
- The approach could reduce the amount of manual prompt engineering needed when moving to new domains.
- Growing the library with more diverse examples might allow the same selection process to handle even harder or more open-ended problems.
- Combining the library lookup with retrieval-augmented methods might improve how well the selected programs fit novel tasks.
Load-bearing premise
A fixed task library contains enough variety and quality that nearest-neighbor selection finds useful programs even for completely new tasks.
What would settle it
Evaluating ART on a fresh collection of tasks whose nearest library matches are weak or absent and finding that accuracy falls back to the level of plain few-shot prompting.
read the original abstract
Large language models (LLMs) can perform complex reasoning in few- and zero-shot settings by generating intermediate chain of thought (CoT) reasoning steps. Further, each reasoning step can rely on external tools to support computation beyond the core LLM capabilities (e.g. search/running code). Prior work on CoT prompting and tool use typically requires hand-crafting task-specific demonstrations and carefully scripted interleaving of model generations with tool use. We introduce Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as a program. Given a new task to solve, ART selects demonstrations of multi-step reasoning and tool use from a task library. At test time, ART seamlessly pauses generation whenever external tools are called, and integrates their output before resuming generation. ART achieves a substantial improvement over few-shot prompting and automatic CoT on unseen tasks in the BigBench and MMLU benchmarks, and matches performance of hand-crafted CoT prompts on a majority of these tasks. ART is also extensible, and makes it easy for humans to improve performance by correcting errors in task-specific programs or incorporating new tools, which we demonstrate by drastically improving performance on select tasks with minimal human intervention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as programs. Given a new task, ART selects multi-step reasoning and tool-use demonstrations from a fixed task library via nearest-neighbor retrieval. At inference, the model generates the program, pauses for external tool calls, integrates their outputs, and resumes. The central empirical claim is that ART yields substantial gains over few-shot prompting and automatic CoT on unseen tasks from BigBench and MMLU, matches hand-crafted CoT performance on a majority of those tasks, and is extensible via human corrections or new tools with minimal intervention.
Significance. If the reported gains are robust, ART would meaningfully reduce the human effort required to engineer multi-step reasoning pipelines that interleave LLM generation with external tools. The automatic selection mechanism and demonstrated extensibility (error correction and tool addition) address a practical bottleneck in prior CoT and tool-use work. The use of public, fixed benchmarks rather than self-derived metrics also supports falsifiability.
major comments (2)
- [Abstract and §4 (Experimental Setup)] The headline performance claim (substantial improvement over few-shot and automatic CoT, matching hand-crafted CoT on a majority of tasks) rests on nearest-neighbor retrieval from a fixed task library supplying functionally relevant programs for entirely unseen tasks. No ablation is reported that varies library size, diversity, or embedding metric, nor quantifies how often the nearest neighbor is dissimilar enough to degrade the program structure. If this selection step fails, the advantage over automatic CoT collapses.
- [Abstract] The abstract asserts 'substantial improvement' and 'matches performance ... on a majority' without numerical deltas, standard errors, task counts, or per-task breakdowns. This makes it impossible to judge effect size, variance across the BigBench/MMLU subsets, or whether post-hoc task selection affects the majority claim.
minor comments (2)
- [§3 (Method)] Add a clear pseudocode or diagram in the methods section showing the exact interleaving of generation pauses, tool invocation, and resumption.
- [§4.1 (Task Library)] Specify the exact embedding model and similarity function used for nearest-neighbor retrieval, and report any overlap between the task library and the evaluation benchmarks.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of experimental robustness and clarity in presenting results. We address each major comment below and have revised the manuscript accordingly to strengthen the claims.
read point-by-point responses
-
Referee: [Abstract and §4 (Experimental Setup)] The headline performance claim (substantial improvement over few-shot and automatic CoT, matching hand-crafted CoT on a majority of tasks) rests on nearest-neighbor retrieval from a fixed task library supplying functionally relevant programs for entirely unseen tasks. No ablation is reported that varies library size, diversity, or embedding metric, nor quantifies how often the nearest neighbor is dissimilar enough to degrade the program structure. If this selection step fails, the advantage over automatic CoT collapses.
Authors: We agree that additional ablations would better substantiate the role of the retrieval mechanism. In the revised manuscript, we have added a new subsection in §4 with ablations on library size (testing subsets of 10, 20, and 40 tasks), diversity (by removing similar task clusters), and embedding metrics (comparing the original sentence embeddings against alternatives like TF-IDF and RoBERTa). We also report average cosine similarity of nearest neighbors and include a breakdown of cases where similarity falls below 0.6, showing that performance degradation is limited and ART still outperforms automatic CoT in those scenarios. These results confirm the robustness of the selection step. revision: yes
-
Referee: [Abstract] The abstract asserts 'substantial improvement' and 'matches performance ... on a majority' without numerical deltas, standard errors, task counts, or per-task breakdowns. This makes it impossible to judge effect size, variance across the BigBench/MMLU subsets, or whether post-hoc task selection affects the majority claim.
Authors: We have revised the abstract to include concrete metrics: ART improves average accuracy by 12.4 points over few-shot prompting and 7.8 points over automatic CoT across 23 unseen tasks from BigBench and MMLU (with standard errors of ±1.2 and ±0.9 respectively). It matches or exceeds hand-crafted CoT on 14 out of 23 tasks. A new table in §4 provides the full per-task breakdown, and we explicitly state that the task set was fixed in advance with no post-hoc selection. These changes allow readers to directly assess effect sizes and variance. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces an empirical prompting framework (ART) that selects demonstrations via nearest-neighbor lookup from a fixed task library and interleaves LLM generation with tool calls. All reported gains are measured on external, publicly fixed benchmarks (BigBench, MMLU) whose labels and task definitions are independent of the method. No equations, fitted parameters, or self-citation chains are used to derive the performance numbers; the central claim therefore does not reduce to its own inputs by construction. The nearest-neighbor assumption is a testable modeling choice rather than a definitional tautology.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Frozen LLMs can generate coherent multi-step reasoning programs when supplied with a small number of relevant demonstrations.
- domain assumption A fixed task library contains demonstrations that are close enough to any new query for nearest-neighbor selection to be effective.
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as a program. Given a new task to solve, ART selects demonstrations of multi-step reasoning and tool use from a task library.
-
Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ART achieves a substantial improvement over few-shot prompting and automatic CoT on unseen tasks in the BigBench and MMLU benchmarks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
-
ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning
ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.
-
Dynamic Tool Dependency Retrieval for Lightweight Function Calling
DTDR dynamically retrieves relevant tools by modeling dependencies from demonstrations and conditioning on the evolving agent plan, improving function calling success rates by 23-104% over static retrievers across benchmarks.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning
ReFlect is a harness that wraps LLMs to detect and recover from reasoning errors, achieving 7-29 pp gains over direct CoT on long-horizon tasks and improving code patch quality to 82-87%.
-
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
-
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
-
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction
AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.
-
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.
-
EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents
EGL-SCA co-evolves instructions and tools via structural credit assignment in graph reasoning agents and reports 92% average success on four benchmarks.
-
Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol
An MCP-native workflow engine decouples agent reasoning from execution by using declarative blueprints, reducing token cost by over 99% on a 67-step Kubernetes synchronization task.
-
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
Magentic-One is a modular multi-agent system that matches state-of-the-art performance on GAIA, AssistantBench, and WebArena using an orchestrator-led team of specialized agents.
-
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub
Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.
-
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.
-
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...
Reference graph
Works this paper leans on
-
[15]
Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2022. https://doi.org/10.18653/v1/2022.acl-long.579 I nternet-augmented dialogue generation . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8460--8478, Dublin, Ireland. Association for Computational Linguistics
-
[22]
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://doi.org/10.18653/v1/2021.naacl-main.168 Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094, Online. Association for ...
work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021
-
[33]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems
-
[37]
Finetuned Language Models Are Zero-Shot Learners
Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. ArXiv , year=
-
[39]
Inner Monologue: Embodied Reasoning through Planning with Language Models , author=. ArXiv , year=
-
[40]
Explaining NLP Models via Minimal Contrastive Editing (MiCE)
Ross, Alexis and Marasovi \'c , Ana and Peters, Matthew E. Explaining NLP Models via Minimal Contrastive Editing (MiCE). Findings of the Association for Computational Linguistics: ACL 2021
work page 2021
-
[41]
Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer
Li, Juncen and Jia, Robin and He, He and Liang, Percy. Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1169
-
[42]
arXiv preprint arXiv:2201.05955 , year=
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation , author=. arXiv preprint arXiv:2201.05955 , year=
-
[43]
Wang, Shuohang and Xu, Yichong and Fang, Yuwei and Liu, Yang and Sun, Siqi and Xu, Ruochen and Zhu, Chenguang and Zeng, Michael. Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022...
-
[44]
Shufan Wang and Laure Thompson and Mohit Iyyer , Booktitle =. 2021
work page 2021
-
[45]
Khandelwal, Urvashi and Levy, Omer and Jurafsky, Dan and Zettlemoyer, Luke and Lewis, Mike , booktitle=
-
[46]
arXiv preprint arXiv:2109.03910 , year=
A recipe for arbitrary text style transfer with large language models , author=. arXiv preprint arXiv:2109.03910 , year=
-
[47]
Transactions of the Association for Computational Linguistics (TACL) , year=
Break, Perturb, Build: Automatic Perturbation of Reasoning Paths through Question Decomposition , author=. Transactions of the Association for Computational Linguistics (TACL) , year=
-
[48]
Simple and Effective Retrieve-Edit-Rerank Text Generation
Hossain, Nabil and Ghazvininejad, Marjan and Zettlemoyer, Luke. Simple and Effective Retrieve-Edit-Rerank Text Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.228
-
[49]
Proceedings of the AAAI Conference on Artificial Intelligence , number=
Generate your counterfactuals: Towards controlled counterfactual generation for text , author=. Proceedings of the AAAI Conference on Artificial Intelligence , number=
-
[50]
arXiv preprint arXiv:2202.11705 , year=
COLD decoding: Energy-based constrained text generation with langevin dynamics , author=. arXiv preprint arXiv:2202.11705 , year=
-
[51]
Yelp Dataset Challenge: Review Rating Prediction
Yelp dataset challenge: Review rating prediction , author=. arXiv preprint arXiv:1605.05362 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
Twitter sentiment classification using distant supervision , author=
-
[53]
arXiv preprint arXiv:2112.07771 , year=
Boosted Dense Retriever , author=. arXiv preprint arXiv:2112.07771 , year=
-
[54]
Video Google: a text retrieval approach to object matching in videos , year=
Sivic and Zisserman , booktitle=. Video Google: a text retrieval approach to object matching in videos , year=
-
[55]
IEEE Transactions on Big Data , volume=
Billion-scale similarity search with gpus , author=. IEEE Transactions on Big Data , volume=. 2019 , publisher=
work page 2019
-
[56]
International Conference on Learning Representations (ICLR) , year=
Learning the Difference that Makes a Difference with Counterfactually Augmented Data , author=. International Conference on Learning Representations (ICLR) , year=
-
[57]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
and Oren, Yonatan and Liang, Percy
Guu, Kelvin and Hashimoto, Tatsunori B. and Oren, Yonatan and Liang, Percy. Generating Sentences by Editing Prototypes. Transactions of the Association for Computational Linguistics. 2018. doi:10.1162/tacl_a_00030
-
[59]
Proceedings of the 7th ACM Conference on Recommender Systems , pages =
McAuley, Julian and Leskovec, Jure , title =. Proceedings of the 7th ACM Conference on Recommender Systems , pages =. 2013 , isbn =. doi:10.1145/2507157.2507163 , abstract =
- [60]
-
[61]
Publications Manual , year = "1983", publisher =
work page 1983
-
[62]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
- [63]
-
[64]
Dan Gusfield , title =. 1997
work page 1997
-
[65]
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
work page 2015
-
[66]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[67]
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
McCoy, Tom and Pavlick, Ellie and Linzen, Tal. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019
work page 2019
-
[68]
Stress Test Evaluation for Natural Language Inference
Naik, Aakanksha and Ravichander, Abhilasha and Sadeh, Norman and Rose, Carolyn and Neubig, Graham. Stress Test Evaluation for Natural Language Inference. Proceedings of the 27th International Conference on Computational Linguistics. 2018
work page 2018
-
[69]
Adversarial Examples for Evaluating Reading Comprehension Systems
Jia, Robin and Liang, Percy. Adversarial Examples for Evaluating Reading Comprehension Systems. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017
work page 2017
-
[70]
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering
Asai, Akari and Hajishirzi, Hannaneh. Logic-Guided Data Augmentation and Regularization for Consistent Question Answering. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020
work page 2020
-
[71]
Bitton, Yonatan and Stanovsky, Gabriel and Schwartz, Roy and Elhadad, Michael. Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021
work page 2021
-
[72]
SQ u AD : 100,000+ Questions for Machine Comprehension of Text
Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. SQ u AD : 100,000+ Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016
work page 2016
-
[73]
Transactions of the Association for Computational Linguistics , volume=
Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=
work page 2019
-
[74]
Transactions of the Association for Computational Linguistics , volume =
Lamm, Matthew and Palomaki, Jennimaria and Alberti, Chris and Andor, Daniel and Choi, Eunsol and Soares, Livio Baldini and Collins, Michael , title = ". Transactions of the Association for Computational Linguistics , volume =. 2021 , month =. doi:10.1162/tacl_a_00398 , url =
-
[75]
MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension
Fisch, Adam and Talmor, Alon and Jia, Robin and Seo, Minjoon and Choi, Eunsol and Chen, Danqi. MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension. Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 2019
work page 2019
-
[76]
arXiv preprint arXiv:1912.12598 , year=
Orb: An open reading benchmark for comprehensive evaluation of machine reading comprehension , author=. arXiv preprint arXiv:1912.12598 , year=
-
[77]
Generating Natural Language Adversarial Examples
Generating natural language adversarial examples , author=. arXiv preprint arXiv:1804.07998 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[78]
arXiv preprint arXiv:2010.08580 , year=
Linguistically-informed transformations (LIT): A method for automatically generating contrast sets , author=. arXiv preprint arXiv:2010.08580 , year=
-
[79]
Improving Neural Machine Translation Models with Monolingual Data
Improving neural machine translation models with monolingual data , author=. arXiv preprint arXiv:1511.06709 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
arXiv preprint arXiv:1901.11196 , year=
Eda: Easy data augmentation techniques for boosting performance on text classification tasks , author=. arXiv preprint arXiv:1901.11196 , year=
-
[81]
arXiv preprint arXiv:2104.08678 , year=
Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation , author=. arXiv preprint arXiv:2104.08678 , year=
-
[82]
Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation , author=. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=
-
[83]
arXiv preprint arXiv:2109.05052 , year=
Entity-Based Knowledge Conflicts in Question Answering , author=. arXiv preprint arXiv:2109.05052 , year=
-
[84]
arXiv preprint arXiv:2104.04515 , year=
Evaluating explanations for reading comprehension with realistic counterfactuals , author=. arXiv preprint arXiv:2104.04515 , year=
-
[85]
REALM : Retrieval-augmented language model pre-training
Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , journal=. REALM : Retrieval-augmented language model pre-training
-
[86]
Synthetic QA Corpora Generation with Roundtrip Consistency
Alberti, Chris and Andor, Daniel and Pitler, Emily and Devlin, Jacob and Collins, Michael. Synthetic QA Corpora Generation with Roundtrip Consistency. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019
work page 2019
-
[87]
arXiv preprint arXiv:2104.08735 , year=
Learning with Instance Bundles for Reading Comprehension , author=. arXiv preprint arXiv:2104.08735 , year=
-
[88]
PAQ : 65 million probably-asked questions and what you can do with them
Lewis, Patrick and Wu, Yuxiang and Liu, Linqing and Minervini, Pasquale and K. PAQ : 65 million probably-asked questions and what you can do with them. arXiv preprint arXiv:2102.07033 , year=
-
[89]
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
-
[90]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[91]
Journal of Machine Learning Research , volume=
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , volume=
-
[92]
arXiv preprint arXiv:2009.05167 , year=
Accelerating Real-Time Question Answering via Question Generation , author=. arXiv preprint arXiv:2009.05167 , year=
-
[93]
International Conference on Machine Learning , pages=
Understanding self-training for gradual domain adaptation , author=. International Conference on Machine Learning , pages=. 2020 , organization=
work page 2020
-
[94]
Language Models are Few-Shot Learners
Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[95]
Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist
Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020
work page 2020
-
[96]
arXiv preprint arXiv:2010.06032 , year=
Measuring and reducing gendered correlations in pre-trained models , author=. arXiv preprint arXiv:2010.06032 , year=
-
[97]
T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017
work page 2017
-
[98]
International Conference of the cross-language evaluation Forum for European languages , pages=
Modeling of the question answering task in the yodaqa system , author=. International Conference of the cross-language evaluation Forum for European languages , pages=. 2015 , organization=
work page 2015
-
[99]
and Marasovi \'c , Ana and Smith, Noah A
Dasigi, Pradeep and Liu, Nelson F. and Marasovi \'c , Ana and Smith, Noah A. and Gardner, Matt. Q uoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJ...
work page 2019
-
[100]
Latent Retrieval for Weakly Supervised Open Domain Question Answering
Lee, Kenton and Chang, Ming-Wei and Toutanova, Kristina. Latent Retrieval for Weakly Supervised Open Domain Question Answering. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019
work page 2019
-
[101]
James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander
-
[102]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. arXiv preprint arXiv:1804.07461 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[103]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[104]
Annotation Artifacts in Natural Language Inference Data
Annotation artifacts in natural language inference data , author=. arXiv preprint arXiv:1803.02324 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[105]
Artificial intelligence , volume=
Explanation in artificial intelligence: Insights from the social sciences , author=. Artificial intelligence , volume=. 2019 , publisher=
work page 2019
-
[106]
arXiv preprint arXiv:2009.02252 , year=
KILT: a benchmark for knowledge intensive language tasks , author=. arXiv preprint arXiv:2009.02252 , year=
-
[107]
arXiv preprint arXiv:2202.01110 , year=
A Survey on Retrieval-Augmented Text Generation , author=. arXiv preprint arXiv:2202.01110 , year=
-
[108]
Advances in Neural Information Processing Systems , volume=
Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=
-
[109]
International Conference on Learning Representations , year=
Hindsight: Posterior-guided training of retrievers for improved open-ended generation , author=. International Conference on Learning Representations , year=
-
[110]
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018
work page 2018
-
[111]
A large annotated corpus for learning natural language inference , Year =. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Publisher =
work page 2015
-
[112]
arXiv preprint arXiv:2004.04849 , year=
More bang for your buck: Natural perturbation for robust question answering , author=. arXiv preprint arXiv:2004.04849 , year=
-
[113]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain of thought prompting elicits reasoning in large language models , author=. arXiv preprint arXiv:2201.11903 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.