{"paper":{"title":"A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"ChatGPT averages 63.41% accuracy across ten reasoning categories and improves only modestly with human interaction.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Bryan Wilie, Dan Su, Holy Lovenia, Nayeon Lee, Pascale Fung, Quyet V. Do, Samuel Cahyawijaya, Tiezheng Yu, Wenliang Dai, Willy Chung, Yan Xu, Yejin Bang, Ziwei Ji","submitted_at":"2023-02-08T12:35:34Z","abstract_excerpt":"This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 23 chosen datasets, the newly designed multimodal dataset, and the 10 reasoning categories provide a representative and low-bias measure of ChatGPT capabilities without major sensitivity to prompt wording or subjective hallucination labeling.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ChatGPT outperforms zero-shot LLMs on most tasks and improves with interaction but scores only 63.41 percent on reasoning categories and generates extrinsic hallucinations from its training data.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"ChatGPT averages 63.41% accuracy across ten reasoning categories and improves only modestly with human interaction.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8a814fab06f443108a773eff91b830adb110c6c355702f7556083a8b920e9519"},"source":{"id":"2302.04023","kind":"arxiv","version":4},"verdict":{"id":"57f14e71-d1ef-47f9-a0e5-6d9394914ccb","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T19:53:52.885799Z","strongest_claim":"ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning.","one_line_summary":"ChatGPT outperforms zero-shot LLMs on most tasks and improves with interaction but scores only 63.41 percent on reasoning categories and generates extrinsic hallucinations from its training data.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 23 chosen datasets, the newly designed multimodal dataset, and the 10 reasoning categories provide a representative and low-bias measure of ChatGPT capabilities without major sensitivity to prompt wording or subjective hallucination labeling.","pith_extraction_headline":"ChatGPT averages 63.41% accuracy across ten reasoning categories and improves only modestly with human interaction."},"references":{"count":23,"sample":[{"doi":"","year":2023,"title":"News summarization and evaluation in the era of gpt-3","work_id":"ec749274-c88c-461b-8bea-cc527ad5c047","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7890–7900","work_id":"c87a4363-a07c-404a-ac90-4bc9bab7e033","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Qa dataset explosion: A taxonomy of nlp resources for question answering and reading com- prehension. ACM Comput. Surv. Just Accepted. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, ","work_id":"34ef0ae1-f5d9-4272-93e7-9deb55e86050","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models","work_id":"bb63abb3-0d50-4362-b97c-b5e725b03b39","ref_index":4,"cited_arxiv_id":"2206.04615","is_internal_anchor":true},{"doi":"","year":2018,"title":"Richmond Thomason","work_id":"0d93f796-a6ff-4975-9a9c-56e6eb348e66","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":23,"snapshot_sha256":"ffbaf2a528ad96be45f30d139f4641b9f5a1f8d4f1fdc12b4bc353ff7f971730","internal_anchors":2},"formal_canon":{"evidence_count":2,"snapshot_sha256":"883f3c61477ebc15cbf623842d0c9eb96fe1dfb7165d54a3f068e40de2bb238b"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}