arxiv: 2302.04023 · v4 · pith:FMJQSJSFnew · submitted 2023-02-08 · 💻 cs.CL · cs.AI

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Yejin Bang , Samuel Cahyawijaya , Nayeon Lee , Wenliang Dai , Dan Su , Bryan Wilie , Holy Lovenia , Ziwei Ji

show 5 more authors

Tiezheng Yu Willy Chung Quyet V. Do Yan Xu Pascale Fung

This is my paper

Pith reviewed 2026-05-17 19:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords ChatGPT evaluationreasoning accuracyhallucinationmultilingual performancemultimodal generationinteractive promptingNLP benchmarks

0 comments

The pith

ChatGPT averages 63.41% accuracy across ten reasoning categories and improves only modestly with human interaction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an evaluation framework that tests ChatGPT on 23 public datasets spanning eight NLP tasks plus a new multimodal dataset. It reports that the model beats zero-shot baselines on most tasks and some fine-tuned models, yet reaches only 63.41% average accuracy on reasoning problems that mix logical deduction, non-textual inference, and commonsense. ChatGPT understands non-Latin scripts better than it generates them, produces multimodal outputs by first writing code, and relies on its parametric memory for answers that often include extrinsic hallucinations. Multi-turn human prompting raises summarization quality by 8% ROUGE-1 and translation quality by 2% ChrF++.

Core claim

ChatGPT outperforms zero-shot LLMs on most of eight standard NLP tasks but averages only 63.41% accuracy across ten reasoning categories that cover logical, non-textual, and commonsense reasoning, rendering it an unreliable reasoner that performs better at deduction than induction. It generates multimodal content through an intermediate code-generation step, produces more extrinsic hallucinations from internal memory than other LLMs, and gains measurable quality from interactive multi-turn prompting on summarization and machine translation.

What carries the argument

A multitask, multilingual, multimodal evaluation framework that applies 23 public datasets and one newly designed multimodal dataset to measure reasoning accuracy, hallucination types, and interactivity gains in ChatGPT.

If this is right

ChatGPT can be deployed directly for many classification and generation tasks without task-specific fine-tuning.
Any application needing reliable step-by-step reasoning must add external verification or knowledge retrieval.
Multimodal output requires an extra code-generation stage rather than direct image or audio production.
Human users can raise output quality on summarization and translation by iterating prompts in conversation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evaluation approach could be applied to later models to check whether reasoning gaps persist or narrow.
Systems that combine ChatGPT with an external search tool would likely reduce the extrinsic hallucinations observed here.
The multi-turn prompting gains suggest that interactive interfaces may become a standard way to compensate for single-pass weaknesses.

Load-bearing premise

The 23 chosen datasets and ten reasoning categories give a representative, low-bias picture of ChatGPT performance that is not highly sensitive to prompt wording or subjective hallucination labels.

What would settle it

A fresh collection of reasoning problems drawn from the same categories where ChatGPT scores either above 75% or below 50% on average, or where small prompt rewordings shift scores by more than ten points, would test whether the reported unreliability holds.

read the original abstract

This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning. ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory as it does not have access to an external knowledge base. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn "prompt engineering" fashion. We also release codebase for evaluation set extraction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies early concrete numbers on ChatGPT's reasoning limits and small interactivity gains but the 63% accuracy figure rests on single-prompt runs without robustness checks.

read the letter

The main point is that ChatGPT scores 63.41% on average across ten reasoning categories and shows modest improvements from interactive prompting on a couple of tasks. The paper does solid work pulling in 23 existing datasets plus a new multimodal one to test multitask, multilingual, and multimodal performance. Breaking reasoning into logical, non-textual, and commonsense categories gives a clearer picture than a single score. The findings on better performance with non-Latin scripts for understanding versus generation, and the hallucination breakdown favoring extrinsic types, add useful detail. Releasing the codebase for dataset extraction helps others verify or extend the results. The main limitation is that the reasoning and interactivity claims rest on single-prompt setups without reported checks for prompt variation or inter-annotator agreement on labels. Since these models respond differently to small wording changes, the 63% figure and the deductive-inductive contrast could shift. The authors acknowledge some of this in the abstract, but more explicit sensitivity analysis would strengthen the unreliability conclusion. This paper is aimed at researchers and engineers who need early quantitative benchmarks on ChatGPT's capabilities in real-world settings like multilingual applications or collaborative use. It supplies actionable numbers even if they are preliminary. It should go to peer review. The empirical scope and code release make it worth referee time for refinement on the evaluation details.

Referee Report

1 major / 3 minor

Summary. The manuscript presents a multitask, multilingual, and multimodal evaluation of ChatGPT using 23 public datasets spanning 8 NLP tasks plus a newly introduced multimodal dataset. It reports that ChatGPT outperforms zero-shot baselines on most tasks and some fine-tuned models, shows stronger understanding than generation for non-Latin scripts, achieves an average accuracy of 63.41% across 10 reasoning categories (logical, non-textual, commonsense), exhibits hallucination issues with a predominance of extrinsic hallucinations, and improves via interactive multi-turn prompting (e.g., +8% ROUGE-1 on summarization). The authors release the evaluation codebase.

Significance. If the empirical results hold, this work supplies a broad, publicly grounded benchmark of ChatGPT's capabilities and limitations in reasoning, hallucination, and interactivity that is useful for the NLP community. The release of the evaluation codebase is a clear strength that supports reproducibility and extension by others. The multilingual and multimodal components add concrete data points on current LLM behavior beyond English text-only settings.

major comments (1)

[Reasoning evaluation section] Reasoning evaluation section: The headline claim of 63.41% average accuracy across the 10 reasoning categories, and the consequent conclusion that ChatGPT is an 'unreliable reasoner' (with the deductive-vs-inductive contrast), rests on single zero-shot prompt evaluations. No results are reported for prompt variants, few-shot settings, or inter-annotator agreement on correctness labels. Because LLM outputs are known to be sensitive to wording, this omission is load-bearing for the reliability and comparative claims.

minor comments (3)

[Abstract and Methods] The abstract and methods should explicitly state the exact ChatGPT model version and access date used, as performance can shift across releases.
[Hallucination analysis] Hallucination analysis: the rules for labeling intrinsic vs. extrinsic hallucinations and the annotation protocol are not fully detailed; adding them (or an appendix) would improve replicability.
[Multimodal evaluation] The newly designed multimodal dataset is mentioned but its construction, size, and task definitions are not described; a short paragraph or table would clarify its contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. The feedback on the reasoning evaluation is well taken, and we address it directly below while preserving the integrity of our original experimental design.

read point-by-point responses

Referee: [Reasoning evaluation section] Reasoning evaluation section: The headline claim of 63.41% average accuracy across the 10 reasoning categories, and the consequent conclusion that ChatGPT is an 'unreliable reasoner' (with the deductive-vs-inductive contrast), rests on single zero-shot prompt evaluations. No results are reported for prompt variants, few-shot settings, or inter-annotator agreement on correctness labels. Because LLM outputs are known to be sensitive to wording, this omission is load-bearing for the reliability and comparative claims.

Authors: We appreciate the referee's emphasis on prompt sensitivity. Our reasoning evaluation deliberately used a single, fixed zero-shot prompt template for each of the 10 categories to ensure consistency, reproducibility, and a direct assessment of ChatGPT's default behavior without additional prompt engineering. This approach aligns with the paper's broader goal of evaluating the model in its publicly available interactive form. While we acknowledge that different wordings or few-shot examples could alter individual scores, the consistent pattern of sub-70% accuracy across logical, non-textual, and commonsense categories still supports the characterization of unreliable zero-shot reasoning and the deductive-inductive contrast. To address the concern, we will (1) reproduce the exact prompts in the appendix, (2) add an explicit statement in the reasoning section clarifying the single-prompt protocol, and (3) insert a short limitations paragraph noting that results may improve with few-shot or chain-of-thought prompting. Regarding correctness labels, they were obtained via author consensus following explicit guidelines; we will report the verification process and any agreement statistics in the revision. revision: partial

Circularity Check

0 steps flagged

Purely empirical benchmarking with no derivations or self-referential reductions

full rationale

The paper reports direct measurements of ChatGPT on 23 public datasets plus one newly designed multimodal set, yielding accuracies such as the 63.41% average across 10 reasoning categories. No equations, fitted parameters, or derivation chains exist that could reduce to the paper's own inputs by construction. All results are obtained by straightforward zero-shot evaluation against external benchmarks; the interactive prompt-engineering gains (8% ROUGE-1, 2% ChrF++) are likewise measured outcomes rather than predictions derived from prior fits. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify the central claims. This is a standard empirical evaluation study whose numbers are independently verifiable against the cited public datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework rests on the assumption that standard NLP benchmarks and the new multimodal dataset validly proxy real capabilities; no free parameters or new invented entities are introduced.

axioms (1)

domain assumption Publicly available NLP datasets and the newly designed multimodal dataset accurately reflect ChatGPT's performance on reasoning, hallucination, and interactivity tasks.
All reported accuracies and improvement percentages depend on these benchmarks being fair and representative proxies.

pith-pipeline@v0.9.0 · 5611 in / 1334 out tokens · 141398 ms · 2026-05-17T19:53:52.885799+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Moreover, we find that ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning.
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evaluating Object Hallucination in Large Vision-Language Models
cs.CV 2023-05 accept novelty 7.0

Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
cs.SE 2023-02 accept novelty 7.0

The authors present a catalog of prompt patterns that provide reusable solutions to common problems in generating and interacting with outputs from LLMs.
Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
cs.LG 2026-05 unverdicted novelty 6.0

A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.
GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking
cs.CL 2026-05 unverdicted novelty 6.0

GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prio...
ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards
cs.CL 2025-10 unverdicted novelty 6.0

ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.
An Embodied Generalist Agent in 3D World
cs.CV 2023-11 unverdicted novelty 6.0

LEO is an embodied generalist agent that performs 3D captioning, question answering, reasoning, navigation, and manipulation after 3D vision-language alignment followed by vision-language-action instruction tuning on ...
Low-Resource Languages Jailbreak GPT-4
cs.CL 2023-10 conditional novelty 6.0

Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
cs.CR 2023-08 unverdicted novelty 6.0

Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
Graph-Augmented LLMs for Swiss MP Ideology Prediction
cs.CL 2026-05 unverdicted novelty 5.0

Graph-augmented LLMs using a political knowledge graph improve ideology prediction accuracy for Swiss MPs by incorporating relational data beyond text alone.
Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)
cs.CL 2026-04 unverdicted novelty 5.0

SSAS improves LLM sentiment prediction consistency and data quality by up to 30% on three review datasets via syntactic and semantic context assessment summarization.
Heuristic Style Transfer for Real-Time, Efficient Weather Attribute Detection
cs.CV 2026-04 conditional novelty 5.0

Lightweight multi-task models using Gram matrices and PatchGAN-style architectures detect 53 weather classes from RGB images with F1 scores above 96% internally and 78% zero-shot externally, supported by a new 503k-im...
Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG
cs.CL 2026-04 unverdicted novelty 5.0

Continual pretraining on UMLS-derived text improves BERT on BLURB biomedical tasks while GraphRAG boosts LLaMA 3-8B accuracy by over 3 points on PubMedQA and 5 on BioASQ without retraining.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
cs.AI 2023-08 accept novelty 5.0

Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
Recommendations for Efficient and Responsible LLM Adoption within Industrial Software Development
cs.SE 2026-04 conditional novelty 4.0

A multi-case study plus survey produces seven actionable recommendations for efficient and responsible LLM use in industrial software engineering.
A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering
cs.CL 2026-04 unverdicted novelty 4.0

Dense retrieval plus query reformulation and reranking reaches 60.49% accuracy on MedQA USMLE, outperforming other setups while domain-specialized models make better use of the retrieved evidence.
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study
cs.SE 2026-05 unverdicted novelty 3.0

Multi-shot prompting raises agreement with humans for Claude Haiku but not DeepSeek-Chat or Gemini 2.5 Flash, with models showing different stability and a consistent bias toward over-labeling negative feedback.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 18 Pith papers · 2 internal anchors

[1]

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 432–446, Dublin, Ireland

e-CARE: a new dataset for exploring explain- able causal reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 432–446, Dublin, Ireland. Association for Computational Lin- guistics. Simon Frieder, Luca Pinchetti, Ryan-Rhys Grif- fiths, Tommaso Salvatori, Thomas Lukasiewicz, Phi...

work page arXiv 2023
[2]

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7890–7900

Zero-shot dialogue state tracking via cross-task transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7890–7900. Holy Lovenia, Bryan Wilie, Romain Barraud, Samuel Cahyawijaya, Willy Chung, and Pascale Fung. 2022. Every picture tells a story: Image-grounded control- lable stylistic story generation. In ...

work page arXiv 2021
[3]

ACM Comput

Qa dataset explosion: A taxonomy of nlp resources for question answering and reading com- prehension. ACM Comput. Surv. Just Accepted. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High- resolution image synthesis with latent diffusion mod- els. In Proceedings of the IEEE/CVF Conference on Computer Vision and Patt...

work page 2022
[4]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Chatgpt and other large language models are double-edged swords. Zhengxiang Shi, Qiang Zhang, and Aldo Lipani. 2022a. Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11321– 11329. Zhengxiang Shi, Qiang Zhang, and Aldo Lipani. 2022b. StepGame: A new benchmark f...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Richmond Thomason

No language left behind: Scaling human- centered machine translation. Richmond Thomason. 2018. Logic and artificial intelli- gence. Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al

work page 2018
[6]

LaMDA: Language Models for Dialog Applications

Lamda: Language models for dialog applica- tions. arXiv preprint arXiv:2201.08239. H Holden Thorp. 2023. Chatgpt is fun, but not an author. Giuseppe Venuto. 2023. Giuven95/chatgpt-failures: Chatgpt failure archive. Douglas Walton. 2014. Abductive reasoning. Univer- sity of Alabama Press. Ada Wan. 2022. Fairness in representation for multi- lingual NLP: In...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Cameron R

Nusax: Multilingual parallel sentiment dataset for 10 indonesian local languages. Cameron R. Wolfe. 2023. Specialized llms: Chatgpt, lamda, galactica, codex, sparrow, and more. BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luc- cioni, François Yvon, Matthi...

work page 2023
[8]

significantly below those of an average mathematics graduate student

Multiwoz 2.2: A dialogue dataset with addi- tional annotation corrections and state tracking base- lines. ACL 2020, page 109. Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Good- man. 2022. Star: Bootstrapping reasoning with rea- soning. In Advances in Neural Information Process- ing Systems. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, ...

work page arXiv 2020
[9]

Laskar et al

Scientific Knowledge, and 3) Ethical Considera- tions. Laskar et al. provide extensive automatic or human-in-the-loop evaluations on 140 tasks. Qin et al. mainly evaluated the reasoning abilities of ChatGPT while Zhuo et al.; Ray focus on other important aspects such as ethics, robustness, reli- ability, limitations, and future scope of ChatGPT. Koco´n et...

work page 2023
[10]

Use the following knowledge base to complete the task of recommending a restaurant as a task-oriented dialogue system

with 100 samples. We take half from scien- tific (covid-scientific) and another half from social (covid-social) sets. We evaluate the accuracy of the veracity by manually checking the generated text. ChatGPT could detect misinformation 92% (46/50) and 73.33% (22/30, excluding verification- refusing cases) accuracy on covid-scientific and covid-social resp...

work page 2020
[11]

Describe how the <NATION> flag looks like

Ask ChatGPT to illustrate the appearance of the flag using the prompt “Describe how the <NATION> flag looks like”

work page
[12]

Generate a code snippet to represent that flag in SVG format

Based on the description, ask ChatGPT to gen- erate the SVG code of that flag by prompting “Generate a code snippet to represent that flag in SVG format”

work page
[13]

<ERROR DESCRIPTION>. Revise the image

If the generated image contains errors, we iteratively ask ChatGPT to fix them. There are four types of evaluation criteria: 1) lay- out 2) color 3) missing components 4) shape/size. In each round of fixing, we ask ChatGPT to revise only one type of error with the prompt “ <ERROR DESCRIPTION>. Revise the image ”. We ter- minate the conversation once the g...

work page 2022
[14]

6 We also test separately on MATH dataset

it has a great ability to search for mathematical objects. 6 We also test separately on MATH dataset. Not surprisingly, it could only score 23.33% (7/30) for the MATH dataset (Saxton et al., 2019), which tests mathematical reasoning. Temporal reasoning Temporal reasoning is men- tioned a few times in the literature but is less com- mon than others. It tes...

work page 2019
[15]

It is not specified in the given descrip- tion

and StepGame (Shi et al., 2022a), which compose of story-question pairs about k relations of k+1 (where k is up to 10) entities written in nat- ural language. ChatGPT is asked to answer spatial relations between two entities based on the pro- vided descriptions of different entities. ChatGPT falls short of the spatial reasoning tasks, as shown in Table 15...

work page
[16]

Where is a business restau- rant likely to be located?

and PiQA (Bisk et al., 2020). Common- senseQA focuses on general commonsense ques- tion answering such as “Where is a business restau- rant likely to be located?” , and PiQA is about physical commonsense reasoning: given a sentence such as “When boiling butter, when it’s ready, you can ", the goal is to fill in the blank with one of two answer options, “P...

work page 2020
[17]

Summarize the above dialogue

Given an input dialogue as the context, we first input the prompt“Summarize the above dialogue”to the ChatGPT

work page
[18]

Please make the summary shorter

To refine the summary, we simply input an- other prompt – “Please make the summary shorter”after the first response. Evaluation: We calculate the ROUGE scores (ROUGE-1, ROUGE-2, and ROUGE-L) of the first and second summaries and compare between turns. H.2 Interactivity on Machine Translation H.2.1 Experiment 1: Multi-turn Post-Editting We explore the capa...

work page 2022
[19]

What is [TARGET_LANGUAGE] translation of the following sentence?\n\n[INPUT_SENTENCE]

Query model to translate to the target lan- guage using “What is [TARGET_LANGUAGE] translation of the following sentence?\n\n[INPUT_SENTENCE]” Label Metric w/o APE w/ APE Post-Edited Marathi Text HTER 88.14 88.79 SacreBLEU 4.81 4.20 METEOR 13.10 12.74 Source English Text HTER 65.36 63.13 SacreBLEU 25.54 27.20 METEOR 43.71 47.51 BERTScore 92.30 92.59 Table...

work page 2022
[20]

Could you perform a post-editing to ensure the meaning is equivalent to “[INPUT_SENTENCE]

Query for the post-editing using the following prompt template: “Could you perform a post-editing to ensure the meaning is equivalent to “[INPUT_SENTENCE]"?” Evaluation: The post-editing results are manually validated by a native speaker in the corresponding language to validate: 1) whether the post-edited sentence is better than the translation one, and

work page
[21]

whether the post-edited sentence is the correct translation of the given English sentence. Based on the evaluation, performing automatic post-editing through interactive LLMs, such as ChatGPT, yields consistently better translation re- sults compared to a single-turn machine transla- tion, which is especially useful for translation in low-resource languag...

work page 2022
[22]

which further supports the limitations of ChatGPT on generating sentences in low-resource and non-Latin script languages. H.3 Interactivity on Multimodal Generation We show an example of a multi-turn flag draw- ing of InstructGPT, which has the same backbone model as ChatGPT but lacks conversation ability, in Figure 6. Similar to ChatGPT, InstructGPT can ...

work page
[23]

Clutrr: A diagnostic benchmark for inductive reasoning from text

and take subsets of languages to represent each group. JGA denotes joint goal accuracy. Average of per- formances for CNN and DM from Goyal et al. (2022b). I.2 Results on Mulilinguality SA Acc. LID Acc. ChatGPTGPT-4ChatGPTGPT-4 English 84% 82% 100% 92% Indonesian 90% 100% 100% 100% Javanese 78% 78% 0% 90% Buginese 56% 10% 12% 64% Table 22: Accuracy of Cha...

work page arXiv 2016