ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

Bo Bai; Chaorui Zhang; Peixian Zhou; Wei Han; Xueyan Niu; Yuxu Chen

arxiv: 2606.17905 · v1 · pith:I7BYQM4Lnew · submitted 2026-06-16 · 💻 cs.CL

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

Peixian Zhou , Yuxu Chen , Chaorui Zhang , Wei Han , Bo Bai , Xueyan Niu This is my paper

Pith reviewed 2026-06-27 01:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords alignedlogicalchinesereasoningenglishchlogicdifficultgeneral

0 comments

The pith

Models lose accuracy on logical reasoning tasks when identical structures are expressed in Chinese rather than English.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ChLogic, a benchmark built from logical templates that pairs each English item with five Chinese realizations to hold the underlying logic fixed while varying surface form. Experiments across Qwen3, Ministral, and GLM models show a consistent performance advantage for English on both general and difficult problem sets. Back-translation into English improves results on easier items for some models but produces mixed or negative effects on harder ones. The findings point to joint effects from Chinese phrasing, translation artifacts, and model-specific training imbalances. ChLogic is presented as a stress test for whether current multilingual reasoning remains stable beyond English.

Core claim

ChLogic supplies three aligned data sets derived from formal logical templates: a General set from 60 propositions across nine families, a Difficult set from 40 problems, and a Chinese-only set covering 15 language-specific phenomena. Each aligned item keeps one English reference and five Chinese variants. Tests on Qwen3, Ministral, and GLM models document a persistent English-Chinese accuracy gap. Back-translation raises scores on the General set for most models yet lowers them for Qwen3-32B and GLM-5.1 on the Difficult set. The results are interpreted as evidence that Chinese surface realization, translation effects, and model behavior together limit robustness in multilingual logical reas

What carries the argument

ChLogic benchmark of English-Chinese aligned logical items, each pairing one English reference with five Chinese realizations drawn from the same formal template so that latent structure stays constant while surface form varies.

If this is right

Logical reasoning performance in current models is sensitive to language-specific surface realizations even when underlying logic is unchanged.
Back-translation from Chinese to English can reduce but does not eliminate the observed gap, and sometimes increases it on difficult problems.
Chinese-only linguistic phenomena require separate evaluation because they are not captured by aligned English-Chinese pairs.
Model-specific training data distributions contribute to the English advantage observed across Qwen3, Ministral, and GLM.
ChLogic functions as a diagnostic tool that reveals limits in multilingual reasoning robustness beyond what monolingual English benchmarks detect.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the gap holds under tighter controls on meaning preservation, applications that rely on logical inference in Chinese (legal analysis, scientific deduction) would inherit the same accuracy shortfall.
The benchmark design could be replicated for other language pairs to test whether the English advantage is unique or part of a broader pattern favoring high-resource languages.
Training regimes that explicitly align reasoning across surface forms might close the gap, though the paper does not test such interventions.
Persistent gaps would imply that scaling alone is unlikely to produce language-agnostic logical competence without targeted multilingual alignment.
keywords:[

Load-bearing premise

The five Chinese realizations for each English item keep exactly the same logical structure and correct answer without introducing new ambiguities, scope changes, or presuppositions.

What would settle it

A controlled experiment in which models achieve equal or higher accuracy on the Chinese realizations than on the matched English items, after surface-form differences are isolated, would falsify the claimed performance gap.

Figures

Figures reproduced from arXiv: 2606.17905 by Bo Bai, Chaorui Zhang, Peixian Zhou, Wei Han, Xueyan Niu, Yuxu Chen.

**Figure 1.** Figure 1: Workflow of CHLOGIC benchmark construction. The three stages are logical-template design, dataset composition, and LLM-assisted generation with quality control. Hanyu Pinyin transliterations and English renderings of the displayed Chinese example are provided in Appendix I.1. els on CHLOGIC. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Main accuracy results on CHLOGIC. (a) Accuracy on the General aligned set for the English expression and five Chinese surface realizations: standard (Ch-Std), natural written (Ch-Nat), colloquial (Ch-Col), rhetoricalquestion (Ch-Rhet), and perturbed (Ch-Pert). (b) English accuracy with Chinese-average accuracy per logicaltemplate family and for the Difficult aligned set; Chinese-average denotes the mean … view at source ↗

read the original abstract

Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models reveal a persistent English--Chinese performance gap. Back-translation from standard Chinese into English often improves performance on the General aligned set, but produces mixed effects on the Difficult aligned set, where Qwen3-32B and GLM-5.1 perform worse after translation. These results indicate that Chinese surface realization, translation artifacts, and model-specific behavior jointly affect multilingual logical reasoning. Overall, ChLogic provides a useful stress test for the robustness of multilingual reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChLogic gives a new aligned benchmark but the reported English-Chinese gap rests on an unverified assumption that the Chinese versions keep identical logical structure.

read the letter

The paper creates ChLogic, an English-Chinese aligned logical reasoning benchmark built from formal templates. It includes a general set from 60 propositions across nine families, a difficult set from 40 problems, and a Chinese-only set for 15 language-specific phenomena. Each aligned item pairs one English reference with five Chinese surface forms. Experiments on Qwen3, Ministral, and GLM models show a persistent English advantage, with back-translation helping on the general set but hurting some models on the difficult set.

What stands out is the template-driven alignment and the three distinct subsets. Reporting model-specific translation effects adds a practical angle for multilingual evaluation. The construction method itself is a reasonable way to hold the underlying logic fixed while varying the surface language.

The soft spot is the missing check on whether the five Chinese realizations actually preserve the exact same entailment and answer. The abstract claims derivation from templates but gives no expert equivalence ratings, inter-annotator numbers, or formal semantic comparison. If quantifier scope, negation, or Chinese phrasing shifts the difficulty in even a modest share of items, the observed gap could reflect changed problem hardness rather than reasoning robustness. Back-translation results do not address this directly. The abstract also omits item counts per family, raw accuracies, and error bars, so the size of the effect stays unclear.

This work is mainly for groups building or stress-testing multilingual reasoning benchmarks. Readers focused on Chinese LLM deployment would find the benchmark construction useful if the equivalence holds. The paper shows clear thinking on the evaluation setup and engages the literature on logical robustness, so it deserves a serious referee who can examine the full data and any equivalence validation steps.

Referee Report

2 major / 2 minor

Summary. The paper introduces ChLogic, an English-Chinese aligned benchmark for logical reasoning robustness derived from formal logical templates. It comprises three datasets: a General aligned set (60 propositions across nine template families), a Difficult aligned set (40 problems), and a Chinese-only set (15 language-specific phenomena). Each aligned item pairs one English reference with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models report a persistent English-Chinese performance gap, with back-translation from Chinese to English yielding mixed effects (improvement on General set but degradation for some models on Difficult set). The work concludes that Chinese surface forms, translation artifacts, and model behavior jointly impact multilingual reasoning.

Significance. If the core assumption of logical equivalence holds, the benchmark offers a concrete, template-based stress test for multilingual reasoning that goes beyond standard English-centric evaluations. The inclusion of multiple Chinese surface realizations per logical structure and the back-translation experiments provide falsifiable, model-specific measurements that could inform targeted improvements in cross-lingual consistency.

major comments (2)

[Benchmark construction (abstract and methods)] The central claim of a persistent English-Chinese performance gap (abstract and §4) rests on the unverified assumption that each English item and its five Chinese realizations encode identical latent logical structure (same correct answer, no scope shifts or added ambiguities). The manuscript states the sets are “derived from formal logical templates” but provides no independent verification such as expert equivalence annotation, inter-annotator agreement, or formal semantic comparison. This is load-bearing: any systematic deviation in quantifier scope, negation, or presupposition in the Chinese variants would confound the gap with changed problem difficulty rather than reasoning robustness.
[Experiments and results] No quantitative results, item counts per template family, accuracy tables with error bars, or per-model/per-set breakdowns appear in the abstract or are referenced with specific numbers in the provided description. The reported “persistent gap” and “mixed effects” of back-translation cannot be assessed for magnitude or statistical reliability without these data.

minor comments (2)

[Methods] Clarify the exact translation procedure and any post-editing steps used to generate the five Chinese realizations per template.
[Experimental setup] Specify the exact prompt templates and few-shot examples used for model evaluation to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Benchmark construction (abstract and methods)] The central claim of a persistent English-Chinese performance gap (abstract and §4) rests on the unverified assumption that each English item and its five Chinese realizations encode identical latent logical structure (same correct answer, no scope shifts or added ambiguities). The manuscript states the sets are “derived from formal logical templates” but provides no independent verification such as expert equivalence annotation, inter-annotator agreement, or formal semantic comparison. This is load-bearing: any systematic deviation in quantifier scope, negation, or presupposition in the Chinese variants would confound the gap with changed problem difficulty rather than reasoning robustness.

Authors: We agree that the logical equivalence assumption is central and that the current manuscript does not include independent verification steps such as expert annotation. The items were generated from formal logical templates that define the underlying structure (e.g., quantifier scope and negation placement) before surface realization in either language. To address this, we will revise the methods section to detail the template-to-expression mapping process and add an appendix with side-by-side English-Chinese examples. We will also perform a targeted expert equivalence review on a subset of items and report agreement statistics in the revised version. revision: yes
Referee: [Experiments and results] No quantitative results, item counts per template family, accuracy tables with error bars, or per-model/per-set breakdowns appear in the abstract or are referenced with specific numbers in the provided description. The reported “persistent gap” and “mixed effects” of back-translation cannot be assessed for magnitude or statistical reliability without these data.

Authors: The full manuscript (Section 3 and Section 4) specifies the item counts (60 General propositions across nine families, 40 Difficult problems) and presents per-model accuracy tables with breakdowns by set and, where relevant, by template family. The abstract summarizes the directional findings without numerical values, which is standard, but we will add explicit references to the magnitude of the English-Chinese gap and back-translation effects. We will also ensure error bars or confidence intervals are included or noted in the tables during revision. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical benchmark with independent measurements

full rationale

The paper constructs ChLogic from logical templates and reports model performance gaps as direct empirical observations. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim (English-Chinese gap) rests on held-out model evaluations rather than reducing to the construction process by definition. The equivalence assumption is a methodological premise, not a load-bearing self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper; no mathematical derivations, fitted parameters, or new postulated entities appear in the abstract.

pith-pipeline@v0.9.1-grok · 5753 in / 1172 out tokens · 26586 ms · 2026-06-27T01:07:13.066822+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 2 canonical work pages

[1]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[2]

Proceedings of EMNLP , year=

A large annotated corpus for learning natural language inference , author=. Proceedings of EMNLP , year=
[3]

Proceedings of NAACL-HLT , year=

A broad-coverage challenge corpus for sentence understanding through inference , author=. Proceedings of NAACL-HLT , year=
[4]

, journal=

Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , journal=
[5]

Xu, Liang and Hu, Hai and Zhang, Xuanwei and Li, Lu and Cao, Chenjie and Li, Yudong and Xu, Yechen and Sun, Kai and Yu, Dian and Yu, Cong and others , booktitle=
[6]

Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Yao and Fu, Yao and Sun, Maosong and He, Junxian , journal=
[7]

Proceedings of IJCAI , year=

Transformers as Soft Reasoners over Language , author=. Proceedings of IJCAI , year=
[8]

Tafjord, Oyvind and Dalvi, Bhavana and Clark, Peter , booktitle=
[9]

Han, Simeng and Schoelkopf, Hailey and Zhao, Yilun and Qi, Zhenting and Riddell, Martin and Benson, Luke and Sun, Lucy and Zubova, Ekaterina and Qiao, Yejin and Burtell, Matthew and others , booktitle=
[10]

Liu, Jian and Cui, Leyang and Liu, Hanmeng and Huang, Dandan and Wang, Yile and Zhang, Yue , booktitle=
[11]

Yu, Weihao and Jiang, Zihang and Dong, Yanfei and Feng, Jiashi , booktitle=
[12]

Transactions on Machine Learning Research , year=

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author=. Transactions on Machine Learning Research , year=
[13]

arXiv preprint arXiv:1502.05698 , year=

Towards AI-complete question answering: A set of prerequisite toy tasks , author=. arXiv preprint arXiv:1502.05698 , year=

Pith/arXiv arXiv
[14]

Findings of EMNLP , year=

Evaluating Models' Local Decision Boundaries via Contrast Sets , author=. Findings of EMNLP , year=
[15]

Beyond Accuracy: Behavioral Testing of NLP Models with

Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer , booktitle=. Beyond Accuracy: Behavioral Testing of NLP Models with
[16]

, booktitle =

Wan, Yuxuan and Wang, Wenxuan and Yang, Yiliu and Yuan, Youliang and Huang, Jen-tse and He, Pinjia and Jiao, Wenxiang and Lyu, Michael R. , booktitle =. 2024 , publisher =

2024
[17]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

Parmar, Mihir and Patel, Nisarg and Varshney, Neeraj and Nakamura, Mutsumi and Luo, Man and Mashetty, Santosh and Mitra, Arindam and Baral, Chitta , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.739 , url =

work page doi:10.18653/v1/2024.acl-long.739 2024
[18]

arXiv preprint arXiv:2203.15099 , year =

Ontan. arXiv preprint arXiv:2203.15099 , year =

arXiv
[19]

Diagnosing the First-Order Logical Reasoning Ability Through L ogic NLI

Tian, Jidong and Li, Yitian and Chen, Weixiang and Xiao, Li and He, Hao and Jin, Yaohui , booktitle =. Diagnosing the First-Order Logical Reasoning Ability Through. 2021 , address =. doi:10.18653/v1/2021.emnlp-main.303 , url =

work page doi:10.18653/v1/2021.emnlp-main.303 2021
[20]

2026 , url =

Rabern, Brian and Mondorf, Philipp and Plank, Barbara , journal =. 2026 , url =

2026
[21]

2001 , publisher =

A Mathematical Introduction to Logic , author =. 2001 , publisher =

2001
[22]

2015 , publisher =

Introduction to Mathematical Logic , author =. 2015 , publisher =

2015
[23]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages =

XNLI: Evaluating Cross-lingual Sentence Representations , author =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages =. 2018 , publisher =

2018
[24]

The Eleventh International Conference on Learning Representations , year =

Language Models are Multilingual Chain-of-Thought Reasoners , author =. The Eleventh International Conference on Learning Representations , year =
[25]

Mirzadeh, Iman and Alizadeh, Keivan and Shahrokhi, Hooman and Tuzel, Oncel and Bengio, Samy and Farajtabar, Mehrdad , journal =
[26]

, journal =

Wang, Bin and Liu, Zhengyuan and Huang, Xin and Jiao, Fangkai and Ding, Yang and Aw, Ai Ti and Chen, Nancy F. , journal =
[27]

and Mares, Diego and Flores, Jorge and Mankikar, Meher and Hernandez, Ernesto and Lee, Dean and Liu, Bing and Xing, Chen , journal =

Fabbri, Alexander R. and Mares, Diego and Flores, Jorge and Mankikar, Meher and Hernandez, Ernesto and Lee, Dean and Liu, Bing and Xing, Chen , journal =
[28]

Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of

Zong, Shi and Lin, Jimmy , booktitle =. Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of. 2024 , publisher =

2024
[29]

Reasoning Robustness of

Gan, Esther and Zhao, Yiran and Cheng, Liying and Mao, Yancan and Goyal, Anirudh and Kawaguchi, Kenji and Kan, Min-Yen and Shieh, Michael , journal =. Reasoning Robustness of

[1] [1]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[2] [2]

Proceedings of EMNLP , year=

A large annotated corpus for learning natural language inference , author=. Proceedings of EMNLP , year=

[3] [3]

Proceedings of NAACL-HLT , year=

A broad-coverage challenge corpus for sentence understanding through inference , author=. Proceedings of NAACL-HLT , year=

[4] [4]

, journal=

Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , journal=

[5] [5]

Xu, Liang and Hu, Hai and Zhang, Xuanwei and Li, Lu and Cao, Chenjie and Li, Yudong and Xu, Yechen and Sun, Kai and Yu, Dian and Yu, Cong and others , booktitle=

[6] [6]

Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Yao and Fu, Yao and Sun, Maosong and He, Junxian , journal=

[7] [7]

Proceedings of IJCAI , year=

Transformers as Soft Reasoners over Language , author=. Proceedings of IJCAI , year=

[8] [8]

Tafjord, Oyvind and Dalvi, Bhavana and Clark, Peter , booktitle=

[9] [9]

Han, Simeng and Schoelkopf, Hailey and Zhao, Yilun and Qi, Zhenting and Riddell, Martin and Benson, Luke and Sun, Lucy and Zubova, Ekaterina and Qiao, Yejin and Burtell, Matthew and others , booktitle=

[10] [10]

Liu, Jian and Cui, Leyang and Liu, Hanmeng and Huang, Dandan and Wang, Yile and Zhang, Yue , booktitle=

[11] [11]

Yu, Weihao and Jiang, Zihang and Dong, Yanfei and Feng, Jiashi , booktitle=

[12] [12]

Transactions on Machine Learning Research , year=

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author=. Transactions on Machine Learning Research , year=

[13] [13]

arXiv preprint arXiv:1502.05698 , year=

Towards AI-complete question answering: A set of prerequisite toy tasks , author=. arXiv preprint arXiv:1502.05698 , year=

Pith/arXiv arXiv

[14] [14]

Findings of EMNLP , year=

Evaluating Models' Local Decision Boundaries via Contrast Sets , author=. Findings of EMNLP , year=

[15] [15]

Beyond Accuracy: Behavioral Testing of NLP Models with

Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer , booktitle=. Beyond Accuracy: Behavioral Testing of NLP Models with

[16] [16]

, booktitle =

Wan, Yuxuan and Wang, Wenxuan and Yang, Yiliu and Yuan, Youliang and Huang, Jen-tse and He, Pinjia and Jiao, Wenxiang and Lyu, Michael R. , booktitle =. 2024 , publisher =

2024

[17] [17]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

Parmar, Mihir and Patel, Nisarg and Varshney, Neeraj and Nakamura, Mutsumi and Luo, Man and Mashetty, Santosh and Mitra, Arindam and Baral, Chitta , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.739 , url =

work page doi:10.18653/v1/2024.acl-long.739 2024

[18] [18]

arXiv preprint arXiv:2203.15099 , year =

Ontan. arXiv preprint arXiv:2203.15099 , year =

arXiv

[19] [19]

Diagnosing the First-Order Logical Reasoning Ability Through L ogic NLI

Tian, Jidong and Li, Yitian and Chen, Weixiang and Xiao, Li and He, Hao and Jin, Yaohui , booktitle =. Diagnosing the First-Order Logical Reasoning Ability Through. 2021 , address =. doi:10.18653/v1/2021.emnlp-main.303 , url =

work page doi:10.18653/v1/2021.emnlp-main.303 2021

[20] [20]

2026 , url =

Rabern, Brian and Mondorf, Philipp and Plank, Barbara , journal =. 2026 , url =

2026

[21] [21]

2001 , publisher =

A Mathematical Introduction to Logic , author =. 2001 , publisher =

2001

[22] [22]

2015 , publisher =

Introduction to Mathematical Logic , author =. 2015 , publisher =

2015

[23] [23]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages =

XNLI: Evaluating Cross-lingual Sentence Representations , author =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages =. 2018 , publisher =

2018

[24] [24]

The Eleventh International Conference on Learning Representations , year =

Language Models are Multilingual Chain-of-Thought Reasoners , author =. The Eleventh International Conference on Learning Representations , year =

[25] [25]

Mirzadeh, Iman and Alizadeh, Keivan and Shahrokhi, Hooman and Tuzel, Oncel and Bengio, Samy and Farajtabar, Mehrdad , journal =

[26] [26]

, journal =

Wang, Bin and Liu, Zhengyuan and Huang, Xin and Jiao, Fangkai and Ding, Yang and Aw, Ai Ti and Chen, Nancy F. , journal =

[27] [27]

and Mares, Diego and Flores, Jorge and Mankikar, Meher and Hernandez, Ernesto and Lee, Dean and Liu, Bing and Xing, Chen , journal =

Fabbri, Alexander R. and Mares, Diego and Flores, Jorge and Mankikar, Meher and Hernandez, Ernesto and Lee, Dean and Liu, Bing and Xing, Chen , journal =

[28] [28]

Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of

Zong, Shi and Lin, Jimmy , booktitle =. Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of. 2024 , publisher =

2024

[29] [29]

Reasoning Robustness of

Gan, Esther and Zhao, Yiran and Cheng, Liying and Mao, Yancan and Goyal, Anirudh and Kawaguchi, Kenji and Kan, Min-Yen and Shieh, Michael , journal =. Reasoning Robustness of