Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque

Itziar Gonzalez-Dios; Jaione Bengoetxea; Rodrigo Agerri

arxiv: 2602.14812 · v3 · submitted 2026-02-16 · 💻 cs.CL

Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque

Jaione Bengoetxea , Itziar Gonzalez-Dios , Rodrigo Agerri This is my paper

Pith reviewed 2026-05-15 21:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords physical commonsense reasoningBasquelow-resource languagesdialectal variantslarge language modelsBasPhyCoverifiability tasknon-QA reasoning

0 comments

The pith

LLMs exhibit limited physical commonsense reasoning in Basque, especially its dialects

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates BasPhyCo, the first non-QA physical commonsense dataset for Basque in both standard and dialectal forms, by adapting narratives from an Italian source. It evaluates multilingual and language-specific LLMs on three levels of increasing difficulty: spotting implausible stories, locating the conflicting element, and verifying the exact physical state responsible. Models perform adequately on the first two levels but show clear weakness on verification, with further drops for dialectal text. This matters because physical commonsense underpins real-world prediction and interaction, and shortfalls in low-resource languages limit model reliability outside English-dominant settings.

Core claim

The paper presents BasPhyCo and demonstrates that large language models have limited physical commonsense capabilities when processing Basque, particularly dialectal variants, as evidenced by low performance on the verifiability task that requires identifying the precise physical state creating an implausible narrative.

What carries the argument

BasPhyCo dataset with three hierarchical tasks (accuracy on plausibility, consistency on conflict identification, verifiability on specific physical state) adapted from Italian GITA narratives to Basque standard and dialectal variants.

Load-bearing premise

The tasks and narratives adapted from the Italian GITA dataset accurately capture physical commonsense reasoning in Basque without introducing translation or cultural biases that alter the intended physical plausibility judgments.

What would settle it

If multilingual or Basque-pretrained LLMs achieve high accuracy, such as above 80 percent, on the verifiability task for both standard and dialectal Basque using the BasPhyCo dataset, the claim of limited capabilities would be contradicted.

Figures

Figures reproduced from arXiv: 2602.14812 by Itziar Gonzalez-Dios, Jaione Bengoetxea, Rodrigo Agerri.

**Figure 1.** Figure 1: The number of unique words in BasPhyCo (left) and BasPhyCowest (right), as well as the overlap of both datasets (middle). Additionally, some examples from each dataset. of the selected models and evaluation framework. 4.1. Task description Our setup is based on GITA4CALAMITA, a GITA version which was adapted to work with generative LLMs for the CALAMITA shared task (Pensa et al., 2024b). This approach eva… view at source ↗

**Figure 2.** Figure 2: Dialectal adaptation prompt [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

read the original abstract

Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces. Recent years have witnessed growing interest in reasoning tasks within Natural Language Processing (NLP). However, no prior research has examined the performance of Large Language Models (LLMs) on non-question-answering (non-QA) physical commonsense reasoning tasks in low-resource languages such as Basque. Taking the Italian GITA as a starting point, this paper addresses this gap by presenting BasPhyCo, the first non-QA physical commonsense reasoning dataset for Basque, available in both standard and dialectal variants. We evaluate model performance across three hierarchical levels of commonsense understanding: (1) distinguishing between plausible and implausible narratives (accuracy), (2) identifying the conflicting element that renders a narrative implausible (consistency), and (3) determining the specific physical state that creates the implausibility (verifiability). These tasks were assessed using multiple multilingual LLMs as well as models pretrained specifically for Italian and Basque. Results indicate that, in terms of verifiability, LLMs exhibit limited physical commonsense capabilities in low-resource languages such as Basque, especially when processing dialectal variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BasPhyCo is a useful first dataset for Basque physical commonsense with dialect coverage, but the adaptation from GITA leaves open whether low scores reflect model limits or translation artifacts.

read the letter

This paper introduces BasPhyCo, the first non-QA physical commonsense dataset for Basque, adapted from the Italian GITA set and released in both standard and dialectal versions. They run three tasks on it: accuracy at spotting plausible versus implausible stories, consistency at locating the conflicting element, and verifiability at identifying the exact physical state that breaks the narrative. Multilingual LLMs and some Basque- or Italian-tuned models are tested, with the abstract stating that verifiability drops especially on the dialect variants.

Referee Report

2 major / 1 minor

Summary. The paper introduces BasPhyCo, the first non-QA physical commonsense reasoning dataset for Basque (standard and dialectal variants) adapted from the Italian GITA dataset. It evaluates multilingual LLMs and Basque/Italian-specific models on three hierarchical tasks—accuracy (distinguishing plausible vs. implausible narratives), consistency (identifying the conflicting element), and verifiability (determining the specific physical state causing implausibility)—and concludes that LLMs exhibit limited physical commonsense capabilities in Basque, especially for dialectal variants.

Significance. If the adaptation from GITA preserves physical plausibility structure without translation artifacts and the empirical results are robust, the work is significant as the first such benchmark for Basque, highlighting gaps in LLM performance on low-resource languages and dialects and providing a new resource for evaluating non-QA commonsense reasoning beyond English-centric datasets.

major comments (2)

[Dataset adaptation (abstract and §3)] Dataset adaptation (abstract and §3): The headline claim of limited LLM verifiability rests on the assumption that translated narratives retain the original physical implausibility structure from GITA. No independent validation of label stability (e.g., native-speaker agreement rates or comparison of plausibility judgments pre- and post-translation) is described, so low verifiability scores could reflect dataset noise or Basque-specific lexical gaps rather than model limitations.
[Results presentation (abstract and §4)] Results presentation (abstract and §4): The abstract states that 'results indicate limited capabilities' but provides no quantitative metrics, error analysis, dataset statistics (e.g., narrative counts, label distributions), or baseline comparisons, preventing assessment of effect sizes or whether the 'especially dialectal' degradation is statistically meaningful.

minor comments (1)

[Abstract] The abstract could include one or two key quantitative results (e.g., verifiability accuracy ranges) to make the conclusions more concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on BasPhyCo. We address each major comment below and will revise the manuscript to incorporate additional validation details and quantitative clarifications.

read point-by-point responses

Referee: [Dataset adaptation (abstract and §3)] Dataset adaptation (abstract and §3): The headline claim of limited LLM verifiability rests on the assumption that translated narratives retain the original physical implausibility structure from GITA. No independent validation of label stability (e.g., native-speaker agreement rates or comparison of plausibility judgments pre- and post-translation) is described, so low verifiability scores could reflect dataset noise or Basque-specific lexical gaps rather than model limitations.

Authors: We acknowledge the value of explicit validation for label stability. The adaptation was performed by native Basque speakers with the explicit goal of preserving GITA's physical plausibility structure, using direct translation followed by dialectal localization where needed. To address the concern, we will add a dedicated subsection in §3 reporting native-speaker agreement rates on plausibility labels (pre- and post-adaptation) and a small-scale comparison of judgments, confirming that translation artifacts or lexical gaps do not explain the observed performance gaps. revision: yes
Referee: [Results presentation (abstract and §4)] Results presentation (abstract and §4): The abstract states that 'results indicate limited capabilities' but provides no quantitative metrics, error analysis, dataset statistics (e.g., narrative counts, label distributions), or baseline comparisons, preventing assessment of effect sizes or whether the 'especially dialectal' degradation is statistically meaningful.

Authors: The full §4 already contains per-model accuracy, consistency, and verifiability scores, dataset statistics (narrative counts and label distributions), and comparisons to multilingual and language-specific baselines. We agree the abstract is overly qualitative. In revision we will insert concise quantitative highlights into the abstract (e.g., verifiability scores for standard vs. dialectal Basque) and expand §4 with explicit error analysis and statistical significance tests for the dialectal degradation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset creation and evaluation

full rationale

The paper constructs BasPhyCo by adapting narratives from the external Italian GITA dataset and evaluates multilingual LLMs on three new tasks (accuracy, consistency, verifiability) using standard metrics. No equations, fitted parameters, predictions, or derivations appear in the described methodology. Central claims rest on fresh data collection and model testing rather than any self-referential reduction, self-citation chain, or renaming of prior results. This matches the default non-circular case for empirical NLP dataset papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that narrative plausibility tasks can isolate physical commonsense and that adaptation from Italian preserves those properties in Basque.

axioms (1)

domain assumption The three hierarchical levels (accuracy, consistency, verifiability) measure distinct and meaningful aspects of physical commonsense understanding.
Invoked to structure the evaluation tasks in the abstract.

pith-pipeline@v0.9.0 · 5527 in / 1107 out tokens · 32796 ms · 2026-05-15T21:47:08.116801+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 4 internal anchors

[1]

Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque

Introduction Commonsense reasoning represents the human capacity to understand and manipulate real-world objects and their interactions. This domain has attracted considerable attention in Artificial Intelli- gence research in recent years (Davis, 2023; Sun et al., 2025). Physical commonsense reasoning, a specific subdomain, addresses events occurring in ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Related Work Physical CommonsenseRecent research has tried to test physical commonsense knowledge of current LLMs. To this end, researchers have devel- oped various datasets and benchmarks, including textual information (Rajani et al., 2019; Bisk et al., 2020;Rajanietal.,2020;Storksetal.,2021;Aroca- Ouellette et al., 2021; Wang et al., 2023; Pensa et al.,...

work page 2019
[3]

to think

Data This study examines physical commonsense rea- soning in Italian and Basque. We employed GITA (Pensa et al., 2024a), an Italian dataset derived 2We use the terms the authors use in their papers. Type Sentence 1 Sentence 2 Sentence 3 Sentence 4 Sentence 5 Plausible George filled the glass with water. George put the glass in the mi- crowave. George turn...

work page 2019
[4]

0 - C0 ": {

Experimental Setup Thissectionpresentsthethreeevaluatedtasksand their associated metrics, followed by a description 3Translation: the technician has not arrived yet. 4Translation: [someone] has taken the turkey. Lorategia Gauean Bizkarralde Piztu Ondoren Lorategixe Gabean Bizkarleku Isiotu Ostean Etxea Hartu Erabili Oinetako Standard Western 763 842524 Fi...

work page 2021
[5]

Results We present the results for the three tasks in Table 3, for Italian (GITA), Standard Basque (BasPhyCo) and Western Basque (BasPhyCowest). ItalianThe multilingual Llama-3.1-70B-It model achieved the highest performance in accuracy and consistency metrics, while Latxa-3.1-70B-It outper- formed other models in terms of verifiability. Con- versely, Ita...

work page 2024
[6]

Thisanalysisaimstoidentify any possible biases that the models could have to- wards implausible story types

Discussion In this section, we focus on more fine-grained re- sults, as the three metrics have been specifically computed for the different types of implausible sto- ries(orderandcloze). Thisanalysisaimstoidentify any possible biases that the models could have to- wards implausible story types. The results for all three metrics, as well as for the differe...

work page
[7]

(21) is fully known, we solve this equation forβ i

Conclusion Thispaperintroducesanoveldatasetforevaluating physical commonsense reasoning in Basque and its Western dialect. The dataset was derived from GITA, a manually curated Italian corpus, which un- derwent manual translation and localization into Standard Basque. Subsequently, the Standard Basque data was automatically adapted to the Western Basque d...

work page doi:10.13039/501100011033 2024
[8]

PHYRE: A New Benchmark for Physical Reasoning.Advances in Neural Information Pro- cessing Systems, 32. Irene Baucells, Javier Aula-Blasco, Iria de Dios- Flores, Silvia Paniagua Suárez, Naiara Perez, Anna Salles, Susana Sotelo Docio, Júlia Fal- cão, Jose Javier Saiz, Robiert Sepulveda Torres, Jeremy Barnes, Pablo Gamallo, Aitor Gonzalez- Agirre, German Rig...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Do generative video models understand physical principles?

PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models. CoRR. Luis Mitxelena. 1981. Lengua común y dialec- tos vascos.Anuario del Seminario de Filología Vasca" Julio de Urquijo", 15:289–313. Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. 2025. Do Generative Video Models Understand Physical Principles?a...

work page internal anchor Pith review arXiv 1981
[10]

Shane Storks, Qiaozi Gao, Yichi Zhang, and Joyce Chai

CommonsenseReasoningforNaturalLan- guage Understanding: A Survey of Benchmarks, Resources, and Approaches.arXiv preprint arXiv:1904.01172, pages 1–60. Shane Storks, Qiaozi Gao, Yichi Zhang, and Joyce Chai. 2021. Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Lan- guage Understanding. InFindings of the Asso- ciation for Computational...

work page arXiv 1904
[11]

ACM Computing Surveys, 57(11):1–43

A Survey of Reasoning with Foundation Models: Concepts, Methodologies, and Outlook. ACM Computing Surveys, 57(11):1–43. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mes- nard, Bobak Shahriari, Alexandre Ramé, et al

work page
[12]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118. Larraitz Uria and Ricardo Etxepare. 2012. Hizk- eren arteko aldakortasun sintaktikoa aztertzeko metodologiaren nondik norakoak: Basyque ap- likazioa.Lapurdum. Euskal ikerketen aldizkaria| Revue d’études basques| Revista de estudios vascos| Basque studies review, (...

work page internal anchor Pith review Pith/arXiv arXiv 2012
[13]

InEuropean Con- ference on Computer Vision, pages 292–309

PACS: A Dataset for Physical Audiovisual CommonSense Reasoning. InEuropean Con- ference on Computer Vision, pages 292–309. Springer. Caleb Ziems, William Held, Jingfeng Yang, Jwala Dhamala, Rahul Gupta, and Diyi Yang. 2023. Multi-VALUE: A Framework for Cross-Dialectal English NLP. InProceedings of the 61st Annual MeetingoftheAssociationforComputationalLin...

work page 2023
[14]

First, list all unique sentences across all three stories

work page
[15]

Adapt each unique sentence exactly once into the Bizkaian dialect

work page
[16]

Then reconstruct the three stories with the translations, making sure that any identical source sentence always has the identical translation

work page
[17]

If there are more than three stories, repeat the same process for all of them. Format: This is an example of an standard (INPUT) instance and an example of the dialectal (OUTPUT) adaptation that you need to do: Standard: STORY1: [’Mikel lanera joan da’, ’Mikelek ordenagailua piztu du’, ’Mikelek mezuak irakurri ditu’, ’Mikelek mezuak erantzun ditu’, ’Mikel...

work page

[1] [1]

Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque

Introduction Commonsense reasoning represents the human capacity to understand and manipulate real-world objects and their interactions. This domain has attracted considerable attention in Artificial Intelli- gence research in recent years (Davis, 2023; Sun et al., 2025). Physical commonsense reasoning, a specific subdomain, addresses events occurring in ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Related Work Physical CommonsenseRecent research has tried to test physical commonsense knowledge of current LLMs. To this end, researchers have devel- oped various datasets and benchmarks, including textual information (Rajani et al., 2019; Bisk et al., 2020;Rajanietal.,2020;Storksetal.,2021;Aroca- Ouellette et al., 2021; Wang et al., 2023; Pensa et al.,...

work page 2019

[3] [3]

to think

Data This study examines physical commonsense rea- soning in Italian and Basque. We employed GITA (Pensa et al., 2024a), an Italian dataset derived 2We use the terms the authors use in their papers. Type Sentence 1 Sentence 2 Sentence 3 Sentence 4 Sentence 5 Plausible George filled the glass with water. George put the glass in the mi- crowave. George turn...

work page 2019

[4] [4]

0 - C0 ": {

Experimental Setup Thissectionpresentsthethreeevaluatedtasksand their associated metrics, followed by a description 3Translation: the technician has not arrived yet. 4Translation: [someone] has taken the turkey. Lorategia Gauean Bizkarralde Piztu Ondoren Lorategixe Gabean Bizkarleku Isiotu Ostean Etxea Hartu Erabili Oinetako Standard Western 763 842524 Fi...

work page 2021

[5] [5]

Results We present the results for the three tasks in Table 3, for Italian (GITA), Standard Basque (BasPhyCo) and Western Basque (BasPhyCowest). ItalianThe multilingual Llama-3.1-70B-It model achieved the highest performance in accuracy and consistency metrics, while Latxa-3.1-70B-It outper- formed other models in terms of verifiability. Con- versely, Ita...

work page 2024

[6] [6]

Thisanalysisaimstoidentify any possible biases that the models could have to- wards implausible story types

Discussion In this section, we focus on more fine-grained re- sults, as the three metrics have been specifically computed for the different types of implausible sto- ries(orderandcloze). Thisanalysisaimstoidentify any possible biases that the models could have to- wards implausible story types. The results for all three metrics, as well as for the differe...

work page

[7] [7]

(21) is fully known, we solve this equation forβ i

Conclusion Thispaperintroducesanoveldatasetforevaluating physical commonsense reasoning in Basque and its Western dialect. The dataset was derived from GITA, a manually curated Italian corpus, which un- derwent manual translation and localization into Standard Basque. Subsequently, the Standard Basque data was automatically adapted to the Western Basque d...

work page doi:10.13039/501100011033 2024

[8] [8]

PHYRE: A New Benchmark for Physical Reasoning.Advances in Neural Information Pro- cessing Systems, 32. Irene Baucells, Javier Aula-Blasco, Iria de Dios- Flores, Silvia Paniagua Suárez, Naiara Perez, Anna Salles, Susana Sotelo Docio, Júlia Fal- cão, Jose Javier Saiz, Robiert Sepulveda Torres, Jeremy Barnes, Pablo Gamallo, Aitor Gonzalez- Agirre, German Rig...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Do generative video models understand physical principles?

PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models. CoRR. Luis Mitxelena. 1981. Lengua común y dialec- tos vascos.Anuario del Seminario de Filología Vasca" Julio de Urquijo", 15:289–313. Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. 2025. Do Generative Video Models Understand Physical Principles?a...

work page internal anchor Pith review arXiv 1981

[10] [10]

Shane Storks, Qiaozi Gao, Yichi Zhang, and Joyce Chai

CommonsenseReasoningforNaturalLan- guage Understanding: A Survey of Benchmarks, Resources, and Approaches.arXiv preprint arXiv:1904.01172, pages 1–60. Shane Storks, Qiaozi Gao, Yichi Zhang, and Joyce Chai. 2021. Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Lan- guage Understanding. InFindings of the Asso- ciation for Computational...

work page arXiv 1904

[11] [11]

ACM Computing Surveys, 57(11):1–43

A Survey of Reasoning with Foundation Models: Concepts, Methodologies, and Outlook. ACM Computing Surveys, 57(11):1–43. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mes- nard, Bobak Shahriari, Alexandre Ramé, et al

work page

[12] [12]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118. Larraitz Uria and Ricardo Etxepare. 2012. Hizk- eren arteko aldakortasun sintaktikoa aztertzeko metodologiaren nondik norakoak: Basyque ap- likazioa.Lapurdum. Euskal ikerketen aldizkaria| Revue d’études basques| Revista de estudios vascos| Basque studies review, (...

work page internal anchor Pith review Pith/arXiv arXiv 2012

[13] [13]

InEuropean Con- ference on Computer Vision, pages 292–309

PACS: A Dataset for Physical Audiovisual CommonSense Reasoning. InEuropean Con- ference on Computer Vision, pages 292–309. Springer. Caleb Ziems, William Held, Jingfeng Yang, Jwala Dhamala, Rahul Gupta, and Diyi Yang. 2023. Multi-VALUE: A Framework for Cross-Dialectal English NLP. InProceedings of the 61st Annual MeetingoftheAssociationforComputationalLin...

work page 2023

[14] [14]

First, list all unique sentences across all three stories

work page

[15] [15]

Adapt each unique sentence exactly once into the Bizkaian dialect

work page

[16] [16]

Then reconstruct the three stories with the translations, making sure that any identical source sentence always has the identical translation

work page

[17] [17]

If there are more than three stories, repeat the same process for all of them. Format: This is an example of an standard (INPUT) instance and an example of the dialectal (OUTPUT) adaptation that you need to do: Standard: STORY1: [’Mikel lanera joan da’, ’Mikelek ordenagailua piztu du’, ’Mikelek mezuak irakurri ditu’, ’Mikelek mezuak erantzun ditu’, ’Mikel...

work page