pith. machine review for the scientific record. sign in

arxiv: 2604.25926 · v1 · submitted 2026-04-01 · 💻 cs.CL · cs.IR

Recognition: no theorem link

MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:36 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords mathematical reasoningPortugueseLLM benchmarkmath Olympiadsmultilingual evaluationreasoning datasetEuropean PortugueseBrazilian Portuguese
0
0 comments X

The pith

A new benchmark of 1,729 Portuguese math problems shows frontier LLMs lose ground on figures and open-ended questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Math-PT, a dataset of 1,729 math problems written directly in European and Brazilian Portuguese and drawn from native Olympiads, competitions, and exams. It evaluates current LLMs on this set and finds that the strongest reasoning models handle multiple-choice items better than open-weight models, yet accuracy falls when problems include figures or require free-form answers. The work directly tackles the English-centric bias in existing math reasoning tests by supplying a high-quality, language-specific alternative that researchers can use to measure real multilingual capability.

Core claim

Math-PT supplies 1,729 problems in Portuguese from authentic Brazilian and Portuguese sources; when LLMs are tested on it, frontier reasoning models outperform open-weight models on multiple-choice questions while all models show reduced performance on items that contain figures or demand open-ended responses.

What carries the argument

The Math-PT dataset itself, a curated collection of 1,729 native Portuguese mathematical problems used to benchmark LLM reasoning across question formats.

If this is right

  • Frontier models can serve as stronger baselines for Portuguese multiple-choice math tasks than open-weight alternatives.
  • Visual and generative math reasoning remain weaker points even for top models when the language is not English.
  • Releasing the dataset and model outputs allows direct comparison and future fine-tuning targeted at Portuguese math.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Multilingual LLM training should incorporate native-language math sources rather than relying on translated English data.
  • Similar benchmarks in other languages could reveal whether the observed format gaps are language-specific or general.
  • Open-weight models might narrow the performance difference if trained on comparable volumes of Portuguese competition problems.

Load-bearing premise

The 1,729 problems drawn from Portuguese Olympiads and exams form a representative and high-quality sample of mathematical reasoning in European and Brazilian Portuguese.

What would settle it

Re-running the same models on an independent collection of recent Portuguese math problems from different exams or textbooks would produce substantially different relative scores or format-specific drops.

Figures

Figures reproduced from arXiv: 2604.25926 by Ana Carolina Erthal, Andr\'e F. T. Martins, Beatriz Canaverde, Diego Mesquita, Eliezer de Souza da Silva, Juan Belieni, Miguel Faria, Tiago Teixeira.

Figure 1
Figure 1. Figure 1: Example of a European Portuguese (pt-PT) prompt for a multiple-choice question with a figure. Resolva a seguinte questão aberta de matemática. Certifique-se de colocar a resposta final dentro de \\boxed{}. Use Português do Brasil para pensar e responder. Questão: Qual o menor valor de $n$ para que um polígono com $n$ lados tenha a soma de seus ângulos internos maior que $2012$ graus? [PITH_FULL_IMAGE:figu… view at source ↗
Figure 2
Figure 2. Figure 2: Example of a Brazilian Portuguese (pt-BR) prompt used for an open-ended question. tent could not be reliably represented in plain text or LATEX, were discarded from the benchmark. 3 Model Evaluation To ensure a consistent and reproducible evalua￾tion protocol, we adopted a standardized prompting strategy adapted to each linguistic variant (Euro￾pean and Brazilian Portuguese) and to the presence or absence … view at source ↗
read the original abstract

The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing {\sc Math-PT}, a novel dataset comprising 1,729 mathematical problems written in European and Brazilian Portuguese. {\sc Math-PT} is curated from a variety of high-quality native sources, including mathematical Olympiads, competitions, and exams from Portugal and Brazil. We present a comprehensive benchmark of current state-of-the-art LLMs on {\sc Math-PT}, revealing that frontier reasoning models achieve strong performance in multiple choice questions compared to open weight models, but that their performance decreases for questions with figures or open-ended questions. To facilitate future research, we release the benchmark dataset and model outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MATH-PT, a dataset of 1,729 mathematical problems in European and Brazilian Portuguese curated from native Olympiads, competitions, and exams. It benchmarks current LLMs and reports that frontier reasoning models outperform open-weight models on multiple-choice questions but exhibit reduced performance on figure-containing and open-ended questions, with the dataset and outputs released to support further research.

Significance. If the performance differences are shown to be robust, the work supplies a valuable non-English math reasoning benchmark that directly addresses linguistic bias in the field. Releasing the full dataset and model outputs strengthens reproducibility and enables targeted follow-up studies on multilingual reasoning.

major comments (2)
  1. [Abstract and Results] The central claim of decreased performance on figure-containing questions (abstract) is load-bearing but unsupported by details on input formatting: it is unclear whether figures were omitted, replaced by textual descriptions, or provided only to multimodal models. Without this protocol, the observed drop cannot be attributed to reasoning limitations rather than missing visual input.
  2. [Evaluation] No information is given on the scoring protocol for open-ended questions (exact match, LLM-as-judge, or human verification), inter-annotator agreement, or statistical significance of gaps across question types. These omissions undermine confidence in the reported performance differences.
minor comments (2)
  1. [Dataset Construction] Provide the exact breakdown of problem sources (Portugal vs. Brazil) and any curation or difficulty calibration steps applied to the 1,729 problems.
  2. [Model Evaluation] Clarify whether any vision-language models were included in the benchmark and how their results were handled for figure questions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional methodological detail will improve the clarity and interpretability of our results. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Results] The central claim of decreased performance on figure-containing questions (abstract) is load-bearing but unsupported by details on input formatting: it is unclear whether figures were omitted, replaced by textual descriptions, or provided only to multimodal models. Without this protocol, the observed drop cannot be attributed to reasoning limitations rather than missing visual input.

    Authors: We agree that the input formatting protocol for figure-containing questions must be stated explicitly to support the reported performance differences. The current manuscript does not provide this level of detail. We will add a dedicated paragraph in the Evaluation section describing the protocol: text-only models received the problem text with figures omitted (no textual descriptions were substituted), while multimodal models received the original images. This clarification will allow readers to attribute the observed drop to the lack of visual input for non-multimodal models rather than to reasoning limitations alone. revision: yes

  2. Referee: [Evaluation] No information is given on the scoring protocol for open-ended questions (exact match, LLM-as-judge, or human verification), inter-annotator agreement, or statistical significance of gaps across question types. These omissions undermine confidence in the reported performance differences.

    Authors: We acknowledge that the Evaluation section currently omits these specifics. Open-ended answers were scored via exact match to the ground-truth solutions after normalizing for equivalent LaTeX and symbolic representations; no LLM-as-judge or additional human verification was applied. Inter-annotator agreement is not applicable because scoring was fully automated and deterministic. We will expand the section to document this protocol, report bootstrap confidence intervals or McNemar tests for the performance gaps across question types, and include any relevant statistical details to increase confidence in the differences. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark dataset release with direct empirical evaluation

full rationale

The paper introduces the Math-PT dataset of 1,729 native Portuguese math problems and reports LLM performance on it. All claims are empirical comparisons (e.g., frontier models vs. open-weight models on multiple-choice vs. figure/open-ended subsets). No equations, fitted parameters, predictions, or self-citations reduce any reported result to prior quantities by construction. The work is self-contained as a data release plus evaluation; the central claims rest on the new dataset itself rather than any imported uniqueness theorem or ansatz.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the collected problems constitute a valid and representative math-reasoning benchmark; no free parameters, axioms, or invented entities are introduced beyond standard dataset curation practices.

pith-pipeline@v0.9.0 · 5493 in / 999 out tokens · 26257 ms · 2026-05-13T22:36:04.757276+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Measuring mathematical prob- lem solving with the math dataset.arXiv preprint arXiv:2103.03874. Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, and Kai Chen

  2. [2]

    InFindings of the Association for Com- putational Linguistics: ACL 2024, pages 6884–6915, Bangkok, Thailand

    MathBench: Evaluating the theory and application proficiency of LLMs with a hierarchical mathematics benchmark. InFindings of the Association for Com- putational Linguistics: ACL 2024, pages 6884–6915, Bangkok, Thailand. Association for Computational Linguistics. Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng,...

  3. [3]

    InFindings of the Association for Computational Lin- guistics: EMNLP 2024, pages 5315–5327, Miami, Florida, USA

    Evaluation of question an- swer generation for Portuguese: Insights and datasets. InFindings of the Association for Computational Lin- guistics: EMNLP 2024, pages 5315–5327, Miami, Florida, USA. Association for Computational Lin- guistics. Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vuli´c, and Anna Korhonen

  4. [4]

    InProceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online

    XCOPA: A multilingual dataset for causal common- sense reasoning. InProceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online. As- sociation for Computational Linguistics. Isadora Salles, Francielle Vargas, and Fabrício Ben- evenuto

  5. [5]

    Kimi K2: Open Agentic Intelligence

    Kimi k2: Open agentic intelli- gence.Preprint, arXiv:2507.20534. Jackson Trager, Francielle Vargas, Diego Alves, Mat- teo Guida, Mikel K. Ngueajio, Ameeta Agrawal, Yalda Daryani, Farzan Karimi Malekabadi, and Flor Miriam Plaza-del Arco

  6. [6]

    InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15709–15740, Suzhou, China

    MFTCXplain: A multilingual benchmark dataset for evaluating the moral reasoning of LLMs through multi-hop hate speech explanation. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15709–15740, Suzhou, China. Association for Com- putational Linguistics. Yiming Wang, Pei Zhang, Jialong Tang, Hao-Ran Wei, Baosong Yang, Rui Wang,...