AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

Afonso Simpl\'icio; David Semedo; Diogo Gl\'oria-Silva; Diogo Tavares; Gon\c{c}alo Vinagre; In\^es Calvo; In\^es Vieira; Jo\~ao Cardeira; Jo\~ao Magalh\~aes; Manuel Letras da Luz

arxiv: 2606.19100 · v3 · pith:AEFSHOALnew · submitted 2026-06-17 · 💻 cs.CV

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

Diogo Gl\'oria-Silva , Jo\~ao Cardeira , Manuel Letras da Luz , Afonso Simpl\'icio , Gon\c{c}alo Vinagre , Diogo Tavares , Rafael Ferreira , In\^es Calvo

show 3 more authors

In\^es Vieira David Semedo Jo\~ao Magalh\~aes

This is my paper

Pith reviewed 2026-07-01 07:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords AMALIA-VLEuropean Portuguesept-PTvision-language modelLVLMopen-sourceinstruction tuningmultimodal

0 comments

The pith

AMALIA-VL is the first open-source instruction-tuned LVLM built natively for European Portuguese.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AMALIA-VL to serve the systematic underrepresentation of European Portuguese in existing open-source multimodal models, which either merge it with Brazilian Portuguese or provide minimal coverage. It pairs a high-resolution vision encoder that uses dynamic image tiling with a pt-PT-optimized language model through a learned connector. A three-stage training sequence and a data collection focused on pt-PT resources aim to produce a model that functions as a native system rather than an adaptation. The authors release the weights, data, pipelines, and translated benchmarks to support additional work on pt-PT vision-language tasks.

Core claim

We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs.

What carries the argument

The three-stage training process applied to a pt-PT-centric multimodal data mix that combines curated public datasets, translations, and novel datasets created to fill the gap in European Portuguese resources.

If this is right

AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs.
Release of model weights, training data, construction pipelines, and machine-translated pt-PT evaluation benchmarks will help democratize pt-PT LVLM development.
The approach supplies novel datasets that directly address the near-total absence of European Portuguese multimodal resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Native construction may yield advantages on tasks involving European-specific cultural references or linguistic distinctions that mixed Portuguese training data obscures.
The same data-mix and staging pattern could be replicated for other language variants that current multilingual models treat as interchangeable.
Performance gaps would be more convincingly shown by testing on naturally occurring, untranslated pt-PT image-text pairs rather than machine-translated benchmarks.

Load-bearing premise

The curated and translated data mix plus three-stage training process produces a model that is meaningfully native to pt-PT rather than a routine adaptation of existing multilingual LVLMs.

What would settle it

Direct head-to-head results in which a multilingual LVLM fine-tuned on equivalent pt-PT data matches or exceeds AMALIA-VL on pt-PT evaluation benchmarks would undermine the claim that the native construction is required.

Figures

Figures reproduced from arXiv: 2606.19100 by Afonso Simpl\'icio, David Semedo, Diogo Gl\'oria-Silva, Diogo Tavares, Gon\c{c}alo Vinagre, In\^es Calvo, In\^es Vieira, Jo\~ao Cardeira, Jo\~ao Magalh\~aes, Manuel Letras da Luz, Rafael Ferreira.

**Figure 1.** Figure 1: AMALIA-VL is natively European Portuguese grounding its answers in Portuguese visual culture, whereas general LVLMs hallucinate or fall back to Brazilian Portuguese. This creates a two pronged challenge: models lack the multimodal capabilities to process pt-PT accurately, and the community lacks the benchmarks to measure pt-PT multimodal capabilities, as, to the best of our knowledge, no multimodal evalua… view at source ↗

**Figure 2.** Figure 2: Samples from several of our pt-PT focused synthetic datasets. 4.3 Stage 3: Preference Optimization This stage used Direct Preference Optimization (DPO) [39] and sought to increase the model’s likelihood of generating preferred responses while minimizing undesirable patterns. Due to the lack of publicly available multimodal preference optimization datasets, we relied on automated synthetic preference annot… view at source ↗

**Figure 2.** Figure 2: Samples from several of our pt-PT focused synthetic datasets. InvoiceQA. This is an invoice-style document processing task that leverages FATURA [23], a public corpus of synthetic invoices for field extraction (e.g. date, buyer name, seller name, invoice number) and rejection of incorrect field/region associations. Each invoice mixes two task formats: field extraction and bounding box prediction. In the fo… view at source ↗

read the original abstract

Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs. We will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AMALIA-VL supplies the first open pt-PT LVLM plus planned releases of weights, data, and benchmarks, but the abstract shows no numbers to support the performance claims.

read the letter

The paper's core offering is the first open-source instruction-tuned LVLM built for European Portuguese, using a high-res vision encoder, dynamic tiling, and a pt-PT language model connected in three training stages. It also plans to release the model, the data mix, pipelines, and machine-translated benchmarks. That directly tackles an underserved language variant that existing models either lump with Brazilian Portuguese or ignore.

The work follows established LVLM recipes—vision-language alignment, instruction tuning, preference optimization—applied to a pt-PT-centric data collection that mixes public sets with new ones. Releasing everything is the practical value here; other groups working on Portuguese multimodal tasks can start from these artifacts instead of starting from scratch.

The main weakness is the complete absence of results. The abstract asserts a "strong baseline" but gives no scores, no ablations on the novel datasets, no comparison against multilingual models on pt-PT-specific items, and no error analysis. Without those, the claim that the training produces something meaningfully native rather than a standard connector adaptation rests on description alone. Translated data often carries artifacts, and nothing in the text shows this mix avoids them.

This is a resource paper aimed at the Portuguese NLP and vision-language community. Readers who need pt-PT multimodal data or a starting checkpoint will get immediate use from the releases. It is coherent on its own terms and engages the right literature, so it clears the bar for peer review even though the evaluation section will need substantial expansion.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for European Portuguese (pt-PT). It pairs a high-resolution vision encoder with dynamic image tiling and a pt-PT-optimized language model via a learned connector, using a three-stage training process (vision-language alignment, general visual instruction tuning, and preference optimization) along with a pt-PT-centric multimodal data mix of curated/translated public datasets and novel datasets. The abstract asserts that evaluations establish a strong baseline for open-source pt-PT LVLMs and announces plans to release model weights, training data, pipelines, and machine-translated benchmarks.

Significance. If supported by quantitative evidence, the work would address a clear gap in open multimodal resources for pt-PT, providing a dedicated training pipeline and data contributions that could serve as a template for other underrepresented language variants.

major comments (2)

[Abstract] Abstract: The assertion that 'Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs' is unsupported by any quantitative results, tables, figures, ablation studies, or error analysis. This directly undermines the central claim of successful native training and performance.
[Abstract] Abstract, paragraph 2: The claim that the pt-PT-centric data mix and three-stage training produces a model 'meaningfully native to pt-PT' (rather than a routine multilingual adaptation) lacks any dialect-specific metrics, pt-PT vs. pt-BR task deltas, or ablations removing the novel datasets, making the 'native' distinction unverifiable from the manuscript.

minor comments (1)

[Abstract] Abstract: The abstract is lengthy and packs multiple technical claims into single sentences; breaking it into clearer paragraphs would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback on the abstract. We agree that the current manuscript text does not provide the quantitative support needed for the stated claims and will revise to address this.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs' is unsupported by any quantitative results, tables, figures, ablation studies, or error analysis. This directly undermines the central claim of successful native training and performance.

Authors: We acknowledge this point. The manuscript as submitted does not include the supporting evaluation results, tables, or analyses referenced in the abstract. In the revised version we will add a full evaluation section with quantitative benchmarks, baseline comparisons, ablations, and error analysis to substantiate the claim, and we will revise the abstract to align with the new content. revision: yes
Referee: [Abstract] Abstract, paragraph 2: The claim that the pt-PT-centric data mix and three-stage training produces a model 'meaningfully native to pt-PT' (rather than a routine multilingual adaptation) lacks any dialect-specific metrics, pt-PT vs. pt-BR task deltas, or ablations removing the novel datasets, making the 'native' distinction unverifiable from the manuscript.

Authors: We agree that dialect-specific evidence is required to support the 'native' framing. The current manuscript does not contain pt-PT vs. pt-BR deltas or ablations isolating the novel datasets. We will incorporate these analyses in the revised manuscript, including targeted metrics and ablation studies, to make the distinction verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical LVLM construction

full rationale

The paper presents an empirical model-building effort: data curation/translation, a three-stage training pipeline (alignment, instruction tuning, preference optimization), and a connector between vision encoder and language model. No equations, fitted parameters presented as predictions, uniqueness theorems, or first-principles derivations exist that could reduce to inputs by construction. The central claim of 'native' pt-PT status rests on the described data mix and training choices rather than any self-referential loop or renamed known result. Self-citations, if present, are not load-bearing for any mathematical step. This is a standard self-contained empirical contribution with no detectable circularity under the specified patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the domain assumption that existing LVLMs under-represent pt-PT and that standard alignment plus instruction tuning can produce a native model when supplied with appropriate data. No free parameters or invented entities are specified in the abstract.

axioms (1)

domain assumption Existing open-source LVLMs either conflate pt-PT with Brazilian Portuguese or severely under-represent it
Stated as motivation in the abstract

pith-pipeline@v0.9.1-grok · 5772 in / 1191 out tokens · 32557 ms · 2026-07-01T07:32:25.346488+00:00 · methodology

Review history (2 revisions) →

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)