pith. sign in

arxiv: 2605.20626 · v1 · pith:G6BA5YLMnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI· cs.CV

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

Pith reviewed 2026-05-21 05:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV
keywords image captioningIndigenous languagesretrieval-augmented generationlow-resource translationcultural captioningshared taskvision-language models
0
0 comments X

The pith

A two-stage pipeline using Spanish intermediate captions and retrieval-augmented prompting achieves over 120 percent gains on Indigenous language image captioning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a system for the AmericasNLP 2026 shared task that first generates Spanish captions for cultural images using Qwen2.5-VL and then translates them into Bribri, Guaraní, or Orizaba Nahuatl via retrieval-augmented many-shot prompting with Gemini 2.5 Flash. This yields 164.1 percent, 131.7 percent, and 122.6 percent improvements over the baseline on dev sets, with sustained gains above 150 percent on test sets for two languages. The approach won the shared task overall and placed second in human evaluations of the target-language output. A sympathetic reader would care because accurate automatic captioning in these languages can support cultural documentation and accessibility where direct vision models for the target languages do not yet exist.

Core claim

The authors establish that an intermediate Spanish caption generated by a vision-language model, followed by retrieval-augmented many-shot translation into the target Indigenous language, produces captions that substantially outperform the shared-task baseline on automatic metrics and secure first place in the overall competition.

What carries the argument

Retrieval-augmented many-shot prompting from the Spanish pivot caption, which draws relevant in-domain examples to guide culturally appropriate generation in the low-resource target language.

If this is right

  • Retrieval augmentation improves results only when large, in-domain corpora exist for the target language.
  • Synthetic data augmentation contributes roughly 28 chrF++ points to the Guaraní dev-set gains.
  • The system maintains over 150 percent relative improvement on Bribri and Orizaba Nahuatl test sets.
  • Automatic-metric wins do not guarantee top human-evaluation rank among finalists.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pivot-through-Spanish strategy may transfer to other low-resource language pairs that lack native vision models.
  • The language-dependent nature of retrieval suggests future work should prioritize corpus size and domain match before applying the technique.
  • If the Spanish pivot proves robust, the same two-stage structure could support image-based cultural knowledge bases in additional Indigenous languages.

Load-bearing premise

The Spanish intermediate captions produced by Qwen2.5-VL are sufficiently accurate and culturally neutral to serve as a reliable pivot for the subsequent retrieval-augmented translation step.

What would settle it

A direct comparison in which the Spanish captions are replaced by noisy or culturally biased alternatives, or in which the retrieval component is removed entirely, would show whether the reported metric gains disappear.

Figures

Figures reproduced from arXiv: 2605.20626 by Aashish Dhawan, Christan Grant, Christopher Driggers-Ellis, Daisy Zhe Wang, Dzmitry Kasinets.

Figure 1
Figure 1. Figure 1: Overview of the proposed two-stage image [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then produces the target-language caption using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. We achieve 164.1%, 131.7%, and 122.6% improvements over the shared task baseline for Bribri, Guaran\'i, and Orizaba Nahuatl captioning, respectively, in our dev set evaluation and maintain >150% improvements for the Bribri and Orizaba Nahuatl languages in the test set evaluation. We find retrieval is highly language-dependent, beneficial only for large, in-domain corpora, and that synthetic data augmentation accounts for around 28 chrF++ of the dev set Guaran\'i performance gain. Our submission is the overall winner of the shared task, placing second out of five finalist submissions in human evaluations of target-language captions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. It describes a two-stage pipeline that first generates Spanish intermediate captions using Qwen2.5-VL and then produces target-language captions (Bribri, Guaraní, Orizaba Nahuatl) via retrieval-augmented many-shot prompting with Gemini 2.5 Flash. The authors report relative chrF++ gains of 164.1%, 131.7%, and 122.6% over the shared-task baseline on the dev set, sustained >150% gains on test for two languages, and note that synthetic data augmentation accounts for ~28 chrF++ on Guaraní; their system won the shared task overall and placed second in human evaluations.

Significance. If the results hold, the work provides concrete evidence that retrieval-augmented translation combined with synthetic data can yield large gains on low-resource Indigenous language image captioning. The language-dependent retrieval findings and the quantified synthetic-data contribution are useful for practitioners. The shared-task win and human-evaluation ranking add practical weight, though fuller isolation of the vision component would strengthen claims about cultural fidelity.

major comments (1)
  1. [§3] §3 (Pipeline Description): No automatic metrics, human ratings, or error analysis are reported for the Spanish intermediate captions generated by Qwen2.5-VL. Because these captions are treated as the culturally accurate pivot for the subsequent retrieval-augmented translation step, the absence of validation leaves the source of the headline relative improvements (e.g., 164.1% for Bribri on dev) unisolated and risks conflating vision-model fidelity with LLM priors or prompting effects.
minor comments (2)
  1. [Abstract] Abstract: The string 'Guaraní' appears with LaTeX escaping; ensure consistent Unicode rendering throughout the manuscript.
  2. [§4] §4 (Ablations): The retrieval-corpus construction details (size, domain filtering, and exact selection criteria) are only summarized; adding a short table or paragraph would improve reproducibility of the language-dependent findings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our submission to the AmericasNLP 2026 shared task. We address the major comment point by point below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Pipeline Description): No automatic metrics, human ratings, or error analysis are reported for the Spanish intermediate captions generated by Qwen2.5-VL. Because these captions are treated as the culturally accurate pivot for the subsequent retrieval-augmented translation step, the absence of validation leaves the source of the headline relative improvements (e.g., 164.1% for Bribri on dev) unisolated and risks conflating vision-model fidelity with LLM priors or prompting effects.

    Authors: We agree that validating the quality of the Spanish intermediate captions is important for isolating the contributions of each stage in our pipeline. In the revised manuscript, we have added a new subsection in §3 that reports automatic metrics (chrF++ and BLEU) for the Qwen2.5-VL generated Spanish captions against available reference Spanish captions from the dataset. Additionally, we include a brief error analysis highlighting common issues such as cultural nuances missed in the vision-to-text step. This revision helps clarify that the large gains in target languages stem from both the accurate Spanish pivot and the retrieval-augmented translation. We note, however, that the primary focus of the shared task and our evaluation remains on the Indigenous target languages, where human evaluations further support the overall pipeline effectiveness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline measured against external shared-task baseline

full rationale

The paper presents a practical two-stage system for the AmericasNLP 2026 shared task: Qwen2.5-VL generates Spanish image captions, followed by retrieval-augmented many-shot translation into target Indigenous languages using Gemini. Reported gains (e.g., 164.1% relative chrF++ on Bribri dev) are direct comparisons to the external shared-task baseline on held-out dev and test sets. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain. The method description and ablation notes (e.g., synthetic augmentation contributing ~28 chrF++ on Guaraní) remain independent of the final performance numbers, which are externally benchmarked. This is a standard empirical submission paper with no internal reductions of outputs to inputs.

Axiom & Free-Parameter Ledger

4 free parameters · 2 axioms · 0 invented entities

The approach depends on the off-the-shelf capabilities of two commercial models and the availability of suitable in-domain retrieval data; no new mathematical parameters or entities are introduced.

free parameters (4)
  • Selection of Qwen2.5-VL for Spanish captioning
    Model choice for first-stage image description
  • Selection of Gemini 2.5 Flash for target-language generation
    Model choice for retrieval-augmented second stage
  • Retrieval corpus construction and size
    In-domain data used for many-shot examples
  • Number of retrieved shots and prompt formatting
    Hyperparameters for the many-shot prompting step
axioms (2)
  • domain assumption Vision-language models produce usable Spanish descriptions of culturally relevant images
    Foundation of the two-stage pivot
  • domain assumption Large language models can translate or generate target-language captions when given retrieved in-domain examples
    Core mechanism of the retrieval-augmented stage

pith-pipeline@v0.9.0 · 5743 in / 1514 out tokens · 46622 ms · 2026-05-21T05:28:52.043878+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 5 internal anchors

  1. [1]

    arXiv preprint arXiv:2305.19474 , year=

    Ethical considerations for machine translation of indigenous languages: Giving a voice to the speakers , author=. arXiv preprint arXiv:2305.19474 , year=

  2. [2]

    1972 , publisher=

    The urbanization of the Guarani language: a problem in language and culture , author=. 1972 , publisher=

  3. [3]

    Linguistic society of America , year=

    What is an endangered language , author=. Linguistic society of America , year=

  4. [4]

    Yliana Rodr. The challenges of creating a corpus of minority languages and its dialects in Natural Language Processing: the case of the South American indigenous language Guarani , howpublished =. 2022 , url =

  5. [5]

    Proceedings of ACL , year=

    Improving Neural Machine Translation Models with Monolingual Data , author=. Proceedings of ACL , year=

  6. [6]

    Proceedings of EMNLP , year=

    Multilingual Translation with Extensible Multilingual Pretraining and Finetuning , author=. Proceedings of EMNLP , year=

  7. [7]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=

  8. [8]

    Proceedings of the AmericasNLP Workshop , year=

    Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas , author=. Proceedings of the AmericasNLP Workshop , year=

  9. [9]

    Proceedings of the AmericasNLP Workshop , year=

    Findings of the AmericasNLP 2023 Shared Task on Machine Translation into Indigenous Languages , author=. Proceedings of the AmericasNLP Workshop , year=

  10. [10]

    Proceedings of the AmericasNLP Workshop , year=

    Findings of the AmericasNLP 2024 Shared Task on Machine Translation into Indigenous Languages , author=. Proceedings of the AmericasNLP Workshop , year=

  11. [11]

    Proceedings of the Third Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP) , year =

    Sheffield's Submission to the AmericasNLP Shared Task on Machine Translation into Indigenous Languages , author =. Proceedings of the Third Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP) , year =

  12. [12]

    Proceedings of the Third Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP) , year =

    Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models , author =. Proceedings of the Third Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP) , year =

  13. [13]

    Proceedings of the First Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP) , year =

    IndT5: A Text-to-Text Transformer for 10 Indigenous Languages , author =. Proceedings of the First Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP) , year =

  14. [14]

    Proceedings of the Second Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP) , year=

    Towards a Guarani-Spanish Bilingual Corpus for Machine Translation , author=. Proceedings of the Second Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP) , year=

  15. [15]

    2025 , eprint=

    MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data , author=. 2025 , eprint=

  16. [16]

    Qwen2.5-VL Technical Report

    Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

  17. [17]

    Qwen3-VL Technical Report

    Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

  18. [18]

    Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

    Dhawan, Aashish and Driggers-Ellis, Christopher and Grant, Christan and Wang, Daisy Zhe. Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing. Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages ( L o R es MT 2026). 2026. doi:10.18653/v1/2026.loresmt-1.10

  19. [19]

    , author=

    Parallel data, tools and interfaces in OPUS. , author=. Lrec , volume=

  20. [20]

    From Text to Multi-Modal: Advancing Low-Resource-Language Translation through Synthetic Data Generation and Cross-Modal Alignments

    Xiao, Bushi and Shen, Qian and Wang, Daisy Zhe. From Text to Multi-Modal: Advancing Low-Resource-Language Translation through Synthetic Data Generation and Cross-Modal Alignments. Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025). 2025. doi:10.18653/v1/2025.loresmt-1.4

  21. [21]

    Popovi. chr. Proceedings of the Tenth Workshop on Statistical Machine Translation , year =

  22. [22]

    Popovi. chr. Proceedings of the Second Conference on Machine Translation , year =

  23. [23]

    and Coto-solano, Rolando and Cruz, Hilaria and Palmer, Alexis and Kann, Katharina

    Ebrahimi, Abteen and Mager, Manuel and Rijhwani, Shruti and Rice, Enora and Oncevay, Arturo and Baltazar, Claudia and Cort \'e s, Mar \'i a and Monta \ n o, Cynthia and Ortega, John E. and Coto-solano, Rolando and Cruz, Hilaria and Palmer, Alexis and Kann, Katharina. Findings of the A mericas NLP 2023 Shared Task on Machine Translation into Indigenous Lan...

  24. [24]

    Stephen Robertson and Hugo Zaragoza

    Robertson, Stephen and Zaragoza, Hugo , title =. 2009 , publisher =. doi:10.1561/1500000019 , journal =

  25. [25]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  26. [26]

    GPT-4 Technical Report

    GPT-4 Technical Report , author=. arXiv preprint arXiv:2303.08774 , year=

  27. [27]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  28. [28]

    Explicit Tone Transcription Improves ASR Performance in Extremely Low-Resource Languages: A Case Study in B ribri

    Coto-Solano, Rolando. Explicit Tone Transcription Improves ASR Performance in Extremely Low-Resource Languages: A Case Study in B ribri. Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas. 2021. doi:10.18653/v1/2021.americasnlp-1.20

  29. [29]

    Findings of the A mericas NLP 2026 Shared Task on Cultural Image Captioning for I ndigenous Languages

    Bui, Minh Duc and Guzm. Findings of the A mericas NLP 2026 Shared Task on Cultural Image Captioning for I ndigenous Languages. Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP). 2026