A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models

Andreas Triantafyllopoulos; Bj\"orn W. Schuller; George Margetis; Ioana Crihana; Iosif Tsangko

arxiv: 2605.31080 · v1 · pith:VIOL5P4Xnew · submitted 2026-05-29 · 💻 cs.MM · cs.AI· cs.CL· cs.CV· cs.HC

A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models

Iosif Tsangko , Andreas Triantafyllopoulos , George Margetis , Ioana Crihana , Bj\"orn W. Schuller This is my paper

Pith reviewed 2026-06-28 19:51 UTC · model grok-4.3

classification 💻 cs.MM cs.AIcs.CLcs.CVcs.HC

keywords art descriptionblind and low-vision accessibilityvision-language modelsLoRA adaptersmultilingual adaptationcurator-guided descriptionssmall VLMs

0 comments

The pith

Language-specific LoRA adapters on a small vision-language model yield more stable and visually grounded art descriptions for Romanian and Serbian than a multilingual adapter.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This pilot study tests whether language-specific fine-tuning of a small 3B vision-language model can produce better curator-guided descriptions of artworks for blind and low-vision audiences in German, Romanian, and Serbian. It builds a parallel caption corpus and compares separate LoRA adapters per language against one shared multilingual adapter, using both automatic metrics and an LLM judge tuned on a small Romanian user study. A sympathetic reader would care because museums often need on-premise models that respect privacy and copyright while serving diverse language communities. The results indicate that language-specific adaptation offers advantages in controllability and grounding for two of the three languages tested.

Core claim

Under our pilot setup, language-specific adapters show more stable controllability and visually grounded description quality for Romanian and Serbian, while multilingual adaptation remains competitive in German. The study frames these findings as deployment-oriented evidence for small on-premise VLMs and calls for larger BLV user studies and broader language coverage.

What carries the argument

Comparison of language-specific LoRA adapters versus a single multilingual adapter on the fixed Qwen2.5-VL-3B-Instruct backbone, evaluated via lexical metrics, embedding metrics, and an LLM-as-Judge protocol.

If this is right

Small on-premise vision-language models can support multilingual art description under privacy constraints.
Language-specific adaptation may be preferable for languages like Romanian and Serbian to achieve stable quality.
Multilingual adaptation can still work well for German under the same training budget.
These results provide initial evidence favoring deployment of adapted small VLMs in museum settings for BLV accessibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the approach to additional languages could reveal whether language-specific adapters consistently outperform multilingual ones or if it depends on linguistic similarity to the base model.
A full-scale study with BLV participants across all three languages would test if the LLM-as-Judge aligns with actual user preferences.
Integrating curator guidance more deeply into the adaptation process might further improve description relevance.

Load-bearing premise

The LLM-as-Judge protocol calibrated against a small Romanian BLV pilot study provides a reliable proxy for human evaluation of description quality across languages.

What would settle it

A larger-scale human evaluation by BLV participants in German and Serbian showing that the multilingual adapter produces descriptions rated as equally or more controllable and grounded than the language-specific ones would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.31080 by Andreas Triantafyllopoulos, Bj\"orn W. Schuller, George Margetis, Ioana Crihana, Iosif Tsangko.

**Figure 1.** Figure 1: End-to-end pipeline. ARTEMIS artwork images and metadata (style, author/work IDs, emotion) feed curator-written [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: LLM-as-Judge calibration on the Romanian (RO) pilot. Left: Mean LS-ML trait-score difference ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Blind and low-vision (BLV) audiences remain underserved by visual art descriptions, particularly across languages and in museum settings where privacy and intellectual-property constraints may favour small on-premise vision-language models (VLMs). This pilot study investigates curator-guided multilingual art description with Qwen2.5-VL-3B-Instruct for German, Romanian, and Serbian. We construct a parallel BLV-oriented caption corpus from artwork images and metadata, and compare language-specific LoRA adapters with a single multilingual adapter under a fixed backbone and training budget. Evaluation combines automatic lexical and embedding-based metrics with an LLM-as-Judge protocol calibrated against a small Romanian BLV pilot study. Under our pilot setup, language-specific adapters show more stable controllability and visually grounded description quality for Romanian and Serbian, while multilingual adaptation remains competitive in German. We frame these findings as deployment-oriented evidence for small on-premise VLMs, and highlight the need for larger BLV user studies and broader language coverage before drawing general conclusions about multilingual accessibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This pilot study provides a practical comparison of language-specific and multilingual LoRA adapters for BLV art descriptions but its evaluation relies on an unvalidated LLM judge across languages.

read the letter

The one thing to know is that this is an early-stage pilot testing language-specific versus multilingual adapters on Qwen2.5-VL-3B for art descriptions in three languages, aimed at blind audiences. It points to some differences in performance but the supporting evidence is not yet strong.

What stands out as new is the direct comparison for German, Romanian, and Serbian using a fixed backbone and budget, plus the emphasis on curator-guided and on-premise setups to handle privacy and IP issues in museums. The paper does well in identifying the accessibility gap and in calling for larger BLV user studies rather than overclaiming.

The main concern is the evaluation setup. The LLM-as-Judge protocol was calibrated against a small Romanian study, and the abstract gives no sign of validation or agreement checks for German or Serbian. That leaves open the possibility that reported advantages for language-specific adapters in those languages stem from judge biases rather than actual output quality. Without visible data, error bars, or the full protocol, the comparative findings on controllability and visual grounding are hard to assess. The automatic metrics are mentioned but not detailed here.

Overall, the work shows clear thinking about deployment realities for small models. It is aimed at HCI and accessibility researchers who care about multilingual VLM applications. A reader looking for ideas on adapter training for specialized tasks could find it useful as a starting point.

I would send this to peer review. The topic matters and the approach is reasonable for a pilot, even if it needs more robust validation to stand on its own.

Referee Report

2 major / 1 minor

Summary. The manuscript reports a pilot study using Qwen2.5-VL-3B-Instruct to generate curator-guided art descriptions for blind and low-vision audiences in German, Romanian, and Serbian. It constructs a parallel BLV-oriented caption corpus and compares language-specific LoRA adapters against a single multilingual adapter under fixed backbone and training budget. Evaluation uses automatic lexical/embedding metrics plus an LLM-as-Judge protocol calibrated on a small Romanian BLV pilot; the abstract concludes that language-specific adapters yield more stable controllability and visually grounded quality for Romanian and Serbian while multilingual adaptation remains competitive for German.

Significance. If the reported language-dependent performance differences hold under rigorous validation, the work supplies deployment-oriented evidence favoring small on-premise VLMs with targeted adaptation for multilingual museum accessibility under privacy constraints. The pilot framing and call for larger BLV user studies are appropriately cautious, but the current evidence base is too thin to support strong claims about relative adapter effectiveness across languages.

major comments (2)

[Abstract] Abstract (LLM-as-Judge protocol): The protocol is calibrated exclusively against a small Romanian BLV pilot study yet is used to underwrite claims of differential controllability and visual grounding quality across German, Romanian, and Serbian. No cross-language human validation, inter-rater agreement statistics, or transfer checks are mentioned, so the reported language-specific advantages rest on an unverified assumption that the judge generalizes; this directly undermines the central comparative claim.
[Abstract] Abstract (evaluation and results): No quantitative values for the automatic metrics, sample sizes, error bars, or controllability measures are supplied, nor is the precise definition of 'stable controllability' or 'visually grounded description quality' given. Without these, the abstract's comparative findings cannot be assessed for statistical or practical significance.

minor comments (1)

[Abstract] Abstract: The phrase 'under our pilot setup' is repeated without clarifying what constraints define the setup (e.g., exact training budget, LoRA rank, or curator-guidance protocol), reducing clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting important limitations in how the pilot results are presented. We address each major comment below with proposed revisions to the abstract that better reflect the scope and evidence of the study.

read point-by-point responses

Referee: [Abstract] Abstract (LLM-as-Judge protocol): The protocol is calibrated exclusively against a small Romanian BLV pilot study yet is used to underwrite claims of differential controllability and visual grounding quality across German, Romanian, and Serbian. No cross-language human validation, inter-rater agreement statistics, or transfer checks are mentioned, so the reported language-specific advantages rest on an unverified assumption that the judge generalizes; this directly undermines the central comparative claim.

Authors: We agree that the LLM-as-Judge protocol was calibrated solely on the Romanian pilot and that no cross-language human validation or inter-rater statistics are reported. This is an inherent limitation of the current pilot. We will revise the abstract to explicitly state the calibration language, replace the comparative phrasing with language-specific observations, and strengthen the existing caveat that larger cross-lingual BLV studies are required before general conclusions can be drawn. revision: partial
Referee: [Abstract] Abstract (evaluation and results): No quantitative values for the automatic metrics, sample sizes, error bars, or controllability measures are supplied, nor is the precise definition of 'stable controllability' or 'visually grounded description quality' given. Without these, the abstract's comparative findings cannot be assessed for statistical or practical significance.

Authors: The provided abstract is a concise summary and omits numerical results and explicit definitions for brevity. We will revise it to include representative quantitative values from the automatic metrics, the number of artworks and captions per language, and short operational definitions of 'stable controllability' and 'visually grounded description quality' so that the pilot findings can be evaluated on their own terms. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pilot with no derivations or self-referential steps

full rationale

The provided abstract describes a pilot study comparing language-specific vs. multilingual LoRA adapters on a fixed VLM backbone, using automatic metrics and an LLM-as-Judge protocol. No equations, derivations, fitted parameters presented as predictions, or self-citations appear in the text. All claims rest on reported empirical comparisons rather than any reduction to inputs by construction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities; purely empirical pilot study based on abstract.

pith-pipeline@v0.9.1-grok · 5720 in / 933 out tokens · 16679 ms · 2026-06-28T19:51:32.102443+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 24 canonical work pages · 8 internal anchors

[1]

Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, and Leonidas J. Guibas. 2021. ArtEmis: Affective Language for Visual Art. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11569–11579. doi:10.1109/CVPR46437.2021.01140

work page doi:10.1109/cvpr46437.2021.01140 2021
[2]

Rahaf Alharbi and Pa Lor. 2024. Misfitting With AI: How Blind People Verify and Contest AI Errors. InProc. Int. ACM SIGACCESS Conf. on Computers and Accessibility (ASSETS). doi:10.1145/3663548.3675659

work page doi:10.1145/3663548.3675659 2024
[3]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, others, and Percy Liang. 2021. On the Opportunities and Risks of Foundation Models. arXiv:2108.07258 [cs.LG] doi:10.48550/arXiv. 2108.07258

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2021
[4]

Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluating the Role of BLEU in Machine Translation Research. InProceedings of the 11th Con- ference of the European Chapter of the Association for Computational Linguistics (EACL)

2006
[5]

Wei-Lin Chiang et al. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, and Chunhua Shen. 2023. MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices.CoRRabs/2312.16886 (2023). arXiv:2312.16886 [cs.CV] https://arxiv. org/abs/2312.16886

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, and Chunhua Shen. 2024. MobileVLM V2: Faster and Stronger Baseline for Vision Language Model.arXiv preprint arXiv:2402.03766(2024). https://arxiv.org/abs/2402.03766

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Clark, Dan Garrette, Iulia Turc, and John Wieting

Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. 2022. Canine: Pre- training an Efficient Tokenization-Free Encoder for Language Representation. Transactions of the Association for Computational Linguistics10 (2022), 73–91. doi:10.1162/tacl_a_00448

work page doi:10.1162/tacl_a_00448 2022
[9]

Yann Dubois et al. 2024. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv:2404.04475 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Desmond Elliott, Stella Frank, Khalil Simaan, and Lucia Specia. 2016. Multi30K: Multilingual English-German Image Descriptions. InProc. 5th Workshop on Vision and Language. Association for Computational Linguistics, Berlin, Germany, 70–

2016
[11]

doi:10.18653/v1/W16-3210

work page doi:10.18653/v1/w16-3210
[12]

European Union. 2024. Regulation (EU) 2024/1689 (Artificial Intelligence Act). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/ 1689/oj

2024
[13]

Manuel Gil-Martín, Cristina Luna-Jiménez, Sergio Esteban-Romero, Marcos Estecha-Garitagoitia, Fernando Fernández-Martínez, and Luis Fernando D’Haro
[14]

A dataset of synthetic art dialogues with ChatGPT.Scientific Data11, 1 (2024), 825

2024
[15]

Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzman, and Angela Fan. 2022. The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation.Transactions of the Association for Computa- tional Linguistics10 (2022), 522–538. doi:10.1162/tacl_a_00474

work page doi:10.1162/tacl_a_00474 2022
[16]

Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. 2020. Cap- tioning Images Taken by People Who Are Blind. InProc. European Conf. on Computer Vision (ECCV). Springer, 417–434

2020
[17]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models.arXiv preprint arXiv:2106.09685(2021). https://arxiv.org/abs/ 2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Georgina Kleege. 2015. Audio Description Described: An Autistic/Blind Account. Disability Studies Quarterly(2015). https://dsq-sds.org/index.php/dsq/article/ 6 A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision–Language Models view/925/1109

2015
[19]

Tom Kocmi and Christian Federmann. 2023. Large Language Models Are State- of-the-Art Evaluators of Translation Quality. arXiv:2302.14520 [cs.CL] https: //arxiv.org/abs/2302.14520

work page arXiv 2023
[20]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out: Proceedings of the ACL-04 Workshop. 74–81. https://aclanthology.org/W04-1013/

2004
[21]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2511–2522. doi:10.18653/ v1/2023.emnlp-main.153

2023
[22]

2023.Artificial Intelligence Risk Management Framework (AI RMF 1.0)

National Institute of Standards and Technology. 2023.Artificial Intelligence Risk Management Framework (AI RMF 1.0). Technical Report NIST AI 100-1. NIST. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf

2023
[23]

Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser
[24]

Why We Need New Evaluation Metrics for NLG

Why We Need New Evaluation Metrics for NLG. arXiv:1707.06875 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). 311–318. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002
[26]

Elisa Perego. 2019. Into the Language of Museum Audio Descriptions: A Corpus- Based Study.Perspectives27, 3 (2019), 333–349. doi:10.1080/0907676X.2018. 1544648

work page doi:10.1080/0907676x.2018 2019
[27]

Ehud Reiter and Anja Belz. 2009. An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems. Computational Linguistics35, 4 (2009), 529–558. doi:10.1162/coli.2009.35.4.35405

work page doi:10.1162/coli.2009.35.4.35405 2009
[28]

Phillip Rust, Jonas Pfeiffer, Ivan Vulic, Sebastian Ruder, and Iryna Gurevych. 2021. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. InProc. 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP) (Volume 1: Lo...

work page doi:10.18653/v1/2021.acl-long.243 2021
[29]

Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris. 2017. Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to- Language Technology for the Blind.Proc. AAAI Conf. on Human Computation and Crowdsourcing (HCOMP)(2017). https://ojs.aaai.org/index.php/HCOMP/ article/view/13301

2017
[30]

Schuller, Adria Mallol-Ragolta, Alejandro Peña Almansa, Iosif Tsangko, Mostafa M

Björn W. Schuller, Adria Mallol-Ragolta, Alejandro Peña Almansa, Iosif Tsangko, Mostafa M. Amin, Anastasia Semertzidou, Lukas Christ, and Shahin Amiriparian
[31]

arXiv(2024)

Affective Computing Has Changed: The Foundation Model Disruption. arXiv(2024). arXiv:2409.08907 [cs.CL] https://arxiv.org/abs/2409.08907

work page arXiv 2024
[32]

2014.The Visual Made Verbal: A Comprehensive Training Manual and Guide to the History and Applications of Audio Description

Joel Snyder. 2014.The Visual Made Verbal: A Comprehensive Training Manual and Guide to the History and Applications of Audio Description. Dog Ear Publishing

2014
[33]

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. 2021. WIT: Wikipedia-based Image Text Dataset for Multilingual Multi- modal Research. InProc. 44th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR). Association for Computing Machinery, Virtual Event, Canada, 1095–1104. doi:10.1145/3404835.3463257

work page doi:10.1145/3404835.3463257 2021
[34]

Poko- rny, Katharina D

Andreas Triantafyllopoulos, Yannik Terhorst, Iosif Tsangko, Franziska B. Poko- rny, Katharina D. Bartl-Pokorny, and Björn W. Schuller. 2024. Large Lan- guage Models for Mental Health.arXiv(2024). arXiv:2411.11880 [cs.CL] https://arxiv.org/abs/2411.11880

work page arXiv 2024
[35]

Schuller

Andreas Triantafyllopoulos, Iosif Tsangko, Anton Gebhard, Annamaria Mesaros, Tuomas Virtanen, and Björn W. Schuller. 2025. Computer Audition: From Task- Specific Machine Learning to Foundation Models.Proc. IEEE113, 8 (2025), 1793–1832. doi:10.1109/JPROC.2025.3608062

work page doi:10.1109/jproc.2025.3608062 2025
[36]

Iosif Tsangko et al . 2025. Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition.IEEE Access(2025). doi:10.1109/ACCESS.2025. 3636968

work page doi:10.1109/access.2025 2025
[37]

Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. 2023. Is ChatGPT a Good NLG Evaluator? A Prelimi- nary Study. arXiv:2303.04048 [cs.CL] https://arxiv.org/abs/2303.04048

work page arXiv 2023
[38]

Lianmin Zheng et al. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL] 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, and Leonidas J. Guibas. 2021. ArtEmis: Affective Language for Visual Art. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11569–11579. doi:10.1109/CVPR46437.2021.01140

work page doi:10.1109/cvpr46437.2021.01140 2021

[2] [2]

Rahaf Alharbi and Pa Lor. 2024. Misfitting With AI: How Blind People Verify and Contest AI Errors. InProc. Int. ACM SIGACCESS Conf. on Computers and Accessibility (ASSETS). doi:10.1145/3663548.3675659

work page doi:10.1145/3663548.3675659 2024

[3] [3]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, others, and Percy Liang. 2021. On the Opportunities and Risks of Foundation Models. arXiv:2108.07258 [cs.LG] doi:10.48550/arXiv. 2108.07258

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2021

[4] [4]

Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluating the Role of BLEU in Machine Translation Research. InProceedings of the 11th Con- ference of the European Chapter of the Association for Computational Linguistics (EACL)

2006

[5] [5]

Wei-Lin Chiang et al. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, and Chunhua Shen. 2023. MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices.CoRRabs/2312.16886 (2023). arXiv:2312.16886 [cs.CV] https://arxiv. org/abs/2312.16886

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, and Chunhua Shen. 2024. MobileVLM V2: Faster and Stronger Baseline for Vision Language Model.arXiv preprint arXiv:2402.03766(2024). https://arxiv.org/abs/2402.03766

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Clark, Dan Garrette, Iulia Turc, and John Wieting

Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. 2022. Canine: Pre- training an Efficient Tokenization-Free Encoder for Language Representation. Transactions of the Association for Computational Linguistics10 (2022), 73–91. doi:10.1162/tacl_a_00448

work page doi:10.1162/tacl_a_00448 2022

[9] [9]

Yann Dubois et al. 2024. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv:2404.04475 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Desmond Elliott, Stella Frank, Khalil Simaan, and Lucia Specia. 2016. Multi30K: Multilingual English-German Image Descriptions. InProc. 5th Workshop on Vision and Language. Association for Computational Linguistics, Berlin, Germany, 70–

2016

[11] [11]

doi:10.18653/v1/W16-3210

work page doi:10.18653/v1/w16-3210

[12] [12]

European Union. 2024. Regulation (EU) 2024/1689 (Artificial Intelligence Act). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/ 1689/oj

2024

[13] [13]

Manuel Gil-Martín, Cristina Luna-Jiménez, Sergio Esteban-Romero, Marcos Estecha-Garitagoitia, Fernando Fernández-Martínez, and Luis Fernando D’Haro

[14] [14]

A dataset of synthetic art dialogues with ChatGPT.Scientific Data11, 1 (2024), 825

2024

[15] [15]

Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzman, and Angela Fan. 2022. The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation.Transactions of the Association for Computa- tional Linguistics10 (2022), 522–538. doi:10.1162/tacl_a_00474

work page doi:10.1162/tacl_a_00474 2022

[16] [16]

Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. 2020. Cap- tioning Images Taken by People Who Are Blind. InProc. European Conf. on Computer Vision (ECCV). Springer, 417–434

2020

[17] [17]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models.arXiv preprint arXiv:2106.09685(2021). https://arxiv.org/abs/ 2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

Georgina Kleege. 2015. Audio Description Described: An Autistic/Blind Account. Disability Studies Quarterly(2015). https://dsq-sds.org/index.php/dsq/article/ 6 A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision–Language Models view/925/1109

2015

[19] [19]

Tom Kocmi and Christian Federmann. 2023. Large Language Models Are State- of-the-Art Evaluators of Translation Quality. arXiv:2302.14520 [cs.CL] https: //arxiv.org/abs/2302.14520

work page arXiv 2023

[20] [20]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out: Proceedings of the ACL-04 Workshop. 74–81. https://aclanthology.org/W04-1013/

2004

[21] [21]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2511–2522. doi:10.18653/ v1/2023.emnlp-main.153

2023

[22] [22]

2023.Artificial Intelligence Risk Management Framework (AI RMF 1.0)

National Institute of Standards and Technology. 2023.Artificial Intelligence Risk Management Framework (AI RMF 1.0). Technical Report NIST AI 100-1. NIST. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf

2023

[23] [23]

Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser

[24] [24]

Why We Need New Evaluation Metrics for NLG

Why We Need New Evaluation Metrics for NLG. arXiv:1707.06875 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). 311–318. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002

[26] [26]

Elisa Perego. 2019. Into the Language of Museum Audio Descriptions: A Corpus- Based Study.Perspectives27, 3 (2019), 333–349. doi:10.1080/0907676X.2018. 1544648

work page doi:10.1080/0907676x.2018 2019

[27] [27]

Ehud Reiter and Anja Belz. 2009. An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems. Computational Linguistics35, 4 (2009), 529–558. doi:10.1162/coli.2009.35.4.35405

work page doi:10.1162/coli.2009.35.4.35405 2009

[28] [28]

Phillip Rust, Jonas Pfeiffer, Ivan Vulic, Sebastian Ruder, and Iryna Gurevych. 2021. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. InProc. 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP) (Volume 1: Lo...

work page doi:10.18653/v1/2021.acl-long.243 2021

[29] [29]

Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris. 2017. Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to- Language Technology for the Blind.Proc. AAAI Conf. on Human Computation and Crowdsourcing (HCOMP)(2017). https://ojs.aaai.org/index.php/HCOMP/ article/view/13301

2017

[30] [30]

Schuller, Adria Mallol-Ragolta, Alejandro Peña Almansa, Iosif Tsangko, Mostafa M

Björn W. Schuller, Adria Mallol-Ragolta, Alejandro Peña Almansa, Iosif Tsangko, Mostafa M. Amin, Anastasia Semertzidou, Lukas Christ, and Shahin Amiriparian

[31] [31]

arXiv(2024)

Affective Computing Has Changed: The Foundation Model Disruption. arXiv(2024). arXiv:2409.08907 [cs.CL] https://arxiv.org/abs/2409.08907

work page arXiv 2024

[32] [32]

2014.The Visual Made Verbal: A Comprehensive Training Manual and Guide to the History and Applications of Audio Description

Joel Snyder. 2014.The Visual Made Verbal: A Comprehensive Training Manual and Guide to the History and Applications of Audio Description. Dog Ear Publishing

2014

[33] [33]

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. 2021. WIT: Wikipedia-based Image Text Dataset for Multilingual Multi- modal Research. InProc. 44th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR). Association for Computing Machinery, Virtual Event, Canada, 1095–1104. doi:10.1145/3404835.3463257

work page doi:10.1145/3404835.3463257 2021

[34] [34]

Poko- rny, Katharina D

Andreas Triantafyllopoulos, Yannik Terhorst, Iosif Tsangko, Franziska B. Poko- rny, Katharina D. Bartl-Pokorny, and Björn W. Schuller. 2024. Large Lan- guage Models for Mental Health.arXiv(2024). arXiv:2411.11880 [cs.CL] https://arxiv.org/abs/2411.11880

work page arXiv 2024

[35] [35]

Schuller

Andreas Triantafyllopoulos, Iosif Tsangko, Anton Gebhard, Annamaria Mesaros, Tuomas Virtanen, and Björn W. Schuller. 2025. Computer Audition: From Task- Specific Machine Learning to Foundation Models.Proc. IEEE113, 8 (2025), 1793–1832. doi:10.1109/JPROC.2025.3608062

work page doi:10.1109/jproc.2025.3608062 2025

[36] [36]

Iosif Tsangko et al . 2025. Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition.IEEE Access(2025). doi:10.1109/ACCESS.2025. 3636968

work page doi:10.1109/access.2025 2025

[37] [37]

Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. 2023. Is ChatGPT a Good NLG Evaluator? A Prelimi- nary Study. arXiv:2303.04048 [cs.CL] https://arxiv.org/abs/2303.04048

work page arXiv 2023

[38] [38]

Lianmin Zheng et al. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL] 7

work page internal anchor Pith review Pith/arXiv arXiv 2023