pith. sign in

arxiv: 2607.02381 · v1 · pith:CB7MT6TEnew · submitted 2026-07-02 · 💻 cs.CL

HULAT2 at MER-TRANS 2026: Governed Multi-Agent Simplification for Spanish Easy-to-Read Generation

Pith reviewed 2026-07-03 14:32 UTC · model grok-4.3

classification 💻 cs.CL
keywords Easy-to-Read generationmulti-agent workflowtext simplificationSpanish machine translationshared taskquality signalsLangGraph
0
0 comments X

The pith

Signal-guided multi-agent routing outperforms linear regeneration for Spanish Easy-to-Read generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports three automatic runs submitted to the Spanish track of the MER-TRANS 2026 shared task on Easy-to-Read translation. Two runs employ a LangGraph multi-agent workflow that combines Gemini 2.5 Flash and RigoChat-7B-v2, uses parallel generation, and routes edits via Event-Condition-Action logic driven by five internal quality signals. The third run is a linear generate-evaluate-regenerate baseline using only RigoChat with prompt engineering and LoRA adaptation. On the official SARI metric, the base multi-agent run scores 44.0543, the version with added lexical support scores 43.1049, and the baseline scores 38.5136. The authors conclude that the signal-guided routing produced higher reference-based scores than the linear baseline and that the lexical layer did not yield further gains.

Core claim

In the Spanish track of MER-TRANS 2026, the signal-guided multi-agent workflow achieved the highest SARI score of 44.0543 among the team's submissions and outperformed the linear regeneration baseline at 38.5136, while addition of a lexical-support layer produced a lower score of 43.1049.

What carries the argument

Event-Condition-Action routing driven by internal quality signals for semantic fidelity, readability, lexical simplicity, syntactic clarity and factual consistency.

If this is right

  • Signal-guided multi-agent routing yields higher SARI than linear regeneration in this Easy-to-Read setting.
  • An added lexical-support layer based on glossaries does not raise reference-based scores.
  • Traceable decisions from the routing allow controlled editing during generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Segment-level and document-level breakdowns could show whether the SARI gains correspond to improved readability for target readers.
  • The same routing logic could be applied to other languages in the shared task to test cross-lingual consistency.
  • Human adequacy ratings would be needed to check whether higher automatic scores reflect better factual consistency and user suitability.

Load-bearing premise

The five internal quality signals serve as reliable proxies for the official SARI and BERTScore metrics.

What would settle it

A measured correlation near zero between the internal signals and official SARI scores on the same test segments would show that the routing decisions do not track the leaderboard metric.

Figures

Figures reproduced from arXiv: 2607.02381 by Lourdes Moreno, Marco Antonio Sanchez-Escudero, Miguel Dom\'inguez-G\'omez, Paloma Mart\'inez.

Figure 1
Figure 1. Figure 1: LangGraph-based multi-agent workflow used for RUN1 and RUN2 in the MER-TRANS 2026 Spanish submissions. 3.3. Multi-Agent Workflow RUN1 and RUN2 were generated with the same LangGraph-based multi-agent architecture. LangGraph was used to implement a stateful workflow with specialised nodes, conditional routing, retry loops and traceable intermediate decisions [34]. The system combines Gemini 2.5 Flash [35] a… view at source ↗
read the original abstract

This paper describes the participation of HULAT2-UC3M in the Spanish track of MER-TRANS 2026, a shared task on multilingual Easy-to-Read translation. Three fully automatic Spanish runs were submitted. RUN1 and RUN2 used a LangGraph-based multi-agent workflow combining Gemini 2.5 Flash and RigoChat-7B-v2, parallel generation strategies, internal quality signals, Event-Condition-Action routing, controlled editing and traceable decisions. RUN1 used the base workflow, while RUN2 activated an additional lexical-support layer based on a glossary and lexical resources. RUN3 was a RigoChat-based generate-evaluate-regenerate baseline with prompt engineering and LoRA-based adaptation. The official leaderboard reports BLEU-Orig, BLEU-Gold, SARI and BERTScore. During development, additional internal signals were also inspected, including semantic fidelity, readability, lexical simplicity, syntactic clarity and factual consistency. According to official SARI, RUN1 was the best HULAT2 run, with 44.0543 points, followed by RUN2 with 43.1049 and RUN3 with 38.5136. These results indicate that, in this task setting, signal-guided multi-agent routing outperformed the linear regeneration baseline. They also show that adding lexical support did not automatically improve reference-based scores. Further segment-level and document-level analysis are required to assess readability, factual consistency and user-oriented adequacy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports HULAT2's participation in the Spanish track of the MER-TRANS 2026 shared task on Easy-to-Read translation. It describes three runs: RUN1 and RUN2 implement a LangGraph multi-agent workflow using Gemini 2.5 Flash and RigoChat-7B-v2 with parallel generation, internal quality signals (semantic fidelity, readability, lexical simplicity, syntactic clarity, factual consistency), Event-Condition-Action routing, and controlled editing; RUN2 adds a lexical-support glossary layer. RUN3 is a linear generate-evaluate-regenerate baseline using only RigoChat-7B-v2 with LoRA adaptation. Official leaderboard results are reported: RUN1 achieves the highest SARI (44.0543), followed by RUN2 (43.1049) and RUN3 (38.5136). The authors conclude that signal-guided multi-agent routing outperformed the linear baseline and that lexical support did not improve reference-based scores.

Significance. The work contributes a detailed system description and official shared-task scores for a governed multi-agent approach to Spanish text simplification. If the SARI advantage can be isolated to the routing mechanism, it would support the value of internal quality signals and traceable multi-agent control in simplification pipelines. The explicit comparison of multiple runs and the use of official metrics are strengths. However, the absence of controlled ablations limits the ability to draw firm conclusions about the routing component.

major comments (2)
  1. [Results] Results: The SARI difference between RUN1 (44.0543) and RUN3 (38.5136) is presented as evidence that signal-guided multi-agent routing outperformed the linear regeneration baseline, yet RUN1 combines Gemini 2.5 Flash with RigoChat-7B-v2 while RUN3 uses only RigoChat-7B-v2 with LoRA; no ablation holding the underlying LLMs fixed is reported, so the performance delta cannot be attributed to the Event-Condition-Action routing, parallel generation, or internal signals rather than model capability.
  2. [Methods] Methods: The Event-Condition-Action routing decisions depend on internal quality signals whose correlation with the official SARI and BERTScore metrics is not validated or quantified anywhere in the manuscript, leaving open whether these signals are reliable proxies for the leaderboard outcomes.
minor comments (2)
  1. [Abstract] Abstract and Results: No error bars, segment-level breakdowns, or statistical tests are provided for the reported score differences, which weakens the strength of the outperformance claim.
  2. [Conclusion] The manuscript states that 'further segment-level and document-level analysis are required' but does not include any such analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on result attribution and signal validation. We respond point by point below.

read point-by-point responses
  1. Referee: [Results] Results: The SARI difference between RUN1 (44.0543) and RUN3 (38.5136) is presented as evidence that signal-guided multi-agent routing outperformed the linear regeneration baseline, yet RUN1 combines Gemini 2.5 Flash with RigoChat-7B-v2 while RUN3 uses only RigoChat-7B-v2 with LoRA; no ablation holding the underlying LLMs fixed is reported, so the performance delta cannot be attributed to the Event-Condition-Action routing, parallel generation, or internal signals rather than model capability.

    Authors: We agree that the SARI difference cannot be attributed solely to the routing, parallel generation, or internal signals, as RUN1 and RUN3 employ different underlying models. The manuscript will be revised to clarify this limitation and rephrase the conclusion to state that the multi-agent workflow with mixed models achieved the highest score among our submissions, without claiming isolation of the routing effect. revision: yes

  2. Referee: [Methods] Methods: The Event-Condition-Action routing decisions depend on internal quality signals whose correlation with the official SARI and BERTScore metrics is not validated or quantified anywhere in the manuscript, leaving open whether these signals are reliable proxies for the leaderboard outcomes.

    Authors: The manuscript describes the internal signals and their use in routing but provides no quantitative correlation with official metrics. We will revise to explicitly note this as a limitation and add a brief discussion of how the signals were inspected during development, without claiming they are validated proxies. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical shared-task reporting

full rationale

The paper reports participation in the MER-TRANS 2026 shared task by describing three runs (RUN1/RUN2 multi-agent workflows vs RUN3 linear baseline) and citing official leaderboard scores (SARI 44.0543 vs 38.5136). No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear. The central claim rests on external competition metrics rather than any internal reduction to inputs or ansatzes. This is self-contained empirical reporting against an external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivation or theoretical model; the work is an empirical system description for a competition. No free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5818 in / 1080 out tokens · 22445 ms · 2026-07-03T14:32:33.210757+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 21 canonical work pages

  1. [1]

    International Organization for Standardization, ISO 21801-1:2020 Cognitive accessibility – Part 1: General guidelines, https://www.iso.org/standard/71711.html, 2020

  2. [2]

    International Organization for Standardization, ISO 24495-1:2023 Plain language – Part 1: Govern- ing principles and guidelines, https://www.iso.org/standard/78907.html, 2023

  3. [3]

    Lectura Fácil

    Asociación Española de Normalización, UNE 153101:2018 EX. Lectura Fácil. Pautas y recomenda- ciones para la elaboración de documentos, https://www.une.org/, 2018

  4. [4]

    Nomura, G

    M. Nomura, G. S. Nielsen, B. Tronbacke, Guidelines for easy-to-read materials, https://www.ifla. org/wp-content/uploads/2019/05/assets/hq/publications/professional-report/120.pdf, 2010

  5. [5]

    Inclusion Europe, Information for all: European standards for making information easy to read and understand, https://www.inclusion-europe.eu/easy-to-read-standards-guidelines/, 2009

  6. [6]

    Saggion, Automatic Text Simplification, Synthesis Lectures on Human Language Technologies, Springer Cham, 2017

    H. Saggion, Automatic Text Simplification, Synthesis Lectures on Human Language Technologies, Springer Cham, 2017. URL: https://link.springer.com/book/10.1007/978-3-031-02166-4. doi: 10. 1007/978-3-031-02166-4

  7. [7]

    Nisioi, S

    S. Nisioi, S. Štajner, S. P. Ponzetto, L. P. Dinu, Exploring neural text simplification models, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 85–91. URL: https: //aclanthology.org/P17-2014/. doi:10.18653/v1/P17-2014

  8. [8]

    Martínez, A

    P. Martínez, A. Ramos, L. Moreno, Exploring large language models to generate easy to read content, Frontiers in Computer Science 6 (2024) 1394705. URL: https://doi.org/10.3389/fcomp.2024.1394705. doi:10.3389/fcomp.2024.1394705

  9. [9]

    Jan Melechovsky, Abhinaba Roy, and Dorien Herremans

    K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: A method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. URL: https://aclanthology.org/P02-1040/. doi:10.3115/1073083.1073135

  10. [10]

    W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. Callison-Burch, Optimizing statistical machine transla- tion for text simplification, Transactions of the Association for Computational Linguistics 4 (2016) 401–415. URL: https://aclanthology.org/Q16-1029/. doi:10.1162/tacl_a_00107

  11. [11]

    Zhang, V

    T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, BERTScore: Evaluating text generation with BERT, in: International Conference on Learning Representations, 2020. URL: https://openreview. net/forum?id=SkeHuCVFDr

  12. [12]

    Alva-Manchego, C

    F. Alva-Manchego, C. Scarton, L. Specia, The (Un)Suitability of automatic evaluation metrics for text simplification, Computational Linguistics 47 (2021) 861–889. doi:10.1162/coli_a_00418

  13. [13]

    Maddela, Y

    M. Maddela, Y. Dou, D. Heineman, W. Xu, LENS: A learnable evaluation metric for text sim- plification, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Toronto, Canada, 2023, pp. 16383–16408. URL: https://aclanthology.org/2023.acl-long.905/. doi:10.18653/v1/2023.acl-long.905

  14. [14]

    Maddela, F

    M. Maddela, F. Alva-Manchego, Adapting sentence-level automatic metrics for document-level simplification evaluation, in: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, As- sociation for Computational Linguistics, Albuquerque, New Mexico, 2025, pp. 64...

  15. [15]

    Cripwell, J

    L. Cripwell, J. Legrand, C. Gardent, Evaluating document simplification: On the importance of separately assessing simplicity and meaning preservation, in: Proceedings of the 3rd Workshop on Tools and Resources for People with READing Difficulties (READI) @ LREC-COLING 2024, ELRA and ICCL, Torino, Italia, 2024, pp. 1–14. URL: https://aclanthology.org/2024...

  16. [16]

    Agrawal, M

    S. Agrawal, M. Carpuat, Do text simplification systems preserve meaning? a human evaluation via reading comprehension, Transactions of the Association for Computational Linguistics 12 (2024) 432–448. URL: https://aclanthology.org/2024.tacl-1.24/. doi:10.1162/tacl_a_00653

  17. [17]

    Carrer, A

    L. Carrer, A. Säuberli, M. Kappus, S. Ebling, Towards holistic human evaluation of automatic text simplification, in: Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024, ELRA and ICCL, Torino, Italia, 2024, pp. 71–80. URL: https://aclanthology.org/2024.humeval-1.7/

  18. [18]

    Saggion, S

    H. Saggion, S. Štajner, D. Ferrés, K. C. Sheang, M. Shardlow, K. North, M. Zampieri, Findings of the TSAR-2022 shared task on multilingual lexical simplification, in: Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 271–283....

  19. [19]

    Štajner, H

    S. Štajner, H. Saggion, M. Shardlow, F. Alva-Manchego (Eds.), Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability, INCOMA Ltd., Shoumen, Bulgaria, Varna, Bulgaria, 2023. URL: https://aclanthology.org/2023.tsar-1.0/

  20. [20]

    Shardlow, H

    M. Shardlow, H. Saggion, F. Alva-Manchego, M. Zampieri, K. North, S. Štajner, R. Stodden (Eds.), Proceedings of the Third Workshop on Text Simplification, Accessibility and Readabil- ity (TSAR 2024), Association for Computational Linguistics, Miami, Florida, USA, 2024. URL: https://aclanthology.org/2024.tsar-1.0/. doi:10.18653/v1/2024.tsar-1.0

  21. [21]

    Shardlow, F

    M. Shardlow, F. Alva-Manchego, K. North, R. Stodden, H. Saggion, N. Khallaf, A. Hayakawa (Eds.), Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025), Association for Computational Linguistics, Suzhou, China, 2025. URL: https://aclanthology. org/2025.tsar-1.0/. doi:10.18653/v1/2025.tsar-1.0

  22. [22]

    Alva-Manchego, R

    F. Alva-Manchego, R. Stodden, J. M. Imperial, A. Barayan, K. North, H. Tayyar Madabushi, Findings of the TSAR 2025 shared task on readability-controlled text simplification, in: Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025), Association for Computational Linguistics, Suzhou, China, 2025, pp. 116–130. ...

  23. [23]

    Wilkens, R

    R. Wilkens, R. Cardon, A. Todirascu, N. Gala (Eds.), Proceedings of the 3rd Workshop on Tools and Resources for People with READing Difficulties (READI) @ LREC-COLING 2024, ELRA and ICCL, Torino, Italia, 2024. URL: https://aclanthology.org/2024.readi-1.0/

  24. [24]

    Botella-Gil, I

    B. Botella-Gil, I. Espinosa-Zaragoza, A. Bonet-Jover, M. Madina, L. Molino Piñar, P. Moreda, I. Gonzalez-Dios, M. T. Martín-Valdivia, L. A. Ureña-López, Overview of clears at iberlef 2025: Challenge for plain language and easy-to-read adaptation for spanish texts, Procesamiento del Lenguaje Natural (2025) 393–400. doi:10.26342/2025-75-28

  25. [25]

    Saggion, M

    H. Saggion, M. Tareh, N. Khallaf, D. Adanza, S. Bott, N. Pérez-Rojas, A. Rascón-Alcaina, S. Szasz, Overview of MER-TRANS at IberLEF 2026: First shared task on multilingual easy-to-read transla- tion, Procesamiento del Lenguaje Natural 77 (2026)

  26. [26]

    Bonet-Jover, J

    A. Bonet-Jover, J. Á. González-Barba, L. Chiruzzo, Overview of IberLEF 2026: Natural language processing challenges for spanish and other iberian languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2026), co-located with the 42nd Conference of the Spanish Society for Natural Language Processing (SEPLN 2026), CEUR-WS.org, 2026

  27. [27]

    S. Bott, V. Riegler, H. Saggion, A. Rascón Alcaina, N. Khallaf, A multilingual human annotated corpus of original and easy-to-read texts to support access to democratic participatory processes, in: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), European Language Resources Association (ELRA), Palma, Mallorca, Spain, ...

  28. [28]

    Khallaf, S

    N. Khallaf, S. Sharoff, Automatic difficulty classification of Arabic sentences, in: N. Habash, H. Bouamor, H. Hajj, W. Magdy, W. Zaghouani, F. Bougares, N. Tomeh, I. Abu Farha, S. Touileb (Eds.), Proceedings of the Sixth Arabic Natural Language Processing Workshop, Association for Computational Linguistics, Kyiv, Ukraine (Virtual), 2021, pp. 105–114. URL...

  29. [29]

    MER-TRANS Organizers, MER-TRANS 2026: Multilingual easy-to-read translation shared task, https://lastus-taln-upf.github.io/mertrans-iberlef-2026/, 2026

  30. [30]

    LaSTUS-TALN-UPF, MER-TRANS Evaluator 2026: Evaluator for shared-task submissions, https: //github.com/LaSTUS-TALN-UPF/mertrans-evaluator-2026, 2026

  31. [31]

    HULAT-UC3M, Bilingual Simplification Metrics: A unified evaluation framework for text simplifi- cation (ES/EN), https://github.com/hulat-group/bilingual_simplification_metrics, 2026

  32. [32]

    Campillos-Llanos, A

    L. Campillos-Llanos, A. R. Terroba Reinares, S. Zakhir Puig, A. Valverde-Mateos, A. Capllonch- Carrión, Building a comparable corpus and a benchmark for spanish medical text simplification, Procesamiento del Lenguaje Natural (2022) 189–196. doi:10.26342/2022-69-16

  33. [33]

    Moreno, P

    L. Moreno, P. Martínez, A human-in/on-the-loop framework for accessible text generation, in: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), European Language Resources Association (ELRA), Palma, Mallorca, Spain, 2026, pp. 7236–7247. URL: https://lrec.elra.info/lrec2026-main-574. doi:10.63317/2gtngp2nmx63

  34. [34]

    URL: https://docs.langchain.com/oss/python/langgraph/ overview

    LangChain, LangGraph overview, 2026. URL: https://docs.langchain.com/oss/python/langgraph/ overview

  35. [35]

    URL: https://ai.google.dev/gemini-api/docs/ models/gemini-2.5-flash

    Google AI for Developers, Gemini 2.5 flash, 2026. URL: https://ai.google.dev/gemini-api/docs/ models/gemini-2.5-flash

  36. [36]

    Santamaría Gómez, G

    G. Santamaría Gómez, G. García Subies, P. Gutiérrez Ruiz, M. González Valero, N. Fuertes, H. Montoro Zamorano, C. Muñoz Sanz, L. Rosado Plaza, N. Aldama García, D. Betancur Sánchez, K. Sushkova, M. Guerrero Nieto, Á. Barbero Jiménez, RigoChat 2: An adapted language model to spanish using a bounded dataset and reduced hardware, 2025. URL: https://arxiv.org...

  37. [37]

    Moreno, P

    L. Moreno, P. Martínez, AIGov-Access: AI Governance for Accessibility-Oriented Text Adaptation — MER-TRANS 2026 Profile, 2026. URL: https://doi.org/10.5281/zenodo.20855013. doi: 10.5281/ zenodo.20855013

  38. [38]

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-rank adaptation of large language models, in: International Conference on Learning Representations,

  39. [39]

    URL: https://openreview.net/forum?id=nZeVKeeFYf9

  40. [40]

    Beauchemin, H

    D. Beauchemin, H. Saggion, R. Khoury, MeaningBERT: Assessing meaning preservation between sentences, Frontiers in Artificial Intelligence 6 (2023). URL: https://www.frontiersin.org/articles/ 10.3389/frai.2023.1223924. doi:10.3389/frai.2023.1223924

  41. [41]

    Fernández Huerta, Medidas sencillas de lecturabilidad, Consigna 214 (1959) 29–32

    J. Fernández Huerta, Medidas sencillas de lecturabilidad, Consigna 214 (1959) 29–32. A. Public Resources Used for Calibration and Lexical Support The following public resources were used during system calibration, prompt refinement or glossary construction: • Plena Inclusión España, The Spanish Constitution in Easy-to-Read Format, Easy-to-Read publication...