Recognition: no theorem link
Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding
Pith reviewed 2026-05-12 05:06 UTC · model grok-4.3
The pith
A retrieval-augmented pipeline with structure-preserving PDF chunking and answer-option-aware reranking reaches 96 percent accuracy on Ukrainian multi-domain document QA.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors built a RAG pipeline that chunks PDFs while preserving their structure, retrieves passages with a dense embedder, reranks them using a model fine-tuned to consider both the question and answer choices, and then generates the answer from the top passages using a large language model. On held-out data this raised retrieval recall at one from 0.70 to 0.79 and answer accuracy from 0.93 to 0.97, with leaderboard scores of 0.945 and 0.960. The work claims that these two design choices—structure preservation and answer-space-aware relevance scoring—outperform the addition of complex downstream heuristics under competition rules.
What carries the argument
Contextual chunking of PDFs paired with reranking that conditions on both the question and the set of answer options.
If this is right
- Reranking that incorporates answer options improves the quality of retrieved passages for multiple-choice questions.
- Limiting generation to the top two reranked passages is enough to reach high answer accuracy.
- Preserving the original layout and order in PDF chunking aids retrieval in multi-domain document collections.
- Off-the-shelf large language models can serve as the backbone for both retrieval and answer selection in this setting.
Where Pith is reading between the lines
- The same pipeline might work for other low-resource languages that have similar PDF-based document collections.
- Answer-aware reranking could reduce the need for task-specific fine-tuning in other retrieval-augmented QA applications.
- If document structure varies greatly across domains, the chunking method may need adaptation for best results.
Load-bearing premise
The test questions and documents in the shared task represent the distribution of real-world Ukrainian multi-domain document understanding problems, and the benefits of the reranking step will appear on entirely new document collections without any additional tuning.
What would settle it
Measuring performance on a fresh collection of Ukrainian PDFs drawn from different domains or with altered question formats; a substantial drop below the reported accuracy would indicate the approach does not generalize as claimed.
read the original abstract
We participated in the Fifth UNLP shared task on multi-domain document understanding, where systems must answer Ukrainian multiple-choice questions from PDF collections and localize the supporting document and page. We propose a retrieval-augmented pipeline built around three ideas: contextual chunking of PDFs, question-aware dense retrieval and reranking conditioned on both the question and answer options, and constrained answer generation from a small set of reranked passages. Our final system uses Qwen3-Embedding-8B for retrieval, a fine-tuned Qwen3-Reranker-8B for passage ranking, and Qwen3-32B for answer selection. On a held-out split, reranking improves Recall@1 from 0.6957 to 0.7935, while using the top-2 reranked passages raises answer accuracy from 0.9348 to 0.9674. Our best leaderboard run reached 0.9452 on the public leaderboard and 0.9598 on the private leaderboard. Our results suggest that, under strict code-competition constraints, preserving document structure and making relevance estimation aware of the answer space are more effective than adding complex downstream heuristics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports on a system for the Fifth UNLP shared task on Ukrainian multi-domain document understanding, where the task is to answer multiple-choice questions from PDF collections and localize supporting documents and pages. The proposed off-the-shelf RAG pipeline features contextual chunking to preserve document structure, question-aware dense retrieval using Qwen3-Embedding-8B, reranking with a fine-tuned Qwen3-Reranker-8B conditioned on the question and answer options, and constrained generation using Qwen3-32B from the top reranked passages. On a held-out split, the system shows Recall@1 improving from 0.6957 to 0.7935 with reranking and answer accuracy from 0.9348 to 0.9674 with top-2 passages. Leaderboard results are 0.9452 public and 0.9598 private. The authors conclude that under strict constraints, structure preservation and answer-space-aware relevance estimation outperform complex downstream heuristics.
Significance. Assuming the empirical results are robust, this work contributes a practical demonstration that targeted use of large language models for retrieval and reranking, with emphasis on document structure and answer option awareness, can deliver strong performance in a challenging multilingual, multi-domain setting. It provides concrete evidence favoring simpler RAG designs over heuristic-heavy approaches in competition-like environments, which may generalize to other low-resource language document understanding tasks. The specific model choices and metric improvements offer a useful reference point for the community.
major comments (2)
- [Abstract] Abstract: The central claim that 'preserving document structure and making relevance estimation aware of the answer space are more effective than adding complex downstream heuristics' lacks direct comparative evidence. The reported results only show gains from reranking (Recall@1 from 0.6957 to 0.7935) and top-2 usage (accuracy from 0.9348 to 0.9674) within the proposed pipeline; no ablations or baselines that incorporate complex heuristics (such as multi-hop LLM reasoning or ensemble retrieval) are provided to support the superiority inference.
- [Evaluation on held-out split] Evaluation on held-out split: The numeric lifts are presented without error bars, confidence intervals, or statistical tests, and the manuscript provides no details on the construction of the held-out split or its representativeness relative to the leaderboard test distribution. This weakens support for the generalizability claim in the abstract.
minor comments (1)
- The abstract would benefit from a short overview sentence listing the three core pipeline components before the results, to improve immediate readability for readers unfamiliar with the shared task.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of our work's significance. We address the two major comments point by point below, with plans for targeted revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'preserving document structure and making relevance estimation aware of the answer space are more effective than adding complex downstream heuristics' lacks direct comparative evidence. The reported results only show gains from reranking (Recall@1 from 0.6957 to 0.7935) and top-2 usage (accuracy from 0.9348 to 0.9674) within the proposed pipeline; no ablations or baselines that incorporate complex heuristics (such as multi-hop LLM reasoning or ensemble retrieval) are provided to support the superiority inference.
Authors: We agree that the manuscript lacks direct ablations or baselines against complex heuristic approaches such as multi-hop LLM reasoning or ensemble retrieval. The claim in the abstract is an inference drawn from our pipeline's strong performance (0.9598 private leaderboard) in the shared task under strict constraints, where we avoided such methods. Since we lack access to other participants' internal designs, direct comparisons are not possible. We will revise the abstract to qualify the language, stating that our results suggest these design choices are effective in this constrained setting rather than claiming broad superiority. We will also add a clarifying sentence in the discussion section. revision: partial
-
Referee: [Evaluation on held-out split] Evaluation on held-out split: The numeric lifts are presented without error bars, confidence intervals, or statistical tests, and the manuscript provides no details on the construction of the held-out split or its representativeness relative to the leaderboard test distribution. This weakens support for the generalizability claim in the abstract.
Authors: We acknowledge that error bars, confidence intervals, and statistical tests are absent, as all experiments were single-run under shared-task time and compute limits. The held-out split was formed by randomly sampling 20% of the organizers' training data with domain stratification to preserve multi-domain coverage; we will add this explicit description to the evaluation section. We will also note the single-run limitation and its implications for generalizability claims. Recomputing with multiple seeds for error bars is not feasible in the current revision timeline, but the observed lifts align with the final leaderboard results. revision: partial
Circularity Check
No circularity: empirical results rest on external leaderboard evaluation
full rationale
The manuscript describes an empirical RAG pipeline for a shared-task competition, reporting Recall@1, accuracy, and leaderboard scores obtained from held-out splits and public/private test sets. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The central suggestion that structure preservation and answer-aware reranking outperform complex heuristics is an interpretive claim drawn from the observed gains (e.g., reranking lifting Recall@1 from 0.6957 to 0.7935), not a derivation that reduces to its own inputs by construction. Evaluation relies on an external competition benchmark rather than internally generated quantities, satisfying the criteria for a self-contained empirical result.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of reranked passages for generation
axioms (1)
- domain assumption Question-and-answer-option-aware dense retrieval plus reranking will surface the correct supporting passage for multiple-choice QA.
Reference graph
Works this paper leans on
- [1]
-
[2]
Publications Manual , year = "1983", publisher =
work page 1983
-
[3]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
- [4]
-
[5]
Dan Gusfield , title =. 1997
work page 1997
-
[6]
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
work page 2015
-
[7]
Deep Search Team , month =. Docling Technical Report , url =. 2408.09869 , doi =
-
[8]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[9]
Ivanyuk-Skulskiy, Bogdan and Zaliznyi, Anton and Reshetar, Oleksand and Protsyk, Oleksiy and Romanchuk, Bohdan and Shpihanovych, Vladyslav , month = oct, title =
-
[13]
Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition , author=. 2024 , eprint=
work page 2024
-
[15]
Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation , author=. 2025 , eprint=
work page 2025
-
[16]
HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking , author=. 2025 , eprint=
work page 2025
-
[20]
Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models , author=. 2025 , eprint=
work page 2025
-
[23]
Tri Nguyen and Mir Rosenberg and Xia Song and Jianfeng Gao and Saurabh Tiwary and Rangan Majumder and Li Deng , title =. CoRR , volume =. 2016 , url =
work page 2016
-
[24]
GooAQ: Open Question Answering with Diverse Answer Types , author=. arXiv preprint , year=
- [25]
-
[26]
Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =
Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =
- [28]
-
[29]
Diffusion-Pretrained Dense and Contextual Embeddings , author=. 2026 , eprint=
work page 2026
-
[30]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=
work page 2025
- [31]
-
[32]
MamayLM v1.0: An efficient state-of-the-art multimodal Ukrainian LLM , author=
-
[33]
Paniv, Yurii and Didenko, Bohdan and Haltiuk, Mykola and Humennyy, Vladyslav and Kravchenko, Andrian and Kyslyi, Roman and Makovska, Viktoriia and Orlovskyi, Artem and Ruban, Bohdan and Rudko, Maksym-Yurii and Senyk, Anastasiia and Drushchak, Nazarii and Chaplynskyi, Dmytro and Romanyshyn, Mariana , month = oct, title =
-
[34]
Daniel Han and Michael Han , title =. 2025 , month = mar, day =
work page 2025
-
[35]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. 2024 , eprint=
work page 2024
-
[36]
Volodymyr Sydorskyi and Nataliia Romanyshyn and Roman Kyslyi and Olena Nahorna , booktitle =. The. 2026 , address =
work page 2026
-
[37]
Tom Aarsen. 2024. natural-questions-hard-negatives. https://huggingface.co/datasets/tomaarsen/natural-questions-hard-negatives. Hugging Face dataset, accessed 2026-04-08
work page 2024
-
[38]
Anton Bazdyrev, Ivan Bashtovyi, Ivan Havlytskyi, Oleksandr Kharytonov, and Artur Khodakovskyi. 2025. https://doi.org/10.18653/v1/2025.unlp-1.13 Transforming causal LLM into MLM encoder for detecting social media manipulation in telegram . In Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025), pages 112--119, Vienna, Austr...
-
[39]
Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.845 Dense X retrieval: What retrieval granularity should we use? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15159--15177, Miami, Florida, USA. Associati...
- [40]
-
[41]
Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan-Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Geza Kovacs, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, Stephane Jaskiewicz, and 2 others. 2026. https://arxiv.org/abs/2601.09012 Translate G emma...
- [42]
-
[43]
Mykola Haltiuk and Aleksander Smywi \'n ski-Pohl. 2025. https://doi.org/10.18653/v1/2025.unlp-1.14 On the path to make U krainian a high-resource language . In Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025), pages 120--130, Vienna, Austria (online). Association for Computational Linguistics
-
[44]
Daniel Han and Michael Han. 2025. https://unsloth.ai/blog/gemma3 Fine-tune & run gemma 3
work page 2025
-
[45]
Bogdan Ivanyuk-Skulskiy, Anton Zaliznyi, Oleksand Reshetar, Oleksiy Protsyk, Bohdan Romanchuk, and Vladyslav Shpihanovych. 2021. https://github.com/fido-ai/ua-datasets ua\_datasets: a collection of ukrainian language datasets
work page 2021
-
[46]
Daniel Khashabi, Amos Ng, Tushar Khot, Ashish Sabharwal, Hannaneh Hajishirzi, and Chris Callison-Burch. 2021. Gooaq: Open question answering with diverse answer types. arXiv preprint
work page 2021
-
[47]
Peerat Limkonchotiwat, Wuttikorn Ponwitayarat, Lalita Lowphansirikul, Potsawee Manakul, Can Udomcharoenchaikit, Ekapol Chuangsuwanich, and Sarana Nutanong. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.157 M c C rolin: Multi-consistency cross-lingual training for retrieval question answering . In Findings of the Association for Computational Lingu...
- [48]
-
[49]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. https://arxiv.org/abs/2306.00978 Awq: Activation-aware weight quantization for llm compression and acceleration . Preprint, arXiv:2306.00978
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [50]
- [51]
-
[52]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. https://arxiv.org/abs/1611.09268 MS MARCO: A human generated machine reading comprehension dataset . CoRR, abs/1611.09268
work page internal anchor Pith review arXiv 2016
-
[53]
Yurii Paniv, Bohdan Didenko, Mykola Haltiuk, Vladyslav Humennyy, Andrian Kravchenko, Roman Kyslyi, Viktoriia Makovska, Artem Orlovskyi, Bohdan Ruban, Maksym-Yurii Rudko, Anastasiia Senyk, Nazarii Drushchak, Dmytro Chaplynskyi, and Mariana Romanyshyn. 2025. https://github.com/lapa-llm/lapa-llm/ Lapa LLM v0.1.2 — the most efficient Ukrainian open-source lan...
work page 2025
-
[54]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. https://doi.org/10.18653/v1/P18-2124 Know what you don ' t know: Unanswerable questions for SQ u AD . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784--789, Melbourne, Australia. Association for Computational Linguistics
-
[55]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics
-
[56]
Volodymyr Sydorskyi, Nataliia Romanyshyn, Roman Kyslyi, and Olena Nahorna. 2026. The UNLP 2026 shared task on multi-domain document understanding. In Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026), Lviv, Ukraine. Association for Computational Linguistics. To appear
work page 2026
-
[57]
Deep Search Team. 2024. https://doi.org/10.48550/arXiv.2408.09869 Docling technical report . Technical report
-
[58]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.19786...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Qwen Team. 2026. https://qwen.ai/blog?id=qwen3.5 Qwen3.5: Accelerating productivity with native multimodal agents
work page 2026
-
[60]
Dingmin Wang, Qiuyuan Huang, Matthew Jackson, and Jianfeng Gao. 2024. https://doi.org/10.1162/tacl_a_00646 Retrieve what you need: A mutual learning framework for open-domain question answering . Transactions of the Association for Computational Linguistics, 12:247--263
-
[61]
Zhitong Wang, Cheng Gao, Chaojun Xiao, Yufei Huang, Shuzheng Si, Kangyang Luo, Yuzhuo Bai, Wenhao Li, Tangjian Duan, Chuancheng Lv, Guoshan Lu, Gang Chen, Fanchao Qi, and Maosong Sun. 2025. https://doi.org/10.18653/v1/2025.findings-acl.422 Document segmentation matters for retrieval-augmented generation . In Findings of the Association for Computational L...
-
[62]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 H otpot QA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels...
-
[64]
Hanna Yukhymenko, Anton Alexandrov, and Martin Vechev. 2025. Mamaylm v1.0: An efficient state-of-the-art multimodal ukrainian llm
work page 2025
-
[65]
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. https://arxiv.org/abs/2506.05176 Qwen3 embedding: Advancing text embedding and reranking through foundation models . Preprint, arXiv:2506.05176
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.