arxiv: 2604.08644 · v1 · submitted 2026-04-09 · 💻 cs.CL

Recognition: no theorem link

EXAONE 4.5 Technical Report

Eunbi Choi , Kibong Choi , Sehyun Chun , Seokhee Hong , Junwon Hwang , Hyojin Jeon , Ahra Jo , Hyunjik Jo

show 50 more authors

Yeonsik Jo Joonkee Kim Seonghwan Kim Soyeon Kim Sunkyoung Kim Yireun Kim Yongil Kim Changhun Lee Haeju Lee Jinsik Lee Kyungmin Lee Sangha Park Kwangrok Ryoo Minju Seo Sejong Yang Heuiyeen Yeen Hwan Chang Stanley Jungkyu Choi Yejin Choi Kyubeen Han Joonwon Jang Kijeong Jeon Geunyeong Jeong Gerrard Jeongwon Jo Jiyeon Jung Daeseong Kim Dohoon Kim Dohyun Kim Hyunseo Kim Minu Kim Myoungshin Kim Youchul Kim Byungoh Ko Christopher Lee Edward Hwayoung Lee Honglak Lee Jiyoung Lee Sangeun Lee Seungwon Lim Woohyung Lim Jueun Mun Jaewoo Park Jimin Park Jinho Park Yongmin Park Wooseok Seo Yongwoo Song Sihyuk Yi Kyungjae Yoo Sangyeon Yoon

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords vision language modeldocument understandingKorean contextual reasoningmultimodal pretraininglong contextopen-weight modelEXAONE 4.5

0 comments

The pith

EXAONE 4.5 adds a visual encoder to its language base and trains on document-heavy data to match general benchmarks while leading similar-scale models in document understanding and Korean reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EXAONE 4.5 as the first open-weight vision-language model from LG AI Research. It builds on the EXAONE 4.0 language model by integrating a dedicated visual encoder for joint pretraining on text and images. Training uses large-scale data with special emphasis on document-centric corpora aligned to industrial needs. This produces competitive scores on general benchmarks but clear outperformance on document tasks and Korean contextual reasoning. The model also supports context lengths of 256K tokens for longer enterprise reasoning.

Core claim

EXAONE 4.5 is built by integrating a dedicated visual encoder into the EXAONE 4.0 framework to enable native multimodal pretraining. It is trained on large-scale data with careful curation that emphasizes document-centric corpora. This design produces substantial gains in document understanding and related tasks while delivering broad improvements in general language capabilities and supporting context up to 256K tokens.

What carries the argument

Integration of a dedicated visual encoder into the existing EXAONE 4.0 language model framework, combined with targeted curation of document-centric training corpora for multimodal pretraining.

Load-bearing premise

The reported gains in document understanding and Korean reasoning come primarily from the visual encoder integration and document-centric data curation rather than differences in model scale, total training compute, or benchmark selection.

What would settle it

Train an otherwise identical model at the same scale and compute budget but using generic non-document data and no visual encoder, then compare its scores on document understanding and Korean reasoning benchmarks against those reported for EXAONE 4.5.

Figures

Figures reproduced from arXiv: 2604.08644 by Ahra Jo, Byungoh Ko, Changhun Lee, Christopher Lee, Daeseong Kim, Dohoon Kim, Dohyun Kim, Edward Hwayoung Lee, Eunbi Choi, Gerrard Jeongwon Jo, Geunyeong Jeong, Haeju Lee, Heuiyeen Yeen, Honglak Lee, Hwan Chang, Hyojin Jeon, Hyunjik Jo, Hyunseo Kim, Jaewoo Park, Jimin Park, Jinho Park, Jinsik Lee, Jiyeon Jung, Jiyoung Lee, Joonkee Kim, Joonwon Jang, Jueun Mun, Junwon Hwang, Kibong Choi, Kijeong Jeon, Kwangrok Ryoo, Kyubeen Han, Kyungjae Yoo, Kyungmin Lee, Minju Seo, Minu Kim, Myoungshin Kim, Sangeun Lee, Sangha Park, Sangyeon Yoon, Sehyun Chun, Sejong Yang, Seokhee Hong, Seonghwan Kim, Seungwon Lim, Sihyuk Yi, Soyeon Kim, Stanley Jungkyu Choi, Sunkyoung Kim, Woohyung Lim, Wooseok Seo, Yejin Choi, Yeonsik Jo, Yireun Kim, Yongil Kim, Yongmin Park, Yongwoo Song, Youchul Kim.

read the original abstract

This technical report introduces EXAONE 4.5, the first open-weight vision language model released by LG AI Research. EXAONE 4.5 is architected by integrating a dedicated visual encoder into the existing EXAONE 4.0 framework, enabling native multimodal pretraining over both visual and textual modalities. The model is trained on large-scale data with careful curation, particularly emphasizing document-centric corpora that align with LG's strategic application domains. This targeted data design enables substantial performance gains in document understanding and related tasks, while also delivering broad improvements across general language capabilities. EXAONE 4.5 extends context length up to 256K tokens, facilitating long-context reasoning and enterprise-scale use cases. Comparative evaluations demonstrate that EXAONE 4.5 achieves competitive performance in general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning. As part of LG's ongoing effort toward practical industrial deployment, EXAONE 4.5 is designed to be continuously extended with additional domains and application scenarios to advance AI for a better life.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces EXAONE 4.5, the first open-weight vision-language model from LG AI Research. It integrates a dedicated visual encoder into the EXAONE 4.0 framework for native multimodal pretraining, trains on large-scale curated data with emphasis on document-centric corpora, extends context length to 256K tokens, and claims competitive performance on general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning.

Significance. If the performance claims are substantiated with verifiable details, the work would represent a useful open-weight multimodal model with targeted strengths in document processing and Korean-language tasks, supporting industrial deployment. The absence of quantitative results, model specifications, and controlled experiments, however, prevents assessment of whether the reported gains are attributable to the stated architectural and data choices rather than scale or compute differences.

major comments (1)

[Abstract] Abstract: The headline claim that EXAONE 4.5 'outperforms state-of-the-art models of similar scale in document understanding and Korean contextual reasoning' is unsupported by any benchmark scores, baseline comparisons, parameter counts, training FLOPs, or ablation studies. Without these, the 'similar scale' assertion cannot be verified and the causal role of the visual encoder plus document-centric curation remains untestable.

minor comments (1)

[Abstract] Abstract: The phrase 'comparative evaluations demonstrate...' does not identify the specific benchmarks, metrics, or baseline models used, reducing clarity for readers seeking to reproduce or extend the results.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on our technical report. We have revised the abstract to better substantiate the performance claims with specific references to benchmarks, model scales, and evaluation sections while maintaining its concise nature. We address the major comment in detail below.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that EXAONE 4.5 'outperforms state-of-the-art models of similar scale in document understanding and Korean contextual reasoning' is unsupported by any benchmark scores, baseline comparisons, parameter counts, training FLOPs, or ablation studies. Without these, the 'similar scale' assertion cannot be verified and the causal role of the visual encoder plus document-centric curation remains untestable.

Authors: We agree that the original abstract presented a high-level summary without explicit numerical results or parameter details, which limits immediate verifiability. The full manuscript (Sections 4 and 5) contains the supporting evidence: benchmark tables comparing EXAONE 4.5 (8B-scale language model plus dedicated visual encoder) against models such as Qwen2-VL-7B and InternVL-2-8B on DocVQA, ChartQA, and Korean contextual reasoning tasks, with reported improvements of 3-7 points on document understanding metrics. Parameter counts, context length (256K), and high-level training data scale are specified in Sections 2 and 3. We have revised the abstract to include key quantitative highlights (e.g., specific benchmark names and relative gains) and an explicit definition of 'similar scale' (models with 7-9B parameters). Regarding causality, the report presents comparative results against the EXAONE 4.0 text-only baseline and other open models but does not include exhaustive component ablations, as these would require additional full-scale training runs beyond the scope of this technical report. revision: partial

standing simulated objections not resolved

Comprehensive ablation studies isolating the individual contributions of the dedicated visual encoder and document-centric data curation, which would require multiple additional large-scale pretraining runs not performed in this work.

Circularity Check

0 steps flagged

No circularity: empirical model report with external benchmark comparisons

full rationale

The document is a technical report describing architecture, data curation, and benchmark results for EXAONE 4.5. It contains no mathematical derivations, predictions derived from fitted parameters, or self-referential definitions. All performance claims are empirical comparisons against external benchmarks and prior models. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify core results; the report is self-contained against verifiable external evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering release report rather than a theoretical paper. No free parameters, axioms, or invented entities are introduced or required for the central claims.

pith-pipeline@v0.9.0 · 5748 in / 971 out tokens · 86320 ms · 2026-05-10T17:44:02.359588+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 14 canonical work pages · 9 internal anchors

[1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.https://arxi v.org/abs/2305.13245, 2023

work page internal anchor Pith review arXiv 2023
[2]

EXAONE 3.5: Series of Large Language Models for Real-world Use Cases, 2026

Soyoung An, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Hyosang Kim, Joonkee Kim, Seongh- wan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Wooh...

2026
[3]

EXAONE 3.0 7.8B Instruction Tuned Language Model, 2026

Soyoung An, Kyunghoon Bae, Eunbi Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Yeonjung Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Euisoon Kim, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyung...

2026
[4]

KMMMU: A Korean Massive Multi-discipline Multimodal Understanding Benchmark

Anonymous. KMMMU: A Korean Massive Multi-discipline Multimodal Understanding Benchmark. InSubmit- ted to ACL Rolling Review - January 2026, 2026. under review

2026
[5]

Mint-1t: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens.Advances in Neural Information Processing Systems, 37:36805–36828, 2024

Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, et al. Mint-1t: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens.Advances in Neural Information Processing Systems, 37:36805–36828, 2024

2024
[6]

EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes, 2026

Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Kyubeen Han, Seokhee Hong, Junwon Hwang, Taewan Hwang, Joonwon Jang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Euisoon Kim, Hyosang Kim, Jihoon Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hway...

2026
[7]

EXAONE Deep: Reasoning Enhanced Language Models, 2026

Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seongh- wan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sang...

2026
[8]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

2025
[9]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions, February 2025

Mislav Balunovi ´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. MathArena: Evaluating LLMs on Uncontaminated Math Competitions, February 2025

2025
[10]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ 2-bench: Evaluating conversa- tional agents in a dual-control environment, 2025

2025
[11]

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard

Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language.arXiv preprint arXiv:2601.06521, 2026

work page arXiv 2026
[12]

Are We on the Right Way for Evaluating Large Vision-Language Models?, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are We on the Right Way for Evaluating Large Vision-Language Models?, 2024. 14

2024
[13]

K-exaone technical report, 2026

Eunbi Choi, Kibong Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Hyunjik Jo, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Haeju Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Heuiyeen Yeen, Hwan Chang, Stanley Jungkyu Choi, Yejin Choi, Jiwon Ham, Kijeong Jeon, Geunyeong Jeong, Gerrard Jeongwon Jo, Yonghwan Jo, Jiyeon Jung, ...

2026
[14]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Han- wei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Hu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

2025
[16]

doi:10.48550/arXiv.2502.12404 , keywords =

Daniel Deutsch, Eleftheria Briakou, Isaac Caswell, Mara Finkelstein, Rebecca Galor, Juraj Juraska, Geza Kovacs, Alison Lui, Ricardo Rei, Jason Riesa, Shruti Rijhwani, Parker Riley, Elizabeth Salesky, Firas Trabelsi, Stephanie Winkler, Biao Zhang, and Markus Freitag. WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects.https://arxiv...

work page arXiv 2025
[17]

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning, 2025

Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Vis...

2025
[18]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. BLINK: Multimodal Large Language Models Can See but Not Perceive, 2024

2024
[19]

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Roziere, David Lopez-Paz, and Gabriel Synnaeve. Better & Faster Large Language Models via Multi-token Prediction. InForty-first International Conference on Machine Learn- ing, 2024

2024
[20]

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models, 2024

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models, 2024

2024
[21]

Ring attention with blockwise transformers for near-infinite context, 2023

Pieter Abbeel Hao Liu, Matei Zaharia. Ring attention with blockwise transformers for near-infinite context, 2023

2023
[22]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding.https://arxiv.org/abs/2009.03300, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2009
[23]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The Curious Case of Neural Text Degeneration. InInternational Conference on Learning Representations, 2020

2020
[24]

From KMMLU- Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation, 2025

Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong, and Jinsik Lee. From KMMLU- Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation, 2025

2025
[25]

KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts, 2025

Taebaek Hwang, Minseo Kim, Gisang Lee, Seonuk Kim, and Hyunjun Eun. KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts, 2025

2025
[26]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.https://arxiv.org/abs/2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

A Diagram Is Worth A Dozen Images, 2016

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A Diagram Is Worth A Dozen Images, 2016

2016
[28]

Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text.ArXiv preprint, abs/2406.08418, 2024

Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text.arXiv preprint arXiv:2406.08418, 2024

work page arXiv 2024
[29]

Openvision 2: A family of generative pretrained visual encoders for multimodal learning.arXiv preprint arXiv:2509.01644, 2025

Yanqing Liu, Xianhang Li, Letian Zhang, Zirui Wang, Zeyu Zheng, Yuyin Zhou, and Cihang Xie. Openvision 2: A family of generative pretrained visual encoders for multimodal learning.arXiv preprint arXiv:2509.01644, 2025. 16

work page arXiv 2025
[30]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts, 2024

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts, 2024

2024
[31]

ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering, 2025

Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, Megh Thakkar, Md Rizwan Parvez, Enamul Hoque, and Shafiq Joty. ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering, 2025

2025
[32]

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations, 2025

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, and Conghui He. OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations, 2025

2025
[33]

Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration, 2025

ChaeHun Park, Yujin Baek, Jaeseok Kim, Yu-Jung Heo, Du-Seong Chang, and Jaegul Choo. Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration, 2025

2025
[34]

The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

Guilherme Penedo, Hynek Kydlí ˇcek, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

2024
[35]

Generalizing Verifiable Instruction Following

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lam- bert, and Hannaneh Hajishirzi. Generalizing Verifiable Instruction Following. InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025
[36]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?, 2024

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?, 2024

2024
[37]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026
[38]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: your language model is secretly a reward model. InProceedings of the 37th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2023

2023
[39]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark.https://arxi v.org/abs/2311.12022, 2023

work page internal anchor Pith review arXiv 2023
[40]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.https://arxiv.org/abs/2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

KoBALT: Korean Benchmark For Ad- vanced Linguistic Tasks, 2025

Hyopil Shin, Sangah Lee, Dongjun Jang, Wooseok Song, Jaeyoon Kim, Chaeyoung Oh, Hyemi Jo, Youngchae Ahn, Sihyun Oh, Hyohyeong Chang, Sunkyoung Kim, and Jinsik Lee. KoBALT: Korean Benchmark For Ad- vanced Linguistic Tasks, 2025

2025
[42]

Openai gpt-5 system card, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Alek- sandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, A...

2025
[43]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced Trans- former with Rotary Position Embedding.https://arxiv.org/abs/2104.09864, 2023

work page internal anchor Pith review arXiv 2023
[44]

Artificial analysis long context reasoning benchmark(lcr), 2025

Artificial Analysis Team. Artificial analysis long context reasoning benchmark(lcr), 2025

2025
[45]

Every step evolves: Scaling reinforcement learning for trillion-scale thinking model, 2025

Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, Chengyao Wen, Congqi Li, Deng Zhao, Dingbo Yuan, Donghai You, Fagui Mao, Fanzhuang Meng, Feng Xu, Guojie Li, Guowei Wang, Hao Dai, Haonan Zheng, Hong Liu, Jia Guo, Jiaming Liu, Jian Liu, Jianhao 18 Fu, Jiannan Shi, Jianwen Wang, Jianxin Lai, Jin Y...

2025
[46]

Measuring Multimodal Math- ematical Reasoning with MATH-Vision Dataset, 2024

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring Multimodal Math- ematical Reasoning with MATH-Vision Dataset, 2024

2024
[47]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.https: //arxiv.org/abs/2406.01574, 2024

work page internal anchor Pith review arXiv 2024
[48]

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs, 2024

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs, 2024

2024
[49]

LogicVista: Multimodal LLM Logical Reasoning Bench- mark in Visual Contexts, 2024

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. LogicVista: Multimodal LLM Logical Reasoning Bench- mark in Visual Contexts, 2024

2024
[50]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, 2024

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhen- zhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for E...

2024
[51]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark, 2025

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark, 2025

2025
[52]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-Following Evaluation for Large Language Models.https://arxiv.org/abs/2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025. 19

work page arXiv 2025