arxiv: 2605.13173 · v1 · submitted 2026-05-13 · 💻 cs.DB

Recognition: 1 theorem link

· Lean Theorem

OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems

Bing Bai, Guoqing Yang, Ke Zhang, Xiaoqiang Xu, Ximan Liu, Yan Li, Yong Liu, Zhen Chen

Pith reviewed 2026-05-14 02:09 UTC · model grok-4.3

classification 💻 cs.DB

keywords e-commerce benchmarkmultimodal modelsLLMsMLLMsdomain-specific knowledgebilingual evaluationvisual reasoningfoundation models

0 comments

The pith

A new benchmark for e-commerce shows leading multimodal models achieve only modest performance because they lack domain-specific knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OxyEcomBench, a multimodal benchmark with roughly 6300 authentic bilingual instances drawn from real platforms, to measure how foundation models handle e-commerce challenges that differ from general knowledge. It covers platform operators, merchants and customers through 29 tasks spanning six capability areas, with text-only and mixed-modality inputs plus a four-level difficulty scale that emphasizes visual evidence. Evaluations of 20 mainstream models reveal modest overall scores and a narrowing of the usual performance gaps seen on open-domain tasks. A reader would care because e-commerce drives large parts of digital commerce and current models therefore fall short on practical uses such as product search, customer support and merchant analytics.

Core claim

OxyEcomBench is a unified multimodal benchmark of approximately 6300 high-quality instances for real-world bilingual Chinese-English e-commerce. It jointly covers platform operators, merchants and customers across six capability aspects and 29 tasks, supporting text-only, single-image, multi-image, single-turn and multi-turn inputs. All data comes from authentic platforms and is verified by domain experts. A four-level P0-P3 difficulty rubric is applied to every task with priority given to visually salient cases. Tests on 20 LLMs and MLLMs show that even leading models attain modest performance and that performance gaps narrow on this benchmark, indicating that insufficient e-commercepecific

What carries the argument

OxyEcomBench, a multimodal benchmark that integrates stakeholder perspectives across 29 tasks with multimodal configurations and a P0-P3 difficulty rubric focused on visual evidence.

Load-bearing premise

The 29 tasks and four-level P0-P3 difficulty rubric, chosen with expert consensus and emphasis on visually salient cases, faithfully represent the full range of real-world e-commerce challenges without introducing selection bias.

What would settle it

A general-purpose model that scores above 80 percent on the hardest P3 tasks while preserving its usual lead over simpler models would falsify the claim that insufficient e-commerce-specific knowledge is the main reason performance gaps narrow.

Figures

Figures reproduced from arXiv: 2605.13173 by Bing Bai, Guoqing Yang, Ke Zhang, Xiaoqiang Xu, Ximan Liu, Yan Li, Yong Liu, Zhen Chen.

**Figure 1.** Figure 1: Taxonomy of OXYECOMBENCH. The inner ring represents the three stakeholder roles— platform, merchant, and customer—as independent evaluation perspectives. The middle ring groups the benchmark into 6 capability aspects, and the outer ring enumerates 29 distinct e-commerce tasks, with each task inheriting the color of its corresponding capability aspect. merchants who list products, handle orders, and manage … view at source ↗

**Figure 2.** Figure 2: Illustration of three types of tasks in O [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Per-task average input-token length of O [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Difficulty-wise performance trends of all evaluated VLMs on O [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Representative task examples in OXYECOMBENCH across different difficulty levels. Topleft (P0): Malicious & Invalid Review Detection requires fundamental understanding to explicitly identify spam or invalid content in user reviews. Top-right (P1): Product Damage Detection demands intermediate analysis and routine reasoning to identify specific defects or damage in product images. Bottom-left (P2): Ad Crea… view at source ↗

**Figure 6.** Figure 6: Representative failure cases of SOTA multimodal models (e.g., Gemini 3 Pro) in O [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

LLMs and MLLMs have become indispensable tools across a wide range of applications. E-commerce, however, poses distinctive challenges -- including intricate domain knowledge, long-tail product evidence, heterogeneous visual data, and the interplay among multiple stakeholder roles -- that diverge substantially from the general world knowledge these models are primarily trained on, often causing a notable gap between their open-domain and e-commerce performance. To systematically quantify this gap, we introduce OxyEcomBench, a unified multimodal benchmark comprising approximately 6,300 high-quality instances for real-world bilingual Chinese--English e-commerce. Although several e-commerce benchmarks have been proposed, they typically adopt a single stakeholder perspective, target a narrow set of tasks, or address isolated challenges, making it difficult to holistically assess models' understanding of the full e-commerce pipeline. OxyEcomBench addresses these limitations by jointly covering platform operators, merchants, and customers across 6 capability aspects and 29 tasks, supporting text-only and mixed-modality inputs with single-image, multi-image, single-turn, and multi-turn configurations. All data is sourced from authentic e-commerce platforms and verified by domain experts. The benchmark further adopts a difficulty-aware design with a four-level P0--P3 rubric applied to all 29 tasks whose difficulty admits stable expert consensus, and rigorously prioritizes visually salient multimodal cases in which key evidence resides in images rather than text alone. Evaluations on 20 mainstream LLMs and MLLMs show that even the leading models attain modest performance and that performance gaps narrow on OxyEcomBench, suggesting that insufficient e-commerce-specific knowledge infusion mutes the advantages of advanced general-purpose models in this domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OxyEcomBench unifies e-commerce tasks across stakeholders and modalities better than prior narrow benchmarks, but its main claim about narrowed gaps rests on unverified task representativeness.

read the letter

OxyEcomBench pulls together a single test set that spans platform operators, merchants, and customers across 29 tasks in six areas, with real platform data, expert checks, and support for multi-image and multi-turn cases. That coverage is the clearest step forward from earlier e-commerce benchmarks that stayed limited to one viewpoint or one narrow job. The four-level P0-P3 difficulty scale and the focus on cases where images carry the decisive evidence also give the evaluation more structure than most ad-hoc collections. Running the same 20 models and seeing modest scores plus smaller gaps between leaders and the rest is consistent with the idea that general models miss e-commerce specifics on this set. The absence of any fitted parameters or circular derivations keeps the work straightforward and reproducible in principle. The data sourcing from authentic platforms and the stable expert rubric are real positives that make the benchmark usable for practical comparisons. The soft spot is the missing detail on how the 29 tasks were sampled and whether they match actual usage frequencies. No inter-annotator agreement numbers appear in the available description, and the stress-test concern about overweighting visual pattern-matching over knowledge-heavy scenarios like regulatory pricing or supply-chain reasoning is reasonable. If the task mix tilts that way, the narrowed gaps could be an artifact of curation rather than proof of missing domain knowledge. A reader working on multimodal models for product search, customer support, or merchant tools would get direct value from the task list and baselines. The construction is substantive enough and the claims are testable enough that it deserves a serious referee. I would send it to peer review and ask for the annotation statistics plus a comparison of task distribution against platform logs.

Referee Report

2 major / 2 minor

Summary. The paper introduces OxyEcomBench, a multimodal benchmark with approximately 6,300 expert-verified instances sourced from authentic Chinese-English e-commerce platforms. It covers platform operators, merchants, and customers across 6 capability aspects and 29 tasks, supporting text-only, single-image, multi-image, single-turn, and multi-turn inputs. A four-level P0-P3 difficulty rubric is applied with emphasis on visually salient cases, and evaluations of 20 LLMs and MLLMs show modest absolute performance together with narrowed gaps between leading and other models, which the authors attribute to insufficient e-commerce-specific knowledge infusion.

Significance. If the 29 tasks and P0-P3 rubric provide a representative sample of real-world e-commerce difficulties, the benchmark would usefully quantify the domain gap for general-purpose models and motivate targeted adaptation. The multi-stakeholder coverage, bilingual construction, and explicit difficulty stratification are constructive features that go beyond prior single-perspective e-commerce benchmarks. The reported modest scores and compressed performance gaps would then constitute actionable evidence for the community.

major comments (2)

[§3.2] §3.2 (Data Collection and Verification): No quantitative inter-annotator agreement figures (Cohen’s kappa, Fleiss’ kappa, or raw agreement percentages) are reported for the expert verification of the 6,300 instances or for the assignment of P0-P3 difficulty labels. Without these statistics, especially for long-tail products, it is impossible to assess the stability of the difficulty rubric or to rule out systematic annotation drift.
[§4.1 and §5] §4.1 (Task Selection) and §5 (Results): The central interpretation that narrowed gaps reflect missing e-commerce knowledge rests on the assumption that the 29 tasks accurately sample scenarios in which domain knowledge is decisive. The explicit prioritization of visually salient multimodal cases risks overweighting pattern-matching tasks (e.g., single-image attribute extraction) while under-sampling knowledge-heavy ones (multi-turn regulatory pricing, supply-chain reasoning). No usage-frequency statistics from the source platforms are provided to demonstrate that the task distribution matches real-world prevalence.

minor comments (2)

[Abstract and §3.1] The exact total number of instances (rather than “approximately 6,300”) should be stated in the abstract and §3.1 for reproducibility.
[Table 2] Table 2 (task taxonomy) would benefit from an additional column indicating the number of instances per task and per difficulty level to allow readers to judge balance.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly where feasible.

read point-by-point responses

Referee: [§3.2] §3.2 (Data Collection and Verification): No quantitative inter-annotator agreement figures (Cohen’s kappa, Fleiss’ kappa, or raw agreement percentages) are reported for the expert verification of the 6,300 instances or for the assignment of P0-P3 difficulty labels. Without these statistics, especially for long-tail products, it is impossible to assess the stability of the difficulty rubric or to rule out systematic annotation drift.

Authors: We agree that quantitative inter-annotator agreement metrics strengthen claims of annotation quality. The verification involved five domain experts following a structured multi-round protocol. In the revised manuscript we will report Fleiss’ kappa for both instance verification and P0-P3 difficulty assignment, including a breakdown for long-tail product categories, to demonstrate stability and address potential drift. revision: yes
Referee: [§4.1 and §5] §4.1 (Task Selection) and §5 (Results): The central interpretation that narrowed gaps reflect missing e-commerce knowledge rests on the assumption that the 29 tasks accurately sample scenarios in which domain knowledge is decisive. The explicit prioritization of visually salient multimodal cases risks overweighting pattern-matching tasks (e.g., single-image attribute extraction) while under-sampling knowledge-heavy ones (multi-turn regulatory pricing, supply-chain reasoning). No usage-frequency statistics from the source platforms are provided to demonstrate that the task distribution matches real-world prevalence.

Authors: Task selection was informed by consultations with e-commerce practitioners to cover representative stakeholder challenges, with visual salience prioritized because images carry decisive evidence in most platform interactions. We acknowledge that proprietary platform data prevent release of exact usage-frequency statistics. In revision we will expand §4.1 with explicit rationale and examples of included knowledge-heavy tasks (e.g., multi-turn regulatory pricing), add a limitations paragraph in §5 qualifying the interpretation of performance gaps, and note the emphasis on visually salient cases as a deliberate design choice rather than an unintended bias. revision: partial

standing simulated objections not resolved

We cannot provide proprietary usage-frequency statistics from the source e-commerce platforms.

Circularity Check

0 steps flagged

No circularity: benchmark construction relies on external data collection and expert protocols

full rationale

The paper constructs OxyEcomBench by sourcing instances from authentic e-commerce platforms, applying expert-verified difficulty rubrics, and evaluating 20 external models. No equations, fitted parameters, or self-referential derivations exist. Central claims about performance gaps rest on the benchmark's task coverage and results, not on any reduction to prior self-citations or definitions. Self-citations, if present, are not load-bearing for any derivation chain. This is a standard benchmark paper with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim that the benchmark reveals a genuine performance gap rests on the assumptions that expert consensus on difficulty is stable and that the sampled instances capture representative long-tail and visual evidence without systematic omission.

axioms (1)

domain assumption Domain experts can reliably assign stable P0-P3 difficulty labels and identify visually salient cases
Invoked when the paper states the rubric admits stable expert consensus and prioritizes cases where key evidence resides in images.

pith-pipeline@v0.9.0 · 5616 in / 1278 out tokens · 43463 ms · 2026-05-14T02:09:18.556656+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
OxyEcomBench ... 29 tasks ... four-level P0–P3 rubric ... visually salient multimodal cases

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 5 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Y ong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems , 2023

work page 2023
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Y ang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Shopping MMLU: A massive multi-task online shopping benchmark for large language models

Yilun Jin, Zheng Li, Chenwei Zhang, Tianyu Cao, Yifan Gao, Pratik Jayarao, Mao Li, Xin Liu, Ritesh Sarkhel, Xianfeng Tang, Haodong Wang, Zhengyang Wang, Wenju Xu, Jingfeng Y ang, Qingyu Yin, Xian Li, Priyanka Nigam, Yi Xu, Kai Chen, Qiang Y ang, Meng Jiang, and Bing Yin. Shopping MMLU: A massive multi-task online shopping benchmark for large language mode...

work page 2024
[5]

Haoxin Wang, Xianhan Peng, Huang Cheng, Yizhe Huang, Ming Gong, Chenghan Y ang, Y ang Liu, and Jiang Lin. ECom-Bench: Can LLM agent resolve real-world e-commerce customer support issues? In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025

work page 2025
[6]

EcomEval: Towards reliable evaluation of large language models for mul- tilingual and multimodal e-commerce applications

Shuyi Xie, Ziqin Liew, Hailing Zhang, Haibo Zhang, Ling Hu, Zhiqiang Zhou, Shuman Liu, and Anxiang Zeng. EcomEval: Towards reliable evaluation of large language models for mul- tilingual and multimodal e-commerce applications. arXiv preprint arXiv:2510.20632, 2025

work page arXiv 2025
[7]

RAIR: A rule-aware multimodal benchmark for challenging e-commerce relevance assessment

Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang, Chuan Y u, and Jian Xu. RAIR: A rule-aware multimodal benchmark for challenging e-commerce relevance assessment. arXiv preprint arXiv:2512.24943, 2025

work page arXiv 2025
[8]

VQA: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In Proceedings of the IEEE international conference on computer vision , 2015

work page 2015
[9]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna V edantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

OK-VQA: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019

work page 2019
[11]

Towards VQA models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Y u Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019

work page 2019
[12]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Y unhang Shen, Y ulei Qin, Mengdan Zhang, Xu Lin, Jinrui Y ang, Xiawu Zheng, Ke Li, Xing Sun, Y unsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

MMBench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, 2024

Y uan Liu, Haodong Duan, Y uanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Y uan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. MMBench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, 2024

work page 2024
[14]

MM-V et: Evaluating large multimodal models for integrated capabil- ities

Weihao Y u, Zhengyuan Y ang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-V et: Evaluating large multimodal models for integrated capabil- ities. In International Conference on Machine Learning , 2024

work page 2024
[15]

TouchStone: Evaluating vision-language models by language models

Shuai Bai, Shusheng Y ang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. TouchStone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890, 2023. 10

work page arXiv 2023
[16]

L VLM-eHub: A comprehensive evaluation benchmark for large vision-language models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Y u Qiao, and Ping Luo. L VLM-eHub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023

work page arXiv 2023
[17]

LAMM: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark

Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Xiaoshui Huang, Zhiyong Wang, Lu Sheng, Lei Bai, Jing Shao, and Wanli Ouyang. LAMM: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. In Advances in Neural Information Processing Systems, 2023

work page 2023
[18]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems, 2022

work page 2022
[19]

MathVista: Evaluating mathematical rea- soning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical rea- soning of foundation models in visual contexts. In International Conference on Learning Representations, 2024

work page 2024
[20]

MMMU: A massive multi-discipline mul- timodal understanding and reasoning benchmark for expert AGI

Xiang Y ue, Y uansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Y uxuan Sun, et al. MMMU: A massive multi-discipline mul- timodal understanding and reasoning benchmark for expert AGI. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024

work page 2024
[21]

CMMU: A benchmark for chinese multi-modal multi-type question understanding and reasoning

Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Y ang, Qiannan Zhu, and Hua Huang. CMMU: A benchmark for chinese multi-modal multi-type question understanding and reasoning. In Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence, 2024

work page 2024
[22]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shaﬁq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022 , 2022

work page 2022
[23]

BLINK: Multimodal large language models can see but not perceive

Xingyu Fu, Y ushi Hu, Bangzheng Li, Y u Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. BLINK: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, 2024

work page 2024
[24]

MAN- TIS: Interleaved multi-image instruction tuning

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. MAN- TIS: Interleaved multi-image instruction tuning. Transactions on Machine Learning Research, 2024

work page 2024
[25]

MuirBench: A comprehensive benchmark for robust multi-image understanding

Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. MuirBench: A comprehensive benchmark for robust multi-image understanding. In International Conference on Learning Representations , 2025

work page 2025
[26]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , 2023

work page 2023
[27]

Hallusion- Bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Y aser Y acoob, Dinesh Manocha, and Tianyi Zhou. Hallusion- Bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Visi...

work page 2024
[28]

SEED-Bench: Benchmarking multimodal large language models

Bohao Li, Y uying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. SEED-Bench: Benchmarking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024

work page 2024
[29]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Y uhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Y u Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Aligning large multimodal models with factually augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Y an Gui, Y u-Xiong Wang, Yiming Y ang, Kurt Keutzer, and Trevor Dar- rell. Aligning large multimodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525, 2023

work page arXiv 2023
[31]

AMBER: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation

Junyang Wang, Y uhang Wang, Jing Zhang, Y ukai Gu, Haitao Jia, Jiaqi Wang, Jitao Sang, Guohai Xu, Haiyang Xu, Ming Y an, and Ji Zhang. AMBER: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023

work page arXiv 2023
[32]

ChineseEcomQA: A scal- able e-commerce concept evaluation benchmark for large language models

Haibin Chen, Kangtao Lv, Chengwei Hu, Y anshi Li, Y ujin Y uan, Y ancheng He, Xingyao Zhang, Langming Liu, Shilei Liu, Wenbo Su, and Bo Zheng. ChineseEcomQA: A scal- able e-commerce concept evaluation benchmark for large language models. arXiv preprint arXiv:2502.20196, 2025

work page arXiv 2025
[33]

Mix-ECom: Towards mixed-type e-commerce dialogues with complex domain rules

Chenyu Zhou, Xiaoming Shi, Hui Qiu, Y ankai Jiang, Shaoguo Liu, Tingting Gao, Haitao Leng, Xiawu Zheng, and Rongrong Ji. Mix-ECom: Towards mixed-type e-commerce dialogues with complex domain rules. In International Conference on Learning Representations , 2026

work page 2026
[34]

EcomMMMU: Strategic utilization of visuals for robust multimodal e-commerce models

Xinyi Ling, Hanwen Du, Zhihui Zhu, and Xia Ning. EcomMMMU: Strategic utilization of visuals for robust multimodal e-commerce models. In Proceedings of the International Joint Conference on Natural Language Processing and the Conference of the Asia-Paciﬁc Chapter of the Association for Computational Linguistics , 2025

work page 2025
[35]

MOON: Generative MLLM-based multimodal repre- sentation learning for e-commerce product understanding

Daoze Zhang, Zhanheng Nie, Jianyu Liu, Chenghan Fu, Wanxian Guan, Y uan Gao, Jun Song, Pengjie Wang, Jian Xu, and Bo Zheng. MOON: Generative MLLM-based multimodal repre- sentation learning for e-commerce product understanding. In Proceedings of the 19th ACM International Conference on Web Search and Data Mining , 2026

work page 2026
[36]

Junchen Fu, Wenhao Deng, Kaiwen Zheng, Y u Y e, Ioannis Arapakis, Y ongxin Ni, Xuri Ge, and Joemon M. Jose. Benchmarking multimodal large language models for missing modality completion in product catalogues. arXiv preprint arXiv:2601.19750, 2026

work page arXiv 2026
[37]

EcomBench: Towards holistic evaluation of foundation agents in e-commerce

Rui Min, Zile Qiao, Ze Xu, Jiawen Zhai, Wenyu Gao, Xuanzhong Chen, Haozhen Sun, Zhen Zhang, Xinyu Wang, Hong Zhou, et al. EcomBench: Towards holistic evaluation of foundation agents in e-commerce. arXiv preprint arXiv:2512.08868, 2025

work page arXiv 2025
[38]

EComStage: Stage-wise and orientation-speciﬁc benchmarking for large language models in e-commerce

Kaiyan Zhao, Zijie Meng, Zheyong Xie, Jin Duan, Y ao Hu, Zuozhu Liu, and Shaosheng Cao. EComStage: Stage-wise and orientation-speciﬁc benchmarking for large language models in e-commerce. arXiv preprint arXiv:2601.02752, 2026

work page arXiv 2026
[39]

WebShop: Towards scalable real-world web interaction with grounded language agents

Shunyu Y ao, Howard Chen, John Y ang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, 2022

work page 2022
[40]

GPT-5 system card

OpenAI. GPT-5 system card. https://cdn.openai.com/gpt-5-system-card.pdf , 2025

work page 2025
[41]

Gemini 3.1 Pro model card

Google DeepMind. Gemini 3.1 Pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/ , 2026. 12 A Per-task Evaluation Results To complement the aggregated results in Section 4.2, Table 4 reports the per-task scores of all evalu- ated models on the 29 tasks in O XYECOM BENCH . The ﬁnal Average row uses the same non-missing task-level mac...

work page 2026
[42]

normal" or

Visual Hallucination and Fine-Grained Recognition Difﬁculties. E-commerce applications necessitate highly precise visual inspection. In tasks like Product Damage Detection or MGC-to- Product Consistency V eriﬁcation, models must detect subtle defects, texture differences, or minor missing accessories. General MLLMs frequently suffer from visual hallucinat...

work page
[43]

1" if there is damage, and output

Lack of Domain-Speciﬁc Professional Knowledge. The e-commerce ecosystem operates on specialized terminology and distinct operational concepts. General MLLMs lack sufﬁcient expo- 15 You are a professional e-commerce visual quality control expert. Your task is to check whether the product display image for advertising (the right image) faithfully restores t...

work page
[44]

Incomplete product display (main body or key components missing)

Complex Visual Distributions in E-commerce Scenarios. Unlike standard natural images, e- commerce visuals (e.g., product detail pages, promotional posters, and user-generated reviews) are highly complex. They typically feature intricate layouts, dense text overlays, multiple sub-images, and varied professional photography styles. This signiﬁcant distribut...

work page