Recognition: 2 theorem links
· Lean TheoremESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions
Pith reviewed 2026-05-10 17:51 UTC · model grok-4.3
The pith
ESOM processes streaming videos to detect and describe user-defined anomalies in real time without training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ESOM is a training-free model for open-world video anomaly detection in streaming settings. It structures user prompts with Definition Normalization to reduce hallucinations, compresses tokens using Inter-frame-matched Intra-frame Token Merging, uses Hybrid Streaming Memory for causal inference, and applies Probabilistic Scoring to generate frame-level anomaly scores from interval outputs. The model achieves state-of-the-art results in localization, classification, and description on the new OpenDef-Bench.
What carries the argument
ESOM's four core modules—Definition Normalization for prompt structuring, Inter-frame-matched Intra-frame Token Merging for token compression, Hybrid Streaming Memory for causal processing, and Probabilistic Scoring for score conversion—combined with the OpenDef-Bench evaluation dataset.
If this is right
- Allows real-time processing of streaming video on a single GPU.
- Supports dynamic, user-specified anomaly definitions without retraining the model.
- Enables causal inference suitable for live applications.
- Reduces hallucinations in anomaly descriptions through normalized prompts.
- Provides a standardized benchmark for comparing open-world VAD methods under varying conditions.
Where Pith is reading between the lines
- The efficiency gains from token merging could make such systems viable for mobile or edge deployment in surveillance.
- OpenDef-Bench might encourage development of models that handle even more diverse or ambiguous anomaly definitions.
- The probabilistic scoring approach could be adapted to other video understanding tasks requiring frame-level outputs from language models.
- Combining ESOM with additional memory mechanisms might further improve long-term streaming performance.
Load-bearing premise
The assumption that the four modules together enable effective open-world detection, hallucination reduction, and causal streaming inference without requiring any training or fine-tuning.
What would settle it
Running ESOM on new streaming videos with novel anomaly definitions and finding that its localization accuracy or description quality falls below that of trained or non-streaming MLLM baselines.
Figures
read the original abstract
Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ESOM, a training-free model for open-world streaming video anomaly detection (OWVAD). It introduces four modules—Definition Normalization to structure prompts and reduce hallucination, Inter-frame-matched Intra-frame Token Merging to compress visual tokens, Hybrid Streaming Memory for causal inference, and Probabilistic Scoring to derive frame-level anomaly scores from textual outputs—along with the new OpenDef-Bench benchmark containing clean surveillance videos and diverse natural anomaly definitions. The central claim is that ESOM achieves real-time single-GPU inference and state-of-the-art performance on anomaly temporal localization, classification, and description generation.
Significance. If the performance and efficiency claims are substantiated, this work would represent a meaningful step toward practical OWVAD systems for surveillance and live-streaming moderation. The training-free design, streaming adaptation, and support for dynamic definitions address documented limitations of prior MLLM-based approaches, while the new benchmark could enable more rigorous evaluation of open-world generalization.
major comments (2)
- Abstract: The claim of achieving state-of-the-art performance in temporal localization, classification, and description generation is presented without reference to specific baselines, quantitative metrics, or error analysis, which prevents full assessment of whether the four modules deliver the asserted gains over existing methods.
- The weakest assumption—that Definition Normalization, Inter-frame-matched Intra-frame Token Merging, Hybrid Streaming Memory, and Probabilistic Scoring together enable effective open-world detection, hallucination reduction, and causal streaming inference without any training—requires explicit ablation or component-wise results to confirm it is load-bearing for the SOTA claim.
minor comments (1)
- The commitment to release code and the OpenDef-Bench benchmark is noted positively; ensure the release includes all prompts, preprocessing details, and evaluation scripts to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight opportunities to better substantiate our claims. We address each major comment below and indicate the corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: Abstract: The claim of achieving state-of-the-art performance in temporal localization, classification, and description generation is presented without reference to specific baselines, quantitative metrics, or error analysis, which prevents full assessment of whether the four modules deliver the asserted gains over existing methods.
Authors: We agree that the abstract would benefit from greater specificity to allow readers to evaluate the SOTA claim more readily. The experiments section already contains detailed quantitative comparisons against relevant baselines for temporal localization (F1-score), classification accuracy, and description generation quality, together with supporting error analysis. In the revised manuscript we have updated the abstract to include brief references to these key metrics and baselines while directing readers to the full tables and analysis in the main text. This change strengthens the abstract without altering the reported experimental outcomes. revision: yes
-
Referee: The weakest assumption—that Definition Normalization, Inter-frame-matched Intra-frame Token Merging, Hybrid Streaming Memory, and Probabilistic Scoring together enable effective open-world detection, hallucination reduction, and causal streaming inference without any training—requires explicit ablation or component-wise results to confirm it is load-bearing for the SOTA claim.
Authors: The referee correctly notes that component-wise validation would more rigorously demonstrate the contribution of each module to the overall performance. While the original manuscript reports end-to-end results for the complete ESOM system, we acknowledge the value of explicit ablations. We have added a dedicated ablation study in the revised version that isolates the effect of removing or altering each module individually. The new results quantify impacts on hallucination rates, inference latency, and detection metrics, confirming that the four modules are collectively load-bearing for the training-free open-world performance gains. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents ESOM as a training-free composition of four explicitly described modules (Definition Normalization to structure prompts and reduce hallucination, Inter-frame-matched Intra-frame Token Merging for token compression, Hybrid Streaming Memory for causal streaming inference, and Probabilistic Scoring to convert interval outputs to frame-level scores) plus the newly introduced OpenDef-Bench benchmark. No equations, fitted parameters, or derivations appear in the abstract or module descriptions that reduce any claimed output to an input by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The performance claims rest on experimental results on the external benchmark rather than internal re-labeling or self-referential fitting, making the derivation chain self-contained.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Breath1024.leanperiod8 echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
a GoF structure with size 8 is adopted, where the first frame is treated as an I-frame, the last frame as a P-frame, and the remaining frames as B-frames. Correspondingly, the token retention ratios for B-frames and P-frames are set to γ_B = 0.2 and γ_P = 0.6
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The DN module converts user prompt into a structured anomaly definition table to reduce hallucination... Probabilistic Scoring (PS) module that converts interval-level textual outputs into frame-level anomaly scores
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Deep learning for video anomaly detection: A review,
P. Wu, C. Pan, Y . Yan, G. Pang, P. Wang, and Y . Zhang, “Deep learning for video anomaly detection: A review,”arxiv preprint, vol. abs/2409.05383, 2024
-
[2]
Open- vocabulary video anomaly detection,
P. Wu, X. Zhou, G. Pang, Y . Sun, J. Liu, P. Wang, and Y . Zhang, “Open- vocabulary video anomaly detection,” inCVPR. Seattle, USA: IEEE, 2024, pp. 18 297–18 307
2024
-
[3]
Anomize: Better open vocabulary video anomaly detection,
F. Li, W. Liu, J. Chen, R. Zhang, Y . Wang, X. Zhong, and Z. Wang, “Anomize: Better open vocabulary video anomaly detection,” inCVPR, 2025, pp. 29 203–29 212
2025
-
[4]
Mul- timodal evidential learning for open-world weakly-supervised video anomaly detection,
C. Huang, W. Huang, Q. Jiang, W. Wang, J. Wen, and B. Zhang, “Mul- timodal evidential learning for open-world weakly-supervised video anomaly detection,”TMM, 2025. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10
2025
-
[5]
Harness- ing large language models for training-free video anomaly detection,
L. Zanella, W. Menapace, M. Mancini, Y . Wang, and E. Ricci, “Harness- ing large language models for training-free video anomaly detection,” inCVPR. Seattle, USA: IEEE, 2024, pp. 18 527–18 536
2024
-
[6]
Holmes-vau: Towards long-term video anomaly understanding at any granularity,
H. Zhang, X. Xu, X. Wang, J. Zuo, X. Huang, C. Gao, S. Zhang, L. Yu, and N. Sang, “Holmes-vau: Towards long-term video anomaly understanding at any granularity,” inCVPR, 2025, pp. 13 843–13 853
2025
-
[7]
Language-guided open- world video anomaly detection under weak supervision,
Z. Liu, X. Wu, J. Wu, X. Wang, and L. Yang, “Language-guided open- world video anomaly detection under weak supervision,”arXiv preprint arXiv:2503.13160, 2025
-
[8]
Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,
P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y . Zhang, “Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,” inAAAI, M. J. Wooldridge, J. G. Dy, and S. Natara- jan, Eds. Vancouver, Canada: AAAI Press, 2024, pp. 6074–6082
2024
-
[9]
Glancevad: Exploring glance supervision for label-efficient video anomaly detection,
H. Zhang, X. Wang, X. Xu, X. Huang, C. Gao, Y . Wang, S. Zhang, and N. Sang, “Glancevad: Exploring glance supervision for label-efficient video anomaly detection,” inICME. IEEE, 2025, pp. 1–6
2025
-
[10]
Dctformer: A dual- branch transformer with cloze tests for video anomaly detection,
P. Chen, S. Du, X. Zhao, J. Hu, J. Li, and T. Li, “Dctformer: A dual- branch transformer with cloze tests for video anomaly detection,”TMM, 2025
2025
-
[11]
Vera: Explainable video anomaly detection via verbalized learning of vision-language models,
M. Ye, W. Liu, and P. He, “Vera: Explainable video anomaly detection via verbalized learning of vision-language models,” inCVPR, 2025, pp. 8679–8688
2025
-
[12]
Deep appearance features for abnormal behavior detection in video,
S. Smeureanu, R. T. Ionescu, M. Popescu, and B. Alexe, “Deep appearance features for abnormal behavior detection in video,” inInt. Conf. Image Anal. Proc., ser. Lecture Notes in Computer Science, vol. 10485. Springer, 2017, pp. 779–789
2017
-
[13]
A deep one-class neural network for anomalous event detection in complex scenes,
P. Wu, J. Liu, and F. Shen, “A deep one-class neural network for anomalous event detection in complex scenes,”TNNLS, vol. 31, no. 7, pp. 2609–2622, 2020
2020
-
[14]
Hawk: Learning to understand open-world video anomalies,
J. Tang, H. LU, R. WU, X. Xu, K. Ma, C. Fang, B. Guo, J. Lu, Q. Chen, and Y . Chen, “Hawk: Learning to understand open-world video anomalies,” inNeurIPS, vol. 37, 2024, pp. 139 751–139 785
2024
-
[15]
Vad-r1: Towards video anomaly reasoning via perception-to-cognition chain-of-thought,
C. Huang, B. Wang, J. Wen, C. Liu, W. Wang, L. Shen, and X. Cao, “Vad-r1: Towards video anomaly reasoning via perception-to-cognition chain-of-thought,”arXiv preprint arXiv:2505.19877, 2025
-
[16]
Vau-r1: Advancing video anomaly understanding via reinforcement fine-tuning,
L. Zhu, Q. Chen, X. Shen, and X. Cun, “Vau-r1: Advancing video anomaly understanding via reinforcement fine-tuning,”arXiv preprint arXiv:2505.23504, 2025
-
[17]
Panda: Towards generalist video anomaly detection via agentic ai engineer,
Z. Yang, C. Gao, and M. Z. Shou, “Panda: Towards generalist video anomaly detection via agentic ai engineer,”arXiv preprint arXiv:2509.26386, 2025
-
[18]
Roformer: En- hanced transformer with rotary position embedding,
J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: En- hanced transformer with rotary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024
2024
-
[19]
Towards open set video anomaly detection,
Y . Zhu, W. Bao, and Q. Yu, “Towards open set video anomaly detection,” inECCV, ser. Lecture Notes in Computer Science, S. Avidan, G. J. Brostow, M. Ciss ´e, G. M. Farinella, and T. Hassner, Eds., vol. 13694. Tel Aviv,Israel: Springer, 2022, pp. 395–412
2022
-
[20]
Enabling real-world supervised video anomaly detection: New open-set benchmark and new framework,
H. Huang, Z. Hu, D. Feng, C. Chen, D. Li, H. Liu, and L. Duan, “Enabling real-world supervised video anomaly detection: New open-set benchmark and new framework,”TMM, 2026
2026
-
[21]
Ubnormal: New benchmark for supervised open-set video anomaly detection,
A. Acsintoae, A. Florescu, M. Georgescu, T. Mare, P. Sumedrea, R. T. Ionescu, F. S. Khan, and M. Shah, “Ubnormal: New benchmark for supervised open-set video anomaly detection,” inCVPR. New Orleans, USA: IEEE, 2022, pp. 20 111–20 121
2022
-
[22]
Domain generalization for video anomaly detection considering diverse anomaly types,
Z. Wang, X. Gu, H. Yan, and X. Gu, “Domain generalization for video anomaly detection considering diverse anomaly types,”Signal, Image and Video Processing, vol. 18, no. 4, pp. 3691–3704, 2024
2024
-
[23]
Cross-domain learning for video anomaly detection with limited supervision,
Y . Jain, A. Dabouei, and M. Xu, “Cross-domain learning for video anomaly detection with limited supervision,” inECCV. Springer, 2024, pp. 468–484
2024
-
[24]
Cross-domain video anomaly detection without target domain adaptation,
A. Aich, K.-C. Peng, and A. K. Roy-Chowdhury, “Cross-domain video anomaly detection without target domain adaptation,” inWACV, 2023, pp. 2579–2591
2023
-
[25]
Human- centric behavior description in videos: New benchmark and model,
L. Zhou, Y . Gao, M. Zhang, P. Wu, P. Wang, and Y . Zhang, “Human- centric behavior description in videos: New benchmark and model,” TMM, vol. 26, pp. 10 867–10 878, 2024
2024
-
[26]
Anomaly-led prompting learning caption generating model and benchmark,
Q. Bao, F. Liu, L. Jiao, Y . Liu, S. Li, L. Li, X. Liu, X. Wang, and B. Chen, “Anomaly-led prompting learning caption generating model and benchmark,”TMM, 2025
2025
-
[27]
Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly,
H. Du, S. Zhang, B. Xie, G. Nan, J. Zhang, J. Xu, H. Liu, S. Leng, J. Liu, H. Fan, D. Huang, J. Feng, L. Chen, C. Zhang, X. Li, H. Zhang, J. Chen, Q. Cui, and X. Tao, “Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly,” inCVPR. IEEE, 2024, pp. 18 793–18 803
2024
-
[28]
Towards surveillance video-and-language understanding: New dataset, baselines, and challenges,
T. Yuan, X. Zhang, K. Liu, B. Liu, C. Chen, J. Jin, and Z. Jiao, “Towards surveillance video-and-language understanding: New dataset, baselines, and challenges,” inCVPR, 2024, pp. 22 052–22 061
2024
-
[29]
Enhancing video anomaly understanding via multi-task instruction tuning,
X. Wang, X. Wu, and Z. Liu, “Enhancing video anomaly understanding via multi-task instruction tuning,”IEEE Signal Processing Letters, vol. 32, pp. 4359–4363, 2025
2025
-
[30]
Towards multi-domain learning for generalizable video anomaly detection,
M. Cho, T. Kim, M. Shim, D. Wee, and S. Lee, “Towards multi-domain learning for generalizable video anomaly detection,”NeurIPS, vol. 37, pp. 50 256–50 284, 2024
2024
-
[31]
Adaptive keyframe sampling for long video understanding,
X. Tang, J. Qiu, L. Xie, Y . Tian, J. Jiao, and Q. Ye, “Adaptive keyframe sampling for long video understanding,” inCVPR, 2025, pp. 29 118– 29 128
2025
-
[32]
Framefusion: Combining similarity and importance for video token reduction on large vision language models,
T. Fu, T. Liu, Q. Han, G. Dai, S. Yan, H. Yang, X. Ning, and Y . Wang, “Framefusion: Combining similarity and importance for video token reduction on large vision language models,” inICCV, 2025, pp. 22 654– 22 663
2025
-
[33]
Overview of the h. 264/avc video coding standard,
T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h. 264/avc video coding standard,”TCSVT, vol. 13, no. 7, pp. 560–576, 2003
2003
-
[34]
Efficient streaming language models with attention sinks,
G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,”arXiv, 2023
2023
-
[35]
Rethinking met- rics and benchmarks of video anomaly detection,
Z. Liu, X. Wu, W. Li, L. Yang, and S. Wang, “Rethinking met- rics and benchmarks of video anomaly detection,”arXiv preprint arXiv:2505.19022, 2025
-
[36]
Chain-of-thought reasoning without prompting,
X. Wang and D. Zhou, “Chain-of-thought reasoning without prompting,” NeurIPS, vol. 37, pp. 66 383–66 409, 2024
2024
-
[37]
Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,
J. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, “Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,” inCVPR, 2019, pp. 1237–1246
2019
-
[38]
A new comprehensive bench- mark for semi-supervised video anomaly detection and anticipation,
C. Cao, Y . Lu, P. Wang, and Y . Zhang, “A new comprehensive bench- mark for semi-supervised video anomaly detection and anticipation,” in CVPR. Vancouver, Canada: IEEE, 2023, pp. 20 392–20 401
2023
-
[39]
A large-scale benchmark dataset for event recognition in surveillance video,
S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee, S. Mukherjee, J. K. Aggarwal, H. Lee, L. Daviset al., “A large-scale benchmark dataset for event recognition in surveillance video,” inCVPR. IEEE, 2011, pp. 3153–3160
2011
-
[40]
Meva: A large-scale multiview, multimodal video dataset for activity detection,
K. Corona, K. Osterdahl, R. Collins, and A. Hoogs, “Meva: A large-scale multiview, multimodal video dataset for activity detection,” inWACV, 2021, pp. 1060–1068
2021
-
[41]
Real-world anomaly detection in surveillance videos,
W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” inCVPR. Computer Vision Foundation / IEEE Computer Society, 2018, pp. 6479–6488
2018
-
[42]
Not only look, but also listen: Learning multimodal violence detection under weak supervision,
P. Wu, J. Liu, Y . Shi, Y . Sun, F. Shao, Z. Wu, and Z. Yang, “Not only look, but also listen: Learning multimodal violence detection under weak supervision,” inECCV, ser. Lecture Notes in Computer Science, vol. 12375. Springer, 2020, pp. 322–339
2020
-
[43]
Advancing video anomaly detection: A concise review and a new dataset,
L. Zhu, L. Wang, A. Raj, T. Gedeon, and C. Chen, “Advancing video anomaly detection: A concise review and a new dataset,” inNeurIPS, vol. 37. Curran Associates, Inc., 2024, pp. 89 943–89 977
2024
-
[44]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Learning prompt-enhanced context features for weakly-supervised video anomaly detection,
Y . Pu, X. Wu, L. Yang, and S. Wang, “Learning prompt-enhanced context features for weakly-supervised video anomaly detection,”TIP, vol. 33, pp. 4923–4936, 2024
2024
-
[46]
Streamavatar: Streaming diffusion models for real-time interactive human avatars,
Z. Sun, Z. Peng, Y . Ma, Y . Chen, Z. Zhou, Z. Zhou, G. Zhang, Y . Zhang, Y . Zhou, Q. Luet al., “Streamavatar: Streaming diffusion models for real-time interactive human avatars,”arXiv preprint arXiv:2512.22065, 2025
-
[47]
Token Merging: Your ViT But Faster
D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your vit but faster,”arXiv preprint arXiv:2210.09461, 2022
work page internal anchor Pith review arXiv 2022
-
[48]
Dycoke: Dynamic compression of tokens for fast video large language models,
K. Tao, C. Qin, H. You, Y . Sui, and H. Wang, “Dycoke: Dynamic compression of tokens for fast video large language models,” inCVPR, 2025, pp. 18 992–19 001
2025
-
[49]
Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025
B. Yang, B. Wen, B. Ding, C. Liu, C. Chu, C. Song, C. Rao, C. Yi, D. Li, D. Zanget al., “Kwai keye-vl 1.5 technical report,”arXiv preprint arXiv:2509.01563, 2025
-
[50]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-VL Technical Report,”arXiv preprint arXiv:2502.13923, Feb. 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.