{"total":22,"items":[{"citing_arxiv_id":"2605.16671","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sustainable Intelligence for the Wild: Democratizing Ecological Monitoring via Knowledge-Adaptive Edge Expert Agents","primary_cat":"cs.AI","submitted_at":"2026-05-15T22:12:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes a knowledge-adaptive edge expert agent architecture for sustainable biodiversity monitoring that separates visual perception from reasoning with an explicit knowledge base.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12882","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence","primary_cat":"cs.CL","submitted_at":"2026-05-13T01:54:42+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10120","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph","primary_cat":"cs.CV","submitted_at":"2026-05-11T07:35:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"framework that substantially enhances MLLM reasoning performance on microscopy tasks. 2.2 Retrieval-Augmented Generation in Biomedical Field Retrieval-augmented generation (RAG) has been extensively explored for natural images, with rep- resentative approaches spanning text-centric methods such as FLMR [23], vision-centric methods including VisRAG [24] and ColPali [25], as well as text-vision-centric frameworks like REVEAL [26]. In the biomedical domain, MasonNLP [27] leverages a generalist, instruction-tuned large language model coupled with a RAG framework to integrate in-domain textual and visual exemplars, demon- strating the benefits of retrieval-based knowledge grounding. MMed-RAG [28] further introduces a"},{"citing_arxiv_id":"2605.09271","ref_index":114,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding","primary_cat":"cs.AI","submitted_at":"2026-05-10T02:42:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"be formulated as a language-based task. In contrast to pure scaling, this view emphasizessystem compositionover raw capacity. Here, intelligence is seen not as an emergent property of a single monolithic model, but as the result of a collaborative system where the LLM acts as the cognitive core [112, 113], orchestrating specialized tools such as search engines [ 114, 115], code interpreters [116], or symbolic planners [ 117, 118]. This paradigm extends the LLM's effective reach without requiring further scaling, enabling problem- solving across modalities, data sources, and reasoning domains [119, 120]. Compared with Our View:Language Representation Design as the Next Frontier for Expanding LLM Intelligence While scaling and tool augmentation have propelled LLMs to unprecedented capability, we argue that"},{"citing_arxiv_id":"2605.08133","ref_index":19,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-05-01T05:50:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"frameworks, their direct adaptation to the high-stakes domain of autonomous driving is impeded by two fundamental chal- lenges. First, regarding real-time efficiency, the high dimen- sionality of dense visual streams imposes excessive compu- tational overhead, rendering raw retrieval incompatible with the strict millisecond-level latency constraints of closed-loop control [19, 20]. Second, regarding scenario distinguishability, direct visual matching often suffers from high visual similar- ity; scenarios with identical static backgrounds but distinct semantic logic (e.g., different traffic light phases) are difficult to differentiate based solely on pixel-level features, leading to the retrieval of confused guidance [21]."},{"citing_arxiv_id":"2604.22281","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning","primary_cat":"cs.CV","submitted_at":"2026-04-24T06:51:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DocPrune is a training-free token pruning method that removes background and irrelevant tokens from document images using question and comprehension signals, yielding 3x encoder and 3.3x decoder throughput gains plus +1 F1 on M3DocRAG.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22280","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings","primary_cat":"cs.CV","submitted_at":"2026-04-24T06:50:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"retrieval and improved flexibility. • We introduce the refine reinforcement learning that uses discriminative embeddings as anchors, guiding the rewrit- ing process to better refine the generative embedding. • Extensive experiments demonstrate that RIME significantly outperforms existing generative embedding models on the MMEB-V2 [17], UVRB [13], and MRMR [49] benchmarks. 2 Related Works 2.1 Multimodal Representation Learning Multimodal representation learning aims to embed heterogeneous modalities into a unified semantic space for multimodal retrieval. Early works mainly adopt dual-encoder architectures trained with contrastive learning [ 36, 37, 47]. For example, CLIP [ 37] learns shared image-text representations from large-scale web data and"},{"citing_arxiv_id":"2604.14029","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch","primary_cat":"cs.CV","submitted_at":"2026-04-15T16:09:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ternal, retrieved information [20,21]. This paradigm has effectively extended to multimodal domains, particularly for knowledge-intensive visual question an- swering (VQA) tasks. By retrieving external knowledge, which encompasses text, images, and multimodal structured data, these methods can leverage the evi- dence from such sources to better address complex VQA problems [2,3,16,26,30, 36,53,59]. Despite their empirical success, conventional RAG methods face criti- cal limitations: they predominantly rely on static knowledge bases and typically decouple retrieval from generation, preventing effective end-to-end optimization. Autonomous web-search agents.Recent advancements have pivoted from static corpora to real-time web interaction, empowering models to access dy-"},{"citing_arxiv_id":"2604.13710","ref_index":41,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs","primary_cat":"cs.CV","submitted_at":"2026-04-15T10:39:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Extracting high- quality, compact embeddings from these generative backbones without compromising their reasoning abilities remains an open challenge. 2.3 Adapting MLLMs for Multimodal Retrieval Recent works repurpose MLLMs as dense retrievers. Methods like GME [ 45], MM-Embed [22], VLM2VEC [15], and MMRet [ 47] extract the last token's hidden state, while VisRAG [ 41] and ColPali [9] use multi-vector representations. These approaches typically employ full fine-tuning or LoRA, requiring massive computation and risking semantic distortion. In contrast, SLQ utilizes learnable queries to aggregate features, avoiding invasive tuning while minimizing optimization cost. 2.4 Multimodal Prompt Tuning and Query-based Methods"},{"citing_arxiv_id":"2604.12812","ref_index":17,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding","primary_cat":"cs.AI","submitted_at":"2026-04-14T14:39:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-efficient resolution allocation.","context_count":1,"top_context_role":"method","top_context_polarity":"baseline","context_text":"and applies universally to almost any document format. However, MLLMs powered pure-visual methods con- front two fundamental hurdles when scaling to long doc- uments. The first is severelow Signal-to-Noise Ratio (SNR), where crucial evidence are buried within vast irrele- vant content. Although several visual-Retrieval-Augmented Generation (RAG) methods [17, 18] for document page re- trieval have been proposed, which can pre-filter question relevant pages, it introduces the classic top-kdilemma: a largekensures high recall but introduce more noise, whereas a smallkrisks missing the evidence. The second is thescarcity of fine-grained supervi- sion. Most existing multi-page document VQA datasets provide only final short answers, lacking intermediate rea-"},{"citing_arxiv_id":"2604.11095","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Bottleneck Tokens for Unified Multimodal Retrieval","primary_cat":"cs.LG","submitted_at":"2026-04-13T07:12:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"VisDial 1.0 123,287 VisualNews (t2i) 1.0 99,903 VisualNews (i2t) 1.0 100,000 MSCOCO (t2i) 1.0 100,000 MSCOCO (i2t) 1.0 113,287 MSCOCO (grounding) 1.0 100,000 CIRR 0.5 26,116 NIGHTS 0.5 15,941 WebQA 0.5 17,166 Video Video Caption (t2v) LLaVA-Hound [26] 5.0 301,751 Video Caption (v2t) 5.0 301,751 Video QA 5.0 255,000 VisDoc VisRAG (in-domain) VisRAG [25] 12.0 122,752 ColPali Train ColPali [3] 10.0 118,195 Total (25 datasets) 52.0 2,167,921 BToks and the reproduced baseline are the bottlenecked retrieval interface and the training objectives. Table 6 summarizes the training data composition by modality and source. For the modality-specific variants (Image-only, Video-only, VisDoc-only) dis- cussed in Sec."},{"citing_arxiv_id":"2604.10167","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval","primary_cat":"cs.CV","submitted_at":"2026-04-11T11:31:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ments where semantic meaning is derived from both textual content and spatial layout [40, 42]. Traditional pipelines rely on OCR to extract text, but they oftenstruggle to preserve structural integrity and fail on non-textual elements like tables and charts[ 27, 45]. While the era of Large Vision-Language Models (LVLMs) has introduced end-to-end single-vector models to bypass OCR (e.g.,DSE [ 19], GME [46], UniSE [18]), these modelssuffer from significant infor- mation loss by compressing complex, high-resolution pages into a single coarse-grained representation. Consequently, multi-vector ar- chitectures, pioneered by ColPali [ 5], have redefined the SOTA through late interaction [13], with subsequent research focusing on enhancing performance via model architecture (e."},{"citing_arxiv_id":"2604.09508","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-10T17:25:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-horizon visual reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07220","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval","primary_cat":"cs.IR","submitted_at":"2026-04-08T15:41:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and +14.1 over the best multimodal baseline.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"MM-BRIGHT confirmed this directly: the best multimodal model (Nomic-Vision: 27.6) underperforms the best text- only retriever (DiVeR: 32.2), showing visual reasoning can- not be reduced to embedding fusion. 2.3. Visual Document Retrieval ColPali [10] treats document pages as images and embeds them via a VLM using ColBERT-style late interaction, by- passing OCR entirely. DSE [22] and VisRAG [38] similarly embed full page images for dense retrieval. While effec- tive for visual document retrieval, these methods assume vi- sual queriesanddocuments; HIVE addresses multimodal- to-text retrieval where documents are purely textual. 2.4. LLM-Augmented Query Reformulation Query expansion and reformulation have long been used to bridge the vocabulary gap between queries and docu-"},{"citing_arxiv_id":"2604.07201","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment","primary_cat":"cs.IR","submitted_at":"2026-04-08T15:28:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retriever at 33.3.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"they cannot reason about what a query imageimpliesfor document relevance. MM-BRIGHT [6] confirmed this di- rectly - the best multimodal model (27.6 nDCG@10) un- derperforms the best text-only retriever (32.2), revealing that the bottleneck is not visual encoding capacity but query representation quality. 2.3. Visual Document Retrieval ColPali [12], DSE [23], and VisRAG [43] embed full docu- ment pages as images for dense retrieval. This line of work assumes documents are visual, operating in a fundamentally different setting from ours where the corpus is text-only and only the query contains visual content. 2.4. Query Rewriting and Alignment Query rewriting has a long history in information retrieval, from classical pseudo-relevance feedback [35] to modern"},{"citing_arxiv_id":"2604.07079","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL","primary_cat":"cs.IR","submitted_at":"2026-04-08T13:35:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Visual Document Retrieval A parallel line of work targets retrieval where both queries and documents are visual. ColPali [10] proposed treating document pages as images and embedding them directly via a VLM using ColBERT-style multi-vector late interac- tion, capturing fine-grained visual structure such as tables, charts, and layout. DSE [22] and VisRAG [35] similarly embed full page images for dense retrieval. 2.4. LLM-Based Query Expansion Query expansion has a long history in information retrieval, from classical pseudo-relevance feedback [28] to modern LLM-based reformulation. HyDE [11] generates hypothet- ical documents from the query for zero-shot dense retrieval, while Query2Doc [33] expands queries with pseudo-"},{"citing_arxiv_id":"2604.04901","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FileGram: Grounding Agent Personalization in File-System Behavioral Traces","primary_cat":"cs.CV","submitted_at":"2026-04-06T17:49:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02073","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PLUME: Latent Reasoning Based Universal Multimodal Embedding","primary_cat":"cs.CV","submitted_at":"2026-04-02T14:04:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"itation: E5-V [19] and MM-Embed [28] prompt MLLMs for universal embeddings; VLM2Vec [20] introduces the MMEB benchmark; and VLM2Vec-V2 [31], GME [54], UniME [10], LamRA [30], LLaVE [23], MoCa [3], and DUME [55] further improve retrieval quality and modal- ity coverage. More recent efforts explore multi-vector rep- resentations [6], large-scale data synthesis [56, 57], vi- sual document retrieval [49], and reinforcement-learning- based alignment [48] to push the accuracy-efficiency fron- tier. However most methods derive embeddings from a sin- gle forward pass without modeling intermediate reasoning, limiting performance on complex retrieval query. 2.2. Reasoning-Enhanced Embedding Chain-of-thought (CoT) prompting [41, 43] elicits multi- step reasoning in language models, and subsequent ex-"},{"citing_arxiv_id":"2507.04590","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents","primary_cat":"cs.CV","submitted_at":"2025-07-07T00:51:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.20670","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MMSearch-R1: Incentivizing LMMs to Search","primary_cat":"cs.CV","submitted_at":"2025-06-25T17:59:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.22095","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation","primary_cat":"cs.CL","submitted_at":"2025-05-28T08:17:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.21169","ref_index":285,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction","primary_cat":"cs.MM","submitted_at":"2024-10-28T16:11:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents. arXiv preprint arXiv:2410.10594 (2024). [284] Wenwen Yu, Yuliang Liu, Wei Hua, Deqiang Jiang, Bo Ren, and Xiang Bai. 2023. Turning a clip model into a scene text detector. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 6978-6988. [285] Ye Yuan, Xiao Liu, Wondimu Dikubab, Hui Liu, Zhilong Ji, Zhongqin Wu, and Xiang Bai. 2022. Syntax-aware network for handwritten mathematical expression recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 4553-4562. [286] Fangneng Zhan and Shijian Lu. 2018. ESIR: End-to-end Scene Text Recognition via Iterative Rectification."}],"limit":50,"offset":0}