DiagramRAG: A Lightweight Framework to Retrieve Scientific Diagram for Figure Generation

Chengliang Chai; Chi Zhang; Jingyuan Tan; Junyi Han; Xiangyu Fu; Xinjiang Yu; Yixiang Jian; Yu-Ping Wang; Zhuofan Chen; Zirui You

arxiv: 2605.27931 · v1 · pith:CTYGPTNZnew · submitted 2026-05-27 · 💻 cs.AI

DiagramRAG: A Lightweight Framework to Retrieve Scientific Diagram for Figure Generation

Xinjiang Yu , Junyi Han , Zhuofan Chen , Chi Zhang , Xiangyu Fu , Jingyuan Tan , Zirui You , Yixiang Jian

show 2 more authors

Yu-Ping Wang Chengliang Chai

This is my paper

Pith reviewed 2026-06-29 12:43 UTC · model grok-4.3

classification 💻 cs.AI

keywords diagram retrievalsketch-based generationknowledge graphsretrieval-augmented generationscientific diagramstopological compatibilityembedding alignmentfigure generation

0 comments

The pith

DiagramRAG retrieves reference diagrams that match both the content and topological structure of a user sketch to guide scientific diagram generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DiagramRAG as a retrieval-augmented system that converts incomplete user sketches into higher-quality scientific diagrams by locating suitable reference diagrams. It encodes diagrams as knowledge graphs to preserve structure, generates multiple simplified sketch versions for training data, and learns embeddings that place compatible sketches and full diagrams close together. Retrieved references then supply missing content, connections, and visual details to the generation process. A reader would care because existing sketch methods either copy the rough input or ignore its layout, while pure text methods overlook spatial relations that diagrams require.

Core claim

Given a user sketch, DiagramRAG retrieves reference diagrams that are both semantically relevant to the sketch content and topologically compatible with its structure, and uses them to guide downstream diagram generation. To enable efficient structure-aware retrieval, diagrams are represented as knowledge graphs, sketch variants at different simplification levels are synthesized, and an embedding model is trained to align sketches with compatible diagrams in a shared space. The retrieved references further provide content, topology, and visual priors for completing and rendering the final diagram.

What carries the argument

Knowledge graph encoding of diagrams plus an embedding model trained on synthesized multi-level sketch variants, which aligns incomplete sketches to topologically compatible full references for retrieval.

If this is right

The system reaches F1-scores of 0.848 on DiagramBank and 0.802 on FigureBench.
Generation quality improves to a peak VLM-as-a-Judge score of 7.170.
Inference runs at 35.48 seconds per sample.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-plus-embedding retrieval could be tested on non-scientific sketches such as flowcharts or network diagrams to check whether topological matching transfers.
Feeding the retrieved reference graphs directly as additional conditioning tokens into the generator might raise the observed quality scores further.
The multi-level sketch synthesis used for training suggests a reusable way to make retrieval robust when input drawings omit many edges or labels.

Load-bearing premise

That encoding diagrams as knowledge graphs and training embeddings on synthesized sketch variants will produce alignments that reliably locate references whose content and structure actually improve the quality of the generated diagram.

What would settle it

Generating diagrams from the same sketches once with the retrieved references and once without them, then measuring no gain or a drop in VLM-as-a-Judge score or F1 on DiagramBank and FigureBench, would show the retrieval step adds no value.

Figures

Figures reproduced from arXiv: 2605.27931 by Chengliang Chai, Chi Zhang, Jingyuan Tan, Junyi Han, Xiangyu Fu, Xinjiang Yu, Yixiang Jian, Yu-Ping Wang, Zhuofan Chen, Zirui You.

**Figure 1.** Figure 1: Overview of DiagramRAG. 3.1 Problem Definition In this paper, we study the problem of sketch-based diagram completion over a high-quality diagram set D, where each diagram d ∈ D is a publication-quality diagram and the sketch s is a simplified diagram representation. The objective is to retrieve diagrams that are semantically relevant and topologically compatible with s and generate a refined diagram I bas… view at source ↗

**Figure 2.** Figure 2: Examples of generated sketch variants [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of sketch variants on retrieval performance. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Dual-Agent ablation. VLM-based comparison of variants with and without the Structural Planning Agent and Visual Guidance Agent. arrow directions, and topological consistency. The Visual Guidance Agent uses the retrieved top-3 fine diagrams as visual references and guides layout organization, color logic, icon usage, local panels, feature/token/image thumbnails, and the level of detail expected in publicati… view at source ↗

**Figure 5.** Figure 5: Mean information loss in the Diagrambank dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Mean information loss in the Figurebench dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Scientific diagrams are essential for communicating complex methodologies in academic papers. A natural way for researchers to specify such diagrams is through rough sketches, where text labels, connectors, and spatial arrangements express early semantic and topological intentions. However, sketches are usually incomplete, making them insufficient for directly producing publication-quality diagrams. Existing sketch-based generation methods mainly reconstruct the sketch itself, while recent text-driven diagram generation frameworks rely on textual semantics and do not fully exploit the topological structure contained in sketches. In this paper, we introduce DiagramRAG, a lightweight retrieval-augmented framework for sketch-based scientific diagram completion. Given a user sketch, DiagramRAG retrieves reference diagrams that are both semantically relevant to the sketch content and topologically compatible with its structure, and uses them to guide downstream diagram generation. To enable efficient structure-aware retrieval, we represent diagrams as knowledge graphs, synthesize sketch variants at different simplification levels, and train an embedding model to align sketches with compatible diagrams in a shared space. The retrieved references further provide content, topology, and visual priors for completing and rendering the final diagram. Experiments show that DiagramRAG achieves F1-scores of 0.848 and 0.802 on DiagramBank and FigureBench, respectively, and improves generation quality with the best VLM-as-a-Judge score of 7.170, while reducing inference latency to 35.48 seconds per sample. Our code and data are available at https://anonymous.4open.science/r/DiagramRAG-A262 and https://huggingface.co/datasets/anonymous-review-a262/DiagramSketch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiagramRAG gives a concrete retrieval pipeline using knowledge graphs and multi-level synthetic sketches, but the lack of real-user-sketch testing leaves the main claim unproven.

read the letter

DiagramRAG builds a retrieval system that encodes diagrams as knowledge graphs, generates sketch variants at several simplification levels, trains embeddings to align them, and pulls references to guide final diagram generation. It reports F1 scores of 0.848 and 0.802 on DiagramBank and FigureBench, a top VLM judge score of 7.17, and 35-second latency, with code and data released.

The combination of graph representation and contrastive training on synthesized variants is a straightforward way to handle both content and topology in one pass. That goes beyond pure sketch reconstruction or text-only methods, and the lightweight framing plus public resources make the pipeline easy to inspect.

The main weakness is the one flagged in the stress test. Training and evaluation stay inside the synthetic sketch variants. Real user sketches carry stroke noise, variable label placement, and partial connections that the synthesis process may not capture. Without a held-out test on actual hand-drawn inputs, there is no direct evidence that the embeddings surface topologically compatible references in practice. That gap sits at the center of the claimed benefit.

The reported metrics are specific, but the paper would be tighter with even a small real-sketch validation set or human preference data to show the retrieval step actually improves output quality.

This work is aimed at researchers building tools for scientific figure creation or structured retrieval. A reader who needs a working example of graph-based RAG for diagrams can extract the pipeline and data release.

It should go to peer review. The method is explicit, the numbers are given, and the resources are public, so referees can verify the implementation and ask for the missing real-sketch checks.

Referee Report

1 major / 2 minor

Summary. The paper introduces DiagramRAG, a lightweight RAG framework for sketch-based scientific diagram completion. Diagrams are represented as knowledge graphs; sketch variants at varying simplification levels are synthesized to train a contrastive embedding model that aligns user sketches with semantically relevant and topologically compatible reference diagrams in a shared space. Retrieved references supply content, topology, and visual priors to guide downstream generation. Experiments report F1 scores of 0.848 on DiagramBank and 0.802 on FigureBench, a peak VLM-as-a-Judge score of 7.170, and inference latency of 35.48 s per sample, with code and data released.

Significance. If the retrieval alignments generalize, the framework offers a practical, structure-aware alternative to purely text-driven or sketch-reconstruction methods for scientific figure generation, with measurable gains in quality and efficiency. The public release of code and datasets supports reproducibility and follow-on work.

major comments (1)

[§3.2] §3.2: The contrastive embedding model is trained exclusively on synthesized sketch variants at different simplification levels. No held-out evaluation on authentic user-collected sketches (with real stroke noise, label variance, or partial connectivity) is reported, which directly undermines the central claim that the shared embedding space reliably surfaces topologically compatible references for downstream RAG benefit.

minor comments (2)

The F1 metric is reported without an explicit definition of what constitutes a true positive (e.g., node/edge overlap thresholds or semantic matching criteria) in the retrieval or generation evaluation.
Anonymous repository and dataset links are appropriate for review but should be replaced with permanent identifiers in the camera-ready version.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the specific comment on evaluation. Below we respond directly to the major point raised.

read point-by-point responses

Referee: [§3.2] §3.2: The contrastive embedding model is trained exclusively on synthesized sketch variants at different simplification levels. No held-out evaluation on authentic user-collected sketches (with real stroke noise, label variance, or partial connectivity) is reported, which directly undermines the central claim that the shared embedding space reliably surfaces topologically compatible references for downstream RAG benefit.

Authors: We agree that evaluation on authentic user sketches would strengthen the central claim. Our synthesis procedure was explicitly constructed to introduce controlled variations in stroke density, label placement, and connectivity that approximate the characteristics of hand-drawn scientific sketches; the contrastive objective is trained across multiple simplification levels precisely to encourage robustness to such noise. The held-out portions of DiagramBank and FigureBench follow the same synthesis distribution, and the downstream RAG gains (F1 0.848/0.802 and VLM score 7.170) provide indirect evidence that the learned space transfers to the generation task. Nevertheless, we recognize that this remains a proxy evaluation. We will revise §3.2 and add a dedicated limitations paragraph to state the reliance on synthesized data, describe the design choices intended to mimic real sketch variance, and note the absence of a real-user held-out set as an open direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core pipeline—representing diagrams as knowledge graphs, synthesizing sketch variants for contrastive embedding training, retrieving from external collections (DiagramBank/FigureBench), and using references to guide generation—does not contain any self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations. The reported F1-scores (0.848/0.802) and VLM score are external benchmark metrics on held-out data, not quantities forced by construction from the synthesis or training inputs. The embedding alignment is presented as an empirical method whose reliability is tested separately rather than assumed by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the core modeling choice of knowledge-graph diagram encoding is treated as a domain assumption.

axioms (1)

domain assumption Diagrams can be faithfully represented as knowledge graphs that capture both semantic labels and topological structure.
Invoked to enable structure-aware retrieval from sketches.

pith-pipeline@v0.9.1-grok · 5847 in / 1226 out tokens · 32916 ms · 2026-06-29T12:43:56.065248+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 19 canonical work pages · 5 internal anchors

[1]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision – ECCV 2016, volume 9908 of Lecture Notes in Computer Science, pages 235–251. Springer, 2016. doi: 10.1007/ 978-3-319-46493-0_15. URLhttps://doi.org/10.1007/978-3-319-46493-0_15

work page doi:10.1007/978-3-319-46493-0_15 2016
[2]

Dy- namic graph generation network: Generating relational knowledge from diagrams

Daesik Kim, Youngjoon Yoo, Jeesoo Kim, Sangkuk Lee, and Nojun Kwak. Dy- namic graph generation network: Generating relational knowledge from diagrams. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3298– 3307, 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Kim_ Dynamic_Graph_Generation_CVPR_2018_paper.html

2018
[3]

DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation

Tingwen Zhang, Ling Yue, Zhen Xu, and Shaowu Pan. Diagrambank: A large-scale dataset of diagram design exemplars with paper metadata for retrieval-augmented generation. arXiv preprint arXiv:2604.20857, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Graphesis: Visual Knowledge Production and Representation

Johanna Drucker. Graphesis: Visual Knowledge Production and Representation. Harvard University Press, 2014

2014
[5]

Visualizing thought

Barbara Tversky. Visualizing thought. Topics in Cognitive Science, 3(3):499–535, 2011

2011
[6]

West, and Bill Howe

Po-Shen Lee, Jevin D. West, and Bill Howe. Viziometrics: Analyzing visual information in the scientific literature. IEEE Transactions on Big Data, 4(1):117–129, 2018. doi: 10.1109/ TBDATA.2017.2689038

work page arXiv 2018
[8]

URLhttps://arxiv.org/abs/2601.23265

work page arXiv
[9]

Autofigure: Generating and refining publication-ready scientific illustrations

Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, and Yue Zhang. Autofigure: Generating and refining publication-ready scientific illustrations. In International Conference on Learning Representations, 2026. URL https://arxiv.org/ abs/2602.03828. to appear

work page arXiv 2026
[10]

SciFig: Towards Automating Editable Figure Generation for Scientific Papers

Siyuan Huang, Yutong Gao, Juyang Bai, Yifan Zhou, Zi Yin, Xinxin Liu, Rama Chellappa, Chun Pong Lau, Sayan Nag, Cheng Peng, and Shraman Pramanick. SciFig: Towards Automating Scientific Figure Generation. arXiv preprint arXiv:2601.04390, 2026. URL https://arxiv. org/abs/2601.04390

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Sketch2diagram: Generating vector diagrams from hand-drawn sketches

Itsumi Saito, Haruto Yoshida, and Keisuke Sakaguchi. Sketch2diagram: Generating vector diagrams from hand-drawn sketches. In International Conference on Learning Representations (ICLR), 2025

2025
[12]

Tenenbaum

Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Joshua B. Tenenbaum. Learning to in- fer graphics programs from hand-drawn images. In Advances in Neural Information Processing Systems, volume 31, pages 6062–6071, 2018. URL https://proceedings.neurips.cc/ paper/2018/hash/6788076842014c83cedadbe6b0ba0314-Abstract.html

work page arXiv 2018
[13]

Cheng Tan, Qi Chen, Jingxuan Wei, Gaowei Wu, Zhangyang Gao, Siyuan Li, Bihui Yu, Ruifeng Guo, and Stan Z. Li. Sketchagent: Generating structured diagrams from hand-drawn sketches. arXiv preprint arXiv:2508.01237, 2025

work page arXiv 2025
[14]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023. URL https: //openaccess.thecvf.com/content/ICCV2023/html/Zhang_Adding_Conditional_ Control_to_Text-to-Image_Diffusion_Models_ICCV_2023_paper.html

2023
[15]

Diagrammergpt: Generating open- domain, open-platform diagrams via llm planning

Abhay Zala, Han Lin, Jaemin Cho, and Mohit Bansal. Diagrammergpt: Generating open- domain, open-platform diagrams via llm planning. arXiv preprint arXiv:2310.12128, 2023. URLhttps://arxiv.org/abs/2310.12128. 10

work page arXiv 2023
[16]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandras Piktus, and et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 2020

2020
[17]

Shamma, Michael S

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

2015
[18]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV), 2017

2017
[19]

Visual relationship detection with language priors

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In European Conference on Computer Vision, pages 852–869. Springer,
[20]

Visual Relationship Detection with Language Priors

doi: 10.1007/978-3-319-46448-0_51. URLhttps://arxiv.org/abs/1608.00187

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-319-46448-0_51
[21]

Choy, and Li Fei-Fei

Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5419, 2017. URL https://openaccess.thecvf.com/content_ cvpr_2017/html/Xu_Scene_Graph_Generation_CVPR_2017_paper.html

2017
[22]

Neural motifs: Scene graph parsing with global context

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5831–5840, 2018. URL https://openaccess.thecvf.com/ content_cvpr_2018/html/Zellers_Neural_Motifs_Scene_CVPR_2018_paper.html

2018
[23]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 8748– 8763, ...

2021
[24]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023

2023
[25]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking. arXiv preprint arXiv:2601.04720, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Clip for all things zero-shot sketch-based image retrieval, fine-grained or not

Aneeshan Sain, Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Subhadeep Koley, Tao Xiang, and Yi-Zhe Song. Clip for all things zero-shot sketch-based image retrieval, fine-grained or not. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2765–2775, 2023

2023
[27]

Al Taleb, G

Xuefei Li, Yizhou Zhou, et al. Multimodal representation learning via contrastive predic- tion for image and text matching. Pattern Recognition, 123:108402, 2022. doi: 10.1016/j. patcog.2021.108402. URL https://www.sciencedirect.com/science/article/pii/ S0031320321003198

work page doi:10.1016/j 2022
[28]

SimGNN: A Neural Network Approach to Fast Graph Similarity Computation

Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, Yizhou Sun, and Wei Wang. SimGNN: A Neural Network Approach to Fast Graph Similarity Computation. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (WSDM), pages 384–392,
[29]

doi: 10.1145/3289600.3290967

work page doi:10.1145/3289600.3290967
[30]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. Raptor: Recursive abstractive processing for tree-organized retrieval. 2024

2024
[31]

From local to global: A graph rag approach to query-focused summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization. 2025. 11

2025
[32]

Colbert: Efficient and effective passage search via contextu- alized late interaction over bert

Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextu- alized late interaction over bert. 2020

2020
[33]

Diagrameval: Evaluating llm-generated diagrams via graphs

Chumeng Liang and Jiaxuan You. Diagrameval: Evaluating llm-generated diagrams via graphs. 2025

2025
[34]

Sciflow-bench: Evaluat- ing structure-aware scientific diagram generation via inverse parsing

Tong Zhang, Honglin Lin, Zhou Liu, Chong Chen, and Wentao Zhang. Sciflow-bench: Evaluat- ing structure-aware scientific diagram generation via inverse parsing. 2026

2026
[35]

Optimal transport graph neural networks

Gary Bécigneul, Octavian-Eugen Ganea, Benson Chen, Regina Barzilay, and Tommi Jaakkola. Optimal transport graph neural networks. arXiv preprint arXiv:2006.04804, 2020. URL https://arxiv.org/abs/2006.04804

work page arXiv 2006
[36]

Bridging the gap between graph edit distance and kernel machines

Michel Neuhaus and Horst Bunke. Bridging the gap between graph edit distance and kernel machines. World Scientific Publishing Co., Inc., 15(10):2129–2144, 2005

2005
[37]

A comprehensive sur- vey and experimental study of subgraph matching: Trends, unbiasedness, and interaction

Ziheng Sun, Hongning Wang, Kangfei Zhao, and Qiong Luo. A comprehensive sur- vey and experimental study of subgraph matching: Trends, unbiasedness, and interaction. ACM Transactions on Database Systems, 49(1):1–42, 2024. doi: 10.1145/3639315. URL https://dl.acm.org/doi/10.1145/3639315

work page doi:10.1145/3639315 2024
[38]

A comprehen- sive survey of subgraph matching: [experiments & analysis]

Myoungji Han, Hyunjoon Kim, Geonmo Gu, Kunsoo Park, and Wook-Shin Han. A comprehen- sive survey of subgraph matching: [experiments & analysis]. ACM Transactions on Database Systems, 2025. doi: 10.1145/3771791. URL https://dl.acm.org/doi/10.1145/3771791. Just Accepted

work page doi:10.1145/3771791 2025
[39]

LightGBM: A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, volume 30, 2017. URL https://papers.nips.cc/ paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree

2017
[40]

Prometheus- vision: Vision-language model as a judge for fine-grained evaluation

Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, and Minjoon Seo. Prometheus- vision: Vision-language model as a judge for fine-grained evaluation. arXiv preprint arXiv:2401.06591, 2024. URLhttps://arxiv.org/abs/2401.06591

work page arXiv 2024
[41]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

2023
[42]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Brief: Bi-level coreset selection for efficient instruction tuning in llms

Chaoyuan Shen, Chi Zhang, Chengliang Chai, Jiacheng Wang, Jia Yuan, Yuping Wang, Ye Yuan, Guoren Wang, and Lei Cao. Brief: Bi-level coreset selection for efficient instruction tuning in llms. Proceedings of the VLDB Endowment, 19(6):1264–1277, 2026

2026
[44]

Handling label noise via instance-level difficulty modeling and dynamic optimization

Kuan Zhang, Chengliang Chai, Jingzhe Xu, Chi Zhang, Han Han, Ye Yuan, Guoren Wang, and Lei Cao. Handling label noise via instance-level difficulty modeling and dynamic optimization. Advances in Neural Information Processing Systems, 38:46667–46696, 2026

2026
[45]

Not all documents are what you need for extracting instruction tuning data

Chi Zhang, Huaping Zhong, Hongtao Li, Chengliang Chai, Jiawei Hong, Yuhao Deng, Jiacheng Wang, Tian Tan, Yizhou Yan, Jiantao Qiu, et al. Not all documents are what you need for extracting instruction tuning data. arXiv preprint arXiv:2505.12250, 2025

work page arXiv 2025
[46]

decision

Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai, Rui Wang, Xinlin Zhuang, Tianyi Bai, Qiu Jiantao, Lei Cao, Ju Fan, et al. Harnessing diversity for important data selection in pretraining large language models. In International Conference on Learning Representations, volume 2025, pages 72980–73003, 2025. 12 A Additional Experimental Details Training...

2025

[1] [1]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision – ECCV 2016, volume 9908 of Lecture Notes in Computer Science, pages 235–251. Springer, 2016. doi: 10.1007/ 978-3-319-46493-0_15. URLhttps://doi.org/10.1007/978-3-319-46493-0_15

work page doi:10.1007/978-3-319-46493-0_15 2016

[2] [2]

Dy- namic graph generation network: Generating relational knowledge from diagrams

Daesik Kim, Youngjoon Yoo, Jeesoo Kim, Sangkuk Lee, and Nojun Kwak. Dy- namic graph generation network: Generating relational knowledge from diagrams. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3298– 3307, 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Kim_ Dynamic_Graph_Generation_CVPR_2018_paper.html

2018

[3] [3]

DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation

Tingwen Zhang, Ling Yue, Zhen Xu, and Shaowu Pan. Diagrambank: A large-scale dataset of diagram design exemplars with paper metadata for retrieval-augmented generation. arXiv preprint arXiv:2604.20857, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Graphesis: Visual Knowledge Production and Representation

Johanna Drucker. Graphesis: Visual Knowledge Production and Representation. Harvard University Press, 2014

2014

[5] [5]

Visualizing thought

Barbara Tversky. Visualizing thought. Topics in Cognitive Science, 3(3):499–535, 2011

2011

[6] [6]

West, and Bill Howe

Po-Shen Lee, Jevin D. West, and Bill Howe. Viziometrics: Analyzing visual information in the scientific literature. IEEE Transactions on Big Data, 4(1):117–129, 2018. doi: 10.1109/ TBDATA.2017.2689038

work page arXiv 2018

[7] [8]

URLhttps://arxiv.org/abs/2601.23265

work page arXiv

[8] [9]

Autofigure: Generating and refining publication-ready scientific illustrations

Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, and Yue Zhang. Autofigure: Generating and refining publication-ready scientific illustrations. In International Conference on Learning Representations, 2026. URL https://arxiv.org/ abs/2602.03828. to appear

work page arXiv 2026

[9] [10]

SciFig: Towards Automating Editable Figure Generation for Scientific Papers

Siyuan Huang, Yutong Gao, Juyang Bai, Yifan Zhou, Zi Yin, Xinxin Liu, Rama Chellappa, Chun Pong Lau, Sayan Nag, Cheng Peng, and Shraman Pramanick. SciFig: Towards Automating Scientific Figure Generation. arXiv preprint arXiv:2601.04390, 2026. URL https://arxiv. org/abs/2601.04390

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [11]

Sketch2diagram: Generating vector diagrams from hand-drawn sketches

Itsumi Saito, Haruto Yoshida, and Keisuke Sakaguchi. Sketch2diagram: Generating vector diagrams from hand-drawn sketches. In International Conference on Learning Representations (ICLR), 2025

2025

[11] [12]

Tenenbaum

Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Joshua B. Tenenbaum. Learning to in- fer graphics programs from hand-drawn images. In Advances in Neural Information Processing Systems, volume 31, pages 6062–6071, 2018. URL https://proceedings.neurips.cc/ paper/2018/hash/6788076842014c83cedadbe6b0ba0314-Abstract.html

work page arXiv 2018

[12] [13]

Cheng Tan, Qi Chen, Jingxuan Wei, Gaowei Wu, Zhangyang Gao, Siyuan Li, Bihui Yu, Ruifeng Guo, and Stan Z. Li. Sketchagent: Generating structured diagrams from hand-drawn sketches. arXiv preprint arXiv:2508.01237, 2025

work page arXiv 2025

[13] [14]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023. URL https: //openaccess.thecvf.com/content/ICCV2023/html/Zhang_Adding_Conditional_ Control_to_Text-to-Image_Diffusion_Models_ICCV_2023_paper.html

2023

[14] [15]

Diagrammergpt: Generating open- domain, open-platform diagrams via llm planning

Abhay Zala, Han Lin, Jaemin Cho, and Mohit Bansal. Diagrammergpt: Generating open- domain, open-platform diagrams via llm planning. arXiv preprint arXiv:2310.12128, 2023. URLhttps://arxiv.org/abs/2310.12128. 10

work page arXiv 2023

[15] [16]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandras Piktus, and et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 2020

2020

[16] [17]

Shamma, Michael S

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

2015

[17] [18]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV), 2017

2017

[18] [19]

Visual relationship detection with language priors

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In European Conference on Computer Vision, pages 852–869. Springer,

[19] [20]

Visual Relationship Detection with Language Priors

doi: 10.1007/978-3-319-46448-0_51. URLhttps://arxiv.org/abs/1608.00187

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-319-46448-0_51

[20] [21]

Choy, and Li Fei-Fei

Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5419, 2017. URL https://openaccess.thecvf.com/content_ cvpr_2017/html/Xu_Scene_Graph_Generation_CVPR_2017_paper.html

2017

[21] [22]

Neural motifs: Scene graph parsing with global context

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5831–5840, 2018. URL https://openaccess.thecvf.com/ content_cvpr_2018/html/Zellers_Neural_Motifs_Scene_CVPR_2018_paper.html

2018

[22] [23]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 8748– 8763, ...

2021

[23] [24]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023

2023

[24] [25]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking. arXiv preprint arXiv:2601.04720, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [26]

Clip for all things zero-shot sketch-based image retrieval, fine-grained or not

Aneeshan Sain, Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Subhadeep Koley, Tao Xiang, and Yi-Zhe Song. Clip for all things zero-shot sketch-based image retrieval, fine-grained or not. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2765–2775, 2023

2023

[26] [27]

Al Taleb, G

Xuefei Li, Yizhou Zhou, et al. Multimodal representation learning via contrastive predic- tion for image and text matching. Pattern Recognition, 123:108402, 2022. doi: 10.1016/j. patcog.2021.108402. URL https://www.sciencedirect.com/science/article/pii/ S0031320321003198

work page doi:10.1016/j 2022

[27] [28]

SimGNN: A Neural Network Approach to Fast Graph Similarity Computation

Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, Yizhou Sun, and Wei Wang. SimGNN: A Neural Network Approach to Fast Graph Similarity Computation. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (WSDM), pages 384–392,

[28] [29]

doi: 10.1145/3289600.3290967

work page doi:10.1145/3289600.3290967

[29] [30]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. Raptor: Recursive abstractive processing for tree-organized retrieval. 2024

2024

[30] [31]

From local to global: A graph rag approach to query-focused summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization. 2025. 11

2025

[31] [32]

Colbert: Efficient and effective passage search via contextu- alized late interaction over bert

Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextu- alized late interaction over bert. 2020

2020

[32] [33]

Diagrameval: Evaluating llm-generated diagrams via graphs

Chumeng Liang and Jiaxuan You. Diagrameval: Evaluating llm-generated diagrams via graphs. 2025

2025

[33] [34]

Sciflow-bench: Evaluat- ing structure-aware scientific diagram generation via inverse parsing

Tong Zhang, Honglin Lin, Zhou Liu, Chong Chen, and Wentao Zhang. Sciflow-bench: Evaluat- ing structure-aware scientific diagram generation via inverse parsing. 2026

2026

[34] [35]

Optimal transport graph neural networks

Gary Bécigneul, Octavian-Eugen Ganea, Benson Chen, Regina Barzilay, and Tommi Jaakkola. Optimal transport graph neural networks. arXiv preprint arXiv:2006.04804, 2020. URL https://arxiv.org/abs/2006.04804

work page arXiv 2006

[35] [36]

Bridging the gap between graph edit distance and kernel machines

Michel Neuhaus and Horst Bunke. Bridging the gap between graph edit distance and kernel machines. World Scientific Publishing Co., Inc., 15(10):2129–2144, 2005

2005

[36] [37]

A comprehensive sur- vey and experimental study of subgraph matching: Trends, unbiasedness, and interaction

Ziheng Sun, Hongning Wang, Kangfei Zhao, and Qiong Luo. A comprehensive sur- vey and experimental study of subgraph matching: Trends, unbiasedness, and interaction. ACM Transactions on Database Systems, 49(1):1–42, 2024. doi: 10.1145/3639315. URL https://dl.acm.org/doi/10.1145/3639315

work page doi:10.1145/3639315 2024

[37] [38]

A comprehen- sive survey of subgraph matching: [experiments & analysis]

Myoungji Han, Hyunjoon Kim, Geonmo Gu, Kunsoo Park, and Wook-Shin Han. A comprehen- sive survey of subgraph matching: [experiments & analysis]. ACM Transactions on Database Systems, 2025. doi: 10.1145/3771791. URL https://dl.acm.org/doi/10.1145/3771791. Just Accepted

work page doi:10.1145/3771791 2025

[38] [39]

LightGBM: A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, volume 30, 2017. URL https://papers.nips.cc/ paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree

2017

[39] [40]

Prometheus- vision: Vision-language model as a judge for fine-grained evaluation

Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, and Minjoon Seo. Prometheus- vision: Vision-language model as a judge for fine-grained evaluation. arXiv preprint arXiv:2401.06591, 2024. URLhttps://arxiv.org/abs/2401.06591

work page arXiv 2024

[40] [41]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

2023

[41] [42]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [43]

Brief: Bi-level coreset selection for efficient instruction tuning in llms

Chaoyuan Shen, Chi Zhang, Chengliang Chai, Jiacheng Wang, Jia Yuan, Yuping Wang, Ye Yuan, Guoren Wang, and Lei Cao. Brief: Bi-level coreset selection for efficient instruction tuning in llms. Proceedings of the VLDB Endowment, 19(6):1264–1277, 2026

2026

[43] [44]

Handling label noise via instance-level difficulty modeling and dynamic optimization

Kuan Zhang, Chengliang Chai, Jingzhe Xu, Chi Zhang, Han Han, Ye Yuan, Guoren Wang, and Lei Cao. Handling label noise via instance-level difficulty modeling and dynamic optimization. Advances in Neural Information Processing Systems, 38:46667–46696, 2026

2026

[44] [45]

Not all documents are what you need for extracting instruction tuning data

Chi Zhang, Huaping Zhong, Hongtao Li, Chengliang Chai, Jiawei Hong, Yuhao Deng, Jiacheng Wang, Tian Tan, Yizhou Yan, Jiantao Qiu, et al. Not all documents are what you need for extracting instruction tuning data. arXiv preprint arXiv:2505.12250, 2025

work page arXiv 2025

[45] [46]

decision

Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai, Rui Wang, Xinlin Zhuang, Tianyi Bai, Qiu Jiantao, Lei Cao, Ju Fan, et al. Harnessing diversity for important data selection in pretraining large language models. In International Conference on Learning Representations, volume 2025, pages 72980–73003, 2025. 12 A Additional Experimental Details Training...

2025