pith. sign in

arxiv: 2605.27931 · v1 · pith:CTYGPTNZnew · submitted 2026-05-27 · 💻 cs.AI

DiagramRAG: A Lightweight Framework to Retrieve Scientific Diagram for Figure Generation

Pith reviewed 2026-06-29 12:43 UTC · model grok-4.3

classification 💻 cs.AI
keywords diagram retrievalsketch-based generationknowledge graphsretrieval-augmented generationscientific diagramstopological compatibilityembedding alignmentfigure generation
0
0 comments X

The pith

DiagramRAG retrieves reference diagrams that match both the content and topological structure of a user sketch to guide scientific diagram generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DiagramRAG as a retrieval-augmented system that converts incomplete user sketches into higher-quality scientific diagrams by locating suitable reference diagrams. It encodes diagrams as knowledge graphs to preserve structure, generates multiple simplified sketch versions for training data, and learns embeddings that place compatible sketches and full diagrams close together. Retrieved references then supply missing content, connections, and visual details to the generation process. A reader would care because existing sketch methods either copy the rough input or ignore its layout, while pure text methods overlook spatial relations that diagrams require.

Core claim

Given a user sketch, DiagramRAG retrieves reference diagrams that are both semantically relevant to the sketch content and topologically compatible with its structure, and uses them to guide downstream diagram generation. To enable efficient structure-aware retrieval, diagrams are represented as knowledge graphs, sketch variants at different simplification levels are synthesized, and an embedding model is trained to align sketches with compatible diagrams in a shared space. The retrieved references further provide content, topology, and visual priors for completing and rendering the final diagram.

What carries the argument

Knowledge graph encoding of diagrams plus an embedding model trained on synthesized multi-level sketch variants, which aligns incomplete sketches to topologically compatible full references for retrieval.

If this is right

  • The system reaches F1-scores of 0.848 on DiagramBank and 0.802 on FigureBench.
  • Generation quality improves to a peak VLM-as-a-Judge score of 7.170.
  • Inference runs at 35.48 seconds per sample.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-plus-embedding retrieval could be tested on non-scientific sketches such as flowcharts or network diagrams to check whether topological matching transfers.
  • Feeding the retrieved reference graphs directly as additional conditioning tokens into the generator might raise the observed quality scores further.
  • The multi-level sketch synthesis used for training suggests a reusable way to make retrieval robust when input drawings omit many edges or labels.

Load-bearing premise

That encoding diagrams as knowledge graphs and training embeddings on synthesized sketch variants will produce alignments that reliably locate references whose content and structure actually improve the quality of the generated diagram.

What would settle it

Generating diagrams from the same sketches once with the retrieved references and once without them, then measuring no gain or a drop in VLM-as-a-Judge score or F1 on DiagramBank and FigureBench, would show the retrieval step adds no value.

Figures

Figures reproduced from arXiv: 2605.27931 by Chengliang Chai, Chi Zhang, Jingyuan Tan, Junyi Han, Xiangyu Fu, Xinjiang Yu, Yixiang Jian, Yu-Ping Wang, Zhuofan Chen, Zirui You.

Figure 1
Figure 1. Figure 1: Overview of DiagramRAG. 3.1 Problem Definition In this paper, we study the problem of sketch-based diagram completion over a high-quality diagram set D, where each diagram d ∈ D is a publication-quality diagram and the sketch s is a simplified diagram representation. The objective is to retrieve diagrams that are semantically relevant and topologically compatible with s and generate a refined diagram I bas… view at source ↗
Figure 2
Figure 2. Figure 2: Examples of generated sketch variants [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of sketch variants on retrieval performance. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dual-Agent ablation. VLM-based comparison of variants with and without the Structural Planning Agent and Visual Guidance Agent. arrow directions, and topological consistency. The Visual Guidance Agent uses the retrieved top-3 fine diagrams as visual references and guides layout organization, color logic, icon usage, local panels, feature/token/image thumbnails, and the level of detail expected in publicati… view at source ↗
Figure 5
Figure 5. Figure 5: Mean information loss in the Diagrambank dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mean information loss in the Figurebench dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Scientific diagrams are essential for communicating complex methodologies in academic papers. A natural way for researchers to specify such diagrams is through rough sketches, where text labels, connectors, and spatial arrangements express early semantic and topological intentions. However, sketches are usually incomplete, making them insufficient for directly producing publication-quality diagrams. Existing sketch-based generation methods mainly reconstruct the sketch itself, while recent text-driven diagram generation frameworks rely on textual semantics and do not fully exploit the topological structure contained in sketches. In this paper, we introduce DiagramRAG, a lightweight retrieval-augmented framework for sketch-based scientific diagram completion. Given a user sketch, DiagramRAG retrieves reference diagrams that are both semantically relevant to the sketch content and topologically compatible with its structure, and uses them to guide downstream diagram generation. To enable efficient structure-aware retrieval, we represent diagrams as knowledge graphs, synthesize sketch variants at different simplification levels, and train an embedding model to align sketches with compatible diagrams in a shared space. The retrieved references further provide content, topology, and visual priors for completing and rendering the final diagram. Experiments show that DiagramRAG achieves F1-scores of 0.848 and 0.802 on DiagramBank and FigureBench, respectively, and improves generation quality with the best VLM-as-a-Judge score of 7.170, while reducing inference latency to 35.48 seconds per sample. Our code and data are available at https://anonymous.4open.science/r/DiagramRAG-A262 and https://huggingface.co/datasets/anonymous-review-a262/DiagramSketch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DiagramRAG, a lightweight RAG framework for sketch-based scientific diagram completion. Diagrams are represented as knowledge graphs; sketch variants at varying simplification levels are synthesized to train a contrastive embedding model that aligns user sketches with semantically relevant and topologically compatible reference diagrams in a shared space. Retrieved references supply content, topology, and visual priors to guide downstream generation. Experiments report F1 scores of 0.848 on DiagramBank and 0.802 on FigureBench, a peak VLM-as-a-Judge score of 7.170, and inference latency of 35.48 s per sample, with code and data released.

Significance. If the retrieval alignments generalize, the framework offers a practical, structure-aware alternative to purely text-driven or sketch-reconstruction methods for scientific figure generation, with measurable gains in quality and efficiency. The public release of code and datasets supports reproducibility and follow-on work.

major comments (1)
  1. [§3.2] §3.2: The contrastive embedding model is trained exclusively on synthesized sketch variants at different simplification levels. No held-out evaluation on authentic user-collected sketches (with real stroke noise, label variance, or partial connectivity) is reported, which directly undermines the central claim that the shared embedding space reliably surfaces topologically compatible references for downstream RAG benefit.
minor comments (2)
  1. The F1 metric is reported without an explicit definition of what constitutes a true positive (e.g., node/edge overlap thresholds or semantic matching criteria) in the retrieval or generation evaluation.
  2. Anonymous repository and dataset links are appropriate for review but should be replaced with permanent identifiers in the camera-ready version.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the specific comment on evaluation. Below we respond directly to the major point raised.

read point-by-point responses
  1. Referee: [§3.2] §3.2: The contrastive embedding model is trained exclusively on synthesized sketch variants at different simplification levels. No held-out evaluation on authentic user-collected sketches (with real stroke noise, label variance, or partial connectivity) is reported, which directly undermines the central claim that the shared embedding space reliably surfaces topologically compatible references for downstream RAG benefit.

    Authors: We agree that evaluation on authentic user sketches would strengthen the central claim. Our synthesis procedure was explicitly constructed to introduce controlled variations in stroke density, label placement, and connectivity that approximate the characteristics of hand-drawn scientific sketches; the contrastive objective is trained across multiple simplification levels precisely to encourage robustness to such noise. The held-out portions of DiagramBank and FigureBench follow the same synthesis distribution, and the downstream RAG gains (F1 0.848/0.802 and VLM score 7.170) provide indirect evidence that the learned space transfers to the generation task. Nevertheless, we recognize that this remains a proxy evaluation. We will revise §3.2 and add a dedicated limitations paragraph to state the reliance on synthesized data, describe the design choices intended to mimic real sketch variance, and note the absence of a real-user held-out set as an open direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core pipeline—representing diagrams as knowledge graphs, synthesizing sketch variants for contrastive embedding training, retrieving from external collections (DiagramBank/FigureBench), and using references to guide generation—does not contain any self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations. The reported F1-scores (0.848/0.802) and VLM score are external benchmark metrics on held-out data, not quantities forced by construction from the synthesis or training inputs. The embedding alignment is presented as an empirical method whose reliability is tested separately rather than assumed by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the core modeling choice of knowledge-graph diagram encoding is treated as a domain assumption.

axioms (1)
  • domain assumption Diagrams can be faithfully represented as knowledge graphs that capture both semantic labels and topological structure.
    Invoked to enable structure-aware retrieval from sketches.

pith-pipeline@v0.9.1-grok · 5847 in / 1226 out tokens · 32916 ms · 2026-06-29T12:43:56.065248+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 19 canonical work pages · 5 internal anchors

  1. [1]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision – ECCV 2016, volume 9908 of Lecture Notes in Computer Science, pages 235–251. Springer, 2016. doi: 10.1007/ 978-3-319-46493-0_15. URLhttps://doi.org/10.1007/978-3-319-46493-0_15

  2. [2]

    Dy- namic graph generation network: Generating relational knowledge from diagrams

    Daesik Kim, Youngjoon Yoo, Jeesoo Kim, Sangkuk Lee, and Nojun Kwak. Dy- namic graph generation network: Generating relational knowledge from diagrams. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3298– 3307, 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Kim_ Dynamic_Graph_Generation_CVPR_2018_paper.html

  3. [3]

    DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation

    Tingwen Zhang, Ling Yue, Zhen Xu, and Shaowu Pan. Diagrambank: A large-scale dataset of diagram design exemplars with paper metadata for retrieval-augmented generation. arXiv preprint arXiv:2604.20857, 2026

  4. [4]

    Graphesis: Visual Knowledge Production and Representation

    Johanna Drucker. Graphesis: Visual Knowledge Production and Representation. Harvard University Press, 2014

  5. [5]

    Visualizing thought

    Barbara Tversky. Visualizing thought. Topics in Cognitive Science, 3(3):499–535, 2011

  6. [6]

    West, and Bill Howe

    Po-Shen Lee, Jevin D. West, and Bill Howe. Viziometrics: Analyzing visual information in the scientific literature. IEEE Transactions on Big Data, 4(1):117–129, 2018. doi: 10.1109/ TBDATA.2017.2689038

  7. [8]

    URLhttps://arxiv.org/abs/2601.23265

  8. [9]

    Autofigure: Generating and refining publication-ready scientific illustrations

    Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, and Yue Zhang. Autofigure: Generating and refining publication-ready scientific illustrations. In International Conference on Learning Representations, 2026. URL https://arxiv.org/ abs/2602.03828. to appear

  9. [10]

    SciFig: Towards Automating Editable Figure Generation for Scientific Papers

    Siyuan Huang, Yutong Gao, Juyang Bai, Yifan Zhou, Zi Yin, Xinxin Liu, Rama Chellappa, Chun Pong Lau, Sayan Nag, Cheng Peng, and Shraman Pramanick. SciFig: Towards Automating Scientific Figure Generation. arXiv preprint arXiv:2601.04390, 2026. URL https://arxiv. org/abs/2601.04390

  10. [11]

    Sketch2diagram: Generating vector diagrams from hand-drawn sketches

    Itsumi Saito, Haruto Yoshida, and Keisuke Sakaguchi. Sketch2diagram: Generating vector diagrams from hand-drawn sketches. In International Conference on Learning Representations (ICLR), 2025

  11. [12]

    Tenenbaum

    Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Joshua B. Tenenbaum. Learning to in- fer graphics programs from hand-drawn images. In Advances in Neural Information Processing Systems, volume 31, pages 6062–6071, 2018. URL https://proceedings.neurips.cc/ paper/2018/hash/6788076842014c83cedadbe6b0ba0314-Abstract.html

  12. [13]

    Cheng Tan, Qi Chen, Jingxuan Wei, Gaowei Wu, Zhangyang Gao, Siyuan Li, Bihui Yu, Ruifeng Guo, and Stan Z. Li. Sketchagent: Generating structured diagrams from hand-drawn sketches. arXiv preprint arXiv:2508.01237, 2025

  13. [14]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023. URL https: //openaccess.thecvf.com/content/ICCV2023/html/Zhang_Adding_Conditional_ Control_to_Text-to-Image_Diffusion_Models_ICCV_2023_paper.html

  14. [15]

    Diagrammergpt: Generating open- domain, open-platform diagrams via llm planning

    Abhay Zala, Han Lin, Jaemin Cho, and Mohit Bansal. Diagrammergpt: Generating open- domain, open-platform diagrams via llm planning. arXiv preprint arXiv:2310.12128, 2023. URLhttps://arxiv.org/abs/2310.12128. 10

  15. [16]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandras Piktus, and et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 2020

  16. [17]

    Shamma, Michael S

    Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

  17. [18]

    Shamma, Michael S

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV), 2017

  18. [19]

    Visual relationship detection with language priors

    Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In European Conference on Computer Vision, pages 852–869. Springer,

  19. [20]

    Visual Relationship Detection with Language Priors

    doi: 10.1007/978-3-319-46448-0_51. URLhttps://arxiv.org/abs/1608.00187

  20. [21]

    Choy, and Li Fei-Fei

    Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5419, 2017. URL https://openaccess.thecvf.com/content_ cvpr_2017/html/Xu_Scene_Graph_Generation_CVPR_2017_paper.html

  21. [22]

    Neural motifs: Scene graph parsing with global context

    Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5831–5840, 2018. URL https://openaccess.thecvf.com/ content_cvpr_2018/html/Zellers_Neural_Motifs_Scene_CVPR_2018_paper.html

  22. [23]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 8748– 8763, ...

  23. [24]

    Reproducible scaling laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023

  24. [25]

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking. arXiv preprint arXiv:2601.04720, 2026

  25. [26]

    Clip for all things zero-shot sketch-based image retrieval, fine-grained or not

    Aneeshan Sain, Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Subhadeep Koley, Tao Xiang, and Yi-Zhe Song. Clip for all things zero-shot sketch-based image retrieval, fine-grained or not. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2765–2775, 2023

  26. [27]

    Al Taleb, G

    Xuefei Li, Yizhou Zhou, et al. Multimodal representation learning via contrastive predic- tion for image and text matching. Pattern Recognition, 123:108402, 2022. doi: 10.1016/j. patcog.2021.108402. URL https://www.sciencedirect.com/science/article/pii/ S0031320321003198

  27. [28]

    SimGNN: A Neural Network Approach to Fast Graph Similarity Computation

    Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, Yizhou Sun, and Wei Wang. SimGNN: A Neural Network Approach to Fast Graph Similarity Computation. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (WSDM), pages 384–392,

  28. [29]

    doi: 10.1145/3289600.3290967

  29. [30]

    Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. Raptor: Recursive abstractive processing for tree-organized retrieval. 2024

  30. [31]

    From local to global: A graph rag approach to query-focused summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization. 2025. 11

  31. [32]

    Colbert: Efficient and effective passage search via contextu- alized late interaction over bert

    Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextu- alized late interaction over bert. 2020

  32. [33]

    Diagrameval: Evaluating llm-generated diagrams via graphs

    Chumeng Liang and Jiaxuan You. Diagrameval: Evaluating llm-generated diagrams via graphs. 2025

  33. [34]

    Sciflow-bench: Evaluat- ing structure-aware scientific diagram generation via inverse parsing

    Tong Zhang, Honglin Lin, Zhou Liu, Chong Chen, and Wentao Zhang. Sciflow-bench: Evaluat- ing structure-aware scientific diagram generation via inverse parsing. 2026

  34. [35]

    Optimal transport graph neural networks

    Gary Bécigneul, Octavian-Eugen Ganea, Benson Chen, Regina Barzilay, and Tommi Jaakkola. Optimal transport graph neural networks. arXiv preprint arXiv:2006.04804, 2020. URL https://arxiv.org/abs/2006.04804

  35. [36]

    Bridging the gap between graph edit distance and kernel machines

    Michel Neuhaus and Horst Bunke. Bridging the gap between graph edit distance and kernel machines. World Scientific Publishing Co., Inc., 15(10):2129–2144, 2005

  36. [37]

    A comprehensive sur- vey and experimental study of subgraph matching: Trends, unbiasedness, and interaction

    Ziheng Sun, Hongning Wang, Kangfei Zhao, and Qiong Luo. A comprehensive sur- vey and experimental study of subgraph matching: Trends, unbiasedness, and interaction. ACM Transactions on Database Systems, 49(1):1–42, 2024. doi: 10.1145/3639315. URL https://dl.acm.org/doi/10.1145/3639315

  37. [38]

    A comprehen- sive survey of subgraph matching: [experiments & analysis]

    Myoungji Han, Hyunjoon Kim, Geonmo Gu, Kunsoo Park, and Wook-Shin Han. A comprehen- sive survey of subgraph matching: [experiments & analysis]. ACM Transactions on Database Systems, 2025. doi: 10.1145/3771791. URL https://dl.acm.org/doi/10.1145/3771791. Just Accepted

  38. [39]

    LightGBM: A highly efficient gradient boosting decision tree

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, volume 30, 2017. URL https://papers.nips.cc/ paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree

  39. [40]

    Prometheus- vision: Vision-language model as a judge for fine-grained evaluation

    Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, and Minjoon Seo. Prometheus- vision: Vision-language model as a judge for fine-grained evaluation. arXiv preprint arXiv:2401.06591, 2024. URLhttps://arxiv.org/abs/2401.06591

  40. [41]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  41. [42]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

  42. [43]

    Brief: Bi-level coreset selection for efficient instruction tuning in llms

    Chaoyuan Shen, Chi Zhang, Chengliang Chai, Jiacheng Wang, Jia Yuan, Yuping Wang, Ye Yuan, Guoren Wang, and Lei Cao. Brief: Bi-level coreset selection for efficient instruction tuning in llms. Proceedings of the VLDB Endowment, 19(6):1264–1277, 2026

  43. [44]

    Handling label noise via instance-level difficulty modeling and dynamic optimization

    Kuan Zhang, Chengliang Chai, Jingzhe Xu, Chi Zhang, Han Han, Ye Yuan, Guoren Wang, and Lei Cao. Handling label noise via instance-level difficulty modeling and dynamic optimization. Advances in Neural Information Processing Systems, 38:46667–46696, 2026

  44. [45]

    Not all documents are what you need for extracting instruction tuning data

    Chi Zhang, Huaping Zhong, Hongtao Li, Chengliang Chai, Jiawei Hong, Yuhao Deng, Jiacheng Wang, Tian Tan, Yizhou Yan, Jiantao Qiu, et al. Not all documents are what you need for extracting instruction tuning data. arXiv preprint arXiv:2505.12250, 2025

  45. [46]

    decision

    Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai, Rui Wang, Xinlin Zhuang, Tianyi Bai, Qiu Jiantao, Lei Cao, Ju Fan, et al. Harnessing diversity for important data selection in pretraining large language models. In International Conference on Learning Representations, volume 2025, pages 72980–73003, 2025. 12 A Additional Experimental Details Training...