arxiv: 2605.09997 · v1 · submitted 2026-05-11 · 💻 cs.SI · cs.SE

Recognition: no theorem link

GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation

Changjun Jiang, Sheng Xiang, Ying Zhang, Zihe Wei

Pith reviewed 2026-05-12 03:48 UTC · model grok-4.3

classification 💻 cs.SI cs.SE

keywords LLM graph generationbenchmarkprogressive complexityprompting strategiesiterative verificationinstruction followinggraph synthesiscapability diagnosis

0 comments

The pith

A progressive benchmark shows verification-guided iteration with adaptive prompting outperforms standard methods for LLM graph generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GraphInstruct as a way to test LLMs on generating graphs of increasing structural complexity. It uses six levels and five dimensions to pinpoint where models struggle with instructions for graph synthesis. Testing reveals that tasks combining multiple constraints best expose differences between models and prompting approaches. No one strategy works for all situations, and semantic constraints tied to the graph's domain stay hard to satisfy even with repeated attempts. The authors then introduce an iterative framework that uses verification to adapt prompts and achieves better results than fixed prompting on the evaluated models.

Core claim

GraphInstruct organizes LLM graph generation evaluation into six progressive complexity levels and five dimensions, supported by 800 instructions and over 1500 reference solutions. Across 12 models and 45 configurations, it shows peak discriminative power at multi-constraint composition, absence of a dominant prompting strategy, and invariance of domain-semantic constraints to iteration. Building on these signals, a verification-guided iterative framework employing constraint-aware adaptive prompting exceeds the performance limits of conventional prompt engineering.

What carries the argument

The progressive stratification into six complexity levels and five evaluation dimensions, paired with the verification-guided iterative framework using constraint-aware adaptive prompting.

Load-bearing premise

The six hand-defined complexity levels and five evaluation dimensions, together with the hand-authored instructions and synthesized references, provide an unbiased and comprehensive map of LLM capability gaps in graph generation.

What would settle it

A single-pass prompting method that matches or exceeds the iterative framework's results across all six complexity levels on the same set of models would falsify the claim that iteration is needed to surpass the prompt-engineering ceiling.

Figures

Figures reproduced from arXiv: 2605.09997 by Changjun Jiang, Sheng Xiang, Ying Zhang, Zihe Wei.

**Figure 2.** Figure 2: GraphInstruct dataset overview. Left: per-level instruction count. Center: graph-size [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Per-level Quality by capability tier, averaged over the 45 (model, strategy) configurations in [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 4.** Figure 4: Per-instruction D1 standard deviation by level, averaged over 10 zero-shot models. L2 [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Capability-gap case study at L2 (instruction L2-143). Reference (left) and Sonnet-4.6 [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt sensitivity (σstrat, y-axis) vs. base capability (mean Q, x-axis) across the 11 fullyevaluated models (Sonnet-4 excluded, zero-shot-only). The 4× gap between weakest T3 models (σstrat = 0.074) and most prompt-stable T2 models (σstrat = 0.019) establishes an inverse-scaling relation; the solid line is an OLS fit (R2 = 0.62). Implications. Prompt-engineering budgets should scale inversely with model … view at source ↗

**Figure 7.** Figure 7: Signed strategy × level effect heatmap (average over the 11 fully-evaluated models). Few-shot is net-negative at L2 (−0.034) and net-positive at L4 (+0.069); few-CoT swings from net-negative at L3 (−0.048) to net-positive at L5 (+0.045). Aggregate benchmarks mask these opposite-signed effects. savior at L4, where domain examples convey structural priors the instruction alone cannot. Few-CoT is savior at L5… view at source ↗

**Figure 8.** Figure 8: Signed CoT effect by model family. Qwen3.5 gains uniformly across scales ( [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Qwen3.5 scale family (35B / 122B / 397B) per-level Quality. Scaling monotonically [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Pareto frontier over 45 baseline (model, strategy) configurations in [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Frontier Distance across 45 baseline configurations, sorted ascending. Top: 6 Pareto [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

**Figure 12.** Figure 12: Method × model Quality with per-model Oracle reference line. Combined surpasses Oracle by +0.035–+0.050 on every target model; VGIG-only contributes the majority of the gain [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗

**Figure 13.** Figure 13: E6 feedback-granularity ablation on GPT-4o-mini at [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗

**Figure 14.** Figure 14: E5 rounds-saturation curve. Quality improves substantially from [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗

**Figure 15.** Figure 15: L4 quality across T ∈ {1, 2, 3, 5, 7, 10, 15, 20} for fine/coarse/none feedback (24 configurations). Flat at 0.750–0.754, indicating semantic-constraint failure is a structurally distinct mode iterative refinement cannot address. Mechanism. Two conclusions follow. First, the effective refinement horizon on verifiable graph constraints is ∼5 rounds—markedly shorter than text-domain self-refine budgets of … view at source ↗

**Figure 16.** Figure 16: L4 per-dimension decomposition across 10 zero-shot models. D1 (structural), D3 [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗

**Figure 17.** Figure 17: Per-level capability profiles for six representative models (zero-shot). Each axis shows [PITH_FULL_IMAGE:figures/full_fig_p037_17.png] view at source ↗

read the original abstract

Graph-structured data underpins applications from citation analysis and social-network modeling to molecular design and knowledge-graph construction, and Large Language Models (LLMs) are increasingly used as prompt-driven graph synthesizers. Classical graph-generation reviews catalog deep generative models and their evaluation primitives, but predate the LLM era and provide no foundation for evaluating instruction-following graph synthesis. Recent LLM-era benchmarks evaluate models along graph-type or task-domain axes; such organizations, however, average over structural complexity and cannot localize where in the complexity spectrum an LLM breaks down. To close this diagnostic gap, we introduce GraphInstruct, a progressive-complexity benchmark that stratifies LLM graph generation into six complexity levels and five evaluation dimensions, paired with 800 hand-authored instructions, 1,582 algorithmically synthesized reference solutions, and a 12-LLM capability evaluation across 45 (model, strategy) configurations. We find that discriminative power peaks at multi-constraint composition rather than reasoning depth, that no single prompting strategy dominates across levels or model families, and that domain-semantic constraints remain iteration-invariant under all tested methods -- pointing to retrieval rather than additional compute as the next research frontier. Atop the benchmark, a verification-guided iterative framework with constraint-aware adaptive prompting consistently surpasses the prompt-engineering ceiling on tested target models, demonstrating that the benchmark's fine-grained signals drive method development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraphInstruct gives a finer-grained benchmark for LLM graph generation than prior work but its hand-defined levels still need validation against objective difficulty measures.

read the letter

The main thing to know is that this paper introduces GraphInstruct, a benchmark that splits LLM graph synthesis into six progressive complexity levels and five evaluation dimensions, supported by 800 hand-authored instructions and algorithmic reference graphs. On top of that they present a verification-guided iterative prompting method that outperforms standard strategies across the models they tested. The progressive split and the iterative framework are the actual novelties here; earlier benchmarks stayed at coarser task or domain groupings and did not try to localize breakdowns inside the complexity spectrum. They run a reasonable evaluation across 12 models and 45 configurations, and the reported patterns—that discrimination is strongest at multi-constraint composition and that domain-semantic constraints resist iteration—are concrete enough to be useful for people working on structured generation. The algorithmic references help with reproducibility, which is a clear positive. The soft spot is exactly the one the stress-test note flags: the six levels are hand-defined without shown validation that they track genuine increases in difficulty. There is no evidence in the paper of monotonic performance drops, inter-rater checks on level assignment, or comparisons to model-agnostic metrics such as treewidth or constraint hardness. If the levels mainly reflect surface features of the chosen instructions, then the advantage claimed for the adaptive prompting method could be tied to this specific distribution rather than a general diagnostic power. This paper is aimed at researchers who build or evaluate LLMs for graph tasks in molecular design, knowledge graphs, or social networks. A reader looking for diagnostic benchmarks rather than end-to-end accuracy numbers will find material worth discussing. It deserves a serious referee because the core idea of fine-grained signals is worth developing, even though the level definitions will need additional justification and experiments before the claims about localizing capability gaps can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GraphInstruct, a progressive benchmark for diagnosing LLM capability gaps in instruction-following graph generation. It stratifies 800 hand-authored instructions into six complexity levels and five evaluation dimensions, paired with 1,582 algorithmically synthesized reference solutions, and reports results from evaluating 12 LLMs across 45 (model, strategy) configurations. Key findings include peak discriminative power at multi-constraint composition, absence of a dominant prompting strategy, and iteration-invariant failures on domain-semantic constraints; the authors additionally present a verification-guided iterative framework with constraint-aware adaptive prompting that outperforms standard prompt-engineering baselines.

Significance. If the complexity stratification proves non-arbitrary and the reported improvements are robust, the work supplies a much-needed fine-grained diagnostic instrument for LLM graph synthesis that moves beyond coarse task-domain or graph-type axes. The scale of the evaluation (12 models, 45 configurations) and the demonstration that benchmark signals can drive a new prompting method constitute concrete strengths; the identification of retrieval as a frontier for domain-semantic constraints is a useful, falsifiable pointer for follow-on research.

major comments (2)

[Benchmark Construction] Benchmark Construction section: the six hand-defined complexity levels and the claim that 'discriminative power peaks at multi-constraint composition' rest on an unvalidated partitioning. No monotonic degradation of success rates with level, inter-rater reliability statistics, or correlation with model-agnostic graph-complexity metrics (treewidth, constraint-satisfaction hardness) are reported, leaving open the possibility that observed patterns reflect surface features of the hand-authored instructions rather than intrinsic generation difficulty.
[Evaluation and Results] Evaluation and Results section: the central claim that the verification-guided iterative framework 'consistently surpasses the prompt-engineering ceiling' because of the benchmark's fine-grained signals requires explicit implementation details of the constraint-aware adaptive prompting, per-configuration success rates with error bars, and statistical tests across the 45 setups. Without these, it is impossible to confirm that gains are attributable to the benchmark rather than to the particular instruction distribution or unstated hyper-parameters.

minor comments (2)

[Abstract] Abstract and §4: the pairing between the 800 instructions and 1,582 references is not stated explicitly; clarify whether every instruction has a unique reference or whether some references serve multiple instructions.
[Figures/Tables] Figure and table captions: ensure all evaluation dimensions and complexity levels are defined in the caption or a nearby table so that readers can interpret results without returning to the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and specify the revisions planned for the manuscript.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark Construction section: the six hand-defined complexity levels and the claim that 'discriminative power peaks at multi-constraint composition' rest on an unvalidated partitioning. No monotonic degradation of success rates with level, inter-rater reliability statistics, or correlation with model-agnostic graph-complexity metrics (treewidth, constraint-satisfaction hardness) are reported, leaving open the possibility that observed patterns reflect surface features of the hand-authored instructions rather than intrinsic generation difficulty.

Authors: The six levels were constructed by incrementally composing constraints (structural, numerical, domain-semantic) in a manner intended to reflect increasing instruction complexity for graph generation. We acknowledge the absence of formal validation metrics such as inter-rater reliability or correlations with treewidth/hardness measures. In revision we will add: per-level success rate tables across all models to document the observed patterns; a note that strict monotonic degradation is not theoretically required given heterogeneous LLM capabilities; and exploratory correlations using constraint count as a proxy metric. The peak discriminative power at multi-constraint composition remains an empirical observation from the 45 configurations, but we will qualify the claim to reflect the hand-authored nature of the partitioning. revision: partial
Referee: [Evaluation and Results] Evaluation and Results section: the central claim that the verification-guided iterative framework 'consistently surpasses the prompt-engineering ceiling' because of the benchmark's fine-grained signals requires explicit implementation details of the constraint-aware adaptive prompting, per-configuration success rates with error bars, and statistical tests across the 45 setups. Without these, it is impossible to confirm that gains are attributable to the benchmark rather than to the particular instruction distribution or unstated hyper-parameters.

Authors: We will expand the Evaluation and Results section with: explicit pseudocode and description of the constraint-aware adaptive prompting mechanism; a supplementary table reporting per-configuration success rates (with standard deviations from repeated runs where performed); and statistical tests (paired comparisons with bootstrap intervals) across the 45 (model, strategy) setups. These additions will make transparent that the framework leverages the benchmark's fine-grained failure signals for targeted adaptation rather than relying on generic prompting. We agree the original version omitted sufficient implementation and statistical detail. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements only

full rationale

This is a pure empirical benchmark paper introducing hand-authored instructions, algorithmically synthesized references, and LLM evaluations across configurations. No derivations, equations, fitted parameters, or predictions appear in the abstract or described content. Outcomes are reported as direct measurements against external references. No self-citations are invoked as load-bearing premises. The central claims rest on observed performance differences, not on any reduction to inputs by construction. This aligns with the default expectation for non-circular empirical studies.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper; the abstract introduces no mathematical derivations, fitted constants, background axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5546 in / 1130 out tokens · 46289 ms · 2026-05-12T03:48:52.236009+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 10 internal anchors

[1]

Graph Generators:

Bonifati, Angela and Holub. Graph Generators:. 2020 , volume =

work page 2020
[2]

Xiang, Sheng and Wen, Dong and Cheng, Dawei and Zhang, Ying and Qin, Lu and Qian, Zhengping and Lin, Xuemin , title =. The. 2022 , volume =

work page 2022
[3]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) Student Research Workshop , year =

Demirci, Ege and Kerur, Rithwik and Singh, Ambuj , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) Student Research Workshop , year =

work page
[4]

arXiv preprint arXiv:2403.14358 , year =

Yao, Yang and Wang, Xin and Zhang, Zeyang and Qin, Yijian and Wang, Ziwei and Chu, Xu and Yang, Yuekui and Zhu, Wenwu and Mei, Hong , title =. arXiv preprint arXiv:2403.14358 , year =

work page arXiv
[5]

International Conference on Learning Representations (ICLR) , year =

Tang, Jianheng and Zhang, Qifan and Li, Yuhan and Liu, Nuo and Hua, Hongzhi and Jin, Jiawei and Wang, Yi and Huang, Xiao , title =. International Conference on Learning Representations (ICLR) , year =

work page
[6]

Findings of the Association for Computational Linguistics (ACL) , year =

Wang, Jianing and Wu, Junda and Hou, Yupeng and Liu, Yao and Gao, Ming and McAuley, Julian , title =. Findings of the Association for Computational Linguistics (ACL) , year =

work page
[7]

International Conference on Learning Representations (ICLR) , year =

Peng, Jie and Ji, Jiarui and Lei, Runlin and Wei, Zhewei and Liu, Yongchao and Hong, Chuntao , title =. International Conference on Learning Representations (ICLR) , year =

work page
[8]

International Conference on Learning Representations (ICLR) , year =

Fatemi, Bahare and Halcrow, Jonathan and Perozzi, Bryan , title =. International Conference on Learning Representations (ICLR) , year =

work page
[9]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Wang, Heng and Feng, Shangbin and He, Tianxing and Tan, Zhaoxuan and Han, Xiaochuang and Tsvetkov, Yulia , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[10]

Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) , year =

Chen, Nuo and Li, Yuhan and Tang, Jianheng and Li, Jia , title =. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) , year =

work page
[11]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) , year =

Tang, Jiabin and Yang, Yuhao and Wei, Wei and Shi, Lei and Su, Lixin and Cheng, Suqi and Yin, Dawei and Huang, Chao , title =. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) , year =

work page
[12]

Findings of the Association for Computational Linguistics (ACL) , year =

Jin, Bowen and Xie, Chulin and Zhang, Jiawei and Roy, Kashob Kumar and Zhang, Yu and Li, Zheng and Li, Ruirui and Tang, Xianfeng and Wang, Suhang and Meng, Yu and Han, Jiawei , title =. Findings of the Association for Computational Linguistics (ACL) , year =

work page
[13]

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

Besta, Maciej and Blach, Nils and Kubicek, Ales and Gerstenberger, Robert and Podstawski, Michal and Gianinazzi, Lukas and Gajda, Joanna and Lehmann, Tomasz and Niewiadomski, Hubert and Nyczyk, Piotr and Hoefler, Torsten , title =. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

work page
[14]

International Conference on Learning Representations (ICLR) , year =

Luo, Linhao and Li, Yuan-Fang and Haffari, Gholamreza and Pan, Shirui , title =. International Conference on Learning Representations (ICLR) , year =

work page
[15]

Findings of the Association for Computational Linguistics (EACL) , year =

Ye, Ruosong and Zhang, Caiqi and Wang, Runhui and Xu, Shuyuan and Zhang, Yongfeng , title =. Findings of the Association for Computational Linguistics (EACL) , year =

work page
[16]

and Kaiser, Lukasz and Polosukhin, Illia , title =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[17]

and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D. and Dhariwal, Prafulla and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[18]

arXiv preprint arXiv:2303.08774 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, Hugo and Martin, Louis and Stone, Kevin and others , title =. arXiv preprint arXiv:2307.09288 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[20]

The Llama 3 Herd of Models

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and others , title =. arXiv preprint arXiv:2407.21783 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Qwen2 Technical Report

Yang, An and Yang, Baosong and Hui, Binyuan and Zheng, Bo and others , title =. arXiv preprint arXiv:2407.10671 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Qwen2.5 Technical Report

Yang, An and others , title =. arXiv preprint arXiv:2412.15115 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[23]

arXiv preprint arXiv:2412.19437 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Chi, Ed and Le, Quoc and Zhou, Denny , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[25]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[26]

International Conference on Learning Representations (ICLR) , year =

Wang, Xuezhi and Wei, Jason and Schuurmans, Dale and Le, Quoc and Chi, Ed and Narang, Sharan and Chowdhery, Aakanksha and Zhou, Denny , title =. International Conference on Learning Representations (ICLR) , year =

work page
[27]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas and Cao, Yuan and Narasimhan, Karthik , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[28]

International Conference on Learning Representations (ICLR) , year =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , title =. International Conference on Learning Representations (ICLR) , year =

work page
[29]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[30]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Shinn, Noah and Cassano, Federico and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[31]

International Conference on Learning Representations (ICLR) , year =

Gou, Zhibin and Shao, Zhihong and Gong, Yeyun and Shen, Yelong and Yang, Yujiu and Duan, Nan and Chen, Weizhu , title =. International Conference on Learning Representations (ICLR) , year =

work page
[32]

International Conference on Learning Representations (ICLR) , year =

Huang, Jie and Chen, Xinyun and Mishra, Swaroop and Zheng, Huaixiu Steven and Yu, Adams Wei and Song, Xinying and Zhou, Denny , title =. International Conference on Learning Representations (ICLR) , year =

work page
[33]

International Conference on Learning Representations (ICLR) , year =

Zhou, Denny and Scharli, Nathanael and Hou, Le and Wei, Jason and Scales, Nathan and Wang, Xuezhi and Schuurmans, Dale and Cui, Claire and Bousquet, Olivier and Le, Quoc and Chi, Ed , title =. International Conference on Learning Representations (ICLR) , year =

work page
[34]

and Welling, Max , title =

Kipf, Thomas N. and Welling, Max , title =. International Conference on Learning Representations (ICLR) , year =

work page
[35]

Graph Attention Networks , booktitle =

Veli. Graph Attention Networks , booktitle =

work page
[36]

and Ying, Rex and Leskovec, Jure , title =

Hamilton, William L. and Ying, Rex and Leskovec, Jure , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[37]

International Conference on Learning Representations (ICLR) , year =

Xu, Keyulu and Hu, Weihua and Leskovec, Jure and Jegelka, Stefanie , title =. International Conference on Learning Representations (ICLR) , year =

work page
[38]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Hu, Weihua and Fey, Matthias and Zitnik, Marinka and Dong, Yuxiao and Ren, Hongyu and Liu, Bowen and Catasta, Michele and Leskovec, Jure , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[39]

and Leskovec, Jure , title =

You, Jiaxuan and Ying, Rex and Ren, Xiang and Hamilton, William L. and Leskovec, Jure , title =. International Conference on Machine Learning (ICML) , year =

work page
[40]

International Conference on Learning Representations (ICLR) , year =

Shi, Chence and Xu, Minkai and Zhu, Zhaocheng and Zhang, Weinan and Zhang, Ming and Tang, Jian , title =. International Conference on Learning Representations (ICLR) , year =

work page
[41]

ICML 2018 Deep Generative Models Workshop , year =

De Cao, Nicola and Kipf, Thomas , title =. ICML 2018 Deep Generative Models Workshop , year =

work page 2018
[42]

International Conference on Artificial Neural Networks (ICANN) , year =

Simonovsky, Martin and Komodakis, Nikos , title =. International Conference on Artificial Neural Networks (ICANN) , year =

work page
[43]

International Conference on Machine Learning (ICML) , year =

Jin, Wengong and Barzilay, Regina and Jaakkola, Tommi , title =. International Conference on Machine Learning (ICML) , year =

work page
[44]

Advances in Neural Information Processing Systems (NeurIPS) , year =

You, Jiaxuan and Liu, Bowen and Ying, Zhitao and Pande, Vijay and Leskovec, Jure , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[45]

International Conference on Learning Representations (ICLR) , year =

Vignac, Clement and Krawczuk, Igor and Siraudin, Antoine and Wang, Bohan and Cevher, Volkan and Frossard, Pascal , title =. International Conference on Learning Representations (ICLR) , year =

work page
[46]

International Conference on Machine Learning (ICML) , year =

Jo, Jaehyeong and Lee, Seul and Hwang, Sung Ju , title =. International Conference on Machine Learning (ICML) , year =

work page
[47]

Transactions on Machine Learning Research (TMLR) , year =

Liang, Percy and Bommasani, Rishi and Lee, Tony and Tsipras, Dimitris and Soylu, Dilara and Yasunaga, Michihiro and others , title =. Transactions on Machine Learning Research (TMLR) , year =

work page
[48]

Transactions on Machine Learning Research (TMLR) , year =

Srivastava, Aarohi and Rastogi, Abhinav and Rao, Abhishek and Shoeb, Abu Awal Md and Abid, Abubakar and others , title =. Transactions on Machine Learning Research (TMLR) , year =

work page
[49]

International Conference on Learning Representations (ICLR) , year =

Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , title =. International Conference on Learning Representations (ICLR) , year =

work page
[50]

Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and others , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =

work page
[51]

Training Verifiers to Solve Math Word Problems

Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and others , title =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Evaluating Large Language Models Trained on Code

Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde de Oliveira and others , title =. arXiv preprint arXiv:2107.03374 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Challenging

Suzgun, Mirac and Scales, Nathan and Sch. Challenging. Findings of the Association for Computational Linguistics (ACL) , year =

work page
[54]

Scaling Laws for Neural Language Models

Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and others , title =. arXiv preprint arXiv:2001.08361 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2001
[55]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[56]

Transactions on Machine Learning Research (TMLR) , year =

Wei, Jason and Tay, Yi and Bommasani, Rishi and Raffel, Colin and Zoph, Barret and Borgeaud, Sebastian and others , title =. Transactions on Machine Learning Research (TMLR) , year =

work page
[57]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Schaeffer, Rylan and Miranda, Brando and Koyejo, Sanmi , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[58]

and Khashabi, Daniel and Hajishirzi, Hannaneh , title =

Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh , title =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page
[59]

International Conference on Learning Representations (ICLR) , year =

Xu, Can and Sun, Qingfeng and Zheng, Kai and Geng, Xiubo and Zhao, Pu and Feng, Jiazhan and Tao, Chongyang and Jiang, Daxin , title =. International Conference on Learning Representations (ICLR) , year =

work page
[60]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[61]

Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen

Dong, Yixin and Ruan, Charlie F. and Cai, Yaxing and Lai, Ruihang and Xu, Ziyi and Zhao, Yilong and Chen, Tianqi , title =. arXiv preprint arXiv:2411.15100 , year =

work page arXiv
[62]

and Louf, R\'

Willard, Brandon T. and Louf, R\'. Efficient Guided Generation for Large Language Models , journal =

work page
[63]

Proceedings of the ACM on Programming Languages , volume =

Beurer-Kellner, Luca and Fischer, Marc and Vechev, Martin , title =. Proceedings of the ACM on Programming Languages , volume =. 2023 , doi =

work page 2023
[64]

Emergence of Scaling in Random Networks , journal =

Barab. Emergence of Scaling in Random Networks , journal =

work page
[65]

and Strogatz, Steven H

Watts, Duncan J. and Strogatz, Steven H. , title =. Nature , volume =. 1998 , doi =

work page 1998
[66]

Network Science , publisher =

Barab. Network Science , publisher =

work page
[67]

Newman, Mark , title =

work page
[68]

Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , title =. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page
[69]

Text Summarization Branches Out: Proceedings of the ACL Workshop , year =

Lin, Chin-Yew , title =. Text Summarization Branches Out: Proceedings of the ACL Workshop , year =

work page
[70]

and Artzi, Yoav , title =

Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav , title =. International Conference on Learning Representations (ICLR) , year =

work page
[71]

and Rasch, Malte J

Gretton, Arthur and Borgwardt, Karsten M. and Rasch, Malte J. and Sch. A Kernel Two-Sample Test , journal =

work page
[72]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year =

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , title =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year =

work page 2019
[73]

AI Magazine , volume =

Sen, Prithviraj and Namata, Galileo and Bilgic, Mustafa and Getoor, Lise and Galligher, Brian and Eliassi-Rad, Tina , title =. AI Magazine , volume =. 2008 , doi =

work page 2008
[74]

and Sterling, Teague and Mysinger, Michael M

Irwin, John J. and Sterling, Teague and Mysinger, Michael M. and Bolstad, Erin S. and Coleman, Ryan G. , title =. Journal of Chemical Information and Modeling , volume =

work page
[75]

and Rupp, Matthias and von Lilienfeld, O

Ramakrishnan, Raghunathan and Dral, Pavlo O. and Rupp, Matthias and von Lilienfeld, O. Anatole , title =. Scientific Data , volume =

work page
[76]

ACM Transactions on Knowledge Discovery from Data (TKDD) , volume =

Leskovec, Jure and Kleinberg, Jon and Faloutsos, Christos , title =. ACM Transactions on Knowledge Discovery from Data (TKDD) , volume =. 2007 , doi =

work page 2007
[77]

ACM Transactions on Information Systems (TOIS) , volume =

Huang, Lei and Yu, Weijiang and Ma, Weitao and Zhong, Weihong and Feng, Zhangyin and Wang, Haotian and others , title =. ACM Transactions on Information Systems (TOIS) , volume =. 2025 , doi =

work page 2025
[78]

Constitutional AI: Harmlessness from AI Feedback

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and others , title =. arXiv preprint arXiv:2212.08073 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[79]

Datasheets for Datasets , journal =

Gebru, Timnit and Morgenstern, Jamie and Vecchione, Briana and Vaughan, Jennifer Wortman and Wallach, Hanna and Iii, Hal Daum. Datasheets for Datasets , journal =. 2021 , doi =

work page 2021
[80]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Huang, Haoyu and Chen, Chong and Sheng, Zeang and Li, Yang and Zhang, Wentao , title =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

work page 2025