pith. machine review for the scientific record. sign in

arxiv: 2605.07208 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution

Jianrong Ding, Jianyuan Zhong, Qiang Xu, Zhengyan Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:43 UTC · model grok-4.3

classification 💻 cs.LG
keywords academic impact forecastingcontinuous-time manifoldtopic trajectory modelingLLM evaluation proxyknowledge flow graphspatiotemporal embeddingprospective prediction
0
0 comments X

The pith

A continuous-time manifold model projects papers into evolving topic spaces to forecast their future academic impact more reliably than large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the difficulty of judging new research ideas by treating the forecasting of already-published papers' impact as a measurable proxy task. It shows that frontier LLMs struggle to separate high-impact work from ordinary publications when relying on static text alone. In response, the authors introduce FAME, which embeds papers in a dynamic latent space shaped by both textual content and a knowledge-flow graph, then learns geometric rules that keep impactful papers aligned with their field's forward direction. Experiments across thousands of arXiv papers in fast-moving areas demonstrate that this trajectory-aware approach substantially beats LLM baselines at predicting multidimensional impact. The work also finds that feeding FAME's geometric signals back into LLMs measurably lifts their own forecasting accuracy.

Core claim

FAME projects papers into a dynamic latent space informed by textual features and a verified knowledge-flow graph, learning geometric constraints that align impactful manuscripts with the forward momentum of their fields. This continuous-time manifold evolution enables prospective forecasting of multidimensional impact that outperforms state-of-the-art LLM evaluators on 3,200 arXiv papers from three rapidly changing subfields, while integration of the same signals improves LLM performance.

What carries the argument

The dynamic latent space with learned geometric constraints that enforce alignment between a paper's trajectory and the field's forward momentum, built from text and a knowledge-flow graph.

If this is right

  • Manuscript impact forecasting becomes a verifiable benchmark for automated scientific evaluation.
  • Integrating trajectory signals from the manifold model improves LLM accuracy at the same forecasting task.
  • Static text-based judging by LLMs is shown to be insufficient for reliable prospective evaluation in evolving fields.
  • Models that respect continuous-time geometric evolution can distinguish high-impact papers from ordinary ones more consistently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be extended to rank grant proposals or preprints by estimating how well they would align with emerging topic directions.
  • If the manifold constraints generalize, similar methods might apply to forecasting technological or policy outcomes beyond academia.
  • Testing the model on papers from slower-moving fields would clarify whether the gains depend on rapid topic evolution.

Load-bearing premise

That the measurable future impact of already-published papers serves as a reliable proxy for how well a model can judge entirely new and unpublished research ideas.

What would settle it

A direct comparison of FAME's predicted impact rankings against the actual long-term citation and recognition outcomes of the held-out papers, measured years after the forecast date.

Figures

Figures reproduced from arXiv: 2605.07208 by Jianrong Ding, Jianyuan Zhong, Qiang Xu, Zhengyan Shi.

Figure 1
Figure 1. Figure 1: Comparison of standard LLM evaluators with our proposed method, which evaluates [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of FAME. First, we construct a high-quality inspiration graph. Second, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the manifold training dynamics and the three applied geometric constraints. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The heavy-tailed distributions of the four raw log-normalized ground-truth metrics used to [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative ablation of the spatiotemporal latent manifold. The panels demonstrate how [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 2D visualizations comparing the latent spaces of standard text embeddings versus the [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ternary heatmap illustrating the sensitivity of the FAME framework’s predictive perfor [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional examples comparing original text embedding distributions against FAME [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional examples comparing original text embedding distributions against FAME [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional examples comparing original text embedding distributions against FAME [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Correlation of normalized impact score components with human ratings. The composite [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly used to brainstorm and evaluate research ideas, yet assessing such judgments is fundamentally difficult because the true impact of a new idea may take years to emerge. We address this challenge by using the impact forecasting of human-authored manuscripts as a verifiable proxy task. In a prospective forecasting study, we find that frontier LLMs fail to reliably distinguish high-impact papers from ordinary publications, suggesting that static text-based judging is insufficient for scientific evaluation. To address this limitation, we propose $\textbf{FAME}$ ($\underline{\text{F}}$orecasting $\underline{\text{A}}$cademic Impact via Continuous-Time $\underline{\text{M}}$anifold $\underline{\text{E}}$volution), a spatiotemporal framework for modeling the dynamic trajectories of scientific topics. FAME projects papers into a dynamic latent space informed by textual features and a verified knowledge-flow graph, learning geometric constraints that align impactful manuscripts with the forward momentum of their fields. Experiments on 3,200 arXiv papers across three fast-evolving subfields show that FAME consistently and substantially outperforms state-of-the-art LLM evaluators in prospective multidimensional impact forecasting. Furthermore, integrating FAME's dynamic geometric signals into LLMs significantly improves their forecasting performance. These results support manuscript impact forecasting as a useful, measurable proxy benchmark and position FAME as a strong, trajectory-aware foundation for automated scientific evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes FAME, a continuous-time manifold evolution model that embeds papers in a dynamic latent space derived from textual features and a verified knowledge-flow graph, learning geometric constraints to align high-impact papers with field trajectories. It reports that frontier LLMs fail to distinguish high- from low-impact papers in a prospective forecasting task and that FAME substantially outperforms LLM baselines on 3,200 arXiv papers across three subfields; integrating FAME signals also improves LLM performance. The work frames published-paper impact forecasting as a verifiable proxy benchmark for automated scientific evaluation of research ideas.

Significance. If the empirical results are robust, the paper supplies a falsifiable, trajectory-based benchmark that moves beyond static text evaluation and could guide development of AI systems for research assessment. The explicit use of future citation trajectories as ground truth and the demonstration that dynamic geometric signals help LLMs are concrete strengths that would be valuable to the community.

major comments (3)
  1. [Abstract and Experiments] Abstract/Experiments: The central claim of consistent outperformance on 3,200 papers is presented without any description of data splits, baseline implementations, exact multidimensional impact metrics, statistical significance tests, or error bars. This absence prevents verification that reported gains are not attributable to post-hoc choices or data leakage, directly affecting the soundness of the empirical contribution.
  2. [Introduction/Discussion] Introduction/Discussion: The paper relies on the assumption that forecasting impact trajectories of already-published, peer-reviewed manuscripts is a reliable proxy for judging entirely new, pre-publication research ideas. Published papers already embed field momentum and peer-review signals absent from novel ideas; without a concrete transfer experiment or discussion of how the learned manifold would generalize outside existing trajectories, the broader claim that FAME advances automated evaluation of research ideas rests on an untested extrapolation.
  3. [Method] Method: The description of 'learning geometric constraints that align impactful manuscripts with the forward momentum of their fields' leaves open whether the knowledge-flow graph and latent-space construction are fully independent of the target impact labels. If any supervision from impact metrics enters the geometric alignment step, the reported superiority over LLM baselines could be circular; explicit equations or pseudocode clarifying independence are required.
minor comments (1)
  1. [Abstract] The abstract states the number of papers and subfields but does not name the three subfields or the precise time horizon used for prospective forecasting; adding these details would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which helps improve the clarity and rigor of our work. Below we respond to each major comment, indicating the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Experiments] The central claim of consistent outperformance on 3,200 papers is presented without any description of data splits, baseline implementations, exact multidimensional impact metrics, statistical significance tests, or error bars. This absence prevents verification that reported gains are not attributable to post-hoc choices or data leakage, directly affecting the soundness of the empirical contribution.

    Authors: We agree with this assessment. The current presentation in the abstract and experiments lacks sufficient methodological details for independent verification. In the revised manuscript, we will provide a comprehensive description of the experimental setup, including: the temporal data splits used to ensure prospective forecasting without leakage (training on earlier papers, testing on later ones); full implementation details and hyperparameters for the LLM baselines; exact definitions and computation of the multidimensional impact metrics; results from statistical significance tests; and error bars or standard deviations for all performance figures. These additions will strengthen the empirical contribution and allow readers to assess the robustness of the reported gains. revision: yes

  2. Referee: [Introduction/Discussion] The paper relies on the assumption that forecasting impact trajectories of already-published, peer-reviewed manuscripts is a reliable proxy for judging entirely new, pre-publication research ideas. Published papers already embed field momentum and peer-review signals absent from novel ideas; without a concrete transfer experiment or discussion of how the learned manifold would generalize outside existing trajectories, the broader claim that FAME advances automated evaluation of research ideas rests on an untested extrapolation.

    Authors: This is a fair point regarding the scope of our claims. We will revise the Introduction and add a new subsection in the Discussion to explicitly address the proxy nature of the task. We will discuss the differences between published papers (which include peer-review signals) and novel ideas, and explain how the dynamic manifold learned by FAME can be used to evaluate new research by projecting their embeddings and assessing trajectory alignment. While we cannot perform a direct transfer experiment here due to the absence of immediate ground-truth impact for unpublished ideas, we will outline this as an important direction for future work and moderate our claims about automated evaluation of research ideas accordingly. revision: partial

  3. Referee: [Method] The description of 'learning geometric constraints that align impactful manuscripts with the forward momentum of their fields' leaves open whether the knowledge-flow graph and latent-space construction are fully independent of the target impact labels. If any supervision from impact metrics enters the geometric alignment step, the reported superiority over LLM baselines could be circular; explicit equations or pseudocode clarifying independence are required.

    Authors: We thank the referee for highlighting this potential ambiguity. The knowledge-flow graph is constructed from citation data and external verification sources without reference to impact metrics. The latent space is initialized from textual embeddings and graph structure in an unsupervised manner. The geometric alignment uses the observed temporal dynamics of papers in the space but does not incorporate impact labels directly into the manifold evolution; impact prediction is handled by a separate module. To eliminate any doubt, we will include precise mathematical formulations and pseudocode in the revised Method section that demonstrate this separation. This will confirm that the superiority over LLM baselines, which rely solely on static text, is not due to circular supervision. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation remains independent of target labels

full rationale

The abstract presents FAME as a spatiotemporal model that projects papers into a latent space using textual features plus an external verified knowledge-flow graph, then learns geometric constraints to align with field momentum. No equations appear, no parameters are described as fitted to the impact labels and then renamed as predictions, and no self-citations or uniqueness theorems are invoked to justify the core construction. The reported outperformance on 3,200 papers is framed as an empirical result rather than a definitional identity. Absent any quoted reduction of the forecasting output to the input labels by construction, the derivation chain does not exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review prevents exhaustive enumeration; the framework relies on an unstated assumption that a dynamic latent space can be learned to capture forward momentum and on the existence of a verified knowledge-flow graph whose construction details are not provided.

axioms (1)
  • domain assumption The true impact of a new idea may take years to emerge, making direct evaluation of LLM judgments difficult.
    Explicitly stated as motivation for using manuscript impact forecasting as proxy task.
invented entities (1)
  • dynamic latent space informed by textual features and knowledge-flow graph no independent evidence
    purpose: To project papers and learn geometric constraints aligning impactful manuscripts with field momentum.
    Core modeling construct introduced in the FAME description; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5545 in / 1498 out tokens · 46889 ms · 2026-05-11T02:43:59.665343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 10 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    On the surprising behavior of distance metrics in high dimensional space

    Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. On the surprising behavior of distance metrics in high dimensional space. InInternational conference on database theory, pages 420–434. Springer, 2001

  3. [3]

    The skewness of science in 219 sub-fields and a number of aggregates.Scientometrics, 88(2):385–397, 2011

    Pedro Albarrán, Juan A Crespo, Ignacio Ortuño, and Javier Ruiz-Castillo. The skewness of science in 219 sub-fields and a number of aggregates.Scientometrics, 88(2):385–397, 2011

  4. [4]

    Researchagent: Iterative research idea generation over scientific literature with large language models

    Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pa...

  5. [5]

    Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

    Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

  6. [6]

    arXiv preprint arXiv:2304.05376 , year=

    Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools.arXiv preprint arXiv:2304.05376, 2023

  7. [7]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712, 2023

  8. [8]

    Beltrami flow and neural diffusion on graphs.Advances in Neural Information Processing Systems, 34:1594–1609, 2021

    Benjamin Chamberlain, James Rowbottom, Davide Eynard, Francesco Di Giovanni, Xiaowen Dong, and Michael Bronstein. Beltrami flow and neural diffusion on graphs.Advances in Neural Information Processing Systems, 34:1594–1609, 2021

  9. [9]

    Xgboost: A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016

  10. [10]

    Structural scaffolds for citation intent classification in scientific publications

    Arman Cohan, Waleed Ammar, Madeleine Van Zuylen, and Field Cady. Structural scaffolds for citation intent classification in scientific publications. InProceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, volume 1 (long and short papers), pages 3586–3596, 2019

  11. [11]

    Support-vector networks.Machine learning, 20(3):273– 297, 1995

    Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 20(3):273– 297, 1995

  12. [12]

    Towards an AI co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864, 2025

  13. [13]

    Temporal knowledge graph forecasting with neural ode.arXiv preprint arXiv:2101.05151, 2021

    Zhen Han, Zifeng Ding, Yunpu Ma, Yujia Gu, and V olker Tresp. Temporal knowledge graph forecasting with neural ode.arXiv preprint arXiv:2101.05151, 2021

  14. [14]

    Rodriques, and Andrew D

    Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G Rodriques, and Andrew D White. Paperqa: Retrieval-augmented generative agent for scientific research.arXiv preprint arXiv:2312.07559, 2023

  15. [15]

    Xiaona Li, Zhu Wang, Xindong Chen, Bin Guo, and Zhiwen Yu. A hybrid continuous- time dynamic graph representation learning model by exploring both temporal and repetitive information.ACM Transactions on Knowledge Discovery from Data, 17(9):1–22, 2023

  16. [16]

    Citation importance- aware document representation learning for large-scale science mapping.Information Process- ing & Management, 63(3):104557, 2026

    Zhentao Liang, Nees Jan van Eck, Xuehua Wu, Jin Mao, and Gang Li. Citation importance- aware document representation learning for large-scale science mapping.Information Process- ing & Management, 63(3):104557, 2026. 10

  17. [17]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  18. [18]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

  19. [19]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

  20. [20]

    AInstein: Can LLMs Solve Research Problems From Parametric Memory Alone?

    Shambhavi Mishra, Gaurav Sahu, Marco Pedersoli, Laurent Charlin, Jose Dolz, and Christopher Pal. Ainstein: Assessing the feasibility of ai-generated approaches to research problems.arXiv preprint arXiv:2510.05432, 2025

  21. [21]

    Continuous-time link prediction via temporal dependent graph neural network

    Liang Qu, Huaisheng Zhu, Qiqi Duan, and Yuhui Shi. Continuous-time link prediction via temporal dependent graph neural network. InProceedings of the web conference 2020, pages 3026–3032, 2020

  22. [22]

    Universality of citation distributions: Toward an objective measure of scientific impact.Proceedings of the National Academy of Sciences, 105(45):17268–17272, 2008

    Filippo Radicchi, Santo Fortunato, and Claudio Castellano. Universality of citation distributions: Toward an objective measure of scientific impact.Proceedings of the National Academy of Sciences, 105(45):17268–17272, 2008

  23. [23]

    Temporal graph networks for deep learning on dynamic graphs,

    Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal graph networks for deep learning on dynamic graphs.arXiv preprint arXiv:2006.10637, 2020

  24. [24]

    Kernel principal component analysis

    Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Kernel principal component analysis. InInternational conference on artificial neural networks, pages 583–588. Springer, 1997

  25. [25]

    Si, C., Hashimoto, T., and Yang, D

    Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers.arXiv preprint arXiv:2409.04109, 2024

  26. [26]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  27. [27]

    A tutorial on support vector regression.Statistics and computing, 14(3):199–222, 2004

    Alex J Smola and Bernhard Schölkopf. A tutorial on support vector regression.Statistics and computing, 14(3):199–222, 2004

  28. [28]

    2025 , doi =

    Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025

  29. [29]

    Galactica: A Large Language Model for Science

    Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085, 2022

  30. [30]

    Ai can learn scientific taste.arXiv preprint arXiv:2603.14473, 2026

    Jingqi Tong, Mingzhe Li, Hangcheng Li, Yongzhuo Yang, Yurong Mou, Weijie Ma, Zhiheng Xi, Hongji Chen, Xiaoran Liu, Qinyuan Cheng, et al. Ai can learn scientific taste.arXiv preprint arXiv:2603.14473, 2026

  31. [31]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  32. [32]

    Quantifying long-term scientific impact.Science, 342(6154):127–132, 2013

    Dashun Wang, Chaoming Song, and Albert-László Barabási. Quantifying long-term scientific impact.Science, 342(6154):127–132, 2013

  33. [33]

    Large language models are not fair evaluators

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440–9450, 2024. 11

  34. [34]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

  35. [35]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

  36. [36]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  37. [37]

    Large language models as optimizers

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

  38. [38]

    Towards better dynamic graph learning: New architecture and unified library.Advances in Neural Information Processing Systems, 36:67686–67700, 2023

    Le Yu, Leilei Sun, Bowen Du, and Weifeng Lv. Towards better dynamic graph learning: New architecture and unified library.Advances in Neural Information Processing Systems, 36:67686–67700, 2023

  39. [39]

    T-gcn: A temporal graph convolutional network for traffic prediction.IEEE transactions on intelligent transportation systems, 21(9):3848–3858, 2019

    Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, and Haifeng Li. T-gcn: A temporal graph convolutional network for traffic prediction.IEEE transactions on intelligent transportation systems, 21(9):3848–3858, 2019

  40. [40]

    is_inspired

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 12 A Model Details A.1 Hyperparameters and Computation Resources The hyperparameters for con...