arxiv: 2605.13667 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

Dmitry Yudin, Mark Gizetdinov, Vladislav Makarov

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords scene graph generationvision-language modelsreinforcement learningvideo understandingstructured predictionhallucination mitigation

0 comments

The pith

SceneGraphVLM generates complete scene graphs from videos in about one second using compact vision-language models and targeted reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SceneGraphVLM as a compact approach that turns small vision-language models into generators of structured scene graphs for images and videos. It first performs supervised fine-tuning on graphs serialized in a token-efficient TOON format, then applies reinforcement learning with rewards that reward accurate relations while penalizing unsupported objects and relations. For video input the model optionally conditions each new frame on the graph produced for the previous frame, supplying short-term context without explicit tracking or post-processing. Evaluations on PSG, PVSG, and Action Genome show that the resulting system improves precision-oriented metrics while keeping recall reasonable and producing full graphs at roughly one-second latency per frame with vLLM acceleration.

Core claim

SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or post-processing. With compact VLMs and vLLM-accelerated decoding, SceneGraphVLM improves precision-oriented SGG metrics while preserving reasonable recall and generates complete scene graphs with approximately one-second latency on PSG, PVSG, and Action Gen

What carries the argument

Two-stage training that combines supervised fine-tuning with reinforcement learning using hallucination-aware rewards, plus TOON serialization and optional conditioning on the prior frame's graph for video input.

If this is right

Precision-oriented scene-graph metrics improve while recall remains reasonable across PSG, PVSG, and Action Genome.
Complete graphs are produced at approximately one-second latency using compact VLMs and accelerated decoding.
Video frames receive lightweight context by conditioning on the immediately preceding graph without added tracking modules.
The same pipeline applies to both static images and dynamic video without separate architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The short-term previous-graph conditioning could be extended to multi-frame memory buffers for longer video clips at modest extra cost.
The same reward formulation might transfer to other structured output tasks such as visual question answering or dense captioning.
Direct comparison against larger VLMs on identical benchmarks would quantify how much of the speed gain is retained when model size increases.

Load-bearing premise

The hallucination-aware RL rewards successfully balance coverage and precision on the target benchmarks without introducing new failure modes or requiring dataset-specific tuning.

What would settle it

Evaluating the trained model on a new video dataset outside PSG, PVSG, and Action Genome and checking whether precision metrics fall below existing baselines or unsupported objects appear at higher rates than reported.

Figures

Figures reproduced from arXiv: 2605.13667 by Dmitry Yudin, Mark Gizetdinov, Vladislav Makarov.

**Figure 2.** Figure 2: Overview of the SceneGraphVLM training pipeline. Image annotations are converted into a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Same PSG scene graph represented in the compact TOON format and in JSON. TOON [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Dataset-specific user prompts used for SceneGraphVLM training and evaluation. PSG and [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of the hallucination-aware reward design used in SceneGraphVLM. The base [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: LLM-assisted evaluation protocol. Objects are first matched by bounding-box IoU [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative PSG comparison between the ground-truth graph, R1-SGG, and Scene [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison on PVSG in the GT-prompt setting. The previous-frame graph is [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison on PVSG in the Generated-prompt setting. Here the model receives [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models. SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or post-processing. We evaluate SceneGraphVLM on PSG, PVSG, and Action Genome. With compact VLMs and vLLM-accelerated decoding, SceneGraphVLM achieves a strong quality-speed trade-off, improves precision-oriented SGG metrics while preserving reasonable recall, and generates complete scene graphs with approximately one-second latency. Code and implementation details are available at: https://github.com/markus0440/SceneGraphVLM.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SceneGraphVLM adds a TOON serialization and hallucination-aware RL stage to compact VLMs for video scene graphs, delivering claimed speed gains, but the abstract leaves the actual metrics and ablations unshown.

read the letter

The main takeaway is a two-stage pipeline that first does supervised fine-tuning on a token-efficient TOON graph format, then applies reinforcement learning with rewards that penalize unsupported objects and relations while trying to keep coverage. For video it adds optional conditioning on the previous frame's graph as lightweight context. This combination targets the common problem of VLMs spitting out long, noisy outputs and aims for roughly one-second latency on small models with vLLM decoding. The approach is straightforward and the code release helps anyone who wants to test it directly on PSG, PVSG, or Action Genome. What works is the focus on precision-oriented metrics without dropping recall too far, plus the engineering choice to avoid heavy tracking or post-processing. The RL reward design addresses hallucination in a way that feels like a natural extension of existing supervised methods rather than a complete reinvention. The soft spots sit in the evaluation. The abstract states clear improvements and a strong quality-speed trade-off, yet gives no numbers, no ablation tables, and no error breakdown, so it is impossible to judge how much the RL stage actually moves the needle or whether the reward coefficients needed heavy dataset-specific tuning. Without those details the central claims stay plausible but unverified. This is for computer vision groups already working with VLMs on structured video tasks who need something faster than full tracking pipelines. A reader who wants to try the released code on their own data would get immediate value. I would send it for peer review because the method is clearly motivated, uses public benchmarks, and ships code, even if the results section will need closer checking during revision.

Referee Report

2 major / 1 minor

Summary. The paper introduces SceneGraphVLM, a compact VLM-based method for dynamic scene graph generation from images and videos. Graphs are serialized in a token-efficient TOON format; training proceeds in two stages (supervised fine-tuning followed by RL with hallucination-aware rewards that balance coverage/precision and penalize unsupported objects/relations). For video, each frame can optionally condition on the prior graph for lightweight context without explicit tracking. Evaluations on PSG, PVSG, and Action Genome are claimed to yield improved precision-oriented SGG metrics with reasonable recall and ~1 s latency via compact VLMs and vLLM decoding.

Significance. If the empirical claims hold, the work offers a practical quality-speed trade-off for structured visual output using small VLMs, potentially enabling real-time applications without heavy post-processing. The RL reward design and optional temporal conditioning address hallucination and context issues in VLM-based SGG.

major comments (2)

Abstract: the abstract states improvements on PSG, PVSG, and Action Genome benchmarks with precision gains and one-second latency, but provides no quantitative metrics, ablation details, or error analysis; central performance claims therefore rest on unverified assertions.
Method (RL stage): the hallucination-aware rewards are described as balancing relation coverage and precision while penalizing unsupported objects/relations, yet the specific reward formulation, coefficient values, and any dataset-specific tuning are not provided, leaving the weakest assumption (successful balance without new failure modes) untestable.

minor comments (1)

Abstract: the TOON serialization format is introduced as token-efficient but its exact syntax, token count savings relative to JSON, and implementation details are not specified, hindering immediate reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on our manuscript. We address each major comment below and will incorporate revisions to strengthen the paper.

read point-by-point responses

Referee: Abstract: the abstract states improvements on PSG, PVSG, and Action Genome benchmarks with precision gains and one-second latency, but provides no quantitative metrics, ablation details, or error analysis; central performance claims therefore rest on unverified assertions.

Authors: We agree with the referee that the abstract should provide concrete quantitative support for the claimed improvements. While the full manuscript includes detailed metrics, ablations, and error analysis in the Experiments section, the abstract was overly concise. In the revised manuscript, we will expand the abstract to include specific key results, such as the precision gains on PSG, PVSG, and Action Genome, along with the reported latency. This will make the central claims verifiable from the abstract alone. revision: yes
Referee: Method (RL stage): the hallucination-aware rewards are described as balancing relation coverage and precision while penalizing unsupported objects/relations, yet the specific reward formulation, coefficient values, and any dataset-specific tuning are not provided, leaving the weakest assumption (successful balance without new failure modes) untestable.

Authors: We acknowledge that the specific formulation of the hallucination-aware reward, including the exact mathematical expression, coefficient values, and tuning procedure, was not fully specified in the submitted manuscript. This limits the ability to reproduce and verify the RL stage. In the revision, we will add a detailed description of the reward function in the Method section, including the formulas for coverage, precision, and hallucination penalties, the specific coefficient values used (e.g., weights for each component), and the dataset-specific tuning process based on validation performance. This will address the concern and allow testing of the balance achieved. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical ML method: two-stage training (supervised fine-tuning then RL with hallucination-aware rewards) on public benchmarks (PSG, PVSG, Action Genome), followed by empirical measurement of precision/recall, latency, and quality-speed trade-off. No mathematical derivation chain, equations, or predictions are present that reduce by construction to fitted inputs, self-citations, or renamed ansatzes. All claims rest on standard training objectives and external benchmark evaluation rather than internal self-definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The method depends on standard assumptions about VLM fine-tuning effectiveness and introduces a new serialization format plus custom reward design whose balance parameters are not detailed.

free parameters (1)

RL reward balance coefficients
Weights that trade off relation coverage against precision and hallucination penalties; these must be chosen or tuned to achieve the reported balance.

axioms (1)

domain assumption Small vision-language models can learn to output accurate structured scene graphs after supervised fine-tuning and RL
Core premise that compact VLMs are sufficient for the task once trained with the described objectives.

invented entities (1)

TOON format no independent evidence
purpose: Token-efficient serialization of scene graphs as text
New compact representation introduced to reduce output length and improve decoding speed.

pith-pipeline@v0.9.0 · 5513 in / 1298 out tokens · 37594 ms · 2026-05-14T20:36:53.643217+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We introduce hallucination-aware GRPO training with graph-oriented rewards that balance relation recall and precision and penalize unsupported objects and relations.

Reference graph

Works this paper leans on

48 extracted references · 16 canonical work pages · 10 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen2.5-vl technical report,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,
[3]

URLhttps://arxiv.org/abs/2502.13923. 15

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Compile scene graphs with reinforcement learning, 2025

Zuyao Chen, Jinlin Wu, Zhen Lei, Marc Pollefeys, and Chang Wen Chen. Compile scene graphs with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2504.13617

work page arXiv 2025
[6]

Spatial-temporal transformer for dynamic scene graph generation

Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. Spatial-temporal transformer for dynamic scene graph generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 16372–16382, 2021

2021
[7]

Reltr: Relation transformer for scene graph generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9): 11169–11183, 2023

Yuren Cong, Michael Ying Yang, and Bodo Rosenhahn. Reltr: Relation transformer for scene graph generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9): 11169–11183, 2023

2023
[8]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Maruf, Ismini Lourentzou, Arka Daw, and Anuj Karpatne

Amartya Dutta, Kazi Sajeed Mehrab, Medha Sawhney, Abhilash Neog, Mridul Khurana, Sepideh Fatemi, Aanish Pradhan, M. Maruf, Ismini Lourentzou, Arka Daw, and Anuj Karpatne. Open world scene graph generation using vision language models, 2025. URL https:// arxiv.org/abs/2506.08189

work page arXiv 2025
[10]

Exploiting long-term dependencies for generating dynamic scene graphs

Shengyu Feng, Hesham Mostafa, Marcel Nassar, Somdeb Majumdar, and Subarna Tripathi. Exploiting long-term dependencies for generating dynamic scene graphs. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5130–5139, 2023

2023
[11]

Gptscore: Evaluate as you desire

Jinlan Fu, See Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6556–6576, 2024

2024
[12]

Action genome: Actions as compositions of spatio-temporal scene graphs

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio-temporal scene graphs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10236–10247, 2020

2020
[13]

The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955

Harold W Kuhn. The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955

1955
[14]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023
[15]

From pixels to graphs: Open-vocabulary scene graph generation with vision-language models

Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, and Xuming He. From pixels to graphs: Open-vocabulary scene graph generation with vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 28076–28086, 2024

2024
[16]

Scene graph generation from objects, phrases and region captions

Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. Scene graph generation from objects, phrases and region captions. InProceedings of the IEEE international conference on computer vision, pages 1261–1270, 2017

2017
[17]

Dynamic scene graph generation via anticipa- tory pre-training

Yiming Li, Xiaoshan Yang, and Changsheng Xu. Dynamic scene graph generation via anticipa- tory pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13874–13883, 2022

2022
[18]

Gps-net: Graph property sensing network for scene graph generation

Xin Lin, Changxing Ding, Jinquan Zeng, and Dacheng Tao. Gps-net: Graph property sensing network for scene graph generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3746–3753, 2020

2020
[19]

Visual relationship detection with language priors

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. InEuropean conference on computer vision, pages 852–869. Springer, 2016. 16

2016
[20]

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, et al. Ovis2. 5 technical report.arXiv preprint arXiv:2508.11737, 2025

work page arXiv 2025
[21]

SmolVLM: Redefining small and efficient multimodal models

Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Smolvlm: Redefining small and efficient multimodal models, 2025. URL https: //arxiv.org/abs/2504.05299

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Unbiased scene graph generation in videos

Sayak Nag, Kyle Min, Subarna Tripathi, and Amit K Roy-Chowdhury. Unbiased scene graph generation in videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22803–22813, 2023

2023
[23]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[24]

Zero: Memory optimiza- tions toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

2020
[25]

TOON: Token-oriented object notation, 2025

Johann Schopplich and contributors. TOON: Token-oriented object notation, 2025. URLhttps: //github.com/toon-format/toon. Official reference implementation and specification. MIT License. Accessed: 2026-05-01

2025
[26]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Learning to compose dynamic tree structures for visual contexts

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. Learning to compose dynamic tree structures for visual contexts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6619–6628, 2019

2019
[29]

Qwen3.5-omni technical report, 2026

Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604. 15804

2026
[30]

Make tracking your machine learning experiments easy

Team Comet. Make tracking your machine learning experiments easy. https://www.comet. com/site/blog/make-tracking-your-machine-learning-experiments-easy/ , April 2022. Accessed: 2026-05-06

2022
[31]

Target adaptive context aggregation for video scene graph generation

Yao Teng, Limin Wang, Zhifeng Li, and Gangshan Wu. Target adaptive context aggregation for video scene graph generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13688–13697, 2021

2021
[32]

Oed: Towards one-stage end-to-end dynamic scene graph generation

Guan Wang, Zhimin Li, Qingchao Chen, and Yang Liu. Oed: Towards one-stage end-to-end dynamic scene graph generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27938–27947, 2024

2024
[33]

Cross-modality time-variant relation learning for generating dynamic scene graphs.arXiv preprint arXiv:2305.08522, 2023

Jingyi Wang, Jinfa Huang, Can Zhang, and Zhidong Deng. Cross-modality time-variant relation learning for generating dynamic scene graphs.arXiv preprint arXiv:2305.08522, 2023

work page arXiv 2023
[34]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Dy- namic scene graph generation via temporal prior inference

Shuang Wang, Lianli Gao, Xinyu Lyu, Yuyu Guo, Pengpeng Zeng, and Jingkuan Song. Dy- namic scene graph generation via temporal prior inference. InProceedings of the 30th ACM International Conference on Multimedia, pages 5793–5801, 2022

2022
[36]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: St...

2020
[38]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Scene graph generation by iterative message passing

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5410–5419, 2017

2017
[40]

Llava-spacesgg: Visual instruct tuning for open-vocabulary scene graph generation with enhanced spatial re- lations

Mingjie Xu, Mengyang Wu, Yuzhi Zhao, Jason Chun Lok Li, and Weifeng Ou. Llava-spacesgg: Visual instruct tuning for open-vocabulary scene graph generation with enhanced spatial re- lations. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6362–6372. IEEE, 2025

2025
[41]

Panoptic scene graph generation

Jingkang Yang, Yi Zhe Ang, Zujin Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic scene graph generation. InEuropean conference on computer vision, pages 178–196. Springer, 2022

2022
[42]

Panoptic video scene graph generation

Jingkang Yang, Wenxuan Peng, Xiangtai Li, Zujin Guo, Liangyu Chen, Bo Li, Zheng Ma, Kaiyang Zhou, Wayne Zhang, Chen Change Loy, et al. Panoptic video scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18675–18685, 2023

2023
[44]

Neural motifs: Scene graph parsing with global context

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global context. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5831–5840, 2018

2018
[45]

Focusgraph: Graph-structured frame selection for embodied long video question answering.arXiv preprint arXiv:2603.04349, 2026

Tatiana Zemskova, Solomon Andryushenko, Ilya Obrubov, Viktoriia Khoruzhaia, Ekaterina Eroshenko, Ekaterina Derevyanka, and Dmitry Yudin. Focusgraph: Graph-structured frame selection for embodied long video question answering.arXiv preprint arXiv:2603.04349, 2026

work page arXiv 2026
[46]

Graphical contrastive losses for scene graph parsing

Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. Graphical contrastive losses for scene graph parsing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11535–11543, 2019

2019
[47]

Efficient mixed-precision large language model inference with turbomind.arXiv preprint arXiv:2508.15601, 2025

Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed-precision large language model inference with turbomind.arXiv preprint arXiv:2508.15601, 2025

work page arXiv 2025
[48]

Swift: a scalable lightweight infrastructure for fine-tuning

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025. 18

2025
[49]

Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

2024
[50]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 19

work page internal anchor Pith review Pith/arXiv arXiv 2025