pith. machine review for the scientific record. sign in

arxiv: 2604.10385 · v1 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords video datasetground truth annotationsspatiotemporal reasoningvideo generationvideo encodersspatial relation graphsevent mappingsGEST-Engine
0
0 comments X

The pith

GTASA supplies multi-actor videos with exact per-frame 3D spatial graphs and event mappings to evaluate and train video models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GTASA, a corpus of videos generated via the GEST-Engine that carries exact per-frame spatial relation graphs and event-level temporal mappings. It demonstrates that these videos outperform both open and closed source neural generators on human ratings of physical validity and semantic alignment, and that models trained for video captioning perform better when using GTASA data. The same exact 3D ground truth enables direct testing of four frozen video encoders on eleven spatiotemporal reasoning tasks, revealing stronger spatial encoding in self-supervised encoders than in VLM visual encoders. This setup addresses the difficulty of measuring physical plausibility and semantic faithfulness in complex multi-actor video generation without reliable ground truth.

Core claim

GTASA is a corpus of multi-actor videos with per-frame spatial relation graphs and event-level temporal mappings produced by the GEST-Engine. The method produces videos that human evaluators rate higher in physical validity and semantic alignment than those from neural generators. Training video captioning models on GTASA data leads to better results than on neural-generated videos. Probing four frozen video encoders on 11 tasks enabled by the ground truth shows self-supervised encoders encode spatial structure significantly better than VLM visual encoders.

What carries the argument

GEST-Engine, a system that generates videos from graphs of events in space and time to produce exact per-frame 3D spatial relation graphs and event mappings as ground truth.

Load-bearing premise

The assumption that the GEST-Engine generates videos whose per-frame spatial graphs and event mappings accurately reflect physical plausibility and semantic faithfulness that humans can judge reliably.

What would settle it

A blind human evaluation in which raters score physical validity and semantic alignment of GTASA videos no higher than those from neural generators, or video captioning models trained on GTASA show no accuracy gain over models trained on neural-generated videos.

Figures

Figures reproduced from arXiv: 2604.10385 by Marius Leordeanu, Mihai Masala, Nicolae Cudlenco.

Figure 1
Figure 1. Figure 1: GTASA pipeline. Generated GEST graphs are converted into videos and accompanying textual descriptions. From these texts, we generate synthetic videos us￾ing neural models. The resulting videos—produced both by our approach and by other methods—are evaluated Q1 and subsequently used to train video captioning models Q2 and to probe existing video encoders for their spatiotemporal understanding Q3. models, fr… view at source ↗
read the original abstract

Generating complex multi-actor scenario videos remains difficult even for state-of-the-art neural generators, while evaluating them is hard due to the lack of ground truth for physical plausibility and semantic faithfulness. We introduce GTASA, a corpus of multi-actor videos with per-frame spatial relation graphs and event-level temporal mappings, and the system that produced it based on Graphs of Events in Space and Time (GEST): GEST-Engine. We compare our method with both open and closed source neural generators and prove both qualitatively (human evaluation of physical validity and semantic alignment) and quantitatively (via training video captioning models) the clear advantages of our method. Probing four frozen video encoders across 11 spatiotemporal reasoning tasks enabled by GTASA's exact 3D ground truth reveals that self-supervised encoders encode spatial structure significantly better than VLM visual encoders.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents GTASA, a dataset of multi-actor videos accompanied by per-frame spatial relation graphs and event-level temporal mappings generated by the GEST-Engine. It claims to demonstrate the advantages of this approach over open and closed source neural video generators both qualitatively through human evaluations of physical validity and semantic alignment and quantitatively by training video captioning models. Furthermore, by using the exact 3D ground truth to create 11 spatiotemporal reasoning tasks, it probes four frozen video encoders and finds that self-supervised encoders encode spatial structure significantly better than VLM visual encoders.

Significance. If the ground truth annotations prove to be accurate and the evaluations robust, GTASA could serve as an important benchmark for assessing physical plausibility and semantic faithfulness in video generation models, as well as for probing the capabilities of video encoders on spatiotemporal tasks. The distinction between self-supervised and VLM encoders on spatial structure is a potentially useful insight for the field.

major comments (2)
  1. Abstract: The abstract asserts qualitative and quantitative advantages but supplies no details on the human evaluation protocol, number of raters, statistical tests, or how the 11 tasks were constructed, leaving the central claims without visible supporting evidence.
  2. GEST-Engine and probing experiments sections: The claim that GTASA supplies 'exact 3D ground truth' for the 11 spatiotemporal tasks requires the per-frame spatial graphs and event mappings to be verifiably accurate, yet no independent check (physics simulation match, real 3D capture comparison, or automated consistency test) is described; this is load-bearing for the reported superiority of self-supervised encoders over VLM encoders.
minor comments (1)
  1. The description of the spatial relation graphs would benefit from an explicit example or diagram to clarify the per-frame annotation format.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the positive assessment of GTASA's potential as a benchmark. We respond to each major comment below, clarifying points of the manuscript and indicating where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: Abstract: The abstract asserts qualitative and quantitative advantages but supplies no details on the human evaluation protocol, number of raters, statistical tests, or how the 11 tasks were constructed, leaving the central claims without visible supporting evidence.

    Authors: We agree that the abstract, constrained by length, does not preview these details. The full manuscript describes the human evaluation protocol (including rater instructions, rating scales for physical validity and semantic alignment, and statistical analysis) in the relevant evaluation section, and details the construction of the 11 spatiotemporal tasks from the per-frame graphs and event mappings in the probing section. To improve visibility of the supporting evidence, we will revise the abstract to include a concise reference to the human evaluation and task construction while remaining within length limits. revision: yes

  2. Referee: GEST-Engine and probing experiments sections: The claim that GTASA supplies 'exact 3D ground truth' for the 11 spatiotemporal tasks requires the per-frame spatial graphs and event mappings to be verifiably accurate, yet no independent check (physics simulation match, real 3D capture comparison, or automated consistency test) is described; this is load-bearing for the reported superiority of self-supervised encoders over VLM encoders.

    Authors: The annotations are exact by construction: the GEST-Engine renders videos from explicit 3D scene parameters, spatial relation graphs, and timed event sequences, so the per-frame graphs and event mappings are the generative inputs rather than post-hoc inferences. This generative process is described in the GEST-Engine section. We did not include external validation against real captures or separate physics engines because the dataset is intentionally synthetic to provide perfect alignment between video and annotations. We acknowledge that an explicit consistency check would strengthen the claim and will add a short discussion of this point plus a simple automated verification (e.g., graph consistency) in the revised manuscript. The encoder probing results are presented with this synthetic ground truth in mind. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper introduces a new dataset GTASA produced by the GEST-Engine, performs external comparisons against other generators via human evaluation and downstream captioning model training, and probes frozen encoders on 11 tasks using the annotations as ground truth. No equations, parameter fitting, self-citations, or ansatzes are present in the provided text that reduce any central claim (advantages of the method or encoder performance differences) to an input by construction. The work is self-contained against external benchmarks and does not rely on self-referential definitions or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claims rest on the unverified accuracy of annotations produced by the newly introduced GEST-Engine and on the reliability of human judgments of physical and semantic validity.

axioms (1)
  • domain assumption GEST-Engine generates videos whose per-frame spatial relation graphs and event mappings constitute accurate ground truth for physical plausibility and semantic faithfulness.
    Invoked when claiming advantages over neural generators and when using the annotations for encoder probing.
invented entities (1)
  • GEST-Engine no independent evidence
    purpose: System that produces the GTASA videos together with their spatial and temporal annotations.
    Newly introduced component whose internal correctness is not independently demonstrated in the abstract.

pith-pipeline@v0.9.0 · 5448 in / 1339 out tokens · 52204 ms · 2026-05-10T16:34:41.588256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 14 canonical work pages · 10 internal anchors

  1. [1]

    Communications of the ACM26(11), 832–843 (1983)

    Allen, J.F.: Maintaining knowledge about temporal intervals. Communications of the ACM26(11), 832–843 (1983)

  2. [2]

    Advances in Neural Information Processing Systems37, 58757–58791 (2024)

    Alonso,E.,Jelley,A.,Micheli,V.,Kanervisto,A.,Storkey,A.J.,Pearce,T.,Fleuret, F.: Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems37, 58757–58791 (2024)

  3. [3]

    In: European conference on computer vision

    Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: Semantic propositional image caption evaluation. In: European conference on computer vision. pp. 382–

  4. [4]

    In: ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling (2026), supplied as supple- mental material

    Anonymous: [tiny paper] GEST-engine: Controllable multi-actor video synthesis with perfect spatiotemporal annotations. In: ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling (2026), supplied as supple- mental material

  5. [5]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

  6. [6]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  7. [7]

    Ball, P.J., Bauer, J., Belletti, F., Brownfield, B., Ephrat, A., Fruchter, S., Gupta, A., Holsheimer, K., Holynski, A., Hron, J., Kaplanis, C., Limont, M., McGill, M., Oliveira, Y., Parker-Holder, J., Perbet, F., Scully, G., Shar, J., Spencer, S., Tov, O., Villegas, R., Wang, E., Yung, J., Baetu, C., Berbel, J., Bridson, D., Bruce, J., Buttimore, G., Chak...

  8. [8]

    In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization

    Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005)

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Black, M.J., Patel, P., Tesch, J., Yang, J.: Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8726–8737 (2023)

  10. [10]

    In: Pro- ceedings of the 28th International Conference on Computational Linguistics

    Bogolin, S.V., Croitoru, I., Leordeanu, M.: A hierarchical approach to vision-based language generation: from simple sentences to complex natural language. In: Pro- ceedings of the 28th International Conference on Computational Linguistics. pp. 2436–2447 (2020)

  11. [11]

    Cudlenco et al

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators 16 N. Cudlenco et al

  12. [12]

    In: Forty-first International Conference on Machine Learning (2024)

    Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interactive environments. In: Forty-first International Conference on Machine Learning (2024)

  13. [13]

    Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,

    Cai, M., Tan, R., Zhang, J., Zou, B., Zhang, K., Yao, F., Zhu, F., Gu, J., Zhong, Y., Shang, Y., et al.: Temporalbench: Benchmarking fine-grained temporal under- standing for multimodal video models. arXiv preprint arXiv:2410.10818 (2024)

  14. [14]

    Action100m: A large-scale video action dataset.arXiv preprint arXiv:2601.10592, 2026

    Chen, D., Kasarla, T., Bang, Y., Shukor, M., Chung, W., Yu, J., Bolourchi, A., Moutakanni, T., Fung, P.: Action100m: A large-scale video action dataset. arXiv preprint arXiv:2601.10592 (2026)

  15. [15]

    International Journal of Computer Vision130(1), 33–55 (2022)

    Damen, D., Doughty, H., Farinella, G.M., Furnari, A., Kazakos, E., Ma, J., Molti- santi, D., Munro, J., Perrett, T., Price, W., et al.: Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision130(1), 33–55 (2022)

  16. [16]

    Advances in neural information processing systems36, 10088–10115 (2023)

    Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient fine- tuning of quantized llms. Advances in neural information processing systems36, 10088–10115 (2023)

  17. [17]

    ACM Computing Surveys58(3), 1–38 (2025)

    Ding, J., Zhang, Y., Shang, Y., Zhang, Y., Zong, Z., Feng, J., Yuan, Y., Su, H., Li, N., Sukiennik, N., et al.: Understanding world or predicting future? a compre- hensive survey of world models. ACM Computing Surveys58(3), 1–38 (2025)

  18. [18]

    In: Conference on robot learning

    Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An open urban driving simulator. In: Conference on robot learning. pp. 1–16. PMLR (2017)

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    El Banani, M., Raj, A., Maninis, K.K., Kar, A., Li, Y., Rubinstein, M., Sun, D., Guibas,L.,Johnson,J.,Jampani,V.:Probingthe3dawarenessofvisualfoundation models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21795–21806 (2024)

  20. [20]

    Google: Veo 3 announcement (2025),https://blog.google/innovation- and- ai/products/generative-media-models-io-2025/#veo-3

  21. [21]

    Google: Veo 3 launch (2025),https://cloud.google.com/blog/products/ai- machine-learning/veo-3-fast-available-for-everyone-on-vertex-ai

  22. [22]

    Google: Veo 3 model card (2025),https://storage.googleapis.com/deepmind- media/Model-Cards/Veo-3-Model-Card.pdf, accessed: March 04, 2026

  23. [23]

    something something

    Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al.: The" something something" video database for learning and evaluating visual common sense. In: Proceedings of the IEEE international conference on computer vision. pp. 5842– 5850 (2017)

  24. [24]

    World Models

    Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.101222(3) (2018)

  25. [25]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., et al.: Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026)

  26. [26]

    LTX-Video: Realtime Video Latent Diffusion

    HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024)

  27. [27]

    Mastering Diverse Domains through World Models

    Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104 (2023)

  28. [28]

    Huang, Z., Li, X., Lv, Z., Rehg, J.M.: How much 3d do video foundation models encode? arXiv preprint arXiv:2512.19949 (2025)

  29. [29]

    Ji, J., Krishna, R., Fei-Fei, L., Niebles, J.C.: Action genome: Actions as composi- tionsofspatio-temporalscenegraphs.In:ProceedingsoftheIEEE/CVFconference on computer vision and pattern recognition. pp. 10236–10247 (2020) GTASA: Ground Truth Annotations for Spatial Analysis 17

  30. [30]

    Convergence p

    Jia, X., Berry, A., Johnston, A.: The evolutionary disruption: A paradigm shift in film and animation industry driven by real-time rendering and virtual production. Convergence p. 13548565251356932 (2025)

  31. [31]

    The Kinetics Human Action Video Dataset

    Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  32. [32]

    Krippendorff, K.: Computing krippendorff’s alpha-reliability (2011)

  33. [33]

    2, 2022- 06-27

    LeCun, Y.: A path towards autonomous machine intelligence version 0.9. 2, 2022- 06-27. Open Review62(1), 1–62 (2022)

  34. [34]

    In: Text sum- marization branches out

    Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)

  35. [35]

    Advances in Neural Information Processing Systems36, 46212–46244 (2023)

    Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems36, 46212–46244 (2023)

  36. [36]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Masala, M., Cudlenco, N., Rebedea, T., Leordeanu, M.: Explaining vision and language through graphs of events in space and time. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2826–2831 (2023)

  37. [37]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: Learning a text-video embedding by watching hundred million nar- rated video clips. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2630–2640 (2019)

  38. [38]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Paiss, R., Ephrat, A., Tov, O., Zada, S., Mosseri, I., Irani, M., Dekel, T.: Teaching clip to count to ten. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3170–3180 (2023)

  39. [39]

    In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)

  40. [40]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Puig, X., Ra, K., Boben, M., Li, J., Wang, T., Fidler, S., Torralba, A.: Virtual- home: Simulating household activities via programs. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8494–8502 (2018)

  41. [41]

    Qiu, Y., Nagasaki, Y., Hara, K., Kataoka, H., Suzuki, R., Iwata, K., Satoh, Y.: Virtualhome action genome: A simulated spatio-temporal scene graph dataset with consistentrelationshiplabels.In:ProceedingsoftheIEEE/CVFWinterConference on Applications of Computer Vision. pp. 3351–3360 (2023)

  42. [42]

    In: European conference on computer vision

    Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: Ground truth from computer games. In: European conference on computer vision. pp. 102–118. Springer (2016)

  43. [43]

    In: Proceedings of the 58th annual meeting of the association for computational linguistics

    Sellam, T., Das, D., Parikh, A.: Bleurt: Learning robust metrics for text generation. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp. 7881–7892 (2020)

  44. [44]

    Advances in Neural Information Processing Systems 37, 87310–87356 (2024)

    Tong, P., Brown, E., Wu, P., Woo, S., Iyer, A.J.V., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., et al.: Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms. Advances in Neural Information Processing Systems 37, 87310–87356 (2024)

  45. [45]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4566–4575 (2015)

  46. [46]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 18 N. Cudlenco et al

  47. [47]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14549–14560 (2023)

  48. [48]

    Video models are zero-shot learners and reasoners

    Wiedemer, T., Li, Y., Vicol, P., Gu, S.S., Matarese, N., Swersky, K., Kim, B., Jaini, P., Geirhos, R.: Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328 (2025)

  49. [49]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yang, J., Peng, W., Li, X., Guo, Z., Chen, L., Li, B., Ma, Z., Zhou, K., Zhang, W., Loy, C.C., et al.: Panoptic video scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18675– 18685 (2023)

  50. [50]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025)

  51. [51]

    BERTScore: Evaluating Text Generation with BERT

    Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)

  52. [52]

    Advances in neural information processing systems36, 46595–46623 (2023)

    Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems36, 46595–46623 (2023)