OR3 converts OR clips to action-driven digital twins, uses LLM imagination for hypothetical ActDTs, and achieves 57.6 R@1 and 77.3 R@5 on 276 implicit queries from 386 robotic knee procedure clips, outperforming baselines.
Reasoning text-to-video retrieval via digital twin video repre- sentations and large language models.arXiv preprint arXiv:2511.12371, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2years
2026 2representative citing papers
An RL framework uses digital twin representations with hierarchical uncertainty estimates and a novel clinical plausibility reward to train LLMs for surgical VideoQA, achieving SOTA on a new 2000-pair benchmark and two existing datasets.
citing papers explorer
-
Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins
OR3 converts OR clips to action-driven digital twins, uses LLM imagination for hypothetical ActDTs, and achieves 57.6 R@1 and 77.3 R@5 on 276 implicit queries from 386 robotic knee procedure clips, outperforming baselines.
-
Training LLMs with Reinforcement Learning over Digital Twin Representations for Reasoning-Intensive Surgical VideoQA
An RL framework uses digital twin representations with hierarchical uncertainty estimates and a novel clinical plausibility reward to train LLMs for surgical VideoQA, achieving SOTA on a new 2000-pair benchmark and two existing datasets.