{"total":23,"items":[{"citing_arxiv_id":"2606.27313","ref_index":7,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViQ: Text-Aligned Visual Quantized Representations at Any Resolution","primary_cat":"cs.CV","submitted_at":"2026-06-25T17:29:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViQ is a new two-stage text-aligned quantization method for visual features supporting arbitrary resolutions that claims competitive multimodal performance with efficiency gains of 20-70%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17798","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams","primary_cat":"cs.CV","submitted_at":"2026-06-16T11:18:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LiveStarPro uses SVeD for response timing via perplexity, SCAM for incremental alignment, and TSHM for event-chain memory to achieve 28.9% better semantic correctness and 1.58x speedup on long video streams.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06991","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-06-05T07:29:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LyraV uses FDTC and SToP for per-frame incremental decoding to reach 98.29% video synchrony at 3.89 FPS while preserving general understanding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25979","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence","primary_cat":"cs.CV","submitted_at":"2026-05-25T15:54:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LLaVA-OV-2 uses codec-stream tokenization and a shared 3D RoPE to improve video, spatial, and tracking performance over Qwen3-VL-8B, while introducing the JumpScore benchmark for fine-grained motion localization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25343","ref_index":211,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Native Multimodal Modeling: A Roadmap","primary_cat":"cs.CV","submitted_at":"2026-05-25T01:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"visual tokenization and multimodal rotary position encodings so that image and video tokens remain spatially and temporally grounded under arbitrary resolutions. Qwen3-VL extends this trajectory with longer contexts, improved interleaved position modeling, and stronger temporal grounding for video. Related systems such as LLaV A-UHD [209], LLaV A-OneVision [210], Oryx [211], and InternVL 2.5 [212] use AnyRes-style slicing, spatial schemas, or on-demand compression to preserve high-resolution details while preventing token counts from growing mechanically with pixel count. More recent query-aware approaches, including Q-Zoom [213], further make the resolution decision conditional on the user instruction: the model first reasons over a coarse view, then spends high-resolution tokens only on regions"},{"citing_arxiv_id":"2605.21625","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly","primary_cat":"cs.CV","submitted_at":"2026-05-20T18:36:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18018","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-18T08:09:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06094","ref_index":27,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VISD: Enhancing Video Reasoning via Structured Self-Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-07T12:13:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[25] Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, and Jiang Bian. Self-hinting language models enhance reinforcement learning.arXiv preprint arXiv:2602.03143, 2026. [26] Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners. InEuropean Conference on Computer Vision, pages 1-18. Springer, 2024. [27] Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024. [28] Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush V osoughi, Guoyin Wang, and Jingren Zhou. Fipo: Eliciting deep reasoning with"},{"citing_arxiv_id":"2605.05848","ref_index":38,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-07T08:23:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Bold method names mark our VideoRouter adaptations; the controlled InternVL3 comparison is provided in Table 2. 6 Model Video-MME LongVideoBench MLVU LVBench EgoSchema Long Overall Average Duration 2386s 1010s 473s 651s 4101s 180s Proprietary MLLMs GPT-4V [36] 53.5 59.9 59.1 49.2 - 55.6 GPT-4o [37] 65.3 71.9 66.7 64.6 30.8 72.2 Open-source MLLMs Oryx-1.5 [38] - 58.8 56.3 67.5 - - MiniCPM-v2.6 [39] 51.8 60.9 54.9 37.3 - - mPLUG-Owl3 [40] 50.1 59.3 52.1 63.7 - - NVILA [41] 54.8 64.2 - 70.1 - - LLaV A-Video [6] - 63.3 58.2 70.8 41.5 57.3 Video-XL [42] 49.2 55.5 49.5 64.9 - - VideoLLaMA2 [43] - 47.9 - 48.5 - 51.7 Video-CCAM [44] 46.7 53.2 - 58.5 - - Kangaroo [45] 46.7 56.0 54.8 61.0 39.4 62.7 LongV A [46] 46."},{"citing_arxiv_id":"2604.03296","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"3D-IDE: 3D Implicit Depth Emergent","primary_cat":"cs.CV","submitted_at":"2026-03-28T00:54:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"3D awareness emerges implicitly in MLLMs via self-supervised geometric constraints that create an information bottleneck, removing depth and pose dependencies at inference and cutting latency by 55%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.20913","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-02-24T13:49:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.21334","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Streaming Video Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2025-12-24T18:59:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.06673","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding","primary_cat":"cs.CV","submitted_at":"2025-12-07T06:11:15+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.18265","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","primary_cat":"cs.CV","submitted_at":"2025-08-25T17:58:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"5 66.6 InternVL3.5-8B 66.0 / 68.6 72.1 1.67 70.2 62.1 65.8 InternVL3-14B [187] 70.4 / 73.0 76.6 1.73 73.3 63.9 69.1 InternVL3.5-14B 67.9 / 71.0 72.8 1.73 72.1 62.7 67.4 Kimi-VL-A3B-2506 [125] 67.8 / 72.6 59.7 - 74.2 64.5 - InternVL3.5-20B-A4B 62.4 / 64.9 73.3 1.54 65.6 58.3 62.6 InternVL3.5-30B-A3B 68.7 / 71.8 72.1 1.69 73.0 63.8 67.6 Oryx-1.5-32B [74] 67.3 / 74.9 70.1 1.52 72.3 - - Qwen2.5-VL-32B [5] 70.5 / 77.9 - 1.93 - - - VILA-1.5-40B [66] 60.1 / 61.1 - 1.61 56.7 - - InternVL3-38B [187] 72.7 / 75.0 76.9 1.81 77.8 67.3 71.7 InternVL3.5-38B 70.9 / 74.2 75.0 1.90 77.0 65.7 71.0 GPT-4V/4T [1] 59.9 / 63.3 43.7 1.53 49.2 59.1 54.4 GPT-4o-20240513 [95] 71.9 / 77.2 - 1.63 64.6 66.7 - GPT-4o-20240806 [97] - - 1."},{"citing_arxiv_id":"2507.05920","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2025-07-08T12:05:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.23747","ref_index":51,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence","primary_cat":"cs.CV","submitted_at":"2025-05-29T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.20279","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction","primary_cat":"cs.CV","submitted_at":"2025-05-26T17:56:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.10479","ref_index":78,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-04-14T17:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"5-8B [18] 64.2 / 66.9 72.0 1.68 68.9 60.0 - - InternVL3-8B 66.3 / 68.9 75.4 1.69 71.4 58.8 38.6 / 55.2 61.4 InternVL3-9B 66.7 / 68.9 74.3 1.69 70.8 62.5 41.1 / 58.0 62.3 InternVL3-14B 70.4 / 73.0 76.6 1.73 73.3 63.9 44.1 / 60.6 64.9 InternVL2-26B [19] 57.0 / 60.2 67.5 1.67 64.2 56.1 - - InternVL2.5-26B 66.9 / 69.2 75.2 1.86 72.3 59.9 - - Oryx-1.5-32B [78] 67.3 / 74.9 70.1 1.52 72.3 - - - Qwen2.5-VL-32B [7] 70.5 / 77.9 - 1.93 - - - - VILA-1.5-40B [71] 60.1 / 61.1 - 1.61 56.7 - - - InternVL2-40B [19] 66.1 / 68.6 72.0 1.78 71.0 60.6 - - InternVL2.5-38B [18] 70.7 / 73.1 74.4 1.82 75.3 63.3 - - InternVL3-38B 72.7 / 75.0 76.9 1.81 77.8 67.3 46.9 / 62.8 67.5 GPT-4V/4T [1] 59.9 / 63.3 43.7 1.53 49.2 59.1 - -"},{"citing_arxiv_id":"2501.13106","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-01-22T18:59:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Additionally, recent advances in streaming video understanding focus on real-time processing [167- 171], employing techniques like adaptive memory and incremental processing for tasks such as live event detection and real-time captioning. Previous works [15, 20, 23, 25, 143, 150] typically follow a training recipe that involves an alignment phase, followed by supervised fine-tuning, with instruction-tuning datasets [19, 25- 28] often being video dominant. However, we propose a vision-centric training paradigm to enhance video understanding capabilities by focusing on large-scale image understanding pre-training. This approach leverages high-quality image-text datasets to build robust vision encoders that are then adapted for video tasks. Multimodal LLMs for General Vision Understanding."},{"citing_arxiv_id":"2501.05067","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-01-09T08:43:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.02955","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models","primary_cat":"cs.CV","submitted_at":"2025-01-06T11:57:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.05271","ref_index":160,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2024-12-06T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"6 [274] 60.9 / 63.6 - 1.70 - 54.9 - LLaV A-OneVision-7B [124] 58.2 / - 56.7 - - - - Qwen2-VL-7B [246] 63.3 / 69.0 67.0 1.44 - 55.6 - InternVL2-8B [35] 56.3 / 59.3 65.8 1.57 64.0 54.6 - InternVL2.5-8B 64.2 / 66.9 72.0 1.68 68.9 60.0 - InternVL2-26B [35] 57.0 / 60.2 67.5 1.67 64.2 56.1 - InternVL2.5-26B 66.9 / 69.2 75.2 1.86 72.3 59.9 - Oryx-1.5-32B [160] 67.3 / 74.9 70.1 1.52 72.3 - - VILA-1.5-40B [143] 60.1 / 61.1 - 1.61 56.7 - - InternVL2-40B [35] 66.1 / 68.6 72.0 1.78 71.0 60.6 - InternVL2.5-38B 70.7 / 73.1 74.4 1.82 75.3 63.3 - GPT-4V/4T [3] 59.9 / 63.3 43.7 1.53 49.2 59.1 - GPT-4o-20240513 [192] 71.9 / 77.2 - 1.63 64.6 66.7 - GPT-4o-20240806 [192] - - 1.87 - - 41.8 / 58.3 Gemini-1.5-Pro [200] 75."},{"citing_arxiv_id":"2412.04468","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NVILA: Efficient Frontier Visual Language Models","primary_cat":"cs.CV","submitted_at":"2024-12-05T18:59:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"[62] Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bor- des, Zhuang Liu, Hu Xu, Hyunwoo J Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed El- hoseiny, and Vikas Chandra. LongVU: Spatiotempo- ral Adaptive Compression for Long Video-Language Understanding. arXiv:2410.17434, 2024. [63] Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution. arXiv:2409.12961, 2024. [64] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon- zalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving"}],"limit":50,"offset":0}