Vision-aligned Latent Reasoning for Multi-modal Large Language Model
Pith reviewed 2026-05-16 07:44 UTC · model grok-4.3
The pith
VaLR dynamically inserts vision-aligned latent tokens before each reasoning step to prevent loss of visual details in multi-modal models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that by dynamically generating vision-aligned latent tokens before each Chain of Thought reasoning step and training them through embedding alignment with vision encoders, the model can preserve visual knowledge during extended reasoning. This leads to better performance on benchmarks requiring long-context understanding and precise visual perception, with a notable improvement from 33.0% to 52.9% on VSI-Bench, and enables test-time scaling behavior absent in previous models.
What carries the argument
Vision-aligned latent tokens generated dynamically before each CoT step, trained via alignment of intermediate MLLM embeddings with vision encoder outputs to guide perceptual reasoning in latent space.
If this is right
- VaLR models outperform standard approaches on benchmarks needing long visual reasoning or precise perception.
- The framework shows test-time scaling where additional reasoning steps improve results.
- Significant gains occur on specific tests like VSI-Bench with nearly 20 percentage points improvement.
- Visual knowledge is preserved without harming general language reasoning capabilities.
Where Pith is reading between the lines
- VaLR could be extended to other multi-modal tasks such as video understanding where temporal visual details must persist across steps.
- Combining this with external vision tools might further enhance precision in real-world applications like autonomous navigation.
- Similar alignment techniques could apply to audio or other modalities in future multi-modal systems.
Load-bearing premise
Dynamically inserting vision-aligned latent tokens before each reasoning step, trained via embedding alignment, will preserve visual knowledge without introducing noise or degrading language reasoning.
What would settle it
Run VaLR and baseline models on VSI-Bench while increasing the number of reasoning steps; if performance does not improve or falls below the baseline, the claim of preserved visual information and scaling fails.
Figures
read the original abstract
Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Vision-aligned Latent Reasoning (VaLR) for MLLMs to mitigate progressive dilution of visual information during long-context Chain-of-Thought generation. VaLR dynamically inserts vision-aligned latent tokens before each reasoning step; these tokens are produced by training the model to align its intermediate embeddings with those from a vision encoder. The approach is reported to yield consistent gains on long-context and fine-grained visual benchmarks, including a 19.9 percentage-point improvement on VSI-Bench (33.0% to 52.9%) over Qwen2.5-VL, and to exhibit previously unobserved test-time scaling behavior.
Significance. If the mechanism is shown to preserve visual signal without drift, VaLR would provide a practical route to reliable test-time scaling in multimodal reasoning, addressing a recognized bottleneck in current MLLMs. The scale of the reported VSI-Bench gain and the claim of emergent scaling behavior would constitute a notable empirical advance for the field.
major comments (3)
- [Method] The training objective that aligns MLLM intermediate embeddings with vision-encoder embeddings is never formulated (no loss equation, no specification of which layers or token positions receive the alignment loss, and no statement of how this auxiliary loss is balanced against the standard next-token prediction loss). Without this, it is impossible to verify that the reported gains arise from preserved visual knowledge rather than from extra tokens or additional compute.
- [Experiments / Ablation studies] The central assumption—that a static embedding-alignment objective applied during training will prevent visual-signal drift across multi-step autoregressive generation at inference time—is not tested. No ablation removes the alignment loss, no analysis tracks embedding similarity over long CoT chains, and no control isolates the effect of the inserted latent tokens from other changes in the generation process.
- [Experiments] Baseline implementations, data splits, and training hyper-parameters are not described, so the 19.9 pp VSI-Bench improvement cannot be confidently attributed to the proposed mechanism rather than differences in training regime or evaluation protocol.
minor comments (1)
- [Abstract / Results] The abstract states that VaLR “exhibits test-time scaling behavior not observed in prior MLLMs,” yet no figure or table quantifies scaling curves (performance vs. number of reasoning steps or tokens) for both VaLR and the baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which identify key omissions in the presentation of our method and experiments. We agree that additional details and analyses are needed to strengthen the manuscript and will incorporate revisions to address each point. Our responses below explain the planned changes.
read point-by-point responses
-
Referee: [Method] The training objective that aligns MLLM intermediate embeddings with vision-encoder embeddings is never formulated (no loss equation, no specification of which layers or token positions receive the alignment loss, and no statement of how this auxiliary loss is balanced against the standard next-token prediction loss). Without this, it is impossible to verify that the reported gains arise from preserved visual knowledge rather than from extra tokens or additional compute.
Authors: We agree that the training objective was described only at a high level in the manuscript. In the revised version, we will add an explicit formulation in the Method section: the total loss is L = L_AR + λ * L_align, where L_AR is the standard next-token prediction loss and L_align is the mean squared error between the MLLM's intermediate embeddings (at the positions of the generated latent tokens, taken from the final transformer layer before each reasoning step) and the corresponding outputs from the frozen vision encoder. The hyperparameter λ will be specified (set to 0.1 in our experiments). This formulation will clarify that the alignment is applied specifically to the vision-aligned latent tokens and is balanced against the primary objective. revision: yes
-
Referee: [Experiments / Ablation studies] The central assumption—that a static embedding-alignment objective applied during training will prevent visual-signal drift across multi-step autoregressive generation at inference time—is not tested. No ablation removes the alignment loss, no analysis tracks embedding similarity over long CoT chains, and no control isolates the effect of the inserted latent tokens from other changes in the generation process.
Authors: We acknowledge that direct empirical tests of the drift-prevention assumption are absent from the current manuscript. We will add the following to the Experiments section: (1) an ablation training a variant without the alignment loss (λ=0) and reporting its performance on VSI-Bench and other long-context benchmarks; (2) an analysis of cosine similarity between MLLM intermediate embeddings and vision-encoder embeddings measured at each step of long CoT chains, comparing VaLR to the baseline to show reduced drift; (3) a control experiment inserting non-aligned random latent tokens instead of vision-aligned ones. These additions will isolate the contribution of the alignment mechanism. revision: yes
-
Referee: [Experiments] Baseline implementations, data splits, and training hyper-parameters are not described, so the 19.9 pp VSI-Bench improvement cannot be confidently attributed to the proposed mechanism rather than differences in training regime or evaluation protocol.
Authors: We agree that these details are necessary for reproducibility and attribution. In the revised Experiments section, we will fully specify: the exact baseline configurations (including whether Qwen2.5-VL was used off-the-shelf or further fine-tuned on the same data), the training and test data splits for all benchmarks (e.g., the VSI-Bench split used), and all training hyperparameters (learning rate, batch size, number of epochs, optimizer, and the precise number of latent tokens generated per step). This will enable readers to confirm that the reported gains are attributable to VaLR. revision: yes
Circularity Check
No significant circularity in VaLR framework
full rationale
The paper introduces VaLR as an empirical framework that inserts vision-aligned latent tokens and trains via embedding alignment with vision encoders. All central claims rest on reported performance gains measured on external benchmarks (e.g., VSI-Bench) against named prior models such as Qwen2.5-VL. No equations, derivations, or load-bearing self-citations appear in the provided text that reduce any result to a fitted parameter or self-defined input by construction. The method's assumptions are tested through independent evaluation rather than assumed tautologically, rendering the chain self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Visual information progressively dilutes during long-context generation in MLLMs
- domain assumption Aligning intermediate MLLM embeddings with vision-encoder embeddings preserves visual knowledge during reasoning
invented entities (1)
-
Vision-aligned latent tokens
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LREPA :=−1/NP ∑ sim(ˆFMLLM[p,:],Fϕ[p,:]) … align intermediate embeddings of MLLM with those from vision encoders
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery theorem unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP introduces three-level alignment for visual latent reasoning in MLLMs, achieving top aggregate perception and reasoning performance on Qwen2.5-VL 7B by addressing decoder-input norm mismatch.
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding best aggregate perception/reasoning scores on Qwen2.5-VL 7B among supervised variants while showing task-relevant signal i...
-
What's Holding Back Latent Visual Reasoning?
Latent visual reasoning fails in current models because standard datasets make oracle latents uninformative and inference-time latents collapse away from useful representations.
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...
Reference graph
Works this paper leans on
-
[1]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anthropic. The claude 3 model family: Opus, son- net, haiku. Technical report, Anthropic, 2024a. URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3.pdf. Anthropic. Claude 3.5 sonnet model card. Technical report, Anthropic, 2024b. URL https://www-cdn.anthropic.com/ fed9cc193a14b84131812372d8d5857f8f304c52/ Mode...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., Hui, B., Ji, L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan, C., Tan, S., Tu, J., Wang, P., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yang, J., Yang, S., Yao, Y ., Yu, B., Yuan, H., Y...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
PaliGemma: A versatile 3B VLM for transfer
Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschan- nen, M., Bugliarello, E., et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Bjorck, J., Casta˜neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y ., Fox, D., Hu, F., Huang, S., et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025
Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., and Lin, D. Sharegpt4v: Improving large multi-modal models with better captions. InECCV, pp. 370–387. Springer, 2024a. Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., et al. Are we on the right way for evaluating large vision-language mod...
-
[7]
Cheng, K., Song, W., Fan, J., Ma, Z., Sun, Q., Xu, F., Yan, C., Chen, N., Zhang, J., and Chen, J. Caparena: Benchmarking and analyzing detailed image captioning in the llm era.arXiv preprint arXiv:2503.12329,
-
[8]
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y ., Hu, Y ., Wei, F., Zhang, X., Zhang, B., Wei, X., et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices.arXiv preprint arXiv:2312.16886, 2(6):7,
work page internal anchor Pith review arXiv
-
[9]
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
9 Vision-aligned Latent Reasoning for Multi-modal Large Language Model Chu, X., Qiao, L., Zhang, X., Xu, S., Wei, F., Yang, Y ., Sun, X., Hu, Y ., Lin, X., Zhang, B., et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766,
work page internal anchor Pith review arXiv
-
[10]
OneThinker: All-in-one Reasoning Model for Image and Video
Feng, K., Zhang, M., Li, H., Fan, K., Chen, S., Jiang, Y ., Zheng, D., Sun, P., Zhang, Y ., Sun, H., et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043,
work page internal anchor Pith review Pith/arXiv arXiv
- [11]
-
[12]
Hao, S., Gu, Y ., Luo, H., Liu, T., Shao, X., Wang, X., Xie, S., Ma, H., Samavedhi, A., Gao, Q., et al. Llm reasoners: New evaluation, library, and analysis of step- by-step reasoning with large language models.arXiv preprint arXiv:2404.05221, 2024a. Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models ...
-
[13]
Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Huynh, N. D., Bouadjenek, M. R., Aryal, S., Razzak, I., and Hacid, H. Visual question answering: from early developments to recent advances–a survey.arXiv preprint arXiv:2501.03939,
-
[15]
Decomposed Prompting: A Modular Approach for Solving Complex Tasks
Khot, T., Trivedi, H., Finlayson, M., Fu, Y ., Richardson, K., Clark, P., and Sabharwal, A. Decomposed prompting: A modular approach for solving complex tasks.arXiv preprint arXiv:2210.02406,
work page internal anchor Pith review arXiv
-
[16]
OpenVLA: An Open-Source Vision-Language-Action Model
Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San- keti, P., et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
MolmoAct: Action Reasoning Models that can Reason in Space
Lee, J., Duan, J., Fang, H., Deng, Y ., Liu, S., Li, B., Fang, B., Zhang, J., Wang, Y . R., Lee, S., et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
10 Lucas Lehnert, Sainbayar Sukhbaatar, Paul Mcvay, Michael Rabbat, and Yuandong Tian
Lehnert, L., Sukhbaatar, S., Su, D., Zheng, Q., Mcvay, P., Rabbat, M., and Tian, Y . Beyond a*: Better planning with transformers via search dynamics bootstrapping.arXiv preprint arXiv:2402.14083,
-
[19]
Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746, 2025
10 Vision-aligned Latent Reasoning for Multi-modal Large Language Model Li, A., Wang, C., Fu, D., Yue, K., Cai, Z., Zhu, W. B., Liu, O., Guo, P., Neiswanger, W., Huang, F., et al. Zebra- cot: A dataset for interleaved vision language reasoning. arXiv preprint arXiv:2507.16746, 2025a. Li, B., Sun, X., Liu, J., Wang, Z., Wu, J., Yu, X., Chen, H., Barsoum, E...
-
[20]
Video-llava: Learning united visual representation by alignment before projection
Lin, B., Ye, Y ., Zhu, B., Cui, J., Ning, M., Jin, P., and Yuan, L. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pp. 5971–5984,
work page 2024
-
[21]
Liu, H., Li, C., Wu, Q., and Lee, Y
Accessed: 2025-04-03. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tuning. InNeurIPS, volume 36, pp. 34892–34916,
work page 2025
-
[22]
Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning. InCVPR, pp. 26296–26306, 2024a. Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llavanext: Improved reasoning, ocr, and world knowledge, 2024b. Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.-W., Galley, M...
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El- Nouby, A., et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens,
Qin, Y ., Wei, B., Ge, J., Kallidromitis, K., Fu, S., Darrell, T., and Wang, X. Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025a. Qin, Y ., Wei, B., Ge, J., Kallidromitis, K., Fu, S., Darrell, T., and Wang, X. Chain-of-visual-thought: Teaching vlms to see and think better w...
-
[25]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y ., and Li, H. Visual cot: Unleashing chain-of- thought reasoning in multi-modal language models. In NeurIPS, 2024a. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open lan...
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- time compute optimally can be more effective than scal- ing model parameters.arXiv preprint arXiv:2408.03314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Su, D., Sukhbaatar, S., Rabbat, M., Tian, Y ., and Zheng, Q. Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces.arXiv preprint arXiv:2410.09918,
-
[28]
Tong, P., Brown, E., Wu, P., Woo, S., IYER, A. J. V ., Akula, S. C., Yang, S., Yang, J., Middepogu, M., Wang, Z., et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InNeurIPS, 2024a. Tong, S., Liu, Z., Zhai, Y ., Ma, Y ., LeCun, Y ., and Xie, S. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, pp. ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Monet: Reasoning in latent visual space beyond images and language,
Wang, H., Zheng, A., Zhao, Y ., Wang, T., Zheng, G., Zhang, X., and Zhang, Z. Reconstructive visual instruction tun- ing. InICLR, 2025a. Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. Vggt: Visual geometry grounded transformer. InCVPR, pp. 5294–5306, 2025b. Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., and L...
-
[30]
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Wu, D., Liu, F., Hung, Y .-H., and Duan, Y . Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelli- gence.arXiv preprint arXiv:2505.23747,
work page internal anchor Pith review arXiv
-
[31]
URL https://arxiv.org/abs/2407.10671. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M...
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
arXiv preprint arXiv:2406.05673 , year=
Yu, F., Jiang, L., Kang, H., Hao, S., and Qin, L. Flow of reasoning: Efficient training of llm policy with divergent thinking.arXiv preprint arXiv:2406.05673, 1(2):6,
-
[33]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
12 Vision-aligned Latent Reasoning for Multi-modal Large Language Model Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y ., Kwok, J. T., Li, Z., Weller, A., and Liu, W. Metamath: Boot- strap your own mathematical questions for large language models.arXiv preprint arXiv:2309.12284,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
Yue, X., Qu, X., Zhang, G., Fu, Y ., Huang, W., Sun, H., Su, Y ., and Chen, W. Mammoth: Building math generalist models through hybrid instruction tuning.arXiv preprint arXiv:2309.05653,
work page internal anchor Pith review arXiv
-
[35]
Zheng, D., Huang, S., Li, Y ., and Wang, L. Learning from videos for 3d world: Enhancing mllms with 3d vision ge- ometry priors.arXiv preprint arXiv:2505.24625, 2025a. Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., and Yu, X. Deepeyes: Incentivizing” thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362, 20...
-
[36]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Hyperparameter Stage 1 Stage 2 optimizer AdamW deepspeed Zero-2 learning rate 1e-5 2e-6 MLPψlearning rate - 1e-5 per-GPU batch size 2 gradient accumulation steps 16 weight decay 0.01 epoch 1 warm-up ratio 0.03 latent tokens (K) - 16 alignment weight (λ) - 0.5 During training, we select CLIP (Radford et al., 2021), SigLIP (Tschannen et al., 2025), DINO (Oq...
work page 2021
-
[38]
To evaluate Monet (Wang et al., 2025c), we follow the system prompt proposed by the Monet authors
as the judge. To evaluate Monet (Wang et al., 2025c), we follow the system prompt proposed by the Monet authors. For CoVT (Qin et al., 2025b), we use CoVT-7B-depth-seg-dino. We evaluate various models on VSI-Bench (Yang et al., 2025b) for 3D spatial reasoning tasks, BLINK (Fu et al., 2024), MMVP (Tong et al., 2024b), MMStar (Chen et al., 2024b), V∗ (Wu & ...
work page 2024
-
[39]
In addition, we report the model versions used for API-based evaluation as follows: •openai/gpt-4o-2024-08-06 •Claude/claude-sonnet-4-20250514 Table 8.Number of frames used in VSI-Bench evaluation. Methods # of Frames GPT-4o 16 LLaV A-NeXT-Video-7B 32 R1-OneVision-7B 32 Ocean-R1-7B 32 Qwen2.5-VL-7B 32 LVR 32 CoVT 32 Monet 32 VaLR (Ours) 32 14 Vision-align...
work page 2024
-
[40]
with 170K samples from OneThinker-SFT (Feng et al., 2025). B.2. Non-interleaved CoT Data Let an input image set be I={I 1,· · ·, I Q} where Q is the number of input images, and the ground-truth language CoT reasoning bey= [r 1, r2,· · ·, r N , a]wherer i is thei-th reasoning step andais the final answer. Single-view VQA dataset.For single-view data where ...
work page 2025
-
[41]
to identify which image is most relevant for each reasoning step r(i) in the ground-truth CoT reasoning. Specifically, we process GPT-4o with the set of input images I and the CoT reasoning chain y, and ask it to match each reasoning step with its corresponding target image. After obtaining the target image Itarget for each reasoning step r(i), we apply R...
work page 2025
-
[42]
features as input tokens to the LLM backbone. As shown in Figure 4, REPA outperforms the input token method on VSI-Bench (Yang et al., 2025b), BLINK (Fu et al., 2024), and V∗ (Wu & Xie,
work page 2024
-
[43]
Large Language Model DINOv3 Qwen Enc
video understanding 27 Large Language Model DINOv3 Qwen Enc. Large Language Model DINOv3 Qwen Enc. (a) Visual Features for Input Tokens (b) Visual Features for REPA REPA Figure 4.Comparison between methods using vision encoder features.We compare two methods using DINOv3 features: (a) Using visual features as input visual tokens of MLLM (Green), (b) Align...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.