Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation

Ali Asaria; Deep Gandhi; Tony Salomone

arxiv: 2606.20364 · v1 · pith:TOEKK7EEnew · submitted 2026-06-18 · 💻 cs.LG

Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation

Ali Asaria , Tony Salomone , Deep Gandhi This is my paper

Pith reviewed 2026-06-26 18:23 UTC · model grok-4.3

classification 💻 cs.LG

keywords VLM judge3D mesh generationsingle-image to 3Dde-biasingparameter-efficient adaptationwin-rate evaluationconditioner repairfurniture assets

0 comments

The pith

A hardened VLM-as-3D-judge protocol reaches parity with the base generator but no adaptation exceeds the 65 percent win-rate target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a de-biased VLM judge developed for ranking single-image 3D mesh quality can be hardened into an optimization signal to specialize the base generator on furniture assets using only public models and lightweight adaptation. It separates the training judge from the evaluation judge, applies position-bias correction, and fixes three failure modes to produce an independent signal with clear-gap win-rates of 0.83-1.0. Across six methods, two input regimes, and a severity sweep, the strongest result is parity at 0.50 win-rate from conditioner repair under severe degradation; independent base samples show 0.94 order-flip rate, so no method clears the 65 percent target. This outcome indicates that clean inputs saturate the judge and that flow-based fine-tuning washes out through the sampler, leaving conditioning repair as the only moving part.

Core claim

The central claim is that converting the de-biased VLM-as-3D-judge from ranking to optimization requires explicit hardening against circularity and saturation, after which lightweight parameter-efficient adaptations on public data match but do not surpass the strong base generator, with the mechanistic limit that base samples carry essentially no learnable preference.

What carries the argument

The hardened VLM-as-3D-judge that separates training and evaluation models, corrects position bias, and repairs three failure modes (image overload, geometry-hiding splat renders, reference-free judging) to supply an independent optimization signal.

If this is right

Independent base samples carry essentially no learnable preference, requiring quality-contrastive construction for any signal.
Conditioning repair under severe degradation is the only locus that moves geometry; other adaptations wash out through the sampler.
Matching a strong public-data base with cheap adaptation shows that exceeding it requires more than lightweight PEFT on public data.
The hardened judge protocol functions as a reusable independent evaluator for 3D generation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The saturation on clean inputs implies the judge may be most useful when inputs are deliberately degraded or paired with lower-quality contrasts.
The result could extend to testing whether heavier adaptation techniques or private data sources would be needed to surpass the base.
The protocol may apply to other single-image generation domains where cheap proxies fail but a hardened VLM can supply directional preference.

Load-bearing premise

The judge supplies an independent, non-saturated optimization signal that can be used to specialize the generator without the signal being washed out by the sampler or already maximized on clean base outputs.

What would settle it

An adaptation method that produces a win-rate of 65 percent or higher against the base generator on the n=8 test objects would falsify the result that no method clears the target.

read the original abstract

A companion study established a de-biased, cross-model VLM-as-3D-judge that reliably ranks single-image-to-3D mesh quality where cheap geometry and CLIP proxies fall short. This paper asks: can that judge's preferences specialize a strong open generator, TRELLIS, on one asset class (furniture), cheaply and without human labels? Taking the judge from ranking to optimization is where the work lives. Pushing a VLM judge into the training and evaluation loop exposes failure modes ranking never triggered, so our contribution is an optimization-grade hardening of the judge: a training judge (Qwen2.5-VL-7B) held distinct from an evaluation judge (InternVL3-8B) to break circularity; position-bias correction; and fixes for three failure modes (image overload, geometry-hiding splat renders, and reference-free judging that rewards clean-but-wrong outputs), with calibration evidence (clear-gap win-rate 0.83-1.0; base-vs-base ~0.5). Using this protocol as an independent evaluator, and working only from public models and data with lightweight parameter-efficient adaptation, we find our methods match the strong base rather than exceed it. Independent base samples carry essentially no learnable preference (0.94 order-flip rate), so signal must be engineered by quality-contrastive construction. Across six adaptation methods, two input regimes, and a severity sweep, the most targeted - conditioner repair under severe degradation - reaches parity (0.50) with the base, while no method clears the >=65% win-rate target. The result is mechanistic: clean inputs saturate the judge, flow-DIT fine-tuning washes out through the sampler, and conditioning repair is the locus that moves geometry. Win-rates are directional at n=8 objects. Matching a strong public-data base with cheap adaptation is itself informative: exceeding it needs more than lightweight PEFT on public data, and the judge protocol is reusable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The hardened judge protocol is the real contribution here, but the adaptation results on n=8 objects are too underpowered to support the claim that no method beats the base.

read the letter

The paper moves a VLM judge from ranking to optimization for single-image 3D and adds the practical fixes that ranking setups miss. Separate training judge (Qwen2.5-VL-7B) from evaluation judge (InternVL3-8B), position-bias correction, and handling for image overload, splat hiding, and reference-free errors. They show calibration with base-vs-base near 0.5 and clear-gap rates at 0.83-1.0. That part is concrete and reusable.

The adaptation experiments on TRELLIS for furniture are the main empirical claim: across six methods, two regimes, and a severity sweep, the best reaches only parity at 0.50 and none hit the 65% target. They correctly note that clean base samples carry almost no learnable signal (0.94 flip rate), so any improvement has to be engineered.

The soft spot is sample size. With only eight objects, win-rate comparisons lack power; a 0.50 observation has a wide interval and overlaps heavily with 0.625. No standard errors, p-values, or corrections are reported, and the paper itself calls the rates directional. The stress-test concern is accurate on the text given.

This is for groups working on label-free 3D specialization and VLM judges. It deserves peer review because the judge hardening is new and the negative finding on lightweight public-data methods is worth recording, but the stats and n need to be fixed before the central claim lands.

Referee Report

2 major / 2 minor

Summary. The paper introduces a hardened de-biased VLM-as-3D-judge protocol (distinct training judge Qwen2.5-VL-7B and evaluation judge InternVL3-8B, with position-bias correction and fixes for image overload, splat rendering, and reference-free judging) to optimize the TRELLIS single-image-to-3D generator on furniture assets via six lightweight PEFT adaptation methods. It reports that independent base samples show no learnable preference (0.94 order-flip rate), that the most effective method (conditioner repair under severe degradation) reaches only parity (0.50 win-rate) with the base, and that no method exceeds the >=65% target; the result is attributed to judge saturation on clean inputs and signal washout through the sampler. Calibration evidence is provided (clear-gap win-rates 0.83-1.0; base-vs-base ~0.5), and win-rates are described as directional at n=8 objects.

Significance. If the empirical findings hold after addressing sample-size limitations, the work would usefully document the practical barriers to turning a ranking-grade VLM judge into an optimization signal for 3D generation: clean base outputs already saturate the judge, flow-DiT fine-tuning erases preference information, and only targeted conditioning repair moves geometry. The separation of judges and the calibration protocol constitute reusable methodological contributions that future work can adopt. The negative result on public-data lightweight adaptation also supplies a concrete baseline indicating that exceeding strong open generators will require either larger-scale data, architectural changes, or stronger preference signals.

major comments (2)

[Results / abstract] Results section (and abstract): All win-rate claims rest on n=8 objects. For a binomial proportion, the 95% CI around an observed 0.50 is approximately [0.24, 0.76]; an observed 0.625 still overlaps substantially with 0.50. The manuscript reports no standard errors, p-values, multiple-comparison corrections, or power analysis, yet concludes that 'no method clears the >=65% win-rate target' and that conditioner repair 'reaches parity.' This sample size is load-bearing for the central empirical claim.
[Methods / evaluation protocol] § on adaptation methods and evaluation protocol: The abstract states that 'exact adaptation implementations, dataset sizes, or statistical tests' are not detailed; without these, it is impossible to assess whether the six methods were implemented comparably or whether the reported directional win-rates could be reproduced. This directly affects verifiability of the claim that the judge supplies an independent optimization signal.

minor comments (2)

[Results] The manuscript should explicitly state the exact number of objects, prompts, and renderings used for each win-rate comparison and whether the same 8 objects were used across all conditions.
[Calibration] Clarify whether the 0.94 order-flip rate on base samples was measured on the same n=8 objects or a larger held-out set.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive review and for recognizing the methodological contributions of the de-biased judge protocol. We address the two major comments below.

read point-by-point responses

Referee: [Results / abstract] All win-rate claims rest on n=8 objects. For a binomial proportion, the 95% CI around an observed 0.50 is approximately [0.24, 0.76]; an observed 0.625 still overlaps substantially with 0.50. The manuscript reports no standard errors, p-values, multiple-comparison corrections, or power analysis, yet concludes that 'no method clears the >=65% win-rate target' and that conditioner repair 'reaches parity.' This sample size is load-bearing for the central empirical claim.

Authors: We agree the sample size limits statistical power and will add 95% binomial confidence intervals, standard errors, and an explicit discussion of overlap with 0.5 to the results section and abstract. The manuscript already qualifies results as directional at n=8; we will further temper language around the >=65% target and parity claim to reflect uncertainty. A note on the absence of formal hypothesis testing will be included. We cannot expand to larger n within this study. revision: partial
Referee: [Methods / evaluation protocol] The abstract states that 'exact adaptation implementations, dataset sizes, or statistical tests' are not detailed; without these, it is impossible to assess whether the six methods were implemented comparably or whether the reported directional win-rates could be reproduced. This directly affects verifiability of the claim that the judge supplies an independent optimization signal.

Authors: The full manuscript already describes the six PEFT methods, input regimes, and evaluation protocol in the methods section. To improve verifiability we will expand the methods and supplementary material with exact hyperparameters, dataset sizes, and any statistical considerations used. The abstract will be revised to indicate that full implementation details are provided in the paper. revision: yes

standing simulated objections not resolved

Increasing the evaluation set beyond n=8 objects is not feasible due to the computational cost of 3D generation and VLM judging.

Circularity Check

0 steps flagged

No significant circularity; empirical adaptation results are independent of judge construction

full rationale

The paper's central finding—that no adaptation method exceeds the 65% win-rate target and the best reaches only parity—is an empirical outcome measured on n=8 objects using a hardened judge protocol. The protocol explicitly separates the training judge (Qwen2.5-VL-7B) from the evaluation judge (InternVL3-8B) and reports base-vs-base win-rates near 0.5 as calibration. These steps prevent the optimization signal from being self-referential. No derivation reduces a claimed prediction to a fitted parameter by construction, no uniqueness theorem is imported from overlapping-author prior work, and the negative result is not forced by the inputs. The work is self-contained against the reported public-model benchmarks and internal calibration checks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the companion study's judge being reliable after hardening and on the assumption that public data and models suffice to test whether preference signals can be learned via lightweight adaptation.

pith-pipeline@v0.9.1-grok · 5910 in / 1144 out tokens · 35159 ms · 2026-06-26T18:23:31.082999+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 12 linked inside Pith

[1]

A Cross-Model VLM-Judge Protocol for Single- Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

Ali Asaria, Tony Salomone, and Deep Gandhi. A Cross-Model VLM-Judge Protocol for Single- Image 3D Mesh Quality (and Why Cheap Proxies Fall Short). arXiv:2606.18451 [cs.LG], 2026. URL https://arxiv.org/abs/2606.18451. Companion work; introduces the cross-model VLM-as-3D-judge evaluation protocol adopted here

Pith/arXiv arXiv 2026
[2]

Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3D Latents for Scalable and Versatile 3D Generation
[3]

8 A De-biased VLM-as-3D-Judge Protocol

URLhttps://arxiv.org/abs/2412.01506. 8 A De-biased VLM-as-3D-Judge Protocol

Pith/arXiv arXiv
[4]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[5]

Diffusion Model Alignment Using Direct Preference Optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion Model Alignment Using Direct Preference Optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. URLhttps://arxiv.org/abs/2311.12908

arXiv 2024
[6]

DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness

Ruining Li, Chuanxia Zheng, Christian Rupprecht, and Andrea Vedaldi. DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness. 2025. URLhttps://arxiv. org/abs/2503.22677

arXiv 2025
[7]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2209.03003

Pith/arXiv arXiv 2023
[8]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2210.02747

Pith/arXiv arXiv 2023
[9]

DINOv2: Learning Robust Visual Features without Supervision.Transactions on Machine Learning Research (TMLR), 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Lab...

Pith/arXiv arXiv 2024
[10]

DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization

Zhenglin Zhou, Xiaobo Xia, Fan Ma, Hehe Fan, Yi Yang, and Tat-Seng Chua. DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization
[11]

URLhttps://arxiv.org/abs/2502.04370

arXiv
[12]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR), 2022. URL https://arxiv. org/abs/2106.09685

Pith/arXiv arXiv 2022
[13]

ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic Preference Optimization without Reference Model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. URLhttps://arxiv.org/abs/2403.07691

Pith/arXiv arXiv 2024
[14]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. URL https://arxiv.org...

Pith/arXiv arXiv 2023
[15]

Large Language Models are not Fair Evaluators

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large Language Models are not Fair Evaluators. 2023. URL https://arxiv.org/abs/2305.17926

Pith/arXiv arXiv 2023
[16]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report. a...

Pith/arXiv arXiv 2025
[17]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

Pith/arXiv arXiv 2025
[18]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[19]

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Mark Boss, Zixuan Huang, Aaryaman Vasishta, and Varun Jampani. SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement. arXiv:2408.00653 [cs.CV], 2024. URLhttps://arxiv.org/abs/2408.00653

arXiv 2024
[20]

TripoSR: Fast 3D Object Reconstruction from a Single Image

Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. TripoSR: Fast 3D Object Reconstruction from a Single Image. arXiv:2403.02151 [cs.CV], 2024. URLhttps: //arxiv.org/abs/2403.02151

Pith/arXiv arXiv 2024
[21]

3D-FUTURE: 3D Furniture Shape with TextURE

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3D-FUTURE: 3D Furniture Shape with TextURE. InInternational Journal of Computer Vision (IJCV), 2021. URLhttps://arxiv.org/abs/2009.09633

arXiv 2021
[22]

Flexible Isosurface Extraction for Gradient-Based Mesh Optimization.ACM Transactions on Graphics (TOG), 42(4), 2023

Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible Isosurface Extraction for Gradient-Based Mesh Optimization.ACM Transactions on Graphics (TOG), 42(4), 2023. URL https://arxiv.org/abs/2308.05371. 10

arXiv 2023

[1] [1]

A Cross-Model VLM-Judge Protocol for Single- Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

Ali Asaria, Tony Salomone, and Deep Gandhi. A Cross-Model VLM-Judge Protocol for Single- Image 3D Mesh Quality (and Why Cheap Proxies Fall Short). arXiv:2606.18451 [cs.LG], 2026. URL https://arxiv.org/abs/2606.18451. Companion work; introduces the cross-model VLM-as-3D-judge evaluation protocol adopted here

Pith/arXiv arXiv 2026

[2] [2]

Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3D Latents for Scalable and Versatile 3D Generation

[3] [3]

8 A De-biased VLM-as-3D-Judge Protocol

URLhttps://arxiv.org/abs/2412.01506. 8 A De-biased VLM-as-3D-Judge Protocol

Pith/arXiv arXiv

[4] [4]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[5] [5]

Diffusion Model Alignment Using Direct Preference Optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion Model Alignment Using Direct Preference Optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. URLhttps://arxiv.org/abs/2311.12908

arXiv 2024

[6] [6]

DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness

Ruining Li, Chuanxia Zheng, Christian Rupprecht, and Andrea Vedaldi. DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness. 2025. URLhttps://arxiv. org/abs/2503.22677

arXiv 2025

[7] [7]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2209.03003

Pith/arXiv arXiv 2023

[8] [8]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2210.02747

Pith/arXiv arXiv 2023

[9] [9]

DINOv2: Learning Robust Visual Features without Supervision.Transactions on Machine Learning Research (TMLR), 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Lab...

Pith/arXiv arXiv 2024

[10] [10]

DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization

Zhenglin Zhou, Xiaobo Xia, Fan Ma, Hehe Fan, Yi Yang, and Tat-Seng Chua. DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization

[11] [11]

URLhttps://arxiv.org/abs/2502.04370

arXiv

[12] [12]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR), 2022. URL https://arxiv. org/abs/2106.09685

Pith/arXiv arXiv 2022

[13] [13]

ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic Preference Optimization without Reference Model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. URLhttps://arxiv.org/abs/2403.07691

Pith/arXiv arXiv 2024

[14] [14]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. URL https://arxiv.org...

Pith/arXiv arXiv 2023

[15] [15]

Large Language Models are not Fair Evaluators

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large Language Models are not Fair Evaluators. 2023. URL https://arxiv.org/abs/2305.17926

Pith/arXiv arXiv 2023

[16] [16]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report. a...

Pith/arXiv arXiv 2025

[17] [17]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

Pith/arXiv arXiv 2025

[18] [18]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[19] [19]

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Mark Boss, Zixuan Huang, Aaryaman Vasishta, and Varun Jampani. SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement. arXiv:2408.00653 [cs.CV], 2024. URLhttps://arxiv.org/abs/2408.00653

arXiv 2024

[20] [20]

TripoSR: Fast 3D Object Reconstruction from a Single Image

Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. TripoSR: Fast 3D Object Reconstruction from a Single Image. arXiv:2403.02151 [cs.CV], 2024. URLhttps: //arxiv.org/abs/2403.02151

Pith/arXiv arXiv 2024

[21] [21]

3D-FUTURE: 3D Furniture Shape with TextURE

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3D-FUTURE: 3D Furniture Shape with TextURE. InInternational Journal of Computer Vision (IJCV), 2021. URLhttps://arxiv.org/abs/2009.09633

arXiv 2021

[22] [22]

Flexible Isosurface Extraction for Gradient-Based Mesh Optimization.ACM Transactions on Graphics (TOG), 42(4), 2023

Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible Isosurface Extraction for Gradient-Based Mesh Optimization.ACM Transactions on Graphics (TOG), 42(4), 2023. URL https://arxiv.org/abs/2308.05371. 10

arXiv 2023