arxiv: 2604.13035 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.CL

Recognition: unknown

SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

Kai Ao, Kathakoli Sengupta, Paola Cascante-Bonilla

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:31 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords 3D scene synthesisscene evaluationspatial ontologysymbolic evaluatorindoor layoutsLLM criticVLM refinementhuman alignment

0 comments

The pith

SceneCritic evaluates 3D indoor scene layouts using a symbolic spatial ontology, aligning more closely with human judgments than vision-language model evaluators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SceneCritic to assess AI-generated indoor scenes for spatial plausibility at the floor-plan level. It builds SceneOnto by aggregating object relationship priors from existing datasets and uses this structure to check semantic fit, orientations, and geometry across object pairs. This matters because evaluators based on LLMs or VLMs that score rendered images often fluctuate with viewpoint, prompt wording, or model hallucinations, making it hard to know if a scene is truly coherent. SceneCritic supplies targeted violation reports instead of opaque scores and is tested in an iterative refinement setup across rule-based, text-only LLM, and image-based VLM feedback modes.

Core claim

SceneCritic is a symbolic evaluator for floor-plan-level layouts that traverses SceneOnto, a spatial ontology constructed by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome, to jointly verify semantic, orientation, and geometric coherence across object relationships. It supplies object-level and relationship-level assessments that identify specific violations and successful placements. Experiments with an accompanying refinement test bed show that SceneCritic aligns substantially better with human judgments than VLM-based evaluators, that text-only LLMs can outperform VLMs on semantic layout quality, and that image-based VLM refinement is the most effective critic

What carries the argument

SceneOnto, the structured spatial ontology of typical indoor object relationships and constraints that SceneCritic traverses to produce coherence assessments.

If this is right

SceneCritic supplies concrete, object-specific and relationship-specific feedback that can guide targeted fixes during scene generation.
Text-only LLM critics can achieve higher semantic quality scores than VLMs when operating directly on layout descriptions.
Image-based VLM refinement produces the largest gains in correcting semantic and orientation errors during iterative improvement.
Symbolic evaluation reduces dependence on viewpoint or rendering choices that affect image-based judges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ontology-driven checking approach could be applied to outdoor or functional scene types by expanding the prior collection.
Embedding SceneCritic-style constraints inside generative model training loops might reduce the need for post-hoc refinement altogether.
Hybrid systems that combine symbolic checks with neural generators may offer a path to more consistent physical plausibility without full human oversight.

Load-bearing premise

The aggregated priors from 3D-FRONT, ScanNet, and Visual Genome form a sufficiently complete and unbiased representation of human spatial expectations across indoor environments.

What would settle it

A new human rating study on scenes drawn from cultural or architectural settings outside the source datasets where SceneCritic scores show lower agreement with humans than current VLM evaluators would falsify the core alignment claim.

read the original abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor-plan-level layouts. SceneCritic's constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object-level and relationship-level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule-based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM-based evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is the most effective critic modality for semantic and orientation correction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SceneCritic gives a practical symbolic evaluator for scene layouts that may be more stable than VLM judges, but its ontology needs scrutiny for biases.

read the letter

SceneCritic is a symbolic evaluator for floor-plan level 3D indoor scenes. It uses SceneOnto, an ontology built by aggregating priors from 3D-FRONT, ScanNet, and Visual Genome, to check semantic, orientation, and geometric coherence at object and relationship levels. This approach is new in how it combines those checks into a single evaluator that points out specific violations rather than giving a single score from a model. The paper pairs it with a refinement test bed that tests rule-based, text LLM, and VLM critics. What it does well is highlight that text-only LLMs can sometimes beat VLMs on semantic layout quality and that image-based VLM feedback helps most with corrections. If the experiments are solid, this provides a useful benchmark for model developers. The main concern is whether SceneOnto is broad enough. The datasets it draws from have known limitations in coverage, so the good match with human judgments might not generalize if the raters share those same priors. Without details on how they measured alignment or coverage, it's hard to gauge how load-bearing this is. Readers working on 3D scene synthesis will get the most from this, especially those looking for evaluation tools that don't depend on another generative model. It is worth a serious referee because it addresses a clear need with a new method and some comparative results. I would send it for peer review with a request for more quantitative details on the human study and ontology validation.

Referee Report

2 major / 2 minor

Summary. The paper introduces SceneCritic, a symbolic evaluator for floor-plan-level 3D indoor scene layouts. SceneCritic is grounded in SceneOnto, a spatial ontology constructed by aggregating priors from 3D-FRONT, ScanNet, and Visual Genome. It jointly verifies semantic, orientation, and geometric coherence, providing object- and relationship-level assessments. The work also presents an iterative refinement testbed comparing rule-based, text-only LLM, and image-based VLM critics, with claims that (a) SceneCritic aligns substantially better with human judgments than VLM evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is most effective for semantic and orientation corrections.

Significance. If the alignment and modality results hold under rigorous validation, this provides a stable, interpretable, and violation-specific alternative to unstable VLM judges for scene synthesis evaluation. The symbolic approach could improve reproducibility in 3D scene generation research and enable more targeted model refinement.

major comments (2)

[Section 3] Section 3 (SceneOnto construction): The aggregation of priors from 3D-FRONT, ScanNet, and Visual Genome is described without any reported coverage statistics, completeness metrics, or analysis of potential dataset biases (e.g., under-representation of cultural variations or non-Western spatial semantics). This assumption is load-bearing for the central claim that SceneCritic serves as a faithful proxy for human spatial expectations and thus aligns better with human judgments.
[Section 5] Section 5 (human alignment and modality experiments): The manuscript reports alignment with human judgments and comparative results but provides no quantitative details such as Pearson/Spearman correlations, inter-rater agreement scores (e.g., Fleiss' kappa), error bars, dataset splits, or statistical significance tests. Without these, the claims of 'substantially better' alignment and modality superiority cannot be fully assessed for robustness.

minor comments (2)

The abstract would be strengthened by including one or two key quantitative results (e.g., correlation values) to support the alignment claims.
Notation for SceneOnto traversal and constraint checking could be formalized with a small pseudocode listing or equation to improve clarity for readers unfamiliar with the ontology structure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to improve transparency and statistical rigor.

read point-by-point responses

Referee: [Section 3] Section 3 (SceneOnto construction): The aggregation of priors from 3D-FRONT, ScanNet, and Visual Genome is described without any reported coverage statistics, completeness metrics, or analysis of potential dataset biases (e.g., under-representation of cultural variations or non-Western spatial semantics). This assumption is load-bearing for the central claim that SceneCritic serves as a faithful proxy for human spatial expectations and thus aligns better with human judgments.

Authors: We agree that Section 3 would benefit from explicit quantitative reporting on the ontology construction process. While the manuscript outlines the aggregation of priors from the three source datasets, it does not provide coverage statistics, completeness metrics, or bias analysis. In the revised manuscript we will add a dedicated paragraph and accompanying table that reports (i) the number of unique semantic relations and object categories contributed by each dataset, (ii) pairwise overlap statistics, and (iii) a brief discussion of dataset biases, including the predominantly Western-centric nature of the source collections and the consequent limitations for non-Western spatial semantics. These additions will make the grounding of SceneCritic more transparent while leaving the core experimental claims unchanged. revision: yes
Referee: [Section 5] Section 5 (human alignment and modality experiments): The manuscript reports alignment with human judgments and comparative results but provides no quantitative details such as Pearson/Spearman correlations, inter-rater agreement scores (e.g., Fleiss' kappa), error bars, dataset splits, or statistical significance tests. Without these, the claims of 'substantially better' alignment and modality superiority cannot be fully assessed for robustness.

Authors: We concur that the experimental reporting in Section 5 lacks the quantitative statistical details necessary for full assessment. The current manuscript presents comparative alignment results qualitatively. In the revision we will expand Section 5 to include (i) Pearson and Spearman correlation coefficients between SceneCritic scores and human ratings, (ii) Fleiss' kappa for inter-rater agreement among human annotators, (iii) error bars derived from multiple evaluation runs, (iv) explicit description of the dataset splits used for the human study and refinement experiments, and (v) results of statistical significance tests (paired t-tests or Wilcoxon signed-rank tests) comparing SceneCritic against VLM baselines. These metrics are derivable from the existing human-study data and will be added without altering the experimental protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: SceneCritic and SceneOnto are defined from external dataset aggregation and evaluated experimentally against human judgments.

full rationale

The paper constructs SceneOnto by aggregating priors from 3D-FRONT, ScanNet, and Visual Genome, then uses it to define symbolic constraints for SceneCritic. Performance claims (better alignment with humans, modality comparisons) rest on experimental results rather than any equations, fitted parameters, or self-citations that reduce the outcomes to the inputs by construction. No self-definitional loops, renamed known results, or load-bearing self-citations appear in the provided text. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the assumption that dataset-derived spatial priors are representative and that symbolic constraints can proxy human spatial judgment. No free parameters are described. One invented entity (SceneOnto) and one new system (SceneCritic) are introduced without external falsifiable predictions beyond the reported human alignment.

axioms (1)

domain assumption Indoor scene priors aggregated from 3D-FRONT, ScanNet, and Visual Genome constitute a sufficient and unbiased basis for semantic, orientation, and geometric constraints.
Invoked when constructing SceneOnto and when claiming the evaluator identifies 'specific violations'.

invented entities (2)

SceneOnto no independent evidence
purpose: Structured spatial ontology that encodes object relationships, orientations, and geometry for verification.
Newly constructed by aggregating priors; no independent evidence outside the paper's own experiments is provided.
SceneCritic no independent evidence
purpose: Symbolic evaluator that traverses the ontology to produce object-level and relationship-level assessments.
Core contribution; its superiority is claimed via human alignment experiments.

pith-pipeline@v0.9.0 · 5579 in / 1457 out tokens · 29505 ms · 2026-05-10T15:31:50.823611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 23 canonical work pages · 6 internal anchors

[1]

com/blog/llama-4-multimodal-intelligence/

AI, M.: Introducing llama 4: Advancing multimodal intelligence (2024),https://ai.meta. com/blog/llama-4-multimodal-intelligence/

2024
[2]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025), https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Hallucination of Multimodal Large Language Models: A Survey

Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024)

work page internal anchor Pith review arXiv 2024
[4]

arXiv preprint arXiv:2508.05899 (2025)

Bian, Z., Ren, R., Yang, Y., Callison-Burch, C.: Holodeck 2.0: Vision-language-guided 3d world generation with editing. arXiv preprint arXiv:2508.05899 (2025)

work page arXiv 2025
[5]

In: European Conference on Computer Vision (2024)

Çelen, A., Han, G., Schindler, K., Van Gool, L., Armeni, I., Obukhov, A., Wang, X.: I-design: Personalized llm interior designer. In: European Conference on Computer Vision (2024)

2024
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024
[7]

Scenefoundry: Generating interactive infinite 3d worlds,

Chen, C., Hsu, Y., Liu, Y., Sun, W., Ni, T., Lee, C., Sun, M., Yang, Y.: Scenefoundry: Generating interactive infinite 3d worlds. arXiv preprint arXiv:2601.05810 (2026)

work page arXiv 2026
[8]

arXiv preprint arXiv:2507.04293 (2025)

Chen, W., Chi, D., Liu, Y., Yang, Y., Zhang, Y., Zhuang, Y., Quan, X., Hao, J., Li, G., Lin, L.: Autolayout: Closed-loop layout synthesis via slow-fast collaborative reasoning. arXiv preprint arXiv:2507.04293 (2025)

work page arXiv 2025
[9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

In: Proceedings of the IEEE conference on computer vision and pattern recognition (2017)

Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richly- annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition (2017)

2017
[11]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2023)

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2023)

2023
[12]

In: NeurIPS (2022), outstanding Paper Award

Deitke, M., VanderBilt, E., Herrasti, A., Weihs, L., Salvador, J., Ehsani, K., Han, W., Kolve, E., Farhadi, A., Kembhavi, A., Mottaghi, R.: ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. In: NeurIPS (2022), outstanding Paper Award

2022
[13]

Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., Zhang,T.: Rlhfworkflow: Fromrewardmodelingtoonlinerlhf.arXivpreprintarXiv:2405.07863 (2024) 13

work page arXiv 2024
[14]

Advances in Neural Information Processing Systems (2023)

Feng, W., Zhu, W., Fu, T.j., Jampani, V., Akula, A., He, X., Basu, S., Wang, X.E., Wang, W.Y.: Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems (2023)

2023
[15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

Fu, H., Cai, B., Gao, L., Zhang, L.X., Wang, J., Li, C., Zeng, Q., Sun, C., Jia, R., Zhao, B., et al.: 3d-front: 3d furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

2021
[16]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)

2024
[17]

Ui-venus technical report: Building high-performance ui agents with rft

Gu, Z., Zeng, Z., Xu, Z., Zhou, X., Shen, S., Liu, Y., Zhou, B., Meng, C., Xia, T., Chen, W., et al.: Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833 (2025)

work page arXiv 2025
[18]

In: Forty-first International Conference on Machine Learning (2024)

Hu, Z., Iscen, A., Jain, A., Kipf, T., Yue, Y., Ross, D.A., Schmid, C., Fathi, A.: Scenecraft: An llm agent for synthesizing 3d scenes as blender code. In: Forty-first International Conference on Machine Learning (2024)

2024
[19]

ACM Transactions on Information Systems43(2), 1–55 (2025)

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems43(2), 1–55 (2025)

2025
[20]

arXiv (2017)

Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Gordon, D., Zhu, Y., Gupta, A., Farhadi, A.: AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv (2017)

2017
[21]

International journal of computer vision (2017)

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision (2017)

2017
[22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024
[23]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

Li, C., Zhang, C., Zhou, H., Collier, N., Korhonen, A., Vulić, I.: TopViewRS: Vision-language models as top-view spatial reasoners. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 1786–1807. Association for Computational Linguistics, Miami, Florida, USA (Nov 2024...

work page doi:10.18653/v1/2024.emnlp-main.106 2024
[24]

arXiv preprint arXiv:2505.02836 (2025)

Ling, L., Lin, C.H., Lin, T.Y., Ding, Y., Zeng, Y., Sheng, Y., Ge, Y., Liu, M.Y., Bera, A., Li, Z.: Scenethesis: A language and vision agentic framework for 3d scene generation. arXiv preprint arXiv:2505.02836 (2025)

work page arXiv 2025
[25]

Littlefair,G.,Dutt,N.S.,Mitra,N.J.: Flairgpt: Repurposingllmsforinteriordesigns.In: Computer Graphics Forum (2025)

2025
[26]

A Survey on Hallucination in Large Vision-Language Models

Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W.: A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024) 14

work page internal anchor Pith review arXiv 2024
[27]

arXiv preprint arXiv:2506.23329 (2025)

Liu, P., Li, C., Li, Z., Wu, Y., Li, W., Yang, Z., Zhang, Z., Lin, Y., Han, S., Feng, B.Y.: Ir3d-bench: Evaluating vision-language model scene understanding as agentic inverse rendering. arXiv preprint arXiv:2506.23329 (2025)

work page arXiv 2025
[28]

arXiv preprint arXiv:2505.20129 (2025)

Liu, X., Tai, Y.W., Tang, C.K.: Agentic 3d scene generation with spatially contextualized vlms. arXiv preprint arXiv:2505.20129 (2025)

work page arXiv 2025
[29]

Direct numerical layout generation for 3d indoor scene synthesis via spatial reasoning

Ran, X., Li, Y., Xu, L., Yu, M., Dai, B.: Direct numerical layout generation for 3d indoor scene synthesis via spatial reasoning. arXiv preprint arXiv:2506.05341 (2025)

work page arXiv 2025
[30]

In: Proceedings of the IEEE/CVF international conference on computer vision

Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., et al.: Habitat: A platform for embodied ai research. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9339–9347 (2019)

2019
[31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025)

Sun, F.Y., Liu, W., Gu, S., Lim, D., Bhat, G., Tombari, F., Li, M., Haber, N., Wu, J.: Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025)

2025
[33]

Tam, H.I.I., Pun, H.I.D., Wang, A.T., Chang, A.X., Savva, M.: SceneEval: Evaluating semantic coherence in text-conditioned 3D indoor scene synthesis (2025)

2025
[34]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2024)

Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2024)

2024
[35]

Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning

Wang,H.,Qu,C.,Huang,Z.,Chu,W.,Lin,F.,Chen,W.: Vl-rethinker: Incentivizingself-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837 (2025)

work page arXiv 2025
[36]

arXiv preprint arXiv:2506.10600 (2025)

Wang, X., Liu, L., Cao, Y., Wu, R., Qin, W., Wang, D., Sui, W., Su, Z.: Embodiedgen: Towards a generative 3d world engine for embodied intelligence. arXiv preprint arXiv:2506.10600 (2025)

work page arXiv 2025
[37]

Sage: Scalable agentic 3d scene generation for embodied ai, 2026

Xia, H., Li, X., Li, Z., Ma, Q., Xu, J., Liu, M.Y., Cui, Y., Lin, T.Y., Ma, W.C., Wang, S., et al.: Sage: Scalable agentic 3d scene generation for embodied ai. arXiv preprint arXiv:2602.10116 (2026)

work page arXiv 2026
[38]

Xie, T., Zong, Z., Qiu, Y., Li, X., Feng, Y., Yang, Y., Jiang, C.: Physgaussian: Physics-integrated 3dgaussiansforgenerativedynamics.In: ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition (2024)

2024
[39]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Yang, Y., Jia, B., Zhi, P., Huang, S.: Physcene: Physically interactable 3d scene synthesis for embodied ai. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024
[41]

J., Sanchez, V., and Zheng, F

Yang, Y., Lu, J., Zhao, Z., Luo, Z., Yu, J.J., Sanchez, V., Zheng, F.: Llplace: The 3d indoor scene layout generation and editing via large language model. arXiv preprint arXiv:2406.03866 (2024) 15

work page arXiv 2024
[43]

arXiv preprint arXiv:2506.07570 (2025)

Yang, Y., Luo, Z., Ding, T., Lu, J., Gao, M., Yang, J., Sanchez, V., Zheng, F.: Optiscene: Llm- driven indoor scene layout generation via scaled human-aligned data synthesis and multi-stage preference optimization. arXiv preprint arXiv:2506.07570 (2025)

work page arXiv 2025
[44]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Yang, Y., Sun, F.Y., Weihs, L., VanderBilt, E., Herrasti, A., Han, W., Wu, J., Haber, N., Krishna, R., Liu, L., et al.: Holodeck: Language guided generation of 3d embodied ai environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024
[45]

Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025

Zhao, X., Kang, Z., Feng, A., Levine, S., Song, D.: Learning to reason without external rewards. arXiv preprint arXiv:2505.19590 (2025)

work page arXiv 2025
[46]

Spring village with wildflower meadows, cherry blossoms, clear sky

Zheng, K., Zha, R., Xu, Z., Gu, J., Yang, J., Wang, X.E.: Constructing a 3d scene from a single image. arXiv preprint arXiv:2505.15765 (2025)

work page arXiv 2025
[47]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhou,Y.,While,Z.,Kalogerakis,E.: Scenegraphnet: Neuralmessagepassingfor3dindoorscene augmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7384–7392 (2019) 16 A. Ontology Schema Wecollectstatisticsfromthediversescenedatasetsandre-purposethemintoourontologydataindexed byaobjectcategory. Thedimensionblockdefinesobjec...

2019