pith. machine review for the scientific record. sign in

arxiv: 2604.13035 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.CL

Recognition: unknown

SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

Kai Ao, Kathakoli Sengupta, Paola Cascante-Bonilla

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:31 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords 3D scene synthesisscene evaluationspatial ontologysymbolic evaluatorindoor layoutsLLM criticVLM refinementhuman alignment
0
0 comments X

The pith

SceneCritic evaluates 3D indoor scene layouts using a symbolic spatial ontology, aligning more closely with human judgments than vision-language model evaluators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SceneCritic to assess AI-generated indoor scenes for spatial plausibility at the floor-plan level. It builds SceneOnto by aggregating object relationship priors from existing datasets and uses this structure to check semantic fit, orientations, and geometry across object pairs. This matters because evaluators based on LLMs or VLMs that score rendered images often fluctuate with viewpoint, prompt wording, or model hallucinations, making it hard to know if a scene is truly coherent. SceneCritic supplies targeted violation reports instead of opaque scores and is tested in an iterative refinement setup across rule-based, text-only LLM, and image-based VLM feedback modes.

Core claim

SceneCritic is a symbolic evaluator for floor-plan-level layouts that traverses SceneOnto, a spatial ontology constructed by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome, to jointly verify semantic, orientation, and geometric coherence across object relationships. It supplies object-level and relationship-level assessments that identify specific violations and successful placements. Experiments with an accompanying refinement test bed show that SceneCritic aligns substantially better with human judgments than VLM-based evaluators, that text-only LLMs can outperform VLMs on semantic layout quality, and that image-based VLM refinement is the most effective critic

What carries the argument

SceneOnto, the structured spatial ontology of typical indoor object relationships and constraints that SceneCritic traverses to produce coherence assessments.

If this is right

  • SceneCritic supplies concrete, object-specific and relationship-specific feedback that can guide targeted fixes during scene generation.
  • Text-only LLM critics can achieve higher semantic quality scores than VLMs when operating directly on layout descriptions.
  • Image-based VLM refinement produces the largest gains in correcting semantic and orientation errors during iterative improvement.
  • Symbolic evaluation reduces dependence on viewpoint or rendering choices that affect image-based judges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ontology-driven checking approach could be applied to outdoor or functional scene types by expanding the prior collection.
  • Embedding SceneCritic-style constraints inside generative model training loops might reduce the need for post-hoc refinement altogether.
  • Hybrid systems that combine symbolic checks with neural generators may offer a path to more consistent physical plausibility without full human oversight.

Load-bearing premise

The aggregated priors from 3D-FRONT, ScanNet, and Visual Genome form a sufficiently complete and unbiased representation of human spatial expectations across indoor environments.

What would settle it

A new human rating study on scenes drawn from cultural or architectural settings outside the source datasets where SceneCritic scores show lower agreement with humans than current VLM evaluators would falsify the core alignment claim.

read the original abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor-plan-level layouts. SceneCritic's constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object-level and relationship-level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule-based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM-based evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is the most effective critic modality for semantic and orientation correction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SceneCritic, a symbolic evaluator for floor-plan-level 3D indoor scene layouts. SceneCritic is grounded in SceneOnto, a spatial ontology constructed by aggregating priors from 3D-FRONT, ScanNet, and Visual Genome. It jointly verifies semantic, orientation, and geometric coherence, providing object- and relationship-level assessments. The work also presents an iterative refinement testbed comparing rule-based, text-only LLM, and image-based VLM critics, with claims that (a) SceneCritic aligns substantially better with human judgments than VLM evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is most effective for semantic and orientation corrections.

Significance. If the alignment and modality results hold under rigorous validation, this provides a stable, interpretable, and violation-specific alternative to unstable VLM judges for scene synthesis evaluation. The symbolic approach could improve reproducibility in 3D scene generation research and enable more targeted model refinement.

major comments (2)
  1. [Section 3] Section 3 (SceneOnto construction): The aggregation of priors from 3D-FRONT, ScanNet, and Visual Genome is described without any reported coverage statistics, completeness metrics, or analysis of potential dataset biases (e.g., under-representation of cultural variations or non-Western spatial semantics). This assumption is load-bearing for the central claim that SceneCritic serves as a faithful proxy for human spatial expectations and thus aligns better with human judgments.
  2. [Section 5] Section 5 (human alignment and modality experiments): The manuscript reports alignment with human judgments and comparative results but provides no quantitative details such as Pearson/Spearman correlations, inter-rater agreement scores (e.g., Fleiss' kappa), error bars, dataset splits, or statistical significance tests. Without these, the claims of 'substantially better' alignment and modality superiority cannot be fully assessed for robustness.
minor comments (2)
  1. The abstract would be strengthened by including one or two key quantitative results (e.g., correlation values) to support the alignment claims.
  2. Notation for SceneOnto traversal and constraint checking could be formalized with a small pseudocode listing or equation to improve clarity for readers unfamiliar with the ontology structure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to improve transparency and statistical rigor.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (SceneOnto construction): The aggregation of priors from 3D-FRONT, ScanNet, and Visual Genome is described without any reported coverage statistics, completeness metrics, or analysis of potential dataset biases (e.g., under-representation of cultural variations or non-Western spatial semantics). This assumption is load-bearing for the central claim that SceneCritic serves as a faithful proxy for human spatial expectations and thus aligns better with human judgments.

    Authors: We agree that Section 3 would benefit from explicit quantitative reporting on the ontology construction process. While the manuscript outlines the aggregation of priors from the three source datasets, it does not provide coverage statistics, completeness metrics, or bias analysis. In the revised manuscript we will add a dedicated paragraph and accompanying table that reports (i) the number of unique semantic relations and object categories contributed by each dataset, (ii) pairwise overlap statistics, and (iii) a brief discussion of dataset biases, including the predominantly Western-centric nature of the source collections and the consequent limitations for non-Western spatial semantics. These additions will make the grounding of SceneCritic more transparent while leaving the core experimental claims unchanged. revision: yes

  2. Referee: [Section 5] Section 5 (human alignment and modality experiments): The manuscript reports alignment with human judgments and comparative results but provides no quantitative details such as Pearson/Spearman correlations, inter-rater agreement scores (e.g., Fleiss' kappa), error bars, dataset splits, or statistical significance tests. Without these, the claims of 'substantially better' alignment and modality superiority cannot be fully assessed for robustness.

    Authors: We concur that the experimental reporting in Section 5 lacks the quantitative statistical details necessary for full assessment. The current manuscript presents comparative alignment results qualitatively. In the revision we will expand Section 5 to include (i) Pearson and Spearman correlation coefficients between SceneCritic scores and human ratings, (ii) Fleiss' kappa for inter-rater agreement among human annotators, (iii) error bars derived from multiple evaluation runs, (iv) explicit description of the dataset splits used for the human study and refinement experiments, and (v) results of statistical significance tests (paired t-tests or Wilcoxon signed-rank tests) comparing SceneCritic against VLM baselines. These metrics are derivable from the existing human-study data and will be added without altering the experimental protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: SceneCritic and SceneOnto are defined from external dataset aggregation and evaluated experimentally against human judgments.

full rationale

The paper constructs SceneOnto by aggregating priors from 3D-FRONT, ScanNet, and Visual Genome, then uses it to define symbolic constraints for SceneCritic. Performance claims (better alignment with humans, modality comparisons) rest on experimental results rather than any equations, fitted parameters, or self-citations that reduce the outcomes to the inputs by construction. No self-definitional loops, renamed known results, or load-bearing self-citations appear in the provided text. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the assumption that dataset-derived spatial priors are representative and that symbolic constraints can proxy human spatial judgment. No free parameters are described. One invented entity (SceneOnto) and one new system (SceneCritic) are introduced without external falsifiable predictions beyond the reported human alignment.

axioms (1)
  • domain assumption Indoor scene priors aggregated from 3D-FRONT, ScanNet, and Visual Genome constitute a sufficient and unbiased basis for semantic, orientation, and geometric constraints.
    Invoked when constructing SceneOnto and when claiming the evaluator identifies 'specific violations'.
invented entities (2)
  • SceneOnto no independent evidence
    purpose: Structured spatial ontology that encodes object relationships, orientations, and geometry for verification.
    Newly constructed by aggregating priors; no independent evidence outside the paper's own experiments is provided.
  • SceneCritic no independent evidence
    purpose: Symbolic evaluator that traverses the ontology to produce object-level and relationship-level assessments.
    Core contribution; its superiority is claimed via human alignment experiments.

pith-pipeline@v0.9.0 · 5579 in / 1457 out tokens · 29505 ms · 2026-05-10T15:31:50.823611+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 23 canonical work pages · 6 internal anchors

  1. [1]

    com/blog/llama-4-multimodal-intelligence/

    AI, M.: Introducing llama 4: Advancing multimodal intelligence (2024),https://ai.meta. com/blog/llama-4-multimodal-intelligence/

  2. [2]

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025), https://arxiv.org/abs/2502.13923

  3. [3]

    Hallucination of Multimodal Large Language Models: A Survey

    Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024)

  4. [4]

    arXiv preprint arXiv:2508.05899 (2025)

    Bian, Z., Ren, R., Yang, Y., Callison-Burch, C.: Holodeck 2.0: Vision-language-guided 3d world generation with editing. arXiv preprint arXiv:2508.05899 (2025)

  5. [5]

    In: European Conference on Computer Vision (2024)

    Çelen, A., Han, G., Schindler, K., Van Gool, L., Armeni, I., Obukhov, A., Wang, X.: I-design: Personalized llm interior designer. In: European Conference on Computer Vision (2024)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  7. [7]

    Scenefoundry: Generating interactive infinite 3d worlds,

    Chen, C., Hsu, Y., Liu, Y., Sun, W., Ni, T., Lee, C., Sun, M., Yang, Y.: Scenefoundry: Generating interactive infinite 3d worlds. arXiv preprint arXiv:2601.05810 (2026)

  8. [8]

    arXiv preprint arXiv:2507.04293 (2025)

    Chen, W., Chi, D., Liu, Y., Yang, Y., Zhang, Y., Zhuang, Y., Quan, X., Hao, J., Li, G., Lin, L.: Autolayout: Closed-loop layout synthesis via slow-fast collaborative reasoning. arXiv preprint arXiv:2507.04293 (2025)

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  10. [10]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition (2017)

    Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richly- annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition (2017)

  11. [11]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2023)

    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2023)

  12. [12]

    In: NeurIPS (2022), outstanding Paper Award

    Deitke, M., VanderBilt, E., Herrasti, A., Weihs, L., Salvador, J., Ehsani, K., Han, W., Kolve, E., Farhadi, A., Kembhavi, A., Mottaghi, R.: ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. In: NeurIPS (2022), outstanding Paper Award

  13. [13]

    Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., Zhang,T.: Rlhfworkflow: Fromrewardmodelingtoonlinerlhf.arXivpreprintarXiv:2405.07863 (2024) 13

  14. [14]

    Advances in Neural Information Processing Systems (2023)

    Feng, W., Zhu, W., Fu, T.j., Jampani, V., Akula, A., He, X., Basu, S., Wang, X.E., Wang, W.Y.: Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems (2023)

  15. [15]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Fu, H., Cai, B., Gao, L., Zhang, L.X., Wang, J., Li, C., Zeng, Q., Sun, C., Jia, R., Zhao, B., et al.: 3d-front: 3d furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

  16. [16]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)

  17. [17]

    Ui-venus technical report: Building high-performance ui agents with rft

    Gu, Z., Zeng, Z., Xu, Z., Zhou, X., Shen, S., Liu, Y., Zhou, B., Meng, C., Xia, T., Chen, W., et al.: Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833 (2025)

  18. [18]

    In: Forty-first International Conference on Machine Learning (2024)

    Hu, Z., Iscen, A., Jain, A., Kipf, T., Yue, Y., Ross, D.A., Schmid, C., Fathi, A.: Scenecraft: An llm agent for synthesizing 3d scenes as blender code. In: Forty-first International Conference on Machine Learning (2024)

  19. [19]

    ACM Transactions on Information Systems43(2), 1–55 (2025)

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems43(2), 1–55 (2025)

  20. [20]

    arXiv (2017)

    Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Gordon, D., Zhu, Y., Gupta, A., Farhadi, A.: AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv (2017)

  21. [21]

    International journal of computer vision (2017)

    Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision (2017)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  23. [23]

    In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

    Li, C., Zhang, C., Zhou, H., Collier, N., Korhonen, A., Vulić, I.: TopViewRS: Vision-language models as top-view spatial reasoners. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 1786–1807. Association for Computational Linguistics, Miami, Florida, USA (Nov 2024...

  24. [24]

    arXiv preprint arXiv:2505.02836 (2025)

    Ling, L., Lin, C.H., Lin, T.Y., Ding, Y., Zeng, Y., Sheng, Y., Ge, Y., Liu, M.Y., Bera, A., Li, Z.: Scenethesis: A language and vision agentic framework for 3d scene generation. arXiv preprint arXiv:2505.02836 (2025)

  25. [25]

    Littlefair,G.,Dutt,N.S.,Mitra,N.J.: Flairgpt: Repurposingllmsforinteriordesigns.In: Computer Graphics Forum (2025)

  26. [26]

    A Survey on Hallucination in Large Vision-Language Models

    Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W.: A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024) 14

  27. [27]

    arXiv preprint arXiv:2506.23329 (2025)

    Liu, P., Li, C., Li, Z., Wu, Y., Li, W., Yang, Z., Zhang, Z., Lin, Y., Han, S., Feng, B.Y.: Ir3d-bench: Evaluating vision-language model scene understanding as agentic inverse rendering. arXiv preprint arXiv:2506.23329 (2025)

  28. [28]

    arXiv preprint arXiv:2505.20129 (2025)

    Liu, X., Tai, Y.W., Tang, C.K.: Agentic 3d scene generation with spatially contextualized vlms. arXiv preprint arXiv:2505.20129 (2025)

  29. [29]

    Direct numerical layout generation for 3d indoor scene synthesis via spatial reasoning

    Ran, X., Li, Y., Xu, L., Yu, M., Dai, B.: Direct numerical layout generation for 3d indoor scene synthesis via spatial reasoning. arXiv preprint arXiv:2506.05341 (2025)

  30. [30]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., et al.: Habitat: A platform for embodied ai research. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9339–9347 (2019)

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

  32. [32]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025)

    Sun, F.Y., Liu, W., Gu, S., Lim, D., Bhat, G., Tombari, F., Li, M., Haber, N., Wu, J.: Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025)

  33. [33]

    Tam, H.I.I., Pun, H.I.D., Wang, A.T., Chang, A.X., Savva, M.: SceneEval: Evaluating semantic coherence in text-conditioned 3D indoor scene synthesis (2025)

  34. [34]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2024)

    Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2024)

  35. [35]

    Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning

    Wang,H.,Qu,C.,Huang,Z.,Chu,W.,Lin,F.,Chen,W.: Vl-rethinker: Incentivizingself-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837 (2025)

  36. [36]

    arXiv preprint arXiv:2506.10600 (2025)

    Wang, X., Liu, L., Cao, Y., Wu, R., Qin, W., Wang, D., Sui, W., Su, Z.: Embodiedgen: Towards a generative 3d world engine for embodied intelligence. arXiv preprint arXiv:2506.10600 (2025)

  37. [37]

    Sage: Scalable agentic 3d scene generation for embodied ai, 2026

    Xia, H., Li, X., Li, Z., Ma, Q., Xu, J., Liu, M.Y., Cui, Y., Lin, T.Y., Ma, W.C., Wang, S., et al.: Sage: Scalable agentic 3d scene generation for embodied ai. arXiv preprint arXiv:2602.10116 (2026)

  38. [38]

    Xie, T., Zong, Z., Qiu, Y., Li, X., Feng, Y., Yang, Y., Jiang, C.: Physgaussian: Physics-integrated 3dgaussiansforgenerativedynamics.In: ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition (2024)

  39. [39]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Yang, Y., Jia, B., Zhi, P., Huang, S.: Physcene: Physically interactable 3d scene synthesis for embodied ai. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  41. [41]

    J., Sanchez, V., and Zheng, F

    Yang, Y., Lu, J., Zhao, Z., Luo, Z., Yu, J.J., Sanchez, V., Zheng, F.: Llplace: The 3d indoor scene layout generation and editing via large language model. arXiv preprint arXiv:2406.03866 (2024) 15

  42. [43]

    arXiv preprint arXiv:2506.07570 (2025)

    Yang, Y., Luo, Z., Ding, T., Lu, J., Gao, M., Yang, J., Sanchez, V., Zheng, F.: Optiscene: Llm- driven indoor scene layout generation via scaled human-aligned data synthesis and multi-stage preference optimization. arXiv preprint arXiv:2506.07570 (2025)

  43. [44]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Yang, Y., Sun, F.Y., Weihs, L., VanderBilt, E., Herrasti, A., Han, W., Wu, J., Haber, N., Krishna, R., Liu, L., et al.: Holodeck: Language guided generation of 3d embodied ai environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  44. [45]

    Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025

    Zhao, X., Kang, Z., Feng, A., Levine, S., Song, D.: Learning to reason without external rewards. arXiv preprint arXiv:2505.19590 (2025)

  45. [46]

    Spring village with wildflower meadows, cherry blossoms, clear sky

    Zheng, K., Zha, R., Xu, Z., Gu, J., Yang, J., Wang, X.E.: Constructing a 3d scene from a single image. arXiv preprint arXiv:2505.15765 (2025)

  46. [47]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhou,Y.,While,Z.,Kalogerakis,E.: Scenegraphnet: Neuralmessagepassingfor3dindoorscene augmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7384–7392 (2019) 16 A. Ontology Schema Wecollectstatisticsfromthediversescenedatasetsandre-purposethemintoourontologydataindexed byaobjectcategory. Thedimensionblockdefinesobjec...