pith. sign in

arxiv: 2603.00607 · v2 · pith:C26XQVQ6new · submitted 2026-02-28 · 💻 cs.CV · cs.AI

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

Pith reviewed 2026-05-21 12:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multi-subject image generationidentity preservationstability-plasticity dilemmaflow matchingdirect preference optimizationdiffusion modelsage transformationgroup composition
0
0 comments X

The pith

IdGlow uses adaptive timestep scheduling and group-level preference optimization to balance identity preservation with natural scene composition in mask-free multi-subject generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the stability-plasticity dilemma where methods must keep reference faces recognizable while allowing flexible scene changes and deformations. Current approaches rely on rigid masks or attention that break down for tasks like turning adults into children within a group photo. IdGlow builds a two-stage process on flow matching diffusion models: first a fine-tuning step with decaying timestep constraints and a gating window that injects identity only during key moments, plus a vision-language model to create context prompts from bad cases. The second stage applies weighted direct preference optimization at the group level to fix artifacts and improve harmony. If effective, this removes the need for spatial inputs and produces images that look both faithful and aesthetically natural.

Core claim

IdGlow is a mask-free, progressive two-stage framework on Flow Matching diffusion models. The supervised fine-tuning stage introduces linear decay timestep scheduling to relax constraints for natural group composition and a temporal gating mechanism that limits identity injection to a critical semantic window, preserving adult facial semantics without overriding child-like structures. A badcase-driven vision-language model supplies precise prompts to avoid attribute leakage and ambiguity. The second stage uses fine-grained group-level direct preference optimization with weighted margins to remove multi-subject artifacts, improve texture harmony, and align identity fidelity with real-worlds,,

What carries the argument

Task-adaptive timestep scheduling with linear decay and temporal gating that concentrates identity signals in a critical semantic window, paired with badcase-driven VLM prompt synthesis and weighted-margin group-level direct preference optimization.

If this is right

  • Complex deformations such as age transformation become feasible while keeping multiple reference identities intact.
  • Group images can be composed without relying on explicit spatial masks or localized attention mechanisms.
  • Texture harmony and overall aesthetic quality improve alongside identity fidelity on real-world distributions.
  • Performance gains appear on both direct multi-person fusion and age-transformed group generation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The scheduling and gating ideas could transfer to other conditional generation settings where preservation must coexist with structural change.
  • The badcase-driven prompt method might reduce manual prompt engineering in broader image editing or composition tools.
  • Extending the group-level optimization to video or multi-view synthesis could test whether the same balance holds over time or across viewpoints.

Load-bearing premise

A vision-language model trained on bad cases can reliably generate precise context-aware prompts that fix attribute leakage and semantic ambiguity without any layout or spatial guidance.

What would settle it

Running the model on mixed-age group scenes with adult identities and checking whether child anatomical features remain intact or whether faces blend incorrectly when the VLM prompt step is removed.

Figures

Figures reproduced from arXiv: 2603.00607 by Changhao Qiao, Chao Hui, Haohua Chen, Honghao Cai, Jing Li, Runqi Wang, Sijie Xu, Tianze Zhou, Wei Zhu, Xiangyuan Wang, Xu Tang, Yao Hu, Yibo Chen, Yunhao Bai, Yuyang Hao, Yuyuan Yang, Zezhou Cui, Zhen Li.

Figure 1
Figure 1. Figure 1: Qualitative results of IdGlow on two multi-subject generation tasks. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The architecture of IdGlow-DiT. The model processes variable numbers of reference identities through a unified encoding strategy, forming a concatenated multi-ID sequence. A key innovation is the Dynamics-Aware Gating Module (highlighted in orange), which modulates the intensity of the identity se￾quence based on the diffusion timestep t and the specific task (e.g., age transformation curves). These gated … view at source ↗
Figure 3
Figure 3. Figure 3: Task-specific prompt synthesis via the Image-Edit-Prompt model. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dynamics-aware identity modulation tailored to specific generative tasks. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents IdGlow, a mask-free progressive two-stage framework for multi-subject image generation based on Flow Matching diffusion models. The first stage involves supervised fine-tuning with task-adaptive timestep scheduling using a linear decay schedule and a temporal gating mechanism to preserve identity semantics. A badcase-driven Vision-Language Model is used for context-aware prompt synthesis to avoid attribute leakage without spatial inputs. The second stage employs Fine-Grained Group-Level Direct Preference Optimization (DPO) with weighted margin to improve harmony and fidelity. Experiments on multi-person fusion and age-transformed group generation benchmarks claim to achieve a superior balance between facial fidelity and aesthetic quality, mitigating the stability-plasticity conflict.

Significance. If the quantitative results and ablations substantiate the claims, this work could offer a significant advance in mask-free multi-subject generation by addressing the stability-plasticity dilemma through dynamic modulation techniques. The integration of VLM for prompt synthesis and group-level DPO represents an interesting approach to handling complex identity interactions without rigid spatial conditioning. However, the current presentation leaves the empirical support unclear.

major comments (3)
  1. [Abstract] The abstract asserts superior performance on two benchmarks yet supplies no quantitative numbers, error bars, ablation results, or baseline comparisons, leaving the central claim unsupported by visible evidence.
  2. [§3.2 (Prompt Synthesis)] The assumption that a badcase-driven Vision-Language Model can reliably produce precise, context-aware prompts that resolve attribute leakage and semantic ambiguity without any explicit layout or spatial inputs is load-bearing for the mask-free claim. No quantitative evaluation of VLM prompt fidelity for the age-transformation or group-fusion benchmarks is provided.
  3. [§3.3 (DPO Stage)] The Fine-Grained Group-Level DPO optimizes directly against human or model preferences on generated outputs, raising a circularity concern where the reported fidelity gains may be partly defined by the same optimization loop used to produce them.
minor comments (2)
  1. [§3.1 and §3.3] The definitions of the linear decay schedule parameters and the weighted margin in group-level DPO could be clarified with explicit equations to aid reproducibility.
  2. [Related Work] Additional references to recent works on mask-free multi-subject generation or VLM-guided diffusion would strengthen the positioning of the contributions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be incorporated to strengthen the presentation and empirical support.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts superior performance on two benchmarks yet supplies no quantitative numbers, error bars, ablation results, or baseline comparisons, leaving the central claim unsupported by visible evidence.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In the revised manuscript we will add key metrics from the experiments section, including facial fidelity and aesthetic quality scores with baseline comparisons, to directly evidence the claimed performance. revision: yes

  2. Referee: [§3.2 (Prompt Synthesis)] The assumption that a badcase-driven Vision-Language Model can reliably produce precise, context-aware prompts that resolve attribute leakage and semantic ambiguity without any explicit layout or spatial inputs is load-bearing for the mask-free claim. No quantitative evaluation of VLM prompt fidelity for the age-transformation or group-fusion benchmarks is provided.

    Authors: This is a valid observation regarding the centrality of the VLM component. While end-to-end benchmark results demonstrate the practical effectiveness of the synthesized prompts, we will add a quantitative evaluation of prompt fidelity (e.g., semantic similarity and leakage reduction metrics) specifically on the age-transformation and group-fusion benchmarks. revision: yes

  3. Referee: [§3.3 (DPO Stage)] The Fine-Grained Group-Level DPO optimizes directly against human or model preferences on generated outputs, raising a circularity concern where the reported fidelity gains may be partly defined by the same optimization loop used to produce them.

    Authors: We appreciate the methodological concern. The preference data is collected from human annotators on a held-out set of SFT-stage outputs, and final reporting relies on independent objective metrics and test splits. We will revise §3.3 to explicitly detail this separation between preference collection and evaluation protocols. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with independent methodological contributions

full rationale

The paper presents a progressive two-stage framework consisting of supervised fine-tuning with task-adaptive timestep scheduling and temporal gating, integrated with badcase-driven VLM prompt synthesis to address attribute leakage, followed by Fine-Grained Group-Level DPO with weighted margin formulation. These steps are described as targeting distinct facets of the stability-plasticity dilemma in mask-free multi-subject generation. Claims of superior Pareto balance are supported by experiments on the direct multi-person fusion and age-transformed group generation benchmarks, without any reduction of results to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The central derivation remains externally falsifiable via the reported benchmark outcomes rather than tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The framework rests on several domain assumptions about diffusion dynamics and VLM capabilities plus newly introduced mechanisms whose effectiveness is asserted rather than independently verified.

free parameters (2)
  • linear decay schedule parameters
    The rate and breakpoints of the linear decay used for timestep scheduling are chosen to align with generative dynamics and are therefore fitted or hand-tuned.
  • weighted margin in group-level DPO
    The margin weighting in the preference optimization objective is a tunable hyper-parameter that directly influences the final identity and harmony scores.
axioms (2)
  • domain assumption Flow Matching diffusion models admit task-adaptive timestep scheduling that progressively relaxes identity constraints without destroying earlier semantic structure.
    Invoked to justify the supervised fine-tuning stage.
  • ad hoc to paper A vision-language model driven by bad-case examples can synthesize prompts that eliminate attribute leakage without spatial layout information.
    Central to the prompt-synthesis component.
invented entities (2)
  • temporal gating mechanism no independent evidence
    purpose: Concentrate identity injection inside a critical semantic window during diffusion
    New component introduced to preserve adult facial semantics while permitting child-like anatomy.
  • Fine-Grained Group-Level Direct Preference Optimization no independent evidence
    purpose: Simultaneously remove multi-subject artifacts, improve texture harmony, and recalibrate identity fidelity
    Custom DPO variant presented as the second training stage.

pith-pipeline@v0.9.0 · 5832 in / 1595 out tokens · 69489 ms · 2026-05-21T12:50:22.174685+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 9 internal anchors

  1. [1]

    ACM Transactions on Graphics (TOG)40(4), 1–12 (2021) 2, 4

    Alaluf, Y., Patashnik, O., Cohen-Or, D.: Only a matter of style: Age transformation using a style-based regression model. ACM Transactions on Graphics (TOG)40(4), 1–12 (2021) 2, 4

  2. [2]

    Training Diffusion Models with Reinforcement Learning

    Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2024) 3, 4

  3. [3]

    HunyuanImage 3.0 Technical Report

    Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025) 10

  4. [4]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: Additive angular margin loss for deep face recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4690–4699 (2019) 2, 10

  5. [5]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206 (2024) 4, 5

  6. [6]

    In: International Conference on Learning Representations (ICLR) (2023) 2, 3

    Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An im- age is worth one word: Personalizing text-to-image generation using textual inversion. In: International Conference on Learning Representations (ICLR) (2023) 2, 3

  7. [7]

    Seedream 3.0 Technical Report

    Gao, Y., Gong, L., Guo, Q., Hou, X., Lai, Z., Li, F., Li, L., Lian, X., Liao, C., Liu, L., Liu, W., et al.: Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346 (2025) 10

  8. [8]

    Pulid: Pure and lightning id customization via contrastive alignment

    Guo, Z., Wu, Y., Chen, Z., Chen, L., He, Q.: PuLID: Pure and lightning ID customization via contrastive alignment. arXiv preprint arXiv:2404.16022 (2024) 2, 3

  9. [9]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 2

    Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image generation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 2

  10. [10]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 2, 4

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 2, 4

  11. [11]

    In: International Conference on Learning Representations (ICLR) (2018) 10

    Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: International Conference on Learning Representations (ICLR) (2018) 10

  12. [12]

    In: International Conference on Learning Representations (ICLR) (2014) 5

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (ICLR) (2014) 5

  13. [13]

    Naval Research Logistics Quarterly 2(1–2), 83–97 (1955) 7

    Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1–2), 83–97 (1955) 7

  14. [14]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to- image diffusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1931–1941 (2023) 3

  15. [15]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Li, Y., et al.: Training flow matching models via online RL. arXiv preprint arXiv:2505.05470 (2025) 4

  16. [16]

    arXiv preprint arXiv:2312.04461 (2024) 2, 3

    Li, Z., Cao, M., Wang, X., Qi, Z., Cheng, M.M., Shan, Y.: Photomaker: Customizing realistic human photos via stacked ID embedding. arXiv preprint arXiv:2312.04461 (2024) 2, 3

  17. [17]

    In: International Conference on Learning Representations (ICLR) (2023) 5

    Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: International Conference on Learning Representations (ICLR) (2023) 5

  18. [18]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2023) 5

  19. [19]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 2

    Mo, W., Zhang, T., Bai, Y., Su, B., Wen, J.R., Yang, Q.: Dynamic prompt optimizing for text-to-image generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 2

  20. [20]

    Or-El,R.,Sengupta,S.,Quispe,J.,etal.:Lifespanagetransformationsynthesis.In:EuropeanConference on Computer Vision (ECCV). pp. 739–755 (2020) 2, 4 14

  21. [21]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 4

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 4

  22. [22]

    In: European Conference on Computer Vision (ECCV) (2024) 2

    Papantoniou, F.P., Lattas, A., Moschoglou, S., Deng, J., Kainz, B., Zafeiriou, S.: Arc2Face: A foundation model for ID-consistent human faces. In: European Conference on Computer Vision (ECCV) (2024) 2

  23. [23]

    In: IEEE/CVF International Confer- ence on Computer Vision (ICCV)

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 4172–4182 (2023) 5

  24. [24]

    In: International Conference on Machine Learning (ICML) (2021) 10

    Radford,A.,Kim,J.W.,Hallacy,C.,Ramesh,A.,Goh,G.,Agrawal,S.,Sastry,G.,Askell,A.,Mishkin,P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML) (2021) 10

  25. [25]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 3, 4, 8

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimiza- tion: Your language model is secretly a reward model. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 3, 4, 8

  26. [26]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with la- tent diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684–10695 (2022) 2, 4

  27. [27]

    Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

    Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242 (2023) 2, 3

  28. [28]

    Advances in Neural Information Processing Systems (NeurIPS) (2022) 8, 10

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: LAION-5B: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems (NeurIPS) (2022) 8, 10

  29. [29]

    In: International Conference on Learning Represen- tations (ICLR) (2021) 2, 4

    Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Represen- tations (ICLR) (2021) 2, 4

  30. [30]

    Diffusion model alignment using direct preference optimization.arXiv preprint arXiv:2311.12908,

    Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., Naik, N.: Diffusion model alignment using direct preference optimization. arXiv preprint arXiv:2311.12908 (2024) 3, 4, 8, 9

  31. [31]

    InstantID: Zero-shot Identity-Preserving Generation in Seconds

    Wang, Q., Bai, X., Wang, H., Qin, Z., Chen, A.: InstantID: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519 (2024) 2, 3, 10

  32. [32]

    In: International Conference on Learning Representations (ICLR) (2025) 3

    Wang, X., Huang, Q., et al.: MS-Diffusion: Multi-subject zero-shot image personalization with layout guidance. In: International Conference on Learning Representations (ICLR) (2025) 3

  33. [33]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 10

  34. [34]

    International Journal of Computer Vision (IJCV) (2024) 3, 10, 11

    Xiao, G., Yin, T., Freeman, W.T., Durand, F., Han, S.: FastComposer: Tuning-free multi-subject image generation with localized attention. International Journal of Computer Vision (IJCV) (2024) 3, 10, 11

  35. [35]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023) 2, 3

  36. [36]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 3 15

    Zhou, Y., Zhou, D., Cheng, M., Feng, J., Hou, Q.: StoryDiffusion: Consistent self-attention for long-range image and video generation. In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 3 15