CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration

Ali Aghayari, Arash Marioriyad, Mahdieh Soleymani Baghshah, MohammadAmin Fazli, Mohammad Hossein Rohban, Niki Sepasian, Seyed Amir Kasaei, Shayan Baghayi Nejad

Authors on Pith no claims yet

classification 💻 cs.CV cs.CL

keywords optimizationcarinoxexplorationnoisealignmentinitialrewardcategory-aware

0 comments

read the original abstract

Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at https://amirkasaei.com/carinox/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.