{"total":14,"items":[{"citing_arxiv_id":"2605.18267","ref_index":54,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-18T12:03:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SRC-Flow compresses RAE features via a Semantic Representation Compressor into a low-dimensional space, enabling normalizing flows to reach gFID 1.65 on ImageNet 256x256 and 2.07 on 512x512 while retaining exact likelihoods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26348","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance","primary_cat":"cs.CV","submitted_at":"2026-04-29T07:00:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ACPO uses anchor-based regularization with NR-IQA guidance to enable stable perceptual quality improvements in diffusion model fine-tuning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"As highlighted in recent literature [30], these advancements in vi- sual coherence and prompt alignment are now central to evaluating modern generative models. 2.2 Generative Models The development of deep generative models has progressed signifi- cantly over the past decade. Early paradigms, including Variational Autoencoders (VAEs) [7] and autoregressive models [31], pioneered probabilistic likelihood modeling but often struggled with overly smooth outputs or high computational costs. Generative Adver- sarial Networks (GANs) [ 21] subsequently achieved high visual realism, though they remained notoriously difficult to optimize. A major paradigm shift occurred with diffusion models, which formulate generation as a learned denoising process."},{"citing_arxiv_id":"2603.26357","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model","primary_cat":"cs.CV","submitted_at":"2026-03-27T12:30:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices. Code is released athttps://github.com/ quandao10/MPDiT 1. Introduction Diffusion models [16, 27, 37, 58] have emerged as a lead- ing class of generative models, surpassing generative adver- sarial networks [20], normalizing flows [17, 33, 80], and autoregressive models [45, 61, 64] in many vision tasks. Compared to GANs [20], diffusion models [27] are gen- erally easier to train and avoid issues such as instability and mode collapse. In 2D image generation, diffusion-based approaches have demonstrated strong performance in text- to-image synthesis [52], enabling downstream applications such as personalization [53, 65, 76], image editing[13, 25,"},{"citing_arxiv_id":"2207.05221","ref_index":117,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Language Models (Mostly) Know What They Know","primary_cat":"cs.CL","submitted_at":"2022-07-11T22:59:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2112.10752","ref_index":95,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"High-Resolution Image Synthesis with Latent Diffusion Models","primary_cat":"cs.CV","submitted_at":"2021-12-20T18:55:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrained autoencoders with cross-attention conditioning, while cutting computational and","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[93] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score- based generative modeling in latent space. CoRR, abs/2106.05931, 2021. 2, 3, 5, 6 [94] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, koray kavukcuoglu, Oriol Vinyals, and Alex Graves. Con- ditional image generation with pixelcnn decoders. In Ad- vances in Neural Information Processing Systems, 2016. 3 [95] A ¨aron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. CoRR, abs/1601.06759, 2016. 3 [96] A ¨aron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In NIPS, pages 6306-6315, 2017. 2, 4, 29 [97] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N."},{"citing_arxiv_id":"2112.00861","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A General Language Assistant as a Laboratory for Alignment","primary_cat":"cs.CL","submitted_at":"2021-12-01T22:24:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2102.01293","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Laws for Transfer","primary_cat":"cs.LG","submitted_at":"2021-02-02T04:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2010.14701","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling Laws for Autoregressive Generative Modeling","primary_cat":"cs.LG","submitted_at":"2020-10-28T02:17:24+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.11559","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bayesian Volumetric Autoregressive generative models for better semisupervised learning","primary_cat":"cs.LG","submitted_at":"2019-07-26T13:08:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Volumetric PixelCNN reformulated as Bayesian deep GP yields uncertainty that improves semi-supervised learning on brain MRI with low label proportions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1906.09925","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"To each route its own ETA: A generative modeling framework for ETA prediction","primary_cat":"cs.LG","submitted_at":"2019-06-24T13:13:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A route-specific deep generative model learns the probability distribution of bus trip ETAs from historical data alone and conditions updates on real-time trip progress.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1906.08237","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"XLNet: Generalized Autoregressive Pretraining for Language Understanding","primary_cat":"cs.CL","submitted_at":"2019-06-19T17:35:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1904.10509","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generating Long Sequences with Sparse Transformers","primary_cat":"cs.LG","submitted_at":"2019-04-23T19:29:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1609.03499","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WaveNet: A Generative Model for Raw Audio","primary_cat":"cs.SD","submitted_at":"2016-09-12T17:29:40+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":9.0,"formal_verification":"none","one_line_summary":"WaveNet generates realistic raw audio using an autoregressive neural network with dilated convolutions, achieving state-of-the-art naturalness in speech synthesis for English and Mandarin.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1605.08803","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Density estimation using Real NVP","primary_cat":"cs.LG","submitted_at":"2016-05-27T21:24:32+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We also augment the CIFAR-10, CelebA and LSUN datasets during training to also include horizontal ﬂips of the training examples. We train our model on four natural image datasets: CIFAR-10 [36], Imagenet [52], Large-scale Scene Understanding (LSUN) [70], CelebFaces Attributes (CelebA) [41]. More speciﬁcally, we train on the downsampled to 32× 32 and 64× 64 versions of Imagenet [46]. For the LSUN dataset, we train on the bedroom, tower and church outdoor categories. The procedure for LSUN is the same as in [47]: we downsample the image so that the smallest side is96 pixels and take random crops of 64× 64. For CelebA, we use the same procedure as in [38]: we take an approximately central crop of 148× 148 then resize it to 64× 64."}],"limit":50,"offset":0}