{"total":14,"items":[{"citing_arxiv_id":"2605.21981","ref_index":43,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RiT: Vanilla Diffusion Transformers Suffice in Representation Space","primary_cat":"cs.CV","submitted_at":"2026-05-21T04:21:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21573","ref_index":20,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20613","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HRM-Text: Efficient Pretraining Beyond Scaling","primary_cat":"cs.CL","submitted_at":"2026-05-20T01:59:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 1B-parameter hierarchical recurrent model pretrained on 40B instruction-response tokens achieves 60.7% MMLU and strong results on ARC-C, DROP, GSM8K, and MATH while using 100-900x fewer tokens than standard baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18387","ref_index":38,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Graph Hierarchical Recurrence for Long-Range Generalization","primary_cat":"cs.LG","submitted_at":"2026-05-18T13:31:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GHR uses hierarchical recurrence on pooled graph abstractions to improve long-range dependency capture and out-of-range generalization while using far fewer parameters than existing models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15592","ref_index":41,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Efficient Image Synthesis with Sphere Latent Encoder","primary_cat":"cs.CV","submitted_at":"2026-05-15T04:03:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decouples Sphere Encoder into fixed pretrained encoder and spherical latent denoiser, yielding higher quality and faster inference than the joint original on Animal-Faces, Oxford-Flowers and ImageNet-1K.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07915","ref_index":103,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-08T15:52:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[102] Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, and Ping Luo. Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing, 2025. URLhttps://arxiv.org/abs/2512.17909. [103] Yue Zhao, Yuanjun Xiong, and Philipp Krähenbühl. Image and video tokenization with binary spherical quantization, 2024. URLhttps://arxiv.org/abs/2406.07548. [104] Yue Zhao, Yuanjun Xiong, and Philipp Krähenbühl. Image and video tokenization with binary spherical quantization, 2024. URLhttps://arxiv.org/abs/2406.07548. [105] Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu"},{"citing_arxiv_id":"2605.06501","ref_index":95,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Cubit: Token Mixer with Kernel Ridge Regression","primary_cat":"cs.LG","submitted_at":"2026-05-07T16:18:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03769","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer","primary_cat":"cs.LG","submitted_at":"2026-05-05T14:00:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(mn) complexity and proven scalability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26898","ref_index":54,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models","primary_cat":"math.PR","submitted_at":"2026-04-29T17:09:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12163","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Nucleus-Image: Sparse MoE for Image Generation","primary_cat":"cs.CV","submitted_at":"2026-04-14T00:43:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"than MoE. We found this essential for training stability: without dense initial layers, the layer-wise activation norms grow rapidly in early training, leading to divergence. We hypothesize that routing decisions require meaningful token representations to be effective, which are not yet available in the earliest layers. QK-Normalization.We apply RMSNorm [ 24] to query and key projections before computing attention scores. The normalization weights are initialized to produce identity behavior and are frozen during training. This prevents attention logit explosion without introducing additional learnable parameters. Tanh Gating.Residual connections use tanh-bounded gates: x←x+ tanh(g)⊙r . The tanh nonlinearity constrains"},{"citing_arxiv_id":"2604.10098","ref_index":177,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation","primary_cat":"cs.LG","submitted_at":"2026-04-11T08:41:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04539","ref_index":94,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control","primary_cat":"cs.LG","submitted_at":"2026-04-06T09:03:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Inverted Residual Backbone.The backbone stacks inverted residual blocks inspired by the Transformer feedforward block [89] (Figure 2). Each block expands features to a higher dimension via an inverted bottleneck [26], projects back to the original dimension, and adds a residual connection [23] to stabilize gradient propagation. After the final block, we apply RMSNorm [94] to bound per-sample feature norms before value heads, preventing out-of-distribution inputs from producing unbounded activations that destabilize bootstrapping. Pre-activation Batch Normalization.Replay data is collected by a mixture of evolving policies, inducing non-stationary input distributions. Without normalization, feature activations can saturate (e."},{"citing_arxiv_id":"2604.03044","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency","primary_cat":"cs.CL","submitted_at":"2026-04-03T13:52:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"As summarized in Table 1, JoyAI-LLM Flash is a Mixture-of-Experts (MoE) model with 48.9B total parameters, of which 3.28B are activated per token. Its micro-architecture draws inspiration from DeepSeek-V3 [8] and Kimi-K2 [9], utilizing Multi-head Latent Attention (MLA) [3] with hidden dimensions of 2048 and 768, respectively. The model incorporates standard components such as RMSNorm [10] for layer normalization, RoPE [11] for positional encoding, and SwiGLU [12] activation within the feed-forward blocks. In terms of macro-architecture, JoyAI-LLM Flash consists of 40 Transformer layers. The first layer utilizes a standard dense feed-forward network, while the remaining 39 layers are sparse MoE layers. The MoE module employs a fine-grained architecture with 256 total experts."},{"citing_arxiv_id":"2603.03243","ref_index":46,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations","primary_cat":"cs.RO","submitted_at":"2026-03-03T18:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HoMMI learns whole-body mobile manipulation policies from robot-free human demonstrations by augmenting UMI with egocentric sensing and bridging the embodiment gap through an agnostic visual representation, relaxed head actions, and a whole-body controller.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}