{"total":18,"items":[{"citing_arxiv_id":"2605.22505","ref_index":37,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Towards Direct Evaluation of Harness Optimizers via Priority Ranking","primary_cat":"cs.AI","submitted_at":"2026-05-21T13:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20619","ref_index":75,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front","primary_cat":"cs.LG","submitted_at":"2026-05-20T02:09:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SURF derives weight sampling rules from the arc-length CDF of the scalarization path to uniformly traverse the Pareto front in multi-objective optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19242","ref_index":54,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PhyWorld: Physics-Faithful World Model for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-19T01:28:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-faithfulness benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17766","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LatentUMM: Dual Latent Alignment for Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T02:35:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11516","ref_index":37,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Agents Should Replace Narrow Predictive AI as the Orchestrator in 6G AI-RAN","primary_cat":"cs.NI","submitted_at":"2026-05-12T04:39:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Position paper proposes replacing fragmented narrow AI models with LLMs as the cognitive orchestrator in the RAN Intelligent Controller for Level 5 autonomous 6G networks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"weight Large Telecom Models (LTMs) running entirely on- premise within the operator's edge data centers. By utilizing techniques such as Low-Rank Adaptation (LoRA) and Direct Preference Optimization (DPO), foundational open-weight models can be heavily fine-tuned on an operator's internal, anonymized datasets, creating highly specialized, sovereign reasoning engines [12], [37]. To achieve multi-operator learning without compromising data privacy, the community should invest in Federated Learning architectures specifically designed for LTMs, allowing agents to share abstracted diagnostic heuristics and zero-day anomaly signatures without exposing the underlying raw telemetry. D. The Sufficiency of Traditional Self-Organizing Networks"},{"citing_arxiv_id":"2605.10716","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"What should post-training optimize? A test-time scaling law perspective","primary_cat":"cs.LG","submitted_at":"2026-05-11T15:25:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"with human feedback.Advances in neural information processing systems, 35:27730-27744, 2022. [14] Laura O'Mahony, Leo Grinsztajn, Hailey Schoelkopf, and Stella Biderman. Attributing mode collapse in the fine-tuning of large language models. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, volume 2, page 2, 2024. 17 [15] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728-53741, 2023. [16] Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D Lee, and Sanjeev Arora. What makes"},{"citing_arxiv_id":"2605.09640","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning","primary_cat":"cs.CV","submitted_at":"2026-05-10T16:36:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. [27] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730-27744, 2022. [28] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728-53741, 2023. [29] Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang."},{"citing_arxiv_id":"2605.09397","ref_index":41,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"BadDLM: Backdooring Diffusion Language Models with Diverse Targets","primary_cat":"cs.CR","submitted_at":"2026-05-10T07:50:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693, 2023. [40] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728-53741, 2023. [41] Javier Rando and Florian Tramèr. Universal jailbreak backdoors from poisoned human feedback.arXiv preprint arXiv:2311.14455, 2023. [42] Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37:130136-130184, 2024."},{"citing_arxiv_id":"2605.01899","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-03T14:28:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14379","ref_index":20,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Step-level Denoising-time Diffusion Alignment with Multiple Objectives","primary_cat":"cs.LG","submitted_at":"2026-04-15T19:52:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13602","ref_index":36,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges","primary_cat":"cs.LG","submitted_at":"2026-04-15T08:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12163","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Nucleus-Image: Sparse MoE for Image Generation","primary_cat":"cs.CV","submitted_at":"2026-04-14T00:43:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization [1], and no human preference tuning. We release the full model weights, training code, and dataset to the community, making Nucleus-Image the first fully open-source MoE diffusion model at this quality tier. https://withnucleus.ai/image https://github.com/WithNucleusAI/Nucleus-Image https://huggingface.co/NucleusAI/NucleusMoE-Image arXiv:2604.12163v1 [cs."},{"citing_arxiv_id":"2603.27977","ref_index":16,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology","primary_cat":"cs.AI","submitted_at":"2026-03-30T02:54:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.16175","ref_index":53,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning to Discover at Test Time","primary_cat":"cs.LG","submitted_at":"2026-01-22T18:24:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.16776","ref_index":21,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Kling-Omni Technical Report","primary_cat":"cs.CV","submitted_at":"2025-12-18T17:08:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Kling-Omni is a unified multimodal generative system that produces cinematic videos from diverse inputs by integrating generation, editing, and intelligent reasoning in a single end-to-end model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.22699","ref_index":59,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2025-11-27T18:52:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"a comprehensive post-training framework leveraging Reinforcement Learning with Human Feedback (RLHF). This framework hinges on a powerful, multi-dimensional reward model, which provides targeted feedback for online optimization. Guided by these feedback signals, our approach is structured into two sequential stages: an initial offline alignment phase using Direct Preference Optimization (DPO) [ 59], followed by an online refinement phase with Group Relative Policy Optimization (GRPO) [66, 46]. This two-stage strategy allows us to first efficiently instill robust adherence to objective standards and then leverage the fine-grained signals from our reward model for optimizing more subjective qualities. As illustrated in Figure 14, this comprehensive process yields substantial improvements in photorealism,"},{"citing_arxiv_id":"2509.26158","ref_index":29,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Towards Continual Expansion of Data Coverage: Automatic Text-guided Edge-case Synthesis","primary_cat":"cs.CV","submitted_at":"2025-09-30T12:11:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Automated LLM-based prompt engineering for text-to-image edge-case synthesis improves object detection robustness on the FishEye8K benchmark over naive augmentation and manual prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.17412","ref_index":39,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Ridge Too Far: Correcting Over-Shrinkage via Negative Regularization","primary_cat":"cs.LG","submitted_at":"2025-08-24T15:34:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Negative-capable ridge regression uses controlled negative regularization as anti-shrinkage to increase effective complexity along weak eigendirections and mitigate underfitting in small-data regression.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}