{"total":15,"items":[{"citing_arxiv_id":"2605.18052","ref_index":176,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Efficient 3D Content Reconstruction and Generation","primary_cat":"cs.CV","submitted_at":"2026-05-18T08:41:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Here, again, the lower pose accuracy ofF astMapunder the strictest metrics does not preventF astMapposes from yielding 77 competitive PSNR. These results suggest that pose accuracy under a strict metric could be a misleading proxy for downstream view synthesis quality, and vice versa. We also investigate the impact of different SfM poses on rendering with CamP [176], which simultaneously optimizes the radiance field and refines the camera poses. We include the results in table 4.3 for comparison. In general, CamP improves the PSNR for all the three methods, and for some scenes (e.g., flowers, garden, kitchen, etc.) the gap in rendering quality is closed and sometimes even reversed. 4.1.2.3 More Results Additional speed benchmarking with different hardware configurations is reported in Ta-"},{"citing_arxiv_id":"2605.13618","ref_index":12,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research","primary_cat":"cond-mat.mtrl-sci","submitted_at":"2026-05-13T14:47:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OpenAaaS is a hierarchical agent-as-a-service system that enables secure multi-agent collaboration for materials informatics by moving code to data rather than data to code.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11235","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-11T20:50:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"demonstrates improved final performance, yielding up to 67% wall-clock reduction with only around 3.9% per-step overhead, providing a metacognitive framework for efficient LLM RFT. 2 Related Work Reinforcement fine-tuning (RFT).RFT methods vary in reward source and optimization objective. Standard RLHF aligns LLMs with human preference via PPO-style optimization [3, 17, 18], while critic-free, REINFORCE-style alternatives like RLOO, ReMax, and REINFORCE++ simplify the pipeline [19-21]. Reasoning-oriented fine-tuning increasingly relies on verifiable rewards [ 4, 5], where group-relative methods such as GRPO, DAPO, and GSPO estimate advantages from multiple rollouts of the same prompt [ 4, 6, 7], and process-level supervision provides a complementary"},{"citing_arxiv_id":"2605.08811","ref_index":31,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity","primary_cat":"stat.ML","submitted_at":"2026-05-09T09:02:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A shallow dense Transformer achieves uniform epsilon-approximation of alpha-Holder functions with O(epsilon^{-d/alpha}) parameters and near-minimax generalization error O(n^{-2alpha/(2alpha+d)} log n).","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[29] Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality.Journal of Machine Learning Research, 21(174):1-38, 2020. [30] Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high confidence from random samples.Discrete & Computational Geometry, 39(1):419- 441, 2008. [31] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730-27744, 2022. [32] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU"},{"citing_arxiv_id":"2605.07800","ref_index":20,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-08T14:36:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SARA introduces semantic saliency to guide relational alignment in video diffusion models, improving text following and motion quality over prior alignment methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"or video foundation encoder (REPA [7], VideoREPA [8], MoAlign [9], RefAlign [18], expanded in the next paragraph). A parallel line instead injects auxiliary modalities such as optical flow, pose, or trajectories during continual training, at the cost of requiring those conditions at inference (e.g. Tora [19]).(iii) Post-training preference optimization.Following the RLHF recipe [ 20], the VDM is fine-tuned against a reward model via GRPO-style on-policy exploration that turns the flow-matching ODE [21] into an SDE [22], DPO-style paired classification over preferred / rejected samples [23, 24], or ReFL-style differentiable-reward back-propagation [25]. Post-training is largely orthogonal to SARA's SFT-stage gains, and we leave such combinations to future work."},{"citing_arxiv_id":"2605.06230","ref_index":64,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence","primary_cat":"cs.AI","submitted_at":"2026-05-07T13:21:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"standardized, typically following the SFT-RM-PPO (Supervised Fine-Tuning, Reward Model- ing, Proximal Policy Optimization) pipeline: first, the model was aligned using demonstration or preference data, then further optimized through reinforcement learning with a reward model and PPO-based methods. This approach closely mirrored earlier representative works such as InstructGPT[64] and Constitutional AI[ 5]. Thus, the core task during this phase was the engineering and encapsulation of standard components like SFT, RM, and PPO, rather than explicitly addressing multi-round interactive training in more complex environments. As model scale, training costs, and online sampling expenses continued to rise, the focus of training frameworks shifted from simply \"executing RLHF\" to \"organizing large-scale"},{"citing_arxiv_id":"2605.04243","ref_index":59,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA","primary_cat":"cs.AI","submitted_at":"2026-05-05T19:30:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02348","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation","primary_cat":"cs.CL","submitted_at":"2026-05-04T08:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16972","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-04-18T11:43:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07413","ref_index":29,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios","primary_cat":"cs.CV","submitted_at":"2026-04-08T12:23:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FORGE benchmark shows domain-specific knowledge, not visual grounding, is the main bottleneck for MLLMs in manufacturing, with SFT on a 3B model delivering up to 90.8% relative accuracy improvement on held-out scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03044","ref_index":69,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency","primary_cat":"cs.CL","submitted_at":"2026-04-03T13:52:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Large language models are no longer single, monolithic policies: they are increasingly deployed and trained as heterogeneous systems-agentic pipelines spanning domains and tools, mixture-of-experts (MoE) architectures with conditional routing, and distributed/asynchronous training stacks where optimization noise and data nonstationarity are structural rather than incidental. In this regime, alignment via RLHF [69] must simultaneously handle multi-scale instability: token-level stochasticity, trajectory-level drift, and system-level heterogeneity (domains/experts/agents) interacting in the same update. Existing PPO-style \"proximal\" objectives [70, 71, 72] provide only coarse local controls (mostly per-token clipping) and limited diagnostics when failures arise from global structure (e."},{"citing_arxiv_id":"2604.16403","ref_index":79,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Computational Hermeneutics: Evaluating generative AI as a cultural technology","primary_cat":"cs.AI","submitted_at":"2026-03-31T12:18:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"of interactions and interfaces that frame it. The effects of this collaboration are bidirectional. From human to machine, people decide what data the systems are trained on [24]; formulate objective functions that reflect a specific set of goals, values, and assumptions [59]; fine-tune system behavior through mechanisms like reinforcement learning from human feedback [ 79]; and \"engineer\" prompts in order to elicit certain kinds of responses [19]. At multiple layers of the system, human annotators-who can themselves offer conflicting interpretations [37]-can provide feedback on ambiguous cases, rank responses, or supply preference scores, effectively staging a dialogue where the AI's provisional interpretations can be contested and"},{"citing_arxiv_id":"2510.21122","ref_index":37,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation","primary_cat":"cs.CV","submitted_at":"2025-10-24T03:23:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal CoT generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.23678","ref_index":44,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Grounded Reinforcement Learning for Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2025-05-29T17:20:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.10978","ref_index":54,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Group-in-Group Policy Optimization for LLM Agent Training","primary_cat":"cs.LG","submitted_at":"2025-05-16T08:26:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. [53] Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008-3021, 2020. [54] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730-27744, 2022. [55] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and"}],"limit":50,"offset":0}