{"total":14,"items":[{"citing_arxiv_id":"2606.00620","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlowNar: Scalable Streaming Narration for Long-Form Videos","primary_cat":"cs.CV","submitted_at":"2026-05-30T08:51:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlowNar achieves bounded memory and 3x higher throughput for streaming narration on Ego4D, EgoExo4D, and EpicKitchens100 by combining dynamic historical context removal with a Cross Linear Attentive Memory module.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11516","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agents Should Replace Narrow Predictive AI as the Orchestrator in 6G AI-RAN","primary_cat":"cs.NI","submitted_at":"2026-05-12T04:39:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Position paper proposes replacing fragmented narrow AI models with LLMs as the cognitive orchestrator in the RAN Intelligent Controller for Level 5 autonomous 6G networks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"scheduling, but its reward function and operational bounds are dynamically rewritten by the overseeing SLM based on real-time semantic context. Furthermore, rapid advancements in speculative decoding, KV-cache optimization, and dedicated neural processing units (NPUs) natively integrated into base- band hardware will continue to drive down the inference latency of these orchestrating agents [26], [40]. Closely coupled with the computational overhead is the significant capital expenditure associated with deploying Graph- ics Processing Units (GPUs). The telecommunications sector, which operates on tight profit margins and massive physical scale, is understandably hesitant to replace cost-effective, highly optimized Application-Specific Integrated Circuits (ASICs)"},{"citing_arxiv_id":"2605.06165","ref_index":168,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:51:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04901","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference","primary_cat":"cs.CR","submitted_at":"2026-05-06T13:31:15+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Zhicong Huang, Wen-jie Lu, Cheng Hong, and Jian- sheng Ding. 2022. Cheetah: Lean and fast secure two-party deep neural network inference.IACR Cryp- tol. ePrint Arch., 2022:207. Mitsuru Ito, Akira Saito, and Takao Nishizeki. 1989. Secret sharing scheme realizing general access struc- ture.Electronics and Communications in Japan (Part III: Fundamental Electronic Science), 72(9):56-64. Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, and Nicolas Papernot. 2020. High ac- curacy and high fidelity extraction of neural networks. InProceedings of the 29th USENIX Conference on Security Symposium, pages 1345-1362. Andes YL Kei and Sherman SM Chow. 2025. Shaft: Secure, handy, accurate, and fast transformer infer-"},{"citing_arxiv_id":"2604.25975","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective","primary_cat":"cs.LG","submitted_at":"2026-04-28T12:28:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior heuristics in experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17935","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers","primary_cat":"cs.LG","submitted_at":"2026-04-20T08:15:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adaptive-oblivious error separation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04921","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TriAttention: Efficient Long Reasoning with Trigonometric KV Compression","primary_cat":"cs.CL","submitted_at":"2026-04-06T17:58:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04500","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward","primary_cat":"cs.CV","submitted_at":"2026-04-06T07:51:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, as our ultimate goal is to use the saliency maps to guide the training, the effi- ciency is also very important. Our proposal does not re- quires additional backward process or multiple forward pro- cess. Many attention implementations naturally support re- turning attention weights, and the termW l v,j hl−1 p is also available within KV cache [62]. Therefore, the saliency maps do not incur extra computation. In addition, previous literature pointed out that the indirect contribution is minor compared to the direct contribution [21, 24, 86]. Therefore, aligning the dominant contribution is sufficient to steer the model's attention to the correct regions. 3.2. Holistic Saliency-map Aggregation with Think-"},{"citing_arxiv_id":"2603.22910","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction","primary_cat":"cs.CL","submitted_at":"2026-03-24T07:58:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand full-cache inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.10718","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining","primary_cat":"cs.LG","submitted_at":"2026-02-11T10:24:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.21623","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OjaKV: Context-Aware Online Low-Rank KV Cache Compression","primary_cat":"cs.CL","submitted_at":"2025-09-25T21:42:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OjaKV introduces hybrid full-rank storage for key tokens combined with online low-rank KV cache compression via Oja's algorithm to support memory-efficient long-context LLM inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.16419","ref_index":156,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-03-20T17:59:38+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[154] Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, and Shiguo Lian. Dast: Difficulty-adaptive slow-thinking for large reasoning models. arXiv preprint arXiv:2503.04472, 2025. 4, 6, 7 [155] Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint arXiv:2502.21074, 2025. 4, 10 [156] Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, and Hai Zhao. Keep the cost down: A review on methods to optimize llm's kv-cache consumption. arXiv preprint arXiv:2407.18003, 2024. 2 [157] Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimiza-"},{"citing_arxiv_id":"2502.20295","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Judge a Book by its Cover: Investigating Multi-Modal LLMs for Multi-Page Handwritten Document Transcription","primary_cat":"cs.LG","submitted_at":"2025-02-27T17:21:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces OCR+PAGE-1 and OCR+PAGE-N prompting strategies that improve zero-shot multi-page handwritten document transcription by sharing context across pages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.13846","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation","primary_cat":"cs.CL","submitted_at":"2024-10-17T17:58:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}