{"total":13,"items":[{"citing_arxiv_id":"2606.23043","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Deviance from a pink noise regime in the temporal organization of semantic relations in psychosis","primary_cat":"cond-mat.stat-mech","submitted_at":"2026-06-22T08:49:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Patients with psychosis exhibit elevated DFA scaling exponents in BERT-derived semantic similarity time series from transcripts, indicating excessive persistence in semantic fluctuations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31069","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining","primary_cat":"cs.CV","submitted_at":"2026-05-29T09:38:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VISTA mines multi-level event semantics via visual prompts, knowledge-enhanced retrieval, and proposal integration to improve long-video event prediction over existing LVLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28480","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Audio-Mind: An Auditable Agentic Framework for Audio Understanding","primary_cat":"eess.AS","submitted_at":"2026-05-27T13:39:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Audio-Mind introduces a conditional, auditable agentic framework for audio understanding that preserves frontend judgment and acquires bounded external evidence only when needed, reporting 80.4% on MMAR and 82.8% on MSU-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22732","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models","primary_cat":"cs.AI","submitted_at":"2026-05-21T17:03:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multimodal LLM analysis correlates better with TRUST-Pathos than acoustic SER models in a case study of one Bundestag speech, while acoustic features help with arousal.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21132","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary","primary_cat":"cs.CV","submitted_at":"2026-05-20T13:04:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SurgOnAir introduces a streaming vision-language model trained on a hierarchical surgical dataset to generate real-time, multi-level narrations with explicit transition tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16555","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MedASR: An Open-Source Model for High-Accuracy Medical Dictation","primary_cat":"eess.AS","submitted_at":"2026-05-15T18:57:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MedASR is an open-source 105M-parameter ASR model achieving 58% relative WER reduction versus Whisper Large-v3 on medical dictation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27010","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Quantifying the Cost of Manual Navigation: A Comparison of Gesture-Based Magnification versus Direct Access Reading in Digital Layout-based Documents","primary_cat":"cs.HC","submitted_at":"2026-04-29T09:18:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Large-print editions of layout-based documents outperform gesture-based magnification by 18% in reading speed and 30% in target location speed while restoring natural reading strategies and reducing workload.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"significant main effects and interactions. Complete model statis- tics-including F-values, degrees of freedom, and detailed pairwise comparisons-are provided in the supplementary tables in the Ap- pendix. We do not report main effects and interactions that were not significant. 5.2 Task 1 analysis We first transcribed participants' audio recordings using Whis- perX [8], reconstructing a task timeline for each participant. Each timeline was segmented into reading (R) and transition (T) peri- ods (see Fig. 5). From this timeline, we derived four metrics to test hypothesesH1 andH2: (1) Success ratio: The proportion of correctly read articles out of the total (e.g., 5/6=83.3%in Fig. 5). Note that a headline was considered correctly read if its transcription was phonetically"},{"citing_arxiv_id":"2604.25611","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition","primary_cat":"cs.CL","submitted_at":"2026-04-28T13:18:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Whisper for continuous real -time transcription while maintaining bounded memory usage and improving output stability. WhisperPipe employs an adaptive dual-buffer design: a committed text buffer stores finalized, immutable transcripts, while a compact active audio buffer retains only the most recent audio window required for accurate recognition [26, 27]. Instead of reprocessing the entire audio history, WhisperPipe confines inference to the active window plus a fixed look -back context, achieving steady -state bounded computation and flat memory usage during extended operation [28, 29]. To determine when partial hypotheses can be safely finalized, WhisperPipe implements an adaptive two-tier commit policy."},{"citing_arxiv_id":"2604.24416","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling Properties of Continuous Diffusion Spoken Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-27T12:45:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23295","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations","primary_cat":"cs.CL","submitted_at":"2026-04-25T13:18:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15736","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees","primary_cat":"cs.CV","submitted_at":"2026-04-17T06:22:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10456","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation","primary_cat":"cs.CV","submitted_at":"2026-04-12T04:39:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CineAgents is a multi-agent system that builds hierarchical narrative memory via script reverse-engineering and uses iterative planning to produce instruction-driven cinematic video compilations with better coherence than prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06694","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AudioKV: KV Cache Eviction in Efficient Large Audio Language Models","primary_cat":"cs.SD","submitted_at":"2026-04-08T05:20:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}