{"total":11,"items":[{"citing_arxiv_id":"2606.00603","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Agentic Governance: What Shapes LLM-Agent Intervention in Public Forums?","primary_cat":"cs.CY","submitted_at":"2026-05-30T08:01:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Four deployment choices—model version, open/closed weight status, provider, and system prompt—each alter LLM-agent intervention rates on forum posts, with closed-weight models declining more on visible challenges than open-weight models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13905","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Non-Destructive Methodological Framework for Modernizing Legacy Clinical Reporting Systems for AI-Driven Pharmacoinformatics: A SAS Case Study","primary_cat":"cs.SE","submitted_at":"2026-05-13T01:15:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A metadata framework modernizes legacy SAS clinical reporting for AI by adding a non-destructive wrapper layer, achieving 92% code reduction on consolidation and high report parity in validations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06365","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work","primary_cat":"cs.AI","submitted_at":"2026-05-07T14:39:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"whose traces are queried after the fact. Execution lineage, as we define it, moves provenance into the execution substrate itself: lineage helps determine identity, replay, and invalidation during execution. 4 From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work 2.9 Reproducibility and Evaluation LLM reproducibility challenges are well documented [37]. Variance arises from sampling, model updates, and tool interactions. Prior work has correctly emphasized that reproducibility in language model systems is limited by multiple factors, including decoding stochasticity, external API variation, and infrastructure drift. Work on interactive evaluation [52] and realistic agent benchmarks [42, 43] likewise shows that even controlled environments exhibit sensitivity to rollout"},{"citing_arxiv_id":"2605.01391","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VISTA: Video Interaction Spatio-Temporal Analysis Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-02T11:28:20+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15409","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference","primary_cat":"cs.LG","submitted_at":"2026-04-16T15:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13346","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentSPEX: An Agent SPecification and EXecution Language","primary_cat":"cs.CL","submitted_at":"2026-04-14T23:16:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentSPEX is a new language and harness for explicitly specifying and running structured LLM-agent workflows with typed steps, control flow, parallel execution, and a visual editor.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15326","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Analyzing the Presentation, Content, and Utilization of References in LLM-powered Conversational AI Systems","primary_cat":"cs.HC","submitted_at":"2026-03-06T02:38:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM chat systems show large differences in reference quantity and quality, but users rarely click or engage with them.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.15503","ref_index":22,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Guidelines for Empirical Studies in Software Engineering involving Large Language Models","primary_cat":"cs.SE","submitted_at":"2025-08-21T12:30:30+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.","context_count":2,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Benchmarking is the process of evaluating an LLM's performance using standardized tasks and metrics, which requires high-quality reference datasets. LLM output is compared to a ground truth from the benchmark dataset using general metrics for text generation, such asROUGE,BLEU, orMETEOR[ 50], or task-specific metrics, such asCodeBLEUfor code generation. For example,Hu- manEval[ 22] is often used to assess code generation, establishing it as a de facto standard. Example(s).In SE, benchmarking may include the evaluation of an LLM's ability to produce accurate and reliable outputs for a given input, usually a task description, which may be accompanied by data obtained from curated real-world projects or from synthetic SE-specific datasets."},{"citing_arxiv_id":"2406.06608","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Prompt Report: A Systematic Survey of Prompt Engineering Techniques","primary_cat":"cs.CL","submitted_at":"2024-06-06T18:10:11+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.11511","ref_index":156,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection","primary_cat":"cs.CL","submitted_at":"2023-10-17T18:18:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.10253","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts","primary_cat":"cs.AI","submitted_at":"2023-09-19T02:19:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}