{"total":15,"items":[{"citing_arxiv_id":"2606.28615","ref_index":30,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs","primary_cat":"cs.LG","submitted_at":"2026-06-26T21:14:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes SCSuff metric for evaluating LLM explanation sufficiency via model-generated alternative inputs, showing explanations are typically insufficient and predictable from hidden states.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26040","ref_index":78,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AI translation of literary texts is \"fine\", but readers still prefer human translations","primary_cat":"cs.CL","submitted_at":"2026-06-24T17:15:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Human readers prefer human literary translations over AI-generated ones for immersion and clarity despite finding MT adequate and struggling to identify the source.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24162","ref_index":37,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks","primary_cat":"cs.CL","submitted_at":"2026-06-23T05:30:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12807","ref_index":71,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Detect, Remask, Repair: Diffusion Editing for Faithful Summarization of Evolving Contexts","primary_cat":"cs.CL","submitted_at":"2026-06-11T02:05:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Diffusion-based localized editing framework for faithful summarization of evolving contexts, introducing the StreamSum benchmark and showing tradeoffs in faithfulness, speed, and preservation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08748","ref_index":28,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task","primary_cat":"cs.CL","submitted_at":"2026-06-07T17:38:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"HydraQE is a new end-to-end speech translation QE system using Qwen3-ASR backbone, sparsemax layer mixing, bidirectional Transformer, and multi-task curriculum training on human and pseudo labels that outperforms cascaded baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13596","ref_index":14,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations","primary_cat":"cs.CL","submitted_at":"2026-05-13T14:30:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09533","ref_index":38,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications","primary_cat":"cs.CL","submitted_at":"2026-05-10T13:35:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09098","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation","primary_cat":"cs.CL","submitted_at":"2026-05-09T18:12:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dynamic Meta-Metrics learns source-sentence conditioned combinations of MT metrics, with MLP-based and soft-conditioned versions showing gains over linear and GP ensembles on WMT data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19185","ref_index":32,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization","primary_cat":"cs.CL","submitted_at":"2026-04-21T07:51:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19144","ref_index":45,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation","primary_cat":"cs.CL","submitted_at":"2026-04-21T06:48:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.21819","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Self-Preference Bias in LLM-as-a-Judge","primary_cat":"cs.CL","submitted_at":"2024-10-29T07:42:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs judge their own outputs higher because they assign better scores to lower-perplexity text, even when the text is not self-generated.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.18796","ref_index":23,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models","primary_cat":"cs.CL","submitted_at":"2024-04-29T15:33:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.11805","ref_index":90,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Gemini: A Family of Highly Capable Multimodal Models","primary_cat":"cs.CL","submitted_at":"2023-12-19T02:39:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.10403","ref_index":135,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PaLM 2 Technical Report","primary_cat":"cs.CL","submitted_at":"2023-05-17T17:46:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2101.00190","ref_index":85,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Prefix-Tuning: Optimizing Continuous Prompts for Generation","primary_cat":"cs.CL","submitted_at":"2021-01-01T08:00:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}