archive

Every paper Pith has read. Search by title, abstract, or pith.

4138 papers in cs.CL · page 1

cs.CV 2026-05-14 reviewed

One token unifies agentic and latent visual reasoning
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Pheng-Ann Heng +3
cs.LG 2026-05-14 reviewed

FutureSim shows top AI agents predict events at 25% accuracy
FutureSim: Replaying World Events to Evaluate Adaptive Agents

Ameya Prabhu +7
cs.CL 2026-05-14 reviewed

Grep beats vector search in most agentic tasks
Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Akhil Kasturi +4
cs.CR 2026-05-14 reviewed

Length alone triggers LLM backdoors to leak secrets
MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

Ahmed Salem +4
cs.CL 2026-05-14 reviewed

EHR tables sharpen timing in text-based clinical timelines
Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

Jeremy C. Weiss +3
cs.CL 2026-05-14 reviewed

Memory model lets LLMs add knowledge without retraining
MeMo: Memory as a Model

Alfred Wei Lun Leong +8
cs.CR 2026-05-14 reviewed

The paper builds a 507-leaf taxonomy of LLM inference attacks from 932 recent security…
Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

Alexey A. Shvets +3
cs.CL 2026-05-14 reviewed

The paper presents a framework that converts existing text-based tool-calling benchmarks…
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Jonas Robertson +5
cs.LG 2026-05-14 reviewed

128 random demos suffice for strong RLVR results
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Alexander G. Schwing +2
cs.AI 2026-05-14 reviewed

Decomposing traces boosts AI agent diagnosis accuracy up to 12x
Holistic Evaluation and Failure Diagnosis of AI Agents

Alon Mecilati +14
cs.CV 2026-05-14 reviewed

Internal masking cuts hallucinations in vision-language models
Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

Junzhe Chen +5
cs.CL 2026-05-14 reviewed

Terminal anchors extend LLM context to 64K from short sequences
EndPrompt: Efficient Long-Context Extension via Terminal Anchoring

Dawei Yin +11
cs.CL 2026-05-14 reviewed

Denoising paths supply low-cost uncertainty scores for language diffusion models
Uncertainty Quantification for Large Language Diffusion Models

Artem Shelmanov +5
cs.SE 2026-05-14 reviewed

ML classifier beats rules at spotting BDD refactoring chances
Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines

Ali Hassaan Mughal +2
cs.SE 2026-05-14 reviewed

Memory agent keeps repo documentation consistent
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation

Changkyu Choi +4
cs.LG 2026-05-14 reviewed

Action tokens carry the training signal in agentic RL
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

David Wipf +9
cs.CL 2026-05-14 reviewed

CIPO turns LLM failures into better reasoning
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

Boxi Cao +8
cs.CL 2026-05-14 reviewed

Optimal control reformulation gives language models fast parallel sampling at high quality
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

Liang Lin +5
cs.CL 2026-05-14 reviewed

Many perfect LLM scores hide dimensional intent failures
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

Gang Peng
cs.CL 2026-05-14 reviewed

LLM memory systems hit only 46% on group conversations
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

Evgeniy Gabrilovich +5
cs.CL 2026-05-14 reviewed

Ming glossaries used flexible Chinese characters to approximate foreign sounds
Cross-Linguistic Transcription and Phonological Representation in the Hu\`it\'onggu\v{a}nx\`i Hu\'ay\'iy\`iy\v{u}

Ji-eun Kim
cs.SE 2026-05-14 reviewed

Stale code snippets make models output outdated helpers
When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context

Haobin Pan +4
cs.CL 2026-05-14 reviewed

Probe shows RAG follows wrong context in 85 percent of conflict cases
Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

Huan Xu +6
cs.LG 2026-05-14 reviewed

Guardrails adapt from sparse noisy failures via conservative induction
LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

Bharath Chandrasekhar +8
cs.LG 2026-05-14 reviewed

Orthogonal projection isolates hallucination signals in LLM answers
When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

Erhu Feng +2
cs.CV 2026-05-14 reviewed

Adaptive gate skips reasoning for simple multimodal inputs
Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

Guanghao Zhang +4
cs.CL 2026-05-14 reviewed

Calculus finds optimal vocabulary size for ASR
A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

Sunil Kumar Kopparapu
cs.SE 2026-05-14 reviewed

Agents resolve 45 percent of chained package upgrades
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

Chaozheng Wang +7
cs.CL 2026-05-14 reviewed

New scores track whether unlearning works across languages
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

Hyeonjin Kim +3
cs.CL 2026-05-14 reviewed

Three-tier memory raises recommender hit rate 26 percent
Agentic Recommender System with Hierarchical Belief-State Memory

Benyu Zhang +10
cs.LG 2026-05-14 reviewed

Synthetic queries trigger up to 5x higher LLM failure rates
NodeSynth: Socially Aligned Synthetic Data for AI Evaluation

Darlene Neal +7
cs.CL 2026-05-14 reviewed

Synthetic augmentation lifts defense classification to 58% accuracy
Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation

Hoang-Thuy-Duong Vu +2
cs.CL 2026-05-14 reviewed

Geometry scores pick shallow layers for diffusion insertion in transformers
Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement

Hyoungjoon Lee +2
cs.CL 2026-05-14 reviewed

Semantic RL adds low-resource languages without erasing prior skills
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

Guixian Xu +9
cs.HC 2026-05-14 reviewed

Short concern texts track with activity drops and sleep issues
A Formative Study of Brief Affective Text as a Complement to Wearable Sensing for Longitudinal Student Health Monitoring

Christopher Danforth +9
cs.CL 2026-05-14 reviewed

LLM filter and clustering finds 41 manipulative narrative clusters
LLM-based Detection of Manipulative Political Narratives

Florian Steuber +2
cs.CL 2026-05-14 reviewed

Transformers score German texts on left-right scale
Ideology Prediction of German Political Texts

Florian Steuber +3
cs.LG 2026-05-14 reviewed

Dynamic Latent Routing beats supervised fine-tuning by 6.6 points
Dynamic Latent Routing

Amir Abdullah +2
cs.CL 2026-05-14 reviewed

Exact prefix factorization removes errors in diffusion language models
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

Hang Yuan +3
cs.LG 2026-05-14 reviewed

Simple diversity penalty in KV scorer beats complex designs
Minimal-Intervention KV Retention: A Design-Space Study and a Diversity-Penalty Survivor

Libo Sun +3
cs.CR 2026-05-14 reviewed

Hidden noise stops vision-language models learning real content
To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model

Chengshuai Zhao +4
cs.CR 2026-05-14 reviewed

Web agents should plan before seeing page content
Web Agents Should Adopt the Plan-Then-Execute Paradigm

Annabella Chow +7
cs.LG 2026-05-14 reviewed

MetaMoE combines independently trained expert models into one Mixture-of-Experts system…
MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

Shuhao Chen +2
cs.CL 2026-05-14 reviewed

Agent harnesses allow unsafe actions even with correct final outputs
Auditing Agent Harness Safety

Chengzhi Liu +10
cs.AI 2026-05-14 reviewed

Hypergraph reasoner hits 94.7% on supply chain RCA
Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

Cheng cheng +10
cs.CL 2026-05-14 reviewed

Spelling and test design confound KVL word difficulty ratings
What Makes Words Hard? Sakura at BEA 2026 Shared Task on Vocabulary Difficulty Prediction

Adam Nohejl +5
cs.LG 2026-05-14 reviewed

Active learners raise NDCG@10 per call in PRP reranking
Active Learners as Efficient PRP Rerankers

Francisco Nattero Santiago Mauricio Barron Bucolo +4
cs.LG 2026-05-14 reviewed

Transformer predicts next disease with 0.871 median AUC across 896 categories
DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System

Andrew R Weckstein +3
cs.LG 2026-05-14 reviewed

Small mismatches in LLM RL rollout and optimization cause collapse
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

Geoffrey Fox +7
cs.LG 2026-05-14 reviewed

Prefill-only adapters deliver 1.9x throughput for 512 users
PreFT: Prefill-only finetuning for efficient inference

Andrew Lanpouthakoun +6