archive
Every paper Pith has read. Search by title, abstract, or pith.
4138 papers in cs.CL · page 1
-
One token unifies agentic and latent visual reasoning
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
-
FutureSim shows top AI agents predict events at 25% accuracy
FutureSim: Replaying World Events to Evaluate Adaptive Agents
-
Grep beats vector search in most agentic tasks
Is Grep All You Need? How Agent Harnesses Reshape Agentic Search
-
Length alone triggers LLM backdoors to leak secrets
MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs
-
EHR tables sharpen timing in text-based clinical timelines
Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment
-
Memory model lets LLMs add knowledge without retraining
MeMo: Memory as a Model
-
The paper builds a 507-leaf taxonomy of LLM inference attacks from 932 recent security…
Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks
-
The paper presents a framework that converts existing text-based tool-calling benchmarks…
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
-
128 random demos suffice for strong RLVR results
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
-
Decomposing traces boosts AI agent diagnosis accuracy up to 12x
Holistic Evaluation and Failure Diagnosis of AI Agents
-
Internal masking cuts hallucinations in vision-language models
Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution
-
Terminal anchors extend LLM context to 64K from short sequences
EndPrompt: Efficient Long-Context Extension via Terminal Anchoring
-
Denoising paths supply low-cost uncertainty scores for language diffusion models
Uncertainty Quantification for Large Language Diffusion Models
-
ML classifier beats rules at spotting BDD refactoring chances
Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines
-
Memory agent keeps repo documentation consistent
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
-
Action tokens carry the training signal in agentic RL
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
-
CIPO turns LLM failures into better reasoning
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
-
Optimal control reformulation gives language models fast parallel sampling at high quality
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
-
Many perfect LLM scores hide dimensional intent failures
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
-
LLM memory systems hit only 46% on group conversations
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
-
Ming glossaries used flexible Chinese characters to approximate foreign sounds
Cross-Linguistic Transcription and Phonological Representation in the Hu\`it\'onggu\v{a}nx\`i Hu\'ay\'iy\`iy\v{u}
-
Stale code snippets make models output outdated helpers
When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context
-
Probe shows RAG follows wrong context in 85 percent of conflict cases
Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict
-
Guardrails adapt from sparse noisy failures via conservative induction
LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
-
Orthogonal projection isolates hallucination signals in LLM answers
When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition
-
Adaptive gate skips reasoning for simple multimodal inputs
Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture
-
Calculus finds optimal vocabulary size for ASR
A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR
-
Agents resolve 45 percent of chained package upgrades
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades
-
New scores track whether unlearning works across languages
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
-
Three-tier memory raises recommender hit rate 26 percent
Agentic Recommender System with Hierarchical Belief-State Memory
-
Synthetic queries trigger up to 5x higher LLM failure rates
NodeSynth: Socially Aligned Synthetic Data for AI Evaluation
-
Synthetic augmentation lifts defense classification to 58% accuracy
Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation
-
Geometry scores pick shallow layers for diffusion insertion in transformers
Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement
-
Semantic RL adds low-resource languages without erasing prior skills
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
-
Short concern texts track with activity drops and sleep issues
A Formative Study of Brief Affective Text as a Complement to Wearable Sensing for Longitudinal Student Health Monitoring
-
LLM filter and clustering finds 41 manipulative narrative clusters
LLM-based Detection of Manipulative Political Narratives
-
Transformers score German texts on left-right scale
Ideology Prediction of German Political Texts
-
Dynamic Latent Routing beats supervised fine-tuning by 6.6 points
Dynamic Latent Routing
-
Exact prefix factorization removes errors in diffusion language models
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding
-
Simple diversity penalty in KV scorer beats complex designs
Minimal-Intervention KV Retention: A Design-Space Study and a Diversity-Penalty Survivor
-
Hidden noise stops vision-language models learning real content
To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model
-
Web agents should plan before seeing page content
Web Agents Should Adopt the Plan-Then-Execute Paradigm
-
MetaMoE combines independently trained expert models into one Mixture-of-Experts system…
MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification
-
Agent harnesses allow unsafe actions even with correct final outputs
Auditing Agent Harness Safety
-
Hypergraph reasoner hits 94.7% on supply chain RCA
Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems
-
Spelling and test design confound KVL word difficulty ratings
What Makes Words Hard? Sakura at BEA 2026 Shared Task on Vocabulary Difficulty Prediction
-
Active learners raise NDCG@10 per call in PRP reranking
Active Learners as Efficient PRP Rerankers
-
Transformer predicts next disease with 0.871 median AUC across 896 categories
DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System
-
Small mismatches in LLM RL rollout and optimization cause collapse
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
-
Prefill-only adapters deliver 1.9x throughput for 512 users
PreFT: Prefill-only finetuning for efficient inference