Star-height of Parikh images is bounded by 2 for one-register automata but the rational conjecture fails for multiple registers, showing Parikh's theorem does not hold over infinite alphabets.
super hub Mixed citations
Chandra and Dexter C
Mixed citation behavior. Most common role is unclear (48%).
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
k-REWB matching cannot be solved in O(n to the 2k minus epsilon) time under SETH, is W[2]-hard parameterized by expression length, and 2-use 2-REWBs require superlinear time unless triangle detection does; 1-use REWBs admit an O(n log squared n) algorithm.
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
OpenFinGym is a multi-task verifiable gym environment for quant-finance agents with automated task construction from publications, containerised runtime, paper trading engine, and support for SFT/RL training.
New benchmark quantifies language- and model-dependent over-alignment in criminal law LLM use and identifies abliteration as an effective mitigation with minimal performance cost.
DISC is a new iterative verify-judge-correct procedure for LLMs that improves accuracy on reasoning benchmarks by modeling verification as denoising signals and using a gate to control correction precision.
Chehre introduces a new emoji-prompted video dataset with multi-annotator labels to benchmark models on dominant and distributional facial expression recognition tasks.
NEST is a new benchmark dataset for narrative event structures in long videos, with baselines reporting ETD below 8%, EL under 6%, EAE below 11%, and ERE at 35-44% F1.
Introduces thermodynamic free-energy signatures and spectral form factors from attention Laplacians for hallucination detection, with stability proofs, expressiveness results, a PAC bound, and empirical AUROC gains over baselines.
Mechanistic tracing shows text suppresses but does not erase audio representations in late layers of Audio LLMs; back-patching reduces text dominance.
LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.
Introduces applicability condition extraction for therapeutic drug-disease relations, creates first annotated dataset of 1,119 pairs, and proposes enhanced LoRA method outperforming baselines.
MolGram integrates a conditional n-gram memory module into molecular language models to address locality gaps in SMILES tokenization, improving performance on generation, forward prediction, and retrosynthesis while outperforming 3x larger baselines.
Translating LIBERO to ten languages shows VLA failures under multilingual instructions are driven by language-sensitive steps; a step-wise inference intervention improves performance.
HDSL is a tree-structured DSL for 3D indoor scenes that lets LLM agents generate subtrees recursively and perform localized edits via hierarchical retrieval and deterministic merge.
NüshuVoice releases the first sentence-level Nüshu TTS dataset and shows that an F0-conditioned VITS model using five-level pitch notation outperforms baselines on spectral fidelity, pitch accuracy, and intelligibility.
LLMs match condition-level patterns in a noodle purchase survey but fail to replicate distributional structure, with no model beating a pooled human baseline for purchase quantities.
EGPS localizes MCMC moves to high-entropy decision points using forward-pass entropy, yielding up to 12.6× wall-clock speedup and best-or-tied accuracy on MATH500, HumanEval, and GPQA for Qwen2.5-Math-7B.
SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.
Releases a 457-sentence Komi-Yazva--Russian parallel corpus and shows that retrieval-based few-shot prompting improves LLM translation over zero-shot in this low-resource setting, with performance varying by model and metric.
CollabSim is a new CSCW-grounded simulation framework that enables controlled multi-agent experiments to measure collaborative competence in LLM agents.
LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.
Introduces DelegateCI-Bench (3167 samples) and a CI-guided RL query rewriter that improves privacy-utility tradeoff by up to +10.1 utility over on-device baselines.
LLM rerankers can internally predict ranking quality via self-consistency of sampled outputs, matching SOTA external QPP while direct confidence is overconfident; supervised token-efficient methods improve calibration.
citing papers explorer
-
Star Complexity of Parikh Images of Languages over Infinite Alphabets
Star-height of Parikh images is bounded by 2 for one-register automata but the rational conjecture fails for multiple registers, showing Parikh's theorem does not hold over infinite alphabets.
-
On the Complexity of the Matching Problem of Regular Expressions with Backreferences
k-REWB matching cannot be solved in O(n to the 2k minus epsilon) time under SETH, is W[2]-hard parameterized by expression length, and 2-use 2-REWBs require superlinear time unless triangle detection does; 1-use REWBs admit an O(n log squared n) algorithm.
-
Evaluating Very Long-Term Conversational Memory of LLM Agents
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
-
OpenFinGym: A Verifiable Multi-Task Gym Environment for Evaluating Quant Agents
OpenFinGym is a multi-task verifiable gym environment for quant-finance agents with automated task construction from publications, containerised runtime, paper trading engine, and support for SFT/RL training.
-
Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts
New benchmark quantifies language- and model-dependent over-alignment in criminal law LLM use and identifies abliteration as an effective mitigation with minimal performance cost.
-
Denoising Iterative Self-Correction: Structured Verification Loops for Reliable LLM Reasoning
DISC is a new iterative verify-judge-correct procedure for LLMs that improves accuracy on reasoning benchmarks by modeling verification as denoising signals and using a gate to control correction precision.
-
Chehre: An Emoji-Prompted Video Dataset for Perceptually Diverse Facial Expression Recognition
Chehre introduces a new emoji-prompted video dataset with multi-annotator labels to benchmark models on dominant and distributional facial expression recognition tasks.
-
NEST: Narrative Event Structures in Time for Long Video Understanding
NEST is a new benchmark dataset for narrative event structures in long videos, with baselines reporting ETD below 8%, EL under 6%, EAE below 11%, and ERE at 35-44% F1.
-
Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models
Introduces thermodynamic free-energy signatures and spectral form factors from attention Laplacians for hallucination detection, with stability proofs, expressiveness results, a PAC bound, and empirical AUROC gains over baselines.
-
Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs
Mechanistic tracing shows text suppresses but does not erase audio representations in late layers of Audio LLMs; back-patching reduces text dominance.
-
LegalWorld: A Life-Cycle Interactive Environment for Legal Agents
LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.
-
Applicability Condition Extraction for Therapeutic Drug-Disease Relations
Introduces applicability condition extraction for therapeutic drug-disease relations, creates first annotated dataset of 1,119 pairs, and proposes enhanced LoRA method outperforming baselines.
-
Augmenting Molecular Language Models with Local $n$-gram Memory
MolGram integrates a conditional n-gram memory module into molecular language models to address locality gaps in SMILES tokenization, improving performance on generation, forward prediction, and retrosynthesis while outperforming 3x larger baselines.
-
When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models
Translating LIBERO to ten languages shows VLA failures under multilingual instructions are driven by language-sensitive steps; a step-wise inference intervention improves performance.
-
HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents
HDSL is a tree-structured DSL for 3D indoor scenes that lets LLM agents generate subtrees recursively and perform localized edits via hierarchical retrieval and deterministic merge.
-
N\"ushuVoice: Reviving the Voice of Endangered N\"ushu with Pitch-Aware Text-to-Speech
NüshuVoice releases the first sentence-level Nüshu TTS dataset and shows that an F0-conditioned VITS model using five-level pitch notation outperforms baselines on spectral fidelity, pitch accuracy, and intelligibility.
-
Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level
LLMs match condition-level patterns in a noodle purchase survey but fail to replicate distributional structure, with no model beating a pooled human baseline for purchase quantities.
-
Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling
EGPS localizes MCMC moves to high-entropy decision points using forward-pass entropy, yielding up to 12.6× wall-clock speedup and best-or-tied accuracy on MATH500, HumanEval, and GPQA for Qwen2.5-Math-7B.
-
SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models
SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.
-
A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation
Releases a 457-sentence Komi-Yazva--Russian parallel corpus and shows that retrieval-based few-shot prompting improves LLM translation over zero-shot in this low-resource setting, with performance varying by model and metric.
-
CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments
CollabSim is a new CSCW-grounded simulation framework that enables controlled multi-agent experiments to measure collaborative competence in LLM agents.
-
LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents
LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.
-
Need to Know: Contextual-Integrity-Grounded Query Rewriting for Privacy-Conscious LLM Delegation
Introduces DelegateCI-Bench (3167 samples) and a CI-guided RL query rewriter that improves privacy-utility tradeoff by up to +10.1 utility over on-device baselines.
-
Can LLM Rerankers Predict Their Own Ranking Performance?
LLM rerankers can internally predict ranking quality via self-consistency of sampled outputs, matching SOTA external QPP while direct confidence is overconfident; supervised token-efficient methods improve calibration.
-
Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions
The paper introduces a benchmark of 991 real-world device repair questions and finds that state-of-the-art LLMs remain unreliable for high-risk repairs, with phone repair hardest and Bangla responses worse than English.
-
AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification
AuditFlow combines a graph-grounded symbolic environment with a multi-agent LLM setup to reach 82.09% joint audit accuracy on structured financial reports, 14.93 points above the strongest baseline.
-
When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation
Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.
-
Agentic Clustering: Controllable Text Taxonomies via Multi-Agent Refinement
Agentic multi-agent LLM system for controllable text clustering outperforms fixed-pipeline baselines by up to 32% ARI on seven public benchmarks.
-
Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs
REP elicits hidden LLM reasoning traces via in-context shadow demonstrations, raising similarity to internal traces while retaining distillation utility across datasets and models.
-
TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding
TAPS converts diffusion marginal probabilities into path-conditioned acceptance estimates to select prefix-closed subtrees under a fixed verification budget, achieving up to 7.9x end-to-end speedup over autoregressive decoding.
-
Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization
DEPO formulates detector-evasive paraphrasing as a constrained MDP and solves it via Lagrangian primal-dual RL with GRPO-style updates to achieve evasion while satisfying a semantic-preservation constraint.
-
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.
-
From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach
Introduces Causal Functional Signatures grounded in causal evidence and ILP-learned architectural signatures to enable explicit, comparable, and portable mechanistic claims across model scales.
-
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
-
LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking
LongBEL improves biomedical entity linking consistency by combining full-document context with memory of previous predictions trained via cross-validation rather than gold labels.
-
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
-
The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods
Semantic Softmax aggregates probabilities from semantic synonyms around target labels to correct renormalization bias in zero-shot LLM classification, lowering calibration error and raising AUROC and F1.
-
CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.
-
Accurate and Efficient Statistical Testing for Word Semantic Breadth
A new permutation test uses Householder reflection to align word embedding clouds before testing dispersion differences, cutting Type-I error by 32.5% and speeding up 23x on GPU.
-
Where Do Prompt Perturbations Break Generation? A Segment-Level View of Robustness in LoRA-Tuned Language Models
S²R² improves robustness of LoRA-tuned LLMs to prompt perturbations by penalizing semantic-segment drift while preserving clean performance and cross-dataset transfer.
-
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
-
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.
-
Decoding Text Spans for Efficient and Accurate Named-Entity Recognition
SpanDec achieves competitive NER accuracy with improved efficiency by using a final-stage lightweight decoder for span representations and early candidate filtering to reduce redundant computation.
-
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
-
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data
MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.
-
Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context
Quantile tokens inserted into LLM inputs combined with neighbor retrieval enable direct prediction of full distributions, yielding lower MAPE and narrower intervals than baselines on Airbnb and StackSample tasks.
-
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
-
Structure Guided Retrieval-Augmented Generation for Factual Queries
SG-RAG frames retrieval as subgraph matching to ensure LLMs meet every condition in factual queries and reports large gains over baselines on a new 120k-pair ERQA dataset.
-
From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning
MAGEO is a multi-agent system that distills validated editing patterns into reusable optimization skills for generative engines, outperforming heuristic baselines on visibility and fidelity via a new benchmark and evaluation protocol.
-
Cell-Based Representation of Relational Binding in Language Models
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.