Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
super hub Mixed citations
Scalable training of
Mixed citation behavior. Most common role is unclear (64%).
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background vey of graph meets large language model: progress and future directions. InProceedings of the Thirty- Third International Joint Conference on Artificial Intelligence, pages 8123-8131. Andrés Montoyo, Patricio Martínez-Barco, and Alexan- dra Balahur. 2012. Subjectivity and sentiment analy- sis: An overview of the current state of the area and envisaged developments.Decision Support Systems, 53(4):675-679. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? sentiment classificatio
- background whether u is semantically broader than k. The two samples are expressed as X={x i}n i=1 andY={y j}m j=1,(3) where xi =x u,i and yj =x k,j, with n, m fixed (typically, we subsample to a common size to con- trol the variance across words). A natural null hypothesis is that the two words have the same dispersion but different mean directions. H0 :disp(X) =disp(Y) withE[X]̸=E[Y]allowed.(4) This is because the mean direction is a strong nui- sance factor in contextual embedding spaces. Even if two wo
- background domain. Given Xv ∈R Sv×Dv and Xt ∈R St×Dt, the goal is to refine Xv by aggregating contextual information across scales. We define N scales with two adapter sets: G= {G1, . . . ,GN } (MGFA) and C={C 1, . . . ,CN } (MCFA). At each scale n, features are reshaped to a grid X (0) v ∈R H×W×D v and downsampled by Down(·,2 n−1): X (n) v = Down(X(0) v ,2 n−1).(4) Let Xv,n = Seq(X (n) v ) denote the flattened se- quence. We then refine and fuse: Gn =G n(Xv,n), C n =C n(Xv,n, Xt),(5) ˜Xv,n =G n +w C n,(6)
- other Question: Eukaryotic genes tend to consist of coding regions (exons) and non-coding regions (introns). The figure shows how such a gene leads to the production of a protein. Which of the following statements is true? A. Thymine content of (1) and (2) is approximately equal. B. The process occurring between (2) and (3) takes place in the cytosol. C. (4) can hybridise with (2). D. The number of amino acid residues in (5) must equal the number of nucleotide residues in (2). E. All processes occurri
- background Question: Eukaryotic genes tend to consist of coding regions (exons) and non-coding regions (introns). The figure shows how such a gene leads to the production of a protein. Which of the following statements is true? A. Thymine content of (1) and (2) is approximately equal. B. The process occurring between (2) and (3) takes place in the cytosol. C. (4) can hybridise with (2). D. The number of amino acid residues in (5) must equal the number of nucleotide residues in (2). E. All processes occurri
- other sharing & image reaction functions are integrated to add a multi-modal dimension to the long-term dialogues.2 The image sharing function is called when the agent decides to send an image. This process includes: (1) Generate a caption c for the intended image using M; (2) Convert the caption c into relevant keywords w using M; (3) Use the keywords k to find an image through web search W EB(k)3; (4) Share the chosen image. Con- versely, the image reaction function is triggered upon receiving an im
authors
co-cited works
representative citing papers
Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.
Introduces Causal Functional Signatures grounded in causal evidence and ILP-learned architectural signatures to enable explicit, comparable, and portable mechanistic claims across model scales.
An 8B autoregressive LM implements a language-switching backdoor via a three-phase circuit with early trigger composition, orthogonal mid-layer propagation, and final-layer MLP conversion, routed through a single-position serial bottleneck.
R3-Streaming uses cascaded control with age-aware memory forgetting and TB-GRPO reinforcement learning to reach SOTA scores of 57.92 on OVO-Bench and 76.36 on StreamingBench with 95-96% fewer visual tokens.
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
LongBEL improves biomedical entity linking consistency by combining full-document context with memory of previous predictions trained via cross-validation rather than gold labels.
A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
Semantic Softmax aggregates probabilities from semantic synonyms around target labels to correct renormalization bias in zero-shot LLM classification, lowering calibration error and raising AUROC and F1.
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.
A new permutation test uses Householder reflection to align word embedding clouds before testing dispersion differences, cutting Type-I error by 32.5% and speeding up 23x on GPU.
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
S²R² improves robustness of LoRA-tuned LLMs to prompt perturbations by penalizing semantic-segment drift while preserving clean performance and cross-dataset transfer.
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.
SpanDec achieves competitive NER accuracy with improved efficiency by using a final-stage lightweight decoder for span representations and early candidate filtering to reduce redundant computation.
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.
Quantile tokens inserted into LLM inputs combined with neighbor retrieval enable direct prediction of full distributions, yielding lower MAPE and narrower intervals than baselines on Airbnb and StackSample tasks.
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
SG-RAG frames retrieval as subgraph matching to ensure LLMs meet every condition in factual queries and reports large gains over baselines on a new 120k-pair ERQA dataset.
MAGEO is a multi-agent system that distills validated editing patterns into reusable optimization skills for generative engines, outperforming heuristic baselines on visibility and fidelity via a new benchmark and evaluation protocol.
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.
citing papers explorer
-
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data
MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.
-
From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning
MAGEO is a multi-agent system that distills validated editing patterns into reusable optimization skills for generative engines, outperforming heuristic baselines on visibility and fidelity via a new benchmark and evaluation protocol.
-
History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
A single consistency instruction with harmful prior actions causes aligned frontier LLMs to select unsafe options at 91-98% rates in high-stakes domains, with escalation and inverse scaling by model size.
-
Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations
Improvements in LLM Theory of Mind on static benchmarks do not reliably improve performance in dynamic, first-person human-AI interactions across goal-oriented and experience-oriented tasks.
-
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization
SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.
-
Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression
LightEdit enables scalable lifelong knowledge editing in LLMs via selective knowledge retrieval and probability suppression during decoding, outperforming prior methods on ZSRE, Counterfact, and RIPE while reducing training costs.
-
Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate
TIDE integrates trial and debate mechanisms to improve criteria-based prompt optimization for argumentative essay tasks including automated scoring, component detection, and relation identification.
-
Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation
LLM chain-of-thought rewriting of job postings plus category-aware MoE improves person-job fit AUC by 2.4%, GAUC by 7.5%, and live click-through conversion by 19.4%.
- KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models