arxiv: 2203.14987 · v1 · submitted 2022-03-28 · 💻 cs.AI · cs.CL

Recognition: 4 theorem links

· Lean Theorem

Multilingual Knowledge Graph Completion with Self-Supervised Adaptive Graph Alignment

Bing Yin, Hanqing Lu, Haoming Jiang, Karthik Subbian, Tianyu Cao, Wei Wang, Yizhou Sun, Zheng Li, Zijie Huang

Pith reviewed 2026-05-08 22:40 UTC · model claude-opus-4-7

classification 💻 cs.AI cs.CL

keywords multilingual knowledge graph completionentity alignmentgraph neural networksrelation-aware attentionself-supervised learningcross-lingual transferknowledge graph embedding

0 comments

The pith

Treating cross-language entity alignment as a learned edge type, rather than a loss that forces aligned entities to coincide, improves multilingual knowledge graph completion, especially for low-resource languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper takes aim at a habit in multilingual knowledge graph completion: encode each language's graph on its own, then bolt on a loss that pulls aligned entities toward identical embeddings. The authors argue this hides a real asymmetry, since a rich English graph and a sparse Greek graph should not contribute equal weight to the embedding of an entity they share. Their alternative is to drop the alignment loss and instead splice all the language graphs into one big graph, with cross-language alignment recorded as a new kind of edge. A graph neural network with relation-aware attention then decides, per query and per neighbor, how much each alignment edge should count. A second self-supervised module masks known alignment edges and learns to recover them, which both regularizes the alignment representation and supplies a principled way to propose new alignment pairs as training proceeds. Experiments on a five-language DBpedia benchmark and a newly released multilingual product graph show consistent gains on standard ranking metrics, concentrated on the sparsest graphs.

Core claim

The paper argues that the standard recipe for multilingual knowledge graph completion — embed each language's graph separately and add a loss that pulls aligned entities together — is the wrong shape of model. Aligned entities across languages sit in graphs of very different size and quality, so forcing them to coincide imports noise from rich graphs into sparse ones and vice versa. The authors recast cross-language alignment as just another edge type inside one fused graph, and let a relation-aware attention mechanism in a graph neural network learn how much to trust each alignment link. They pair this with a self-supervised generator that proposes new alignment edges by masking known ones

What carries the argument

A single fused multilingual graph in which every seed alignment pair becomes a triple with a dedicated alignment relation, encoded by a relation-aware attention GNN that learns per-relation and per-pair weights, plus a second GNN trained to recover masked alignment edges and propose new ones via mutual-nearest-neighbor scoring on combined structural and multilingual BERT text embeddings.

If this is right

Multilingual KG completion benefits more from controlling how much each alignment link is trusted than from adding more alignment pairs of equal weight.
Low-resource language graphs gain disproportionately when knowledge can flow in through learned, asymmetric attention rather than a symmetric pull-together loss.
Masked-edge recovery is a usable self-supervised signal for proposing new cross-graph alignments without external supervision.
Embeddings tuned for entity alignment and embeddings tuned for triple completion should not share parameters; the ablation shows sharing hurts.
The fused-graph view reduces multilingual KG completion to a single relational learning problem, opening it to the standard relational GNN toolkit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The asymmetry the model exploits — rich graphs informing poor graphs more than the reverse — should show up as visibly asymmetric attention weights on alignment edges; the case-study figure hints at this but a directional analysis would make the mechanism explicit.
Because the alignment relation is just another edge type, the same architecture should extend to fusing heterogeneous KGs that are not language variants (for example, a product KG and a domain ontology), not only to multilingual settings.
The masked-recovery objective is structurally similar to link prediction on the alignment subgraph; the gain over purely unsupervised mutual-nearest-neighbor pairing suggests that self-supervision is mainly denoising spurious matches rather than discovering new ones.
The reported failure mode — no useful behavior when zero seed alignment is available — points to text embeddings alone being too weak a bridge without at least a few anchor pairs to calibrate the structural channel.

Load-bearing premise

The approach assumes that even a small set of seed alignment pairs is available and roughly accurate; with no seeds, or with seeds that are systematically wrong, neither the attention weights nor the masked-recovery signal have anything to anchor to, and the authors acknowledge this case is not handled.

What would settle it

Re-run the ablation with the relation-aware attention disabled (uniform weights on the alignment edge type) on the lowest-resource graphs; if accuracy on Greek and Japanese does not drop materially relative to the full model, the claim that adaptive weighting of alignment edges — rather than simply adding more cross-graph connectivity — is what drives the gain would not survive.

read the original abstract

Predicting missing facts in a knowledge graph (KG) is crucial as modern KGs are far from complete. Due to labor-intensive human labeling, this phenomenon deteriorates when handling knowledge represented in various languages. In this paper, we explore multilingual KG completion, which leverages limited seed alignment as a bridge, to embrace the collective knowledge from multiple languages. However, language alignment used in prior works is still not fully exploited: (1) alignment pairs are treated equally to maximally push parallel entities to be close, which ignores KG capacity inconsistency; (2) seed alignment is scarce and new alignment identification is usually in a noisily unsupervised manner. To tackle these issues, we propose a novel self-supervised adaptive graph alignment (SS-AGA) method. Specifically, SS-AGA fuses all KGs as a whole graph by regarding alignment as a new edge type. As such, information propagation and noise influence across KGs can be adaptively controlled via relation-aware attention weights. Meanwhile, SS-AGA features a new pair generator that dynamically captures potential alignment pairs in a self-supervised paradigm. Extensive experiments on both the public multilingual DBPedia KG and newly-created industrial multilingual E-commerce KG empirically demonstrate the effectiveness of SS-AG

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sensible reframing of multilingual KG alignment with a useful new dataset, but the empirical story is thinner than the abstract suggests.

read the letter

Quick read on Huang et al., SS-AGA for multilingual KG completion.

The core idea is clean and worth knowing about: instead of treating cross-lingual entity alignment as an auxiliary loss that pulls parallel entities together, they fold all the KGs into one graph and make "alignment" just another relation type. A relation-aware attention GNN then learns how much to trust each alignment edge, which lets the model down-weight transfer from a sparse Greek KG and lean harder on a rich English one. That framing is conceptually nicer than the standard "minimize distance between aligned embeddings" recipe, and it composes naturally with the rest of the encoder. The self-supervised pair generator — mask some seed alignments, train a second GNN to recover them, then propose new pairs via CSLS — is also a reasonable response to the well-known sparsity of seed alignment. They release code and a new multilingual E-commerce KG (E-PKG) across six languages, which is a tangible contribution beyond the model itself.

The stress-test concern is mostly right and worth taking seriously. The gains over KEnS and AlignKGC on DBP-5L are typically 1–2 H@1 points; on E-PKG the model actually loses to KEnS on Spanish and Italian H@1. No seeds, no variance, no significance tests. The ablation gap between the full model and the plain relation-aware GNN is ~1.2 H@1 averaged over DBP-5L, which is in the same range one would expect from seed noise on KGs this small. They also re-ran all baselines with mBERT-initialized embeddings "for fair comparison" without saying whether baselines were retuned under that change — that's a real concern for attributing the gains to the architectural ideas rather than to tuning.

That said, I don't think the contribution collapses. The alignment-as-edge view and the SSL pair generator are independently sensible and the writing makes the design choices legible. The dataset will get reused. I'd just read the numbers as "competitive with KEnS-class methods" rather than "clearly better."

Recommendation: worth sending to peer review with a request for variance numbers, baseline tuning details, and an honest discussion of where SS-AGA loses. Cite if you work in MKGC; skip if you don't. Not a reading-group pick unless someone in the group is actively building cross-lingual KG systems.

Referee Report

5 major / 10 minor

Summary. The paper proposes SS-AGA, a method for multilingual knowledge graph completion (MKGC) that (i) fuses all language-specific KGs into a single graph with cross-KG alignment treated as a new relation type, encoded by a relation-aware attention GNN, and (ii) iteratively generates new alignment pairs in a self-supervised manner via a masked-alignment recovery objective on a separate GNN encoder, combining structural and mBERT-based textual similarity with CSLS. Experiments are reported on the public DBP-5L benchmark and a newly constructed industrial multilingual e-commerce KG (E-PKG, six languages), against monolingual KG embedding baselines (TransE, RotatE, DistMult, KG-BERT) and multilingual baselines (KEnS, CG-MuA, AlignKGC). The authors claim consistent gains across H@1, H@10, and MRR, an ablation supporting each component, and improved transfer to low-resource KGs. They release code and a new dataset.

Significance. The paper addresses a practically important problem (completing low-resource KGs by sharing across languages) and contributes (a) a clean reformulation of cross-lingual alignment as a typed edge inside a fused graph, which is a sensible alternative to the dominant alignment-loss formulation, (b) a self-supervised masked-alignment recovery signal that is conceptually attractive because it provides supervision for what is otherwise a bootstrapping heuristic, and (c) a new multilingual industrial KG with text and alignment annotations, which is itself a useful resource. Code and data are released, which supports reproducibility. If the gains hold under proper variance control and matched baseline tuning, the work is a solid incremental advance over KEnS/AlignKGC; the conceptual contributions (alignment-as-edge-type with relation-aware attention; SSL-guided pair generation) would remain of interest even if absolute gains are modest.

major comments (5)

[Tables 2–3 (main results)] The headline empirical claim rests on margins that are frequently 1–2 H@1 points over KEnS/AlignKGC, and on E-PKG SS-AGA is actually worse than KEnS on ES H@1 (21.0 vs 21.3) and IT H@1 (24.9 vs 25.1) — yet the text describes the method as outperforming baselines 'in most cases' without acknowledging these losses. No standard deviations, confidence intervals, seed counts, or significance tests are reported. Given the size of the reported gaps and the small KGs involved (e.g., Greek with 13.8K facts), variance from random seeds and negative sampling is plausibly of the same order as the claimed improvements. Please report mean ± std over multiple seeds and a paired significance test on the test triples; otherwise the central effectiveness claim is not distinguishable from noise.
[§4.3 Baselines] The statement 'For fair comparison, we use mBERT to obtain initial embeddings of entities and relations from their text for all methods' is a substantive deviation from the original baselines (TransE/RotatE/DistMult/KG-BERT and arguably KEnS/CG-MuA). It is not specified whether learning rate, margin, negative sampling ratio, embedding dimension, training epochs, or early-stopping criteria were re-tuned per baseline under this changed initialization, nor whether KEnS's ensemble component was preserved. Under-tuned baselines would explain the small gaps in Tables 2–3. Please document the hyperparameter search protocol per baseline and, ideally, also report each baseline in its original published configuration.
[Table 4 (ablation)] The ablation supporting the two main contributions shows R-GNN → R-GNN+NPG → SS-AGA going from 25.7 → 26.2 → 26.9 average H@1 on DBP-5L. These ~0.5–0.7 point steps are in the range of typical seed variance for KG embeddings on KGs of this size, but no variance is reported on the ablation either. Without seed-level error bars, the ablation does not support the strong claim that both NPG and the SSL recovery objective are individually necessary. Please re-run the ablation over multiple seeds with reported variance, and consider an additional control where NPG is replaced by a fixed-quality similarity threshold.
[§3.2 / Eq. (3)] The self-supervised pair generator uses two GNN encoders (g_a for SSL, g_k for KGC) and the ablation shows that sharing them harms performance. This is presented as an empirical finding, but it raises a concern about what g_a is actually learning: if the embeddings useful for alignment are decoupled from those useful for KGC, then the new pairs proposed by g_a may not be the pairs most useful for downstream completion. Please report (i) precision/recall of the generated alignment pairs against held-out gold alignment, and (ii) how performance scales with the number of generated pairs, to substantiate that the recovered pairs are the mechanism behind the gains rather than a regularization side-effect.
[§4.7 Case study / Figure 4] The interpretation of attention weights as evidence of cross-lingual knowledge transfer is qualitative and rests on a single normalized matrix without a baseline distribution or statistical test. Attention weights are well known not to map cleanly to feature importance. Please either temper the interpretive claim or provide a controlled intervention (e.g., zeroing alignment edges from one source language at inference and measuring the per-KG H@10 drop) to substantiate that the visualized weights correspond to functional reliance on those KGs.

minor comments (10)

[Abstract] The abstract is truncated mid-sentence ('SS-AG'). Please fix.
[§2.1] The split is described as '60% train, 30% validation, 10% test', which is an unusually large validation share. Please confirm this is intentional and not a typo for 60/10/30 or 80/10/10; if intentional, justify.
[§3.1, Eq. (1)] The neighborhood N(e_j) is defined to include both incoming and outgoing edges with the same relation r, but this collapses directionality (e.g., 'founded_by' vs its inverse). Please clarify whether relation inverses are introduced as separate types, and how r is shared across the two directions in the attention computation.
[§3.2] The CSLS computation requires k-nearest-neighbor search over the union of entities across KGs. Please report wall-clock cost and how often new pairs are regenerated during training, since this is central to scalability claims.
[§3.1 'Scalability issue'] The k-hop sampler is mentioned but k, fan-out, and any neighbor-sampling biases (e.g., toward alignment edges vs intra-KG edges) are not specified. Please add to the appendix.
[Table 1 / Appendix A] Statistics for E-PKG list only 21 relations per language, all in English while entities are language-specific. Please clarify whether relations are genuinely shared across languages by identity, and whether this differs from the DBP-5L setup where each KG has its own R_i.
[§4.4] KEnS MRR is reported as '-' in both Tables 2 and 3 without explanation. Please add a footnote stating why MRR is not available for this baseline.
[§5.2] BootEA (Sun et al., 2018) is discussed as related work for bootstrapped alignment but not included as a baseline; given conceptual similarity to NPG, an empirical comparison would strengthen the contribution claim.
[Algorithm 1] The 'while not converged' criterion is unspecified. Please state the convergence/early-stopping rule and the number of training epochs actually used.
[§4.5] The variant labels 'GNN' and 'R-GNN' are introduced without a precise specification of what 'GNN without relation modeling' means (e.g., GAT? GCN? same depth and width?). Please specify so the ablation is reproducible.

Simulated Author's Rebuttal

5 responses · 2 unresolved

We thank the referee for a careful and constructive report. The criticisms are well-taken, particularly regarding statistical reliability of the reported gains, baseline tuning transparency, and the gap between qualitative attention-based interpretation and functional evidence of cross-lingual transfer. We agree that the headline empirical claims as currently phrased are not adequately supported by variance-controlled experiments, and we acknowledge the cases on E-PKG (ES and IT H@1) where SS-AGA does not improve over KEnS — the phrase 'in most cases' was insufficiently precise about these losses. In the revision we will (1) re-run all main-table and ablation experiments over multiple seeds with mean ± std and paired significance tests; (2) document per-baseline hyperparameter protocols and additionally report each baseline in its original published configuration; (3) add diagnostic experiments on the new-pair generator (precision/recall against held-out gold alignment, scaling with number of generated pairs, and a fixed-threshold control) to clarify whether SSL is an individually necessary mechanism or a denoising regularizer; and (4) replace the qualitative attention interpretation in §4.7 with a controlled edge-ablation intervention. We will calibrate all claims to what the variance-controlled numbers actually support, including downgrading per-language claims where significance is not established. The two conceptual contributions — alignment-as-edge-type with relation-awa

read point-by-point responses

Referee: Margins in Tables 2–3 are 1–2 H@1 points and SS-AGA actually loses to KEnS on E-PKG ES H@1 (21.0 vs 21.3) and IT H@1 (24.9 vs 25.1); no variance, seeds, or significance tests are reported, so improvements may be within noise.

Authors: We accept this criticism. The referee is correct that we do report two cases on E-PKG (ES and IT H@1) where SS-AGA is below KEnS, and that the phrasing 'in most cases' obscures these losses. We will (i) revise the text to explicitly enumerate the cells where SS-AGA does not win, and (ii) re-run all main-table experiments over 5 random seeds (controlling negative sampling and parameter initialization) and report mean ± std. We will additionally provide paired significance tests (Wilcoxon signed-rank over per-query reciprocal ranks on the test set, with Bonferroni correction across languages) for SS-AGA vs. each multilingual baseline. We agree that on small KGs such as EL (13.8K facts) seed variance is non-trivial, and we will calibrate the strength of claims to whatever the variance-controlled numbers support — including downgrading claims for languages where the gap is not significant. revision: yes
Referee: Re-initializing all baselines with mBERT text embeddings is a substantive deviation; hyperparameter re-tuning per baseline is not documented, and KEnS's ensemble component status is unclear. Under-tuned baselines could explain the small gaps.

Authors: This is a legitimate methodological concern. Our intent in unifying mBERT initialization was to isolate the architectural contribution from text-encoder differences, since AlignKGC in particular benefits from pretrained alignment-aware text embeddings. However, we did not document the per-baseline tuning protocol in the submitted version. In the revision we will (i) add an appendix specifying, per baseline, the search grid (learning rate, margin, negative-sample ratio, embedding dimension, epochs, early-stopping criterion) and the selected configuration; (ii) confirm that KEnS's ensemble (knowledge-transfer voting across language-specific models) was preserved, and report results both with and without the mBERT-initialized variant; and (iii) add a column reporting each baseline in its originally published configuration alongside the mBERT-initialized one, so that readers can see both comparisons. If a baseline becomes stronger under its native configuration, we will report that honestly rather than only the configuration favorable to us. revision: yes
Referee: Ablation gains (25.7 → 26.2 → 26.9 avg H@1) are within typical seed variance and lack error bars, so individual necessity of NPG and SSL is not established. A control replacing NPG with a fixed similarity threshold is suggested.

Authors: We agree. We will re-run the ablation in Table 4 over 5 seeds and report mean ± std for each variant (GNN, R-GNN, R-GNN+NPG, SS-AGA, and the shared-encoder variant), and apply paired significance testing between consecutive rows. We will also add the suggested control: an R-GNN+NPG variant in which CSLS-based mutual-nearest-neighbor selection is replaced by a fixed cosine-similarity threshold τ (swept over a small grid on the validation set), so that the contribution of the SSL masked-recovery objective can be isolated from the contribution of simply adding more pseudo-pairs. If the SSL variant is not significantly above the threshold control, we will revise the claim accordingly and report SSL as a denoising regularizer rather than as an individually necessary component. revision: yes
Referee: Using two decoupled encoders g_a (SSL) and g_k (KGC) raises the concern that pairs proposed by g_a are not those most useful for KGC. Requested: precision/recall of generated pairs vs. held-out gold alignment, and scaling of KGC performance with number of generated pairs.

Authors: This is a fair diagnostic question, and we did not include it. We will add: (i) precision and recall of the pairs generated by g_a, evaluated against a held-out subset of the seed alignment that we withhold from training (we will define this split carefully so it is disjoint from both the masked-recovery set used during training and from any pairs used at evaluation); and (ii) a curve plotting MKGC H@1/H@10 against the number of generated pairs added per epoch (controlled by the CSLS top-K), to show whether gains saturate, plateau, or degrade with more pairs. We will also report the ablation 'add K random unaligned pairs as ralign edges' as a regularization control, to test whether the benefit comes from the specific pairs proposed by g_a or from densifying the cross-KG connectivity. We acknowledge that if the random-edge control matches SS-AGA, the contribution should be reframed as structural regularization rather than alignment recovery. revision: yes
Referee: The attention-weight visualization in Figure 4 is qualitative; attention weights do not necessarily reflect functional importance. A controlled intervention (e.g., ablating alignment edges from one source language at inference) is requested.

Authors: We accept this point and agree the current claim overreaches what an attention heatmap can support. In the revision we will: (i) soften the language in §4.7 from a causal interpretation of cross-lingual transfer to a descriptive observation about learned weights; and (ii) add the suggested intervention experiment — at inference time, for each target language, we zero out incoming r_align edges from each source language one at a time and measure the resulting drop in H@1/H@10 on that target. We will report this drop matrix alongside Figure 4 so that readers can see whether the attention-weight ranking and the functional-reliance ranking agree. If they disagree, we will say so explicitly, since the intervention measurement is the more reliable signal. revision: yes

standing simulated objections not resolved

We cannot retroactively guarantee that under fully native, independently re-tuned baseline configurations the H@1 margins on every DBP-5L and E-PKG language will remain positive and significant; we commit to reporting the results honestly even if some per-language claims must be withdrawn.
The decoupled-encoder design (g_a vs. g_k) is currently justified empirically rather than theoretically; we can provide diagnostics but cannot offer a principled account of why the embedding spaces best for alignment differ from those best for KGC.

Circularity Check

0 steps flagged

No significant circularity: SS-AGA's claims are empirical and tested against external benchmarks; concerns are about effect size and baseline tuning, not circular derivation.

full rationale

This is a methods paper proposing a GNN-based multilingual KG completion model. The central claims are empirical (better Hits@1/10 and MRR on DBP-5L and the new E-PKG dataset) and are evaluated against held-out test triples on standard public data plus a new industrial dataset. The derivation chain — fuse KGs by treating alignment as a new edge type, apply a relation-aware attention GNN, add a self-supervised masked-alignment recovery loss, generate new alignment pairs via CSLS — is internally consistent and does not reduce predictions to fitted inputs. Test triples are held out (60/30/10 split) and not used for fitting; the masked alignment loss is supervised by held-out alignment edges, not by the KGC test set. Self-citations (e.g., to KEnS, CG-MuA, BootEA) are used to position prior art and to import the CSLS measure (Conneau et al., a third-party reference), not to import a load-bearing uniqueness theorem. The reader's concerns — small absolute gaps over baselines, no variance reporting, possibly under-tuned baselines after mBERT re-initialization — are legitimate but they are issues of statistical significance and experimental fairness, not circularity. No "prediction" in this paper is equivalent to its input by construction; the model is trained on training triples and evaluated on disjoint test triples. There is one minor element worth flagging: the new pair generator uses entity similarity (structural + mBERT textual) to propose new alignment edges, and the SSL recovery objective is trained on a masked subset of seed alignment, so the generator is partially evaluated on the same kind of signal it was trained on — but this is a standard masked-recovery self-supervision setup, not a circularity that contaminates the downstream KGC evaluation. Score 1.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Model omitted the axiom ledger; defaulted for pipeline continuity.

pith-pipeline@v0.9.0 · 9704 in / 5139 out tokens · 89911 ms · 2026-05-08T22:40:36.366342+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.PhiForcing / Cost.Jcost phi_forcing / Jcost_uniqueness unclear
α^r_ij = (W^l_k h^l_i(r))^T · (W^l_q h^l_j) · (1/√d) · β_r ... we introduce β_r to characterize the general significance of each relation r.
Cost.FunctionalEquation washburn_uniqueness_aczel unclear
J_K = Σ [f(e_h', r, e_t') − f(e_h, r, e_t) + γ]_+ ... J_A = Σ [‖ẽ^a_h − ẽ^a_t‖² − ‖ẽ^a_{h'} − ẽ^a_{t'}‖² + γ_a]_+

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting
cs.CV 2026-05 conditional novelty 8.0

Current CAC models often count the wrong objects because they misalign text prompts with visual content, as demonstrated by new negative-label and distractor tests on the MUCCA dataset.
APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks
cs.CR 2026-05 unverdicted novelty 8.0

APIOT is the first LLM framework to complete the full autonomous discovery-to-remediation cycle on bare-metal OT devices, reaching 90% success across 290 runs on Zephyr RTOS.
Dynamic Grammar-Compressed Self-Index in $\delta$-Optimal Space
cs.DS 2026-04 unverdicted novelty 8.0

The dynamic RR-index is the first dynamic self-index to attain δ-optimal space, with locate in expected O(m + log m log² n + occ (log n / log log n)) time and updates in O(m' log² n + log³ n) time.
Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents
cs.CR 2026-04 unverdicted novelty 8.0

NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
Suffix Random Access via Function Inversion: A Key for Asymmetric Streaming String Algorithms
cs.DS 2026-04 unverdicted novelty 8.0

A bidirectional reduction between suffix random access and function inversion enables improved asymmetric streaming algorithms for exact/approximate pattern matching and relative Lempel-Ziv compression.
A Counterexample to EFX; $n \ge 3$ Agents, $m \ge n + 5$ Items, Monotone Valuations; via SAT-Solving
cs.GT 2026-04 accept novelty 8.0 partial

EFX allocations do not always exist for n ≥ 3 agents and m ≥ n + 5 monotone items, as demonstrated by a SAT-derived counterexample.
Logical Compilation for Multi-Qubit Iceberg Patches
quant-ph 2026-04 unverdicted novelty 8.0

A new heuristic compiler for multi-qubit iceberg patches reduces circuit depth by 34 percent, cuts gate counts, and improves fidelity metrics on 71 benchmarks compared with naive mapping.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
The Linear Representation Hypothesis and the Geometry of Large Language Models
cs.CL 2023-11 conditional novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
A refined CJ--SS--RR method with a reliable removal approach of spurious Ritz values for the Hermitian eigenvalue problem
math.NA 2026-05 unverdicted novelty 7.0

Refined SS-RRR methods with a reliable tune-free removal of spurious Ritz values improve accuracy and efficiency for computing eigenpairs of large Hermitian matrices in a target region.
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
cs.CV 2026-05 unverdicted novelty 7.0

TB-AVA uses text as a semantic anchor with a new Text-Bridged Audio-Visual Adapter and Gated Semantic Modulation to achieve state-of-the-art results on audio-visual benchmarks through parameter-efficient fine-tuning.
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
cs.CV 2026-05 conditional novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domai...
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
cs.AI 2026-05 unverdicted novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
Near-Linear Time Generalized Sinkhorn Algorithms for Bounded Genus Graphs
cs.DS 2026-05 unverdicted novelty 7.0

GenusSink achieves near-linear time approximate generalized Sinkhorn for geodesic optimal transport on bounded-genus graphs by combining separator-based decompositions with Fourier and low-displacement-rank matrix-vec...
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
cs.CL 2026-05 unverdicted novelty 7.0

LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring
cs.CR 2026-05 unverdicted novelty 7.0

A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack succ...
MDGYM: Benchmarking AI Agents on Molecular Simulations
cs.AI 2026-05 unverdicted novelty 7.0

MDGYM benchmark shows AI agents achieve low success rates on molecular dynamics tasks, with distinct failure modes including unstable simulations and premature abandonment.
Hierarchical Task Network Planning with LLM-Generated Heuristics
cs.AI 2026-05 unverdicted novelty 7.0

LLM-generated heuristics for HTN planning nearly match PANDA planner coverage while reducing search effort on 83% of shared problems across six benchmark domains.
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
cs.SE 2026-05 unverdicted novelty 7.0

MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achi...
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
physics.flu-dyn 2026-05 conditional novelty 7.0

AI CFD Scientist autonomously finds a Spalart-Allmaras turbulence correction that lowers wall-friction error by 7.89% versus DNS on the periodic hill case using vision-language physics verification.
PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization
cs.LG 2026-05 unverdicted novelty 7.0

PACZero achieves zero mutual information privacy for LLM fine-tuning via sign-quantized zeroth-order gradients, delivering near-non-private accuracy on SST-2 and SQuAD at I=0.
Privacy by Postprocessing the Discrete Laplace Mechanism
cs.CR 2026-05 unverdicted novelty 7.0

Postprocessing the discrete Laplace mechanism yields unbiased estimators for subexponential functions and equivalent distributions to Laplace or Staircase mechanisms under the same privacy parameters.
ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations
cs.DC 2026-05 conditional novelty 7.0

ADELIA is the first AD-enabled INLA system that computes exact hyperparameter gradients via a structure-exploiting multi-GPU backward pass, delivering 4.2-7.9x per-gradient speedups and 5-8x better energy efficiency t...
Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation
cs.CR 2026-05 unverdicted novelty 7.0

Semantic-envelope metrics, which assign each content variant the maximum score in its semantic class, provide the unique conservative repair that supports class-stratified certificates invariant to platform manipulati...
Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration
cs.CV 2026-05 unverdicted novelty 7.0

CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
GPU-Accelerated Synthesis of Mixed-Boolean Arithmetic: Beyond Caching
cs.PL 2026-05 accept novelty 7.0

SIMBA is a cache-free GPU-accelerated bottom-up enumerator for mixed-boolean arithmetic that solves larger instances faster than prior CPU and cached-GPU tools.
Detecting Time Series Anomalies Like an Expert: A Multi-Agent LLM Framework with Specialized Analyzers
cs.AI 2026-05 unverdicted novelty 7.0

SAGE decomposes univariate time-series anomaly detection into four specialized LLM analyzers plus an evidence-grounded detector and supervisor, achieving the highest average performance on three benchmarks while using...
Inference-Time Budget Control for LLM Search Agents
cs.AI 2026-05 unverdicted novelty 7.0

A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences
cs.CL 2026-05 unverdicted novelty 7.0

The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.
Beyond Gates: Pulse Level Quantum Fourier Models
quant-ph 2026-05 unverdicted novelty 7.0

Pulse-level parameterization of quantum Fourier models replaces single gate angles with multiple independent sub-angles, relaxing monomial couplings and improving gradient descent performance on Fourier series tasks.
Out-of-the-Box Global Optimization for Packing Problems: New Models and Improved Solutions
math.OC 2026-05 unverdicted novelty 7.0

New nonlinear formulations for geometric packing problems, solved with FICO Xpress and SCIP, produce improved and first-known solutions for several variants.
Unsat Core Prediction through Polarity-Aware Representation Learning over Clause-Literal Hypergraphs
cs.LG 2026-05 unverdicted novelty 7.0

A polarity-aware hypergraph GNN framework improves unsat core prediction in SAT by separating polarity-invariant and equivariant literal representations.
From Video-to-PDE: Data-Driven Discovery of Nonlinear Dye Plume Dynamics
cs.LG 2026-05 unverdicted novelty 7.0

A video-to-PDE pipeline extracts the model u_t + v(t)·∇u = 9.005|∇u|^2 + 0.666Δu from grayscale ink-plume footage, outperforming advection-diffusion baselines on held-out frames and reducing to linear form via Cole-Ho...
Layerwise LQR for Geometry-Aware Optimization of Deep Networks
cs.LG 2026-05 unverdicted novelty 7.0

Steepest descent under divergence-induced quadratic models equals an LQR problem, enabling learning of diagonal or Kronecker-factored inverse preconditioners via a global layerwise objective for scalable geometry-awar...
When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

SxS Interleaved Reasoning learns when to disclose partial reasoning during generation and improves accuracy versus content-latency trade-offs on math and science benchmarks.
Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model
cs.CV 2026-05 unverdicted novelty 7.0

The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction
cs.AI 2026-05 unverdicted novelty 7.0

A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in ...
CSGuard: Toward Forgery-Resistant Watermarking in Diffusion Models via Compressed Sensing Constraint
cs.CV 2026-05 unverdicted novelty 7.0

CSGuard binds diffusion-model watermarks to a secret matrix via compressed sensing, cutting forgery attack success from 100% to 28.12% while preserving 100% detection on legitimate images.
ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue
cs.RO 2026-05 unverdicted novelty 7.0

ESARBench is the first unified benchmark for MLLM-driven UAV agents that must explore, locate clues, and decide on victim positions in photorealistic simulated SAR environments.
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents
cs.AI 2026-05 unverdicted novelty 7.0

EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-se...
Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
cs.SE 2026-05 unverdicted novelty 7.0

Themis builds a multilingual benchmark and large preference dataset to train code reward models that score outputs on multiple criteria like correctness, efficiency, and style.
Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
cs.SE 2026-05 unverdicted novelty 7.0

Themis introduces the largest open code preference dataset with over 350k pairs and trains multilingual reward models from 600M to 32B parameters that support flexible multi-criteria scoring, with experiments showing ...
Recovering Hidden Reward in Diffusion-Based Policies
cs.RO 2026-05 unverdicted novelty 7.0

EnergyFlow recovers the gradient of the expert's soft Q-function from the score of a conservative energy field in diffusion policies, enabling reward extraction without adversarial training.
Quasar-Convex Optimization: Fundamental Properties and High-Order Proximal-Point Methods
math.OC 2026-04 unverdicted novelty 7.0

Quasar-convex functions admit high-order proximal algorithms with linear convergence for p=2 and superlinear for p>2 under suitable conditions.
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
cs.CL 2026-04 unverdicted novelty 7.0

uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding
quant-ph 2026-04 unverdicted novelty 7.0

Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
ReTokSync: Self-Synchronizing Tokenization Disambiguation for Generative Linguistic Steganography
cs.CR 2026-04 unverdicted novelty 7.0

ReTokSync resolves tokenization ambiguity in generative linguistic steganography via targeted self-synchronizing resets, achieving over 99.7% extraction accuracy and 100% recovery with an auxiliary channel while match...
Hybrid Path-Sums for Hybrid Quantum Programs
cs.PL 2026-04 unverdicted novelty 7.0

Hybrid Path-Sums offer a new symbolic framework with rewriting rules and assertions to represent, simplify, and verify properties of hybrid quantum-classical programs.
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
cs.CL 2026-04 unverdicted novelty 7.0

DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
VitaminP: cross-modal learning enables whole-cell segmentation from routine histology
cs.CV 2026-04 unverdicted novelty 7.0

VitaminP uses paired H&E-mIF data to train a model that transfers molecular boundary information, enabling accurate whole-cell segmentation directly from routine H&E histology across 34 cancer types.
Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval
cs.IR 2026-04 accept novelty 7.0

Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.
Hybrid Conjecture in a Mixed Shimura variety
math.NT 2026-04 unverdicted novelty 7.0

The hybrid conjecture holds for the universal abelian scheme A_g over A_g, encompassing André-Oort, André-Pink-Zannier, mixed André-Oort, and Manin-Mumford conjectures for abelian varieties.
AgentSearchBench: A Benchmark for AI Agent Search in the Wild
cs.AI 2026-04 unverdicted novelty 7.0

AgentSearchBench shows a consistent gap between semantic similarity and real agent performance on tasks, with execution-aware probing substantially improving retrieval and reranking quality.
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
cs.CV 2026-04 unverdicted novelty 7.0

SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
cs.CR 2026-04 unverdicted novelty 7.0

AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
cs.MM 2026-04 unverdicted novelty 7.0

AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...
A Rocq Formalization of Simplicial Lagrange Finite Elements
cs.LO 2026-04 unverdicted novelty 7.0 full

A Rocq formalization defines simplicial Lagrange finite elements as records with geometric data, polynomial approximations, and unisolvence proofs for any dimension and polynomial degree.
Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication
cs.AI 2026-04 unverdicted novelty 7.0

A new structured prompting method (SPEC) helps AI detect insufficient evidence in adjudication tasks and defer decisions appropriately, reaching 89% accuracy on a benchmark varying information completeness from Colora...
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 7.0

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.