arXiv preprint arXiv:2404.14082 (2024)

Bereska, L · 2024 · arXiv 2404.14082

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

representative citing papers

Data-driven Circuit Discovery for Interpretability of Language Models

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.

Unifying Dynamical Systems and Graph Theory to Mechanistically Understand Computation in Neural Networks

cs.NE · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

RNN computation is recovered from multi-hop graph pathways, and constraining these pathways via resolvent regularization yields improved temporal sparsity and task performance over standard L1.

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

cs.CR · 2026-04-21 · unverdicted · novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.

SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

SoftSAE replaces fixed-K sparsity in autoencoders with a learned, input-dependent number of active features via a soft top-k operator.

Confidence Estimation in Automatic Short Answer Grading with LLMs

cs.CL · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

A hybrid confidence framework for LLM-based short answer grading combines model signals with aleatoric uncertainty from semantic clustering of responses and improves selective grading reliability over single-source methods.

Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.

When AI reviews science: Can we trust the referee?

cs.AI · 2026-04-26 · unverdicted · novelty 6.0

AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.

What Physics do Data-Driven MoCap-to-Radar Models Learn?

cs.LG · 2026-04-19 · unverdicted · novelty 6.0

Data-driven MoCap-to-radar models often fail to learn underlying physics despite low reconstruction error, with temporal attention proving critical for transformers to achieve physical consistency.

Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.

Quantifying Trust: Financial Risk Management for Trustworthy AI Agents

cs.AI · 2026-04-05 · unverdicted · novelty 6.0

The paper introduces the Agentic Risk Standard (ARS) as a payment settlement framework that delivers predefined compensation for AI agent execution failures, misalignment, or unintended outcomes.

Enhancing Adversarial Robustness in Network Intrusion Detection: A Layer-wise Adaptive Regularization Approach

cs.CR · 2026-05-09 · unverdicted · novelty 3.0

LARAR enhances adversarial robustness in network intrusion detection by using layer-wise adaptive regularization and auxiliary classifiers, achieving 95.01% clean accuracy and improved defense against FGSM, PGD, and transfer attacks on UNSW-NB15.

Superposition Yields Robust Neural Scaling

cs.LG · 2025-05-15

citing papers explorer

Showing 13 of 13 citing papers.

Data-driven Circuit Discovery for Interpretability of Language Models cs.AI · 2026-05-09 · unverdicted · none · ref 1
Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
Unifying Dynamical Systems and Graph Theory to Mechanistically Understand Computation in Neural Networks cs.NE · 2026-05-05 · unverdicted · none · ref 29 · 2 links
RNN computation is recovered from multi-hop graph pathways, and constraining these pathways via resolvent regularization yields improved temporal sparsity and task performance over standard L1.
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety cs.CR · 2026-04-21 · unverdicted · none · ref 124
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space cs.CL · 2026-05-12 · unverdicted · none · ref 67
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders cs.LG · 2026-05-07 · unverdicted · none · ref 11 · 2 links
SoftSAE replaces fixed-K sparsity in autoencoders with a learned, input-dependent number of active features via a soft top-k operator.
Confidence Estimation in Automatic Short Answer Grading with LLMs cs.CL · 2026-04-30 · unverdicted · none · ref 4 · 2 links
A hybrid confidence framework for LLM-based short answer grading combines model signals with aleatoric uncertainty from semantic clustering of responses and improves selective grading reliability over single-source methods.
Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM cs.LG · 2026-04-28 · unverdicted · none · ref 1
Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.
When AI reviews science: Can we trust the referee? cs.AI · 2026-04-26 · unverdicted · none · ref 103
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
What Physics do Data-Driven MoCap-to-Radar Models Learn? cs.LG · 2026-04-19 · unverdicted · none · ref 13
Data-driven MoCap-to-radar models often fail to learn underlying physics despite low reconstruction error, with temporal attention proving critical for transformers to achieve physical consistency.
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings cs.LG · 2026-04-09 · unverdicted · none · ref 5
Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
Quantifying Trust: Financial Risk Management for Trustworthy AI Agents cs.AI · 2026-04-05 · unverdicted · none · ref 4
The paper introduces the Agentic Risk Standard (ARS) as a payment settlement framework that delivers predefined compensation for AI agent execution failures, misalignment, or unintended outcomes.
Enhancing Adversarial Robustness in Network Intrusion Detection: A Layer-wise Adaptive Regularization Approach cs.CR · 2026-05-09 · unverdicted · none · ref 15
LARAR enhances adversarial robustness in network intrusion detection by using layer-wise adaptive regularization and auxiliary classifiers, achieving 95.01% clean accuracy and improved defense against FGSM, PGD, and transfer attacks on UNSW-NB15.
Superposition Yields Robust Neural Scaling cs.LG · 2025-05-15 · unreviewed · ref 62

arXiv preprint arXiv:2404.14082 (2024)

fields

years

verdicts

representative citing papers

citing papers explorer