α-TCAV replaces TCAV's hard indicator with a tunable smooth function to create a unified probabilistic framework with lower variance and guidance for parameter choice or Bayes-optimal scoring.
International conference on machine learning , pages=
9 Pith papers cite this work. Polarity classification is still indexing.
years
2026 9representative citing papers
Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.
SoftBlobGIN combines ESM-2 representations with protein contact graphs via a lightweight GNN and differentiable substructure pooling to achieve 92.8% accuracy on enzyme classification, raise binding-site AUROC to 0.983, and generate auditable structural explanations without retraining the language模型
LLMs show significant biases in conflict event classification, with open-weight models exhibiting false illegitimation and adapted models showing actor bias and lexical sensitivity, making them unsuitable for unsupervised deployment.
GCE-MIL is a backbone-agnostic wrapper that directly optimizes MIL evidence for sufficiency, necessity, and recoverability, yielding modest gains in Macro-F1 and C-index plus more faithful patch selection across many backbones and datasets.
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
A perturbation-based metric for XAI quality that formalizes sufficiency and necessity, paired with an adapter trained via differentiable supervision to generate causal explanations on black-box models.
TabSHAP attributes feature impact in LLM tabular classifiers via sampled Shapley coalitions and JSD on output distributions, reporting higher deletion faithfulness than random or XGBoost-proxy baselines on Adult Income and Heart Disease data.
citing papers explorer
-
$\alpha$-TCAV: A Unified Framework for Testing with Concept Activation Vectors
α-TCAV replaces TCAV's hard indicator with a tunable smooth function to create a unified probabilistic framework with lower variance and guidance for parameter choice or Bayes-optimal scoring.
-
GKnow: Measuring the Entanglement of Gender Bias and Factual Gender
Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.
-
Structural Interpretations of Protein Language Model Representations via Differentiable Graph Partitioning
SoftBlobGIN combines ESM-2 representations with protein contact graphs via a lightweight GNN and differentiable substructure pooling to achieve 92.8% accuracy on enzyme classification, raise binding-site AUROC to 0.983, and generate auditable structural explanations without retraining the language模型
-
Are LLMs Ready for Conflict Monitoring? Empirical Evidence from West Africa
LLMs show significant biases in conflict event classification, with open-weight models exhibiting false illegitimation and adapted models showing actor bias and lexical sensitivity, making them unsuitable for unsupervised deployment.
-
GCE-MIL: Faithful and Recoverable Evidence for Multiple Instance Learning in Whole-Slide Imaging
GCE-MIL is a backbone-agnostic wrapper that directly optimizes MIL evidence for sufficiency, necessity, and recoverability, yielding modest gains in Macro-F1 and C-index plus more faithful patch selection across many backbones and datasets.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
-
Learning Quantifiable Visual Explanations Without Ground-Truth
A perturbation-based metric for XAI quality that formalizes sufficiency and necessity, paired with an adapter trained via differentiable supervision to generate causal explanations on black-box models.
-
TabSHAP
TabSHAP attributes feature impact in LLM tabular classifiers via sampled Shapley coalitions and JSD on output distributions, reporting higher deletion faithfulness than random or XGBoost-proxy baselines on Adult Income and Heart Disease data.