Cohort-Based Active Modality Acquisition

Benjamin Wild; Roland Eils; Tillmann Rheude

arxiv: 2505.16791 · v4 · submitted 2025-05-22 · 💻 cs.LG · cs.AI

Cohort-Based Active Modality Acquisition

Tillmann Rheude , Roland Eils , Benjamin Wild This is my paper

Pith reviewed 2026-05-22 13:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords active modality acquisitionmultimodal machine learningimputation strategiescohort-level selectionmissing modalitiestest-time acquisitionUK Biobank

0 comments

The pith

Imputation-based strategies guide acquisition of additional modalities more effectively than random or entropy-based selection at the cohort level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Real-world multimodal machine learning often encounters missing modalities that are costly to obtain, creating the need to prioritize samples for acquisition under a fixed budget. The paper introduces Cohort-based Active Modality Acquisition (CAMA), a test-time setting focused on cohort-level decisions rather than per-sample or training-time choices. It develops imputation-based strategies that train models on already-observed modalities to estimate the expected improvement from acquiring one new modality for selected samples. Experiments on datasets with as many as 15 modalities show these strategies select more useful samples than baselines relying on pre-acquisition information, entropy, or chance. The method is further validated by guiding proteomics data acquisition for disease prediction in the large UK Biobank cohort.

Core claim

Imputation models trained on observed modalities can estimate the expected utility of acquiring a missing modality for a cohort, enabling selection of the samples whose acquisition will most improve overall predictions when only a limited number can be obtained.

What carries the argument

Imputation-based acquisition strategies that use models fitted to existing modalities to approximate the performance gain from adding one new modality to chosen samples in the cohort.

If this is right

Acquisition budgets yield higher model performance when samples are chosen according to estimated utility rather than heuristics.
The approach remains practical for cohorts containing up to 15 modalities without requiring exhaustive computation.
Real-world data collection efforts such as prospective biobank studies can reduce costs by focusing new measurements on high-utility participants.
Test-time cohort decisions become feasible without retraining the entire model after each acquisition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same imputation principle could extend to selecting which new features to measure in high-dimensional tabular or sensor data.
Iterative versions might alternate between acquiring modalities and retraining the imputation models to refine estimates over time.
Performance would likely degrade if the imputation models themselves are trained on very small or unrepresentative observed subsets.

Load-bearing premise

Imputation models trained on observed modalities can produce sufficiently accurate estimates of the expected utility of acquiring a missing modality for the cohort.

What would settle it

On a multimodal dataset with ground-truth performance gains known for every possible sample-modality pair, the samples chosen by the imputation strategies do not yield higher downstream accuracy than those chosen by random selection after the same number of acquisitions.

read the original abstract

Real-world multimodal machine learning often faces missing, costly-to-acquire modalities, raising the problem of which samples to prioritize for additional acquisition under a budget. Prior work mainly studies per-sample or training-time acquisition while test-time, cohort-level acquisition is less explored. We propose Cohort-based Active Modality Acquisition (CAMA), a novel test-time cohort-level modality acquisition setting, and introduce imputation-based acquisition strategies that estimate the expected utility of acquiring a missing modality, along with upper-bound heuristics for benchmarking. Experiments on datasets with up to 15 modalities demonstrate that our proposed imputation-based strategies can more effectively guide the acquisition of an additional modality for selected samples compared with methods relying solely on pre-acquisition information, entropy-based guidance, or random selection. We showcase the real-world relevance and scalability of our method by demonstrating its ability to guide the acquisition of proteomics data for disease prediction in a large prospective cohort, the UK Biobank (UKB). Our work provides an effective approach for optimizing modality acquisition at the cohort level, enabling more effective use of resources in constrained settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAMA defines a clear test-time cohort-level acquisition setting and shows imputation strategies beating simple baselines on multimodal data, but the gains rest on unexamined imputation accuracy.

read the letter

The paper's main contribution is framing modality acquisition as a test-time, cohort-level problem rather than the usual per-sample or training-time versions. They estimate the value of acquiring a missing modality for selected samples by imputing it from the observed ones, then pick under a budget. This is explicitly separated from prior work in the abstract, and they add upper-bound heuristics for comparison. Experiments on sets with up to 15 modalities report better guidance than pre-acquisition information, entropy, or random selection, plus a UK Biobank proteomics example for disease prediction in a large cohort. That practical target is where the work lands best: medical or similar settings where some modalities are expensive and you want to allocate them efficiently after initial data is in hand. The distinction from existing active learning lines is useful and the scale of the real-world demo is reasonable. The soft spots sit mainly in the empirical grounding. The abstract gives no numbers, error bars, or details on imputation model performance, so it is hard to judge how large or stable the reported advantages are. The stress-test concern is fair here: if imputation from observed modalities does not preserve the signal that actually drives downstream improvement, the utility rankings can mislead, especially with high-dimensional inter-modality links or non-random missingness in proteomics data. Without ablations that tie imputation error directly to acquisition quality, the central claim stays provisional. Minor additional gaps include lack of variance reporting across runs and limited discussion of how the method behaves when imputation is only moderately accurate. This paper is for people working on multimodal systems under acquisition budgets, particularly in medical cohorts. A reader focused on practical extensions of active learning or resource allocation would get the most from the new setting and the UK Biobank case. It deserves peer review because the problem is real, the framing is distinct, and the experiments target relevant scale, even though revisions would need tighter quantification and checks on the imputation step.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Cohort-based Active Modality Acquisition (CAMA), a test-time setting for selecting a cohort of samples to acquire one additional missing modality under a fixed budget. It introduces imputation-based strategies that train models on observed modalities to estimate the expected utility of acquiring the missing modality for each sample, together with upper-bound heuristics for benchmarking. Experiments on synthetic and real multimodal datasets (up to 15 modalities) and a UK Biobank proteomics application for disease prediction are reported to show that the imputation-based methods outperform baselines that use only pre-acquisition information, entropy, or random selection.

Significance. If the empirical results are robust, the work addresses a practically relevant gap in multimodal learning: how to allocate limited acquisition resources across a cohort at test time rather than per-sample or during training. The real-world UK Biobank demonstration and the provision of upper-bound heuristics are strengths that would make the contribution useful for resource-constrained settings such as biobanks or clinical studies.

major comments (2)

[Abstract] Abstract: the central claim that imputation-based strategies 'more effectively guide the acquisition' rests on comparative experiments, yet the abstract (and the reader's summary) contains no quantitative metrics, effect sizes, error bars, or statistical tests; this absence makes it impossible to judge whether the reported gains are load-bearing or merely directional.
[Experiments] Experiments / UK Biobank section: the skeptic concern is not resolved by any reported diagnostic; there is no evidence that the imputation models recover conditional distributions whose errors preserve the downstream predictive signal for the target task, especially under the high-dimensional dependencies and potentially non-random missingness present in 15-modality or proteomics data.

minor comments (2)

[Method] Clarify the precise mathematical definition of 'expected utility' that the imputation model is trained to predict; without an explicit equation it is hard to see how imputation error propagates to acquisition ranking.
[Experiments] Add a table or figure that reports imputation model quality (e.g., MSE or correlation with held-out modality values) alongside the acquisition performance metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the presentation of empirical results and to provide additional validation for the imputation-based approach. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that imputation-based strategies 'more effectively guide the acquisition' rests on comparative experiments, yet the abstract (and the reader's summary) contains no quantitative metrics, effect sizes, error bars, or statistical tests; this absence makes it impossible to judge whether the reported gains are load-bearing or merely directional.

Authors: We agree that the abstract would benefit from quantitative details to better convey the strength of the results. In the revised version, we will update the abstract to include specific metrics such as average performance gains (e.g., in AUC or accuracy) over the baselines, along with standard deviations from multiple runs and notes on statistical significance where computed in the experiments. revision: yes
Referee: [Experiments] Experiments / UK Biobank section: the skeptic concern is not resolved by any reported diagnostic; there is no evidence that the imputation models recover conditional distributions whose errors preserve the downstream predictive signal for the target task, especially under the high-dimensional dependencies and potentially non-random missingness present in 15-modality or proteomics data.

Authors: We acknowledge the value of direct diagnostics on imputation quality. While the manuscript validates the approach through improved downstream task performance on both synthetic and real data (including UK Biobank), we will add in the revision explicit diagnostics: imputation error metrics on held-out sets, analysis of how imputation noise affects utility estimates, and checks for whether higher-error imputations correlate with reduced acquisition benefit. For the proteomics data, we will include discussion of missingness patterns and their impact on the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation against external baselines

full rationale

The paper defines CAMA as a test-time cohort-level acquisition setting and proposes imputation-based strategies to estimate expected utility of acquiring missing modalities. Evaluation consists of direct comparisons on held-out datasets (up to 15 modalities) and UK Biobank proteomics against independent baselines (pre-acquisition information, entropy-based guidance, random selection). No equations, derivations, or self-citations are shown that reduce the reported acquisition utility or performance gains to quantities fitted or defined by the same experimental data. The central claim therefore rests on observable experimental outcomes rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the ability of imputation to proxy acquisition utility and on standard supervised learning assumptions; no new physical entities or ad-hoc constants are introduced.

free parameters (1)

acquisition budget
Controls how many samples or modalities can be acquired; treated as an input constraint rather than fitted.

axioms (1)

domain assumption Imputation models trained on observed modalities produce estimates whose expected utility correlates with true utility after acquisition.
Invoked to justify the imputation-based acquisition strategies described in the abstract.

pith-pipeline@v0.9.0 · 5707 in / 1201 out tokens · 31190 ms · 2026-05-22T13:32:30.554525+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

imputation-based acquisition strategies that estimate the expected utility of acquiring a missing modality... KL-Divergence, Rank, Uncertainty
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

normalized area of gain... GM_full(AF) = integral (M_AF(b) - M_pre) db / (M_post - M_pre)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning
cs.LG 2026-04 unverdicted novelty 7.0

Multimodal contrastive learning using multilinear products is fragile to single bad modalities, and a gated version improves top-1 retrieval accuracy on synthetic and real trimodal data.
Fusion or Confusion? Multimodal Complexity Is Not All You Need
cs.LG 2025-12 unverdicted novelty 6.0

Complex multimodal architectures do not reliably outperform unimodal baselines or a simple multimodal baseline under standardized evaluation.