pith. machine review for the scientific record. sign in

eess.IV

Image and Video Processing

Theory, algorithms, and architectures for the formation, capture, processing, communication, analysis, and display of images, video, and multidimensional signals in a wide variety of applications. Topics of interest include: mathematical, statistical, and perceptual image and video modeling and representation; linear and nonlinear filtering, de-blurring, enhancement, restoration, and reconstruction from degraded, low-resolution or tomographic data; lossless and lossy compression and coding; segmentation, alignment, and recognition; image rendering, visualization, and printing; computational imaging, including ultrasound, tomographic and magnetic resonance imaging; and image and video analysis, synthesis, storage, search and retrieval.

0
eess.IV 2026-05-13 Recognition

CycleGAN turns standard CT scans into usable low-dose training data

A Comparative Analysis of CT Degradation for LDCT Nodule Classification using Radiomics

Synthetic images raise nodule classifier AUC to 0.861 and sensitivity to 0.743 on real screening cases.

Figure from the paper full image
abstract click to expand
Low-dose computed tomography (LDCT) is the standard modality for lung cancer screening, known for its low radiation dose but high noise levels. While existing literature focuses on denoising LDCT images, comparative research on simulating LDCT characteristics to directly use these images for model development is lacking. This study shifts the focus from denoising images to degrading available standard-dose CT (SDCT) data, generating synthetic images for data augmentation to train classifiers for screening-detected nodules. We compare three degradation methods: (1) a sinogram domain statistical noise insertion; (2) replicate a validated physics-based simulation using Pix2Pix; and (3) unpaired CycleGAN. The generated images were utilized to simulate LDCT screening scenario replacing 695 SDCT cases from the LIDC-IDRI dataset, from which radiomic features were extracted to train machine learning models for lung nodule classification. Regarding image quality, CycleGAN achieved the best Fr\'echet inception distance (0.1734) and kernel inception distance (0.0813; 0.1002) scores, indicating distributional alignment with the target low-dose domain. In the nodule classification task, results confirmed the necessity of domain adaptation since a baseline model trained on non-degraded SDCT data failed to generalize to the real LDCT set (AUC 0.789) with a low sensitivity (0.571). Degraded images generated using CycleGAN approach led to the most balanced performance on the classification task using Adam Booster classifier, achieving an AUC of 0.861, sensitivity of 0.743 and specificity of 0.858 in the independent test. Our findings confirm that generating synthetic LDCT data from standard-dose scans is a viable strategy for training robust nodule classifiers for screening detected nodules.
0
0
eess.IV 2026-05-13 2 theorems

Radiomics guide diffusion model for label-free lung CT segmentation

DiffSegLung: Diffusion Radiomic Distillation for Unsupervised Lung Pathology Segmentation

Handcrafted tissue descriptors shape network features so clustering can label multiple pathologies on unlabeled scans.

Figure from the paper full image
abstract click to expand
Unsupervised segmentation of pulmonary pathologies in CT remains an open challenge due to the absence of annotated multi pathology cohorts and the failure of existing diffusion-based methods to exploit the quantitative Hounsfield Unit (HU) signal that physically distinguishes tissue classes. To address this, we propose DiffSegLung,a framework that introduces Diffusion Radiomic Distillation, in which handcrafted radiomic descriptors serve as a physics grounded teacher to shape the bottleneck of a 3D diffusion U-Net via a contrastive objective, transferring pathology discriminative structure into the learned representation without any annotations. At inference, the teacher is discarded and multitimestep bottleneck features are clustered by a Gaussian Mixture Model with HU-guided label assignment, followed by Sobel Diffusion Fusion for boundary refinement. Evaluated on 190 expert annotated axial slices drawn from four heterogeneous CT cohorts, Diff-SegLung improves segmentation across all four pathology classes over unsupervised baselines and improves generation fidelity over prior CT diffusion models.
0
0
eess.IV 2026-05-13 Recognition

Joint optimization raises low-field MRI quality without extra time

NexOP: Joint Optimization of NEX-Aware k-space Sampling and Image Reconstruction for Low-Field MRI

NexOP learns varying k-space patterns across repetitions to combine low-SNR scans into clearer images within a fixed budget.

Figure from the paper full image
abstract click to expand
Modern low-field magnetic resonance imaging (MRI) technology offers a compelling alternative to standard high-field MRI, with portable, low-cost systems. However, its clinical utility is limited by a low Signal-to-Noise Ratio (SNR), which hampers diagnostic image quality. A common approach to increase SNR is through repetitive signal acquisitions, known as NEX, but this results in excessively long scan durations. Although recent work has introduced methods to accelerate MRI scans through k-space sampling optimization, the NEX dimension remains unexploited; typically, a single sampling mask is used across all repetitions. Here we introduce NexOP, a deep-learning framework for joint optimization of the sampling and reconstruction in multi-NEX acquisitions, tailored for low-SNR settings. NexOP enables optimizing the sampling density probabilities across the extended k-space-NEX domain, under a fixed sampling-budget constraint, and introduces a new deep-learning architecture for reconstructing a single high-SNR image from multiple low-SNR measurements. Experiments with raw low-field (0.3T) brain data demonstrate that NexOP consistently outperforms competing methods, both quantitatively and qualitatively, across diverse acceleration factors and tissue contrasts. The results also demonstrate that NexOP yields non-uniform sampling strategies, with progressively decreasing sampling across repetitions, hence exploiting the NEX dimension efficiently. Moreover, we present a theoretical analysis supporting these numerical observations. Overall, this work proposes a sampling-reconstruction optimization framework highly suitable for low-field MRI, which can enable faster, higher-quality imaging with low-cost systems and contribute to advancing affordable and accessible healthcare.
0
0
eess.IV 2026-05-13 3 theorems

Frequency modules lift transformer accuracy on 3D medical scans

FEFormer: Frequency-enhanced Vision Transformer for Generic Knowledge Extraction and Adaptive Feature Fusion in Volumetric Medical Image Segmentation

FEFormer adds four frequency-aware components to capture local details and fuse features, outperforming prior methods on four segmentation任劑

Figure from the paper full image
abstract click to expand
Accurate segmentation of organs and lesions in medical images is essential for clinical applications including diagnosis, prognosis, and treatment planning. While Vision Transformers (ViTs) have shown impressive segmentation performance, they face key challenges in module and architecture design. Specifically, self-attention struggles to capture fine-grained local features critical for understanding detailed anatomical structures, standard MLP modules lack explicit mechanisms to preserve spatial information, conventional encoder-decoder architectures rely on naive feature fusion strategies that cannot handle large semantic discrepancies, and existing designs lack explicit mechanisms to propagate low-level information from encoder to decoder. To address these limitations, we propose a Frequency-enhanced Vision Transformer (FEFormer) for robust and efficient volumetric medical image segmentation that explicitly models frequency information to jointly capture global context and fine structural details. FEFormer comprises four novel components: a Frequency-enhanced Dynamic Self-Attention (FDSA) module that jointly captures fine-grained local details and global long-range dependencies through locality-preserving convolution with frequency-domain attention; a Frequency-decomposed Gating MLP (FGMLP) that adaptively models low- and high-frequency components for enhanced semantic and structural representation; a Wavelet-guided Adaptive Feature Fusion (WAFF) module that enables semantically consistent encoder-decoder feature integration in the frequency domain; and a Frequency-enabled Cross-scale Stem Bridge (FCSB) that enhances low-level feature propagation across scales. Evaluated on four diverse volumetric medical image segmentation tasks, FEFormer achieved superior segmentation performance with high computational efficiency compared to state-of-the-art methods.
0
0
eess.IV 2026-05-12 2 theorems

Co-learning refines noisy labels in split federated medical segmentation

SplitFed-CL: A Split Federated Co-Learning Framework for Medical Image Segmentation with Inaccurate Labels

Global teacher guides local students to correct unreliable annotations and raise segmentation accuracy without sharing raw data.

abstract click to expand
Split Federated Learning (SplitFed) combines federated and split learning to preserve privacy while reducing client-side computation. However, in medical image segmentation, heterogeneous label quality across clients can significantly degrade performance. We propose SplitFed-CL, a co-learning framework where a global teacher guides local students to detect and refine unreliable annotations. Reliable labels supervise training directly, while unreliable labels are corrected via weighted student--teacher refinement. SplitFed-CL further incorporates consistency regularization for robustness to input perturbations and a trainable weighting module to balance loss terms adaptively. We also introduce a novel difficulty guided strategy to simulate human like boundary centric annotation errors, where the degree of perturbation is governed by shape complexity and the associated annotation difficulty. Experiments on two multiclass segmentation datasets with controlled synthetic noise, together with a binary segmentation dataset containing real-world annotation errors, demonstrate that SplitFed-CL consistently outperforms seven state-of-the-art baselines, yielding improved segmentation quality and robustness.
1 0
0
eess.IV 2026-05-12 Recognition

Dataset turns satellite construction images into 2.3 million VQA examples

Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

SMART-HC-VQA converts Sentinel-2 chips and activity annotations into temporal questions so multimodal models can track process progression,

abstract click to expand
We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.
0
0
eess.IV 2026-05-12 2 theorems

One network registers cardiac MRI of any length or contrast

Set-Based Groupwise Registration for Variable-Length, Variable-Contrast Cardiac MRI

Trained on a single T1 dataset, the set-based model generalizes to other protocols and improves tissue mapping quality

Figure from the paper full image
abstract click to expand
Quantitative cardiac magnetic resonance imaging (MRI) enables non-invasive myocardial tissue characterization but relies on robust motion correction within these variable-length, variable-contrast image sequences. Groupwise registration, which simultaneously aligns all images, has shown greater robustness than pairwise registration for motion correction. However, current deep-learning-based groupwise registration methods cannot generalize across MRI sequences: the architecture typically encodes input data as a fixed-length channel stack, which rigidly couples network design to protocol-specific sequence length, input ordering, and contrast dynamics. At inference time, any change in imaging protocols will render the network unusable. In this work, we introduce \emph{\AnyTwoReg}, a new set-based groupwise registration framework that takes a quantitative MRI sequence as an unordered set. This set formulation fundamentally decouples network design from sequence length and input ordering. By utilizing a shared encoder and correlation-guided feature aggregation, \emph{\AnyTwoReg} constructs a permutation-invariant canonical reference for registration, and learns a permutation-equivariant mapping from images to deformation fields. Additionally, we extract contrast-insensitive image features from an existing foundation model to handle extreme contrast variations. Trained exclusively on a single public $T_1$ mapping dataset (STONE, sequence length $L=11$), \AnyTwoReg generalizes to two unseen quantitative MRI datasets (MOLLI, ASL) with variable lengths ($L \in [11, 60]$) and different contrast dynamics. It achieves strong cross-protocol generalization in a zero-shot manner, and consistently improves downstream quantitative mapping quality. Notably, while designed for quantitative MRI sequences, our framework is directly applicable to Cine MRI sequences for inter-cardiac-phase registration.
0
0
eess.IV 2026-05-12 2 theorems

Online SAR processor focuses images line by line in 16 ms

Learning to Focus Synthetic Aperture Radar On-line with State-Space Models

State-space model trained by distillation delivers 70x lower latency and 130x lower memory than block methods while supporting detection and

Figure from the paper full image
abstract click to expand
Conventional focusing methods for Synthetic Aperture Radar (SAR) employ block processing efficiently but remain latency-heavy processes that prevent the realisation of a closed-loop cognitive SAR vision system. We present the first Online SAR Processor (OSP), an online image-formation framework that treats SAR sensing as a stream and produces focused SAR image output line by line during acquisition. OSP uses a tiny state-space surrogate model trained with teacher-student distillation and multi-stage losses. We evaluate the method on 300GB of SAR data from Maya4, a Sentinel-1-derived dataset containing raw, range-compressed, range-cell-migration-corrected, and azimuth-compressed products. Relative to a linewise digital-signal-processing baseline, OSP delivers approximately 70$\times$ lower latency and 130$\times$ lower memory use; on a single AMD CPU core it processes one row in 16 ms with a memory footprint of 6 MB whilst maintaining a focusing quality high enough to support downstream decisions, which we illustrate with vessel detection and flood-mapping tasks.
0
0
eess.IV 2026-05-12 2 theorems

Ray tracing lets microwave imaging see hidden targets

Polarization-Aware Ray-Tracing Enhanced Back-Projection Algorithm for Microwave Imaging in Complex Multipath Environments

Reflected paths act as virtual aperture extensions to improve resolution and reveal obstructed objects.

Figure from the paper full image
abstract click to expand
A ray-tracing (RT) enhanced back-projection algorithm (RT-BPA) for microwave imaging in multipath environments is presented. By tightly incorporating the concept of ray-tracing into a generalized version of traditional BPA, this method ensures improved image quality by addressing two important issues. First, when the line-of-sight (LOS) path is obstructed, reflected paths, if available, enable imaging of hidden targets, which extends the applicability of the standard BPA beyond its normal use case. Second, the consideration of reflected ray-paths is equivalent to virtually increasing the aperture size, thus, improving image resolution without requiring new measurements. A key factor in achieving these advancements is the consideration of the vector nature of electromagnetic waves with polarization-dependent phase compensation, which is often ignored when employing a scalar-wave based formulation of the electromagnetic vector field. In addition, the presented method employs a shooting and bouncing rays (SBR) framework, offering better flexibility compared to manual path evaluation in existing RT-BPAs.
0
0
eess.IV 2026-05-12 Recognition

Generative priors stable only in select imaging inverse problems

A Stability Benchmark of Generative Regularizers for Inverse Problems

Numerical benchmarks against variational methods show limitations under out-of-distribution data and model errors in scientific imaging.

Figure from the paper full image
abstract click to expand
Generative (diffusion) priors demonstrate remarkable performance in addressing inverse problems in imaging. Yet, for scientific and medical imaging, it is crucial that reconstruction techniques remain stable and reliable under imperfect settings. Typical definitions of stability encompass the notion of ''convergent regularization'', robustness to out-of-distribution data, and to inaccuracies in the forward operator or noise model. We evaluate these properties numerically. Furthermore, we benchmark generative approaches against modern optimization-based methods inspired by the widely used variational techniques. Our results give insights for which settings and applications generative priors can deliver state-of-the-art reconstructions, and on those in which they fall short or may even be problematic.
0
0
eess.IV 2026-05-12 Recognition

Tube packages stabilize video recovery faster in semantic HARQ

Tube-Structured Incremental Semantic HARQ for Generative Video Receivers

Package-native requests reduce time-weighted costs versus blocks in moderate channels by enabling earlier trajectory stabilization.

Figure from the paper full image
abstract click to expand
Generative semantic communication uses receiver-side generative priors to reconstruct visual content from compact semantics, making it attractive for bandwidth-limited multimedia delivery. For video, reliable recovery remains difficult because errors accumulate over time, useful evidence is temporally correlated, and the receiver must make decisions under limited interaction, retransmission, and reconstruction budgets. Existing generative semantic communication studies mainly emphasize representation, compression, or generative reconstruction, while recent error-resilient and semantic-HARQ methods still largely operate on encoder-defined or frame-block retransmission units. This paper studies receiver-driven semantic HARQ for generative video reconstruction under a budget-constrained AoIS-AUC objective and argues that the retransmission primitive is itself an important system design variable. We propose tube-structured package-native requests, in which temporally local packages are the channel-visible HARQ objects and are transmitted, dropped, received, and committed at package granularity. Under a controlled comparison protocol with matched backbone, budgets, and channel model, this primitive yields lower time-weighted recovery cost than competitive block-based baselines in practically relevant moderate-to-harsh regimes, while the gap naturally shrinks in near-clean channels. The gain mainly appears as earlier stabilization of the recovery trajectory, while final-quality endpoints remain broadly comparable, and it persists even against a tube-aware block-ranking baseline.
0
0
eess.IV 2026-05-11 Recognition

Curated synthetic images boost real pose baselines at low cost

A Real-Calibrated Synthetic-First Data Engine

Mixed training gains appear, but synthetic data alone still trails real-only performance on pose and segmentation tasks.

Figure from the paper full image
abstract click to expand
Modern computer vision systems increasingly encounter performance limitations in data-scarce domains, where collecting large-scale, high-quality labeled data is costly or impractical. While controllable diffusion models enable scalable synthetic image generation, directly applying synthetic augmentation often leads to unstable performance gains due to dataset-level quality issues and insufficient feedback mechanisms. In this work, we present a Real-Calibrated Synthetic-First Data Engine, a modular data engineering framework that combines controllable diffusion generation and multi-stage curation/filtering within a unified pipeline, with optional support for uncertainty-driven selection and human verification. Instead of introducing new generative algorithms, our approach focuses on systematic dataset construction for improving the practical reliability of synthetic augmentation in low-data regimes. The framework is implemented as a modular CLI-based pipeline, where generation, filtering, selection, and validation components can be independently configured and replaced. This design emphasizes reproducibility, flexibility, and practical deployment in real-world data workflows. Through empirical evaluation centered on human pose estimation, we show that synthetic data improves a real-data baseline when used as near-zero-human-annotation-cost augmentation alongside real anchors, while synthetic-only training remains substantially below real-only performance. Supplementary segmentation diagnostics show the same domain-gap pattern. These results highlight the practical value of data-centric orchestration for low-data augmentation.
0
0
eess.IV 2026-05-11 2 theorems

Jacobian metric selects tiny U-Nets at initialization

XTinyU-Net: Training-Free U-Net Scaling via Initialization-Time Sensitivity

Dataset-specific ultralight models match full accuracy using 400x fewer parameters without any training.

Figure from the paper full image
abstract click to expand
While U-Net architectures remain the gold standard for medical image segmentation, their deployment in resource-constrained environments demands aggressive model compression. However, finding an optimally efficient configuration is computationally prohibitive, typically requiring exhaustive train-and-evaluate cycles to find the smallest model that maintains peak performance. In this paper, we introduce a training-free selection framework to automatically identify ultralightweight, dataset-specific U-Net configurations directly at initialization. We observe that systematically scaling down U-Net channel width induces a sharp transition from a stable performance plateau to representational capacity collapse. To pinpoint this boundary without training, we propose a Jacobian-based sensitivity metric that scores discrete, width-capped U-Net variants using a small set of unlabeled images. By analyzing the total variation of this sensitivity curve, we isolate the smallest stable configuration, which we denote as XTinyU-Net. Evaluated across six diverse medical datasets within the nnU-Net framework, XTinyU-Net achieves segmentation accuracy comparable to the heavy nnU-Net baseline with 400x-1600x fewer parameters, and outperforms contemporary lightweight architectures while utilizing 5x-72x fewer parameters. Code is publicly accessible on https://github.com/alvinkimbowa/nntinyunet.git.
0
0
eess.IV 2026-05-11 2 theorems

The paper describes a computational framework that reconstructs moving heart geometries…

Image-Based Whole-Heart Cardiac Flow Simulations in Health and Congenital Heart Disease

An image-based whole-heart CFD framework with ML segmentation and RIS valve modeling reproduces physiologic pressures, valve timing, and…

Figure from the paper full image
abstract click to expand
Intracardiac flow patterns are shaped by the coupled motion of the cardiac chambers and heart valves and provide important information about cardiac function. However, clinical flow imaging remains limited by exam times, noise, resolution, and incomplete details of the three-dimensional flow. Computational fluid dynamics (CFD) can potentially provide detailed flow quantification and predictive insight into treatment outcomes, but clinical translation requires frameworks that reproduce patient-specific measurements while balancing physiological realism, computational cost, and modeling effort. Herein, we present an image-based, patient-specific computational framework for simulating whole-heart intracardiac hemodynamics that balances physiological fidelity with computational efficiency. The framework first employs machine learning-based segmentation and mesh propagation to reconstruct moving cardiac anatomies from time-resolved images. CFD simulations are then performed to resolve blood flow in deforming domains, while resistive immersed surfaces (RIS) are used to model all four cardiac valves with physiologically realistic opening and closing dynamics. The framework was applied to model hemodynamics in a healthy adult and a pediatric patient with complex congenital heart disease (CHD). In the healthy case, the simulations reproduced physiologic pressure-volume behavior, valve timing, and ventricular vortex formation. In the CHD case, simulated chamber and vessel pressures showed agreement with cardiac catheterization measurements. Simulated flow fields were qualitatively consistent with 4D-Flow MRI, while providing higher-resolution visualization of flow structures that were partially obscured by imaging artifacts. Comparison between the healthy and CHD cases further revealed altered diastolic flow organization and elevated normalized viscous dissipation in the CHD heart.
0
0
eess.IV 2026-05-11 Recognition

AI detects fetal brain bleeds without labeled scans

Annotation-free deep learning for detection and segmentation of fetal germinal matrix-intraventricular hemorrhage in brain MRI

Model trained on synthesized hemorrhages from normal MRIs outperforms supervised methods on real cases and speeds up radiologist review.

Figure from the paper full image
abstract click to expand
Background: Prenatal germinal matrix-intraventricular hemorrhage (GMH-IVH) is a leading cause of infant mortality and neurodevelopmental impairment. Manual diagnosis and lesion segmentation are labor-intensive and error-prone. Deep learning models offer potential for automation but typically require large annotated datasets, which are challenging to obtain. Purpose: To develop and validate an annotation-free deep learning framework for automated detection and segmentation of GMH-IVH on brain MRI. Materials and Methods: This retrospective study analyzed 2D T2-weighted MRI data from pregnant women collected from October 2015 to October 2023 at one hospital (internal validation) and two hospitals (external validation). Eligible participants included healthy fetuses and those with GMH-IVH. FreeHemoSeg was developed and trained using pseudo GMH-IVH images synthesized from normal fetal data guided by medical priors. Primary outcomes included diagnostic accuracy (area under the ROC curve [AUROC], sensitivity, specificity) and segmentation accuracy (Dice similarity coefficient [DSC]). A reader study evaluated clinical utility. Results: A total of 1674 stacks from 558 pregnant women were analyzed. FreeHemoSeg achieved the highest performance in both internal (sensitivity: 0.914, 95% CI 0.869-0.945; specificity: 0.966, 95% CI 0.946-0.978; DSC: 0.559, 95% CI 0.546-0.571) and external validation (sensitivity: 0.824, 95% CI 0.739-0.885; specificity: 0.943, 95% CI 0.913-0.964; DSC: 0.512, 95% CI 0.497-0.526), outperforming supervised and unsupervised methods. FreeHemoSeg assistance improved radiologists' sensitivity (from 0.882 to 0.941-1.000) and diagnostic confidence while reducing interpretation time by 16.0-52.7%. Conclusion: FreeHemoSeg accurately detects and localizes fetal brain hemorrhages without annotated training data, enabling earlier diagnosis and supporting timely clinical management.
0
0
eess.IV 2026-05-11 2 theorems

Multi-layer CLIP similarities predict machine image preferences

ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality

ML-CLIPSim beats standard metrics on machine benchmarks and improves compression trade-offs for downstream tasks.

Figure from the paper full image
abstract click to expand
We study full-reference image quality assessment from a machine-centric perspective, where images are evaluated by how well they preserve information for downstream models. We formulate machine-oriented quality as a latent machine utility and approximate it through pairwise predictive-consistency comparisons. To this end, we construct PCMP, a dataset of PSNR-matched distortion pairs labeled by consistency votes from multiple pretrained models. We further propose ML-CLIPSim, a differentiable quality metric built on a frozen CLIP visual encoder, which aggregates intermediate patch-token similarities and global image embeddings. Experiments on machine-preference benchmarks, human-IQA datasets, and learned image compression show that ML-CLIPSim better aligns with machine-oriented preferences than conventional fidelity and perceptual metrics, while remaining competitive for human quality prediction. Used as a compression distortion term, it improves rate--task trade-offs across multiple downstream tasks.
0
0
eess.IV 2026-05-11 2 theorems

Cross-modal vector lifts DR grading to 87.5% accuracy

Cross-Modal Semantic-Enhanced Diffusion Framework for Diabetic Retinopathy Grading

Dot product of adapted image and text-grade features guides diffusion denoising better than complex visual priors on APTOS 2019.

Figure from the paper full image
abstract click to expand
Automated grading of diabetic retinopathy (DR) faces several critical challenges: subtle inter-grade visual distinctions in fine-grained lesion patterns, distributional discrepancies induced by heterogeneous imaging devices and acquisition conditions, and the inherent inability of purely visual approaches to exploit clinical semantic knowledge. In this paper, we propose CLIP-Guided Semantic Diffusion (CGSD), a DR grading framework that synergistically integrates vision-language pretraining with diffusion probabilistic modeling. We adopt a domain-specific vision-language model tailored for DR grading as the semantic guidance module and adapt it to the target domain via Low-Rank Adaptation (LoRA), effectively bridging the distributional gap between the pretrained model and the target dataset with only a minimal number of trainable parameters. Building on this foundation, we construct a cross-modal semantic conditioning vector by computing the dot product between image features and the text description features of each DR grade, yielding a joint representation that simultaneously encodes visual content and clinical-grade semantics. This vector serves as the conditioning signal for the diffusion denoising network, replacing the structurally complex dual-branch visual prior employed in existing diffusion-based classification methods. Experiments on the APTOS 2019 dataset demonstrate that the proposed approach achieves an accuracy of 87.5% and a macro-averaged F1 score of 0.731, outperforming a variety of representative methods. Ablation studies further validate the independent contribution of each constituent module.
0
0
eess.IV 2026-05-11 2 theorems

Open tool unifies methane analysis from five satellites

HyGAS: an Open, Sensor-Agnostic Platform for Multi-Satellite Methane Plume Retrieval, Uncertainty Propagation, and Emission-Rate Estimation

HyGAS standardizes retrieval from raw data or products, uncertainty tracking, and flux estimation for comparable results across PRISMA, EnMA

abstract click to expand
The rapid expansion of spaceborne methane observing capabilities at the facility-scale (fostered both by public missions and commercial constellations) has created a need for harmonised, reproducible, and uncertainty-aware processing chains that support both monitoring workflows and fair inter-sensor comparisons. This paper presents HyGAS (Hyperspectral Gas Analysis Suite), an open and sensor-agnostic framework that standardises methane processing across multiple imaging spectrometers. HyGAS currently supports end-to-end processing from Level-1 radiance to methane enhancement for PRISMA, EnMAP, and Tanager-1, and it supports ingestion of Level-2 methane enhancement products from EMIT and GHGSat, which are subsequently processed through common downstream modules for background selection, plume segmentation, Integrated Mass Enhancement (IME), and emission-rate inversion. HyGAS prioritises operational robustness via (i) matched-filter variants designed to mitigate background heterogeneity and pushbroom artefacts, (ii) explicit decomposition and propagation of uncertainty from instrument noise and scene-driven clutter to IME and flux, and (iii) a scale-aware segmentation strategy defined in physical units and rescaled by ground sampling distance to improve multi-sensor comparability. Representative sample outputs are reported for PRISMA, EnMAP, and Tanager-1. Keywords: Methane emissions, hyperspectral satellites, Tanager-1, PRISMA, EnMAP, GHGSat, EMIT, Tanager, oil and gas, landfills, remote sensing, atmospheric science, greenhouse gas monitoring, spectral analysis, emission quantification, satellite synergy, environmental monitoring.
0
0
eess.IV 2026-05-11 Recognition

Tanager-1 joins PRISMA and EnMAP in methane plume framework

Multi-Sensor Methane Mapping in a Unified Framework: Tanager-1 Integration and comparison to EnMAP and PRISMA

Column-wise filter reduces false positives from sensor artifacts in multi-satellite methane mapping.

abstract click to expand
Spaceborne imaging spectroscopy enables facility-scale methane (CH4) plume detection and quantification by exploiting absorption structure in the 1.65/2.3 um windows. However, performance ultimately depends on both radiometric sensitivity and the mitigation of pushbroom artifacts such as column-dependent variability and striping. This paper reports the integration of Planet/Carbon Mapper Tanager-1 Level-1 radiances into a mature multi-sensor methane processing chain previously applied to PRISMA and EnMAP and evaluates the implications of Tanager-1 radiometric regime for matched-filter retrieval, plume segmentation, and IME-based flux estimation. The retrieval is based on a Clutter Matched Filter (CMF) formulation that yields methane enhancements in concentration-path-length units (ppmm) and propagates uncertainty from radiance noise and background variability through enhancement maps, Integrated Mass Enhancement (IME), and emission rate via the IME method. Particular emphasis is placed on a column-wise CMF (CWCMF), in which background statistics are estimated per detector column to reduce structured false positives induced by pushbroom non-uniformities. A compact radiometric comparison between PRISMA, EnMAP and Tanager-1 is performed on homogeneous high-reflectance calibration scenes to derive reference SNR spectra and striping diagnostics for all these sensors . We then demonstrate CWCMF-only operational results on a landfill super-emitter in the Buenos Aires region, using paired Tanager-1 and EnMAP acquisitions over the same area of interest acquired on different dates. In the absence of near-simultaneous acquisitions and ground truth, results are interpreted in terms of background-limited sensitivity and uncertainty-stabilized IME/flux estimation rather than absolute accuracy.
0
0
eess.IV 2026-05-11 2 theorems

Neural network adapts frame rate and resolution for better streamed graphics

Streaming of rendered content with adaptive frame rate and resolution

Chooses settings from content and motion to raise perceived quality when bandwidth is limited.

Figure from the paper full image
abstract click to expand
Streaming rendered content is an attractive way to bring high-quality graphics to billions of mobile devices that do not have sufficient rendering power. Existing solutions render content on a server at a fixed frame rate, typically 30 or 60 frames per second, and reduce resolution when bandwidth is restricted. However, this strategy leads to suboptimal rendering quality under the bandwidth constraints. In this work, we exploit the spatio-temporal limits of the human visual system to improve perceived quality while reducing rendering costs by adaptively adjusting both frame rate and resolution based on scene content and motion. Our approach is codec-agnostic and requires only minimal modifications to existing rendering infrastructure. We propose a system in which a lightweight neural network predicts the optimal combination of frame rate and resolution for a given transmission bandwidth, content, and motion velocity. This prediction significantly enhances perceptual quality while minimizing computational cost under bandwidth constraints. The network is trained on a large dataset of rendered content labeled with a perceptual video quality metric. The dataset and further information can be found at the project web page: https://www.cl.cam.ac.uk/research/rainbow/projects/adaptive_streaming/.
0
0
eess.IV 2026-05-11 Recognition

Network impairments cut surgical teleoperation success to 12%

VISTA: A Benchmark for Real-Time Video Streaming under Network Impairments in Surgical Teleoperation

Benchmark emulates hospital LAN through GEO satellite conditions and links video freezes to large drops in task success rate and speed.

Figure from the paper full image
abstract click to expand
Real-time video streaming is crucial in surgical teleoperation, yet reproducible evaluation under realistic network impairments remains limited. This paper presents VISTA, a benchmark designed to study how impairments along the forward video path affect received video quality, temporal continuity, and human task performance. VISTA employs Linux Traffic Control with NetEm and a Gilbert-Elliott loss model to emulate five network conditions: Hospital LAN, 5G Urban, 4G Rural, LEO Satellite, and GEO Satellite. The benchmark integrates a standardised peg transfer task with synchronized measurements of network quality of service (QoS), objective video quality (PSNR, SSIM, and VMAF), and temporal continuity through freeze rate, while maintaining a stable reverse control channel. Across 375 experimental trials, network degradation substantially reduced teleoperation performance: success rate decreased from 97% in Hospital LAN to 79% in 5G Urban, 35% in 4G Rural, 71% in LEO Satellite, and 12% in GEO Satellite, while mean task completion time for successful trials increased from 80 s in Hospital LAN to 117 s in 5G Urban, 211 s in 4G Rural, 152 s in LEO Satellite, and 255 s in GEO Satellite. These findings show that network impairments have a direct impact on task completion and success in surgical teleoperation, and provide a reproducible basis for evaluating teleoperation video under realistic network constraints. Source code available at https://github.com/Dzxx623/VISTA.
0
0
eess.IV 2026-05-11 Recognition

Thin clients stream interactive 3D Gaussian Splatting over HTTP/3

Thin-Client Interactive Gaussian Adaptive Streaming over HTTP/3

Backend renders frames in under 10 ms as an ABR algorithm adapts quality to sustain interactive latency and 0.88 SSIM.

Figure from the paper full image
abstract click to expand
Recent advancements in 3D Gaussian Splatting (3DGS) have enabled photorealistic rendering of complex scenes, yet widespread adoption on mobile and Extended Reality (XR) devices is hindered by substantial computational and bandwidth requirements. While existing solutions often focus on model compression for client-side rendering, they still demand significant GPU power, limiting applicability on resource-constrained hardware. We propose TIGAS (Thin-client Interactive Gaussian Adaptive Streaming), a remote rendering framework offloading rasterization to a backend. To bypass the prohibitive latencies connected to fluctuating network conditions, TIGAS streams view-dependent 2D projections to a lightweight web client over QUIC, minimizing head-of-line (HoL) blocking. A dedicated ABR algorithm adapts rendering quality to fluctuating network conditions, maintaining motion-to-photon latency within strict 6DoF interactive constraints. Furthermore, we discuss the integration of an experimental WebGPU super-resolution pipeline to analyze the trade-offs between perceptual quality enhancements and thin-client processing bottlenecks. We extensively evaluate TIGAS across multi-continental environments using 14 3DGS models and real 6DoF EyeNavGS movement traces. Powered by a backend rendering frames in under 10 milliseconds, TIGAS maintains latency within interactive thresholds while achieving an average SSIM of 0.88, serving both as a robust testbed for 3DGS streaming research and a capable delivery system. The source code is available at: https://github.com/Rekenar/GaussianAdaptiveStreamer.
0
0
eess.IV 2026-05-11 2 theorems

Masks raise attention faithfulness over 35% in vision models

CAMAL: Improving Attention Alignment and Faithfulness with Segmentation Masks

CAMAL adds a mask-based regularizer during training to align attention with discriminative regions without slowing inference.

Figure from the paper full image
abstract click to expand
Many vision datasets now provide segmentation masks in addition to annotated images to support a wide range of tasks. In this work, we propose Class Activation Map Attention Learning (CAMAL), an efficient and scalable method that utilizes segmentation masks to improve attention alignment and faithfulness in vision models. Specifically, attention alignment refers to the degree to which a model's attention aligns with ground-truth discriminative regions, while attention faithfulness refers to the degree to which a model's attention influences its decision. Improving both attention alignment and faithfulness is essential for ensuring that model attention is both spatially accurate and causally meaningful. To improve attention alignment and faithfulness in vision models, CAMAL first extracts the model's attention for each image during training and then compares the attention to ground-truth discriminative regions obtained from the corresponding segmentation masks. CAMAL then acts as an auxiliary regularizer, encouraging attention that aligns with ground-truth discriminative regions, while suppressing attention elsewhere. We evaluated CAMAL across two learning paradigms -- Deep Learning (DL) and Deep Reinforcement Learning (DRL) -- and observed consistent, significant improvements in both attention alignment and faithfulness. In particular, CAMAL yields statistically significant gains in attention alignment across all settings, and improves attention faithfulness by over 35% compared to recent work. Moreover, we show that improved attention alignment and faithfulness enhance explainability, while yielding improved or comparable generalization performance without increasing inference cost. These findings demonstrate that the spatial information contained within segmentation masks can be effectively leveraged to guide model attention across learning tasks.
0
0
eess.IV 2026-05-11 2 theorems

Federated quantum model detects early retinopathy privately

FQPDR: Federated Quantum Neural Network for Privacy-preserving Early Detection of Diabetic Retinopathy

Lightweight models trained without sharing images identify subtle signs of diabetic eye damage across sites.

Figure from the paper full image
abstract click to expand
Diabetic Retinopathy (DR) is a common complication of diabetes that can lead to blindness of people. Detecting DR at the earliest stage is essential to prevent irreversible eye damage. Microaneurysm dots are the first signs of DR. As the dots are tiny and of low contrast, detecting mild DR is a very challenging task. Federated learning (FL) preserves data privacy, which is a major concern for medical image processing. FL is a collaborative learning method, which shares only the model parameters with a server, without sharing the patient data to a central server. Inspired by classical FL, we propose a federated learning-based quantum neural network (federated QNN) for this task. We implemented the models with limited samples and few learnable parameters from the E-ophtha and Retina MNIST datasets. The crossevaluation efficiency of the proposed federated quantum neural network system for privacy-preserving early detection of diabetic retinopathy (FQPDR) in Kaggle dataset images indicates the robustness of the light weight learning models. FQPDR performances are inspiring while considering existing non-FL and FL methods.
0
0
eess.IV 2026-05-11 Recognition

MCMC over DeepSDF latents yields calibrated uncertainty for heart shapes

Uncertainty Quantification for Cardiac Shape Reconstruction with Deep Signed Distance Functions via MCMC methods

Interpreting point-cloud fitting error as a likelihood produces both accurate reconstructions and uncertainty estimates that match observed

Figure from the paper full image
abstract click to expand
Atlas-based approaches allow high-quality, patient-specific shape reconstructions of cardiac anatomy from sparse and/or noisy data such as point clouds. However, these methods are mainly prior-driven, so the impact of uncertainty can be large, limiting their clinical reliability. We propose a probabilistic framework for uncertainty-aware cardiac shape reconstruction that combines Deep Signed Distance Functions (DeepSDFs) with Markov Chain Monte Carlo (MCMC) sampling. Cardiac geometries are modeled implicitly as zero-level sets of a neural network conditioned on learned latent codes, enabling multi-surface reconstruction of the left and right ventricles. By interpreting the reconstruction loss as a log-likelihood, we perform Bayesian inference in the latent space to obtain both maximum a posteriori (MAP) and posterior-sampled reconstructions. Experiments on a public cardiac dataset show that our approach produces accurate reconstructions and well-calibrated uncertainty estimates.
0
0
eess.IV 2026-05-11 Recognition

Distance transform on contours boosts self-supervised depth accuracy

Improved monocular depth prediction using distance transform over pre-semantic contours with self-supervised neural networks

Jointly learned pre-semantic boundaries supply extra variance in uniform areas, yielding better depth maps than other self-supervised models

Figure from the paper full image
abstract click to expand
Monocular depth estimation (MDE) with self-supervised training approaches struggles in low-texture areas, where photometric losses may lead to ambiguous depth predictions. To address this, we propose a novel technique that enhances spatial information by applying a distance transform over pre-semantic contours, augmenting discriminative power in low texture regions. Our approach jointly estimates pre-semantic contours, depth and ego-motion. The pre-semantic contours are leveraged to produce new input images, with variance augmented by the distance transform in uniform areas. This approach results in more effective loss functions, enhancing the training process for depth and ego-motion. We demonstrate theoretically that the distance transform is the optimal variance-augmenting technique in this context. Through extensive experiments on KITTI, Cityscapes, Waymo, NYUv2 and ScanNet our model demonstrates robust performance, surpassing competing self-supervised methods in MDE.
0
0
eess.IV 2026-05-11 Recognition

mmWave scans predict visceral fat to 1L accuracy

Non-intrusive Body Composition Assessment from Full-body mmWave Scans

Model trained on synthetic data achieves 1.0 L and 3.2% error on real clothed standing scans for VAT and BFP.

Figure from the paper full image
abstract click to expand
Body composition assessment (BCA) provides detailed information about the distribution of different tissue types in the body, enabling more precise characterization of individuals than BMI or weight alone. Consistent and frequent BCA would be valuable for personalized medicine, but the gold standard methods for BCA, such as CT and MRI, are only practical for opportunistic monitoring of patients with clinical indications for imaging and are not suitable for routine use in the general population. Here, we consider an imaging modality which is not currently used in medical applications: millimeter wave (mmWave) radar. Commonly used in security settings, mmWave scans enable fast, non-intrusive, and privacy-preserving reconstruction of full body shape without the need to remove clothing. To demonstrate the feasibility of fast and convenient BCA from mmWave scans, we present a method for BCA value regression using a multi-task learning strategy that leverages synthetic mmWave-like point clouds derived from clinical imaging and parametric human models. We evaluate the model on a pilot cohort of real mmWave scans with bioimpedance-derived body fat measurements, supporting the feasibility of estimating VAT and body fat percentage (BFP) from mmWave data acquired through clothing in a standing posture. We find that the model can predict VAT and BFP with a mean absolute error of 1.0 L and 3.2\%, respectively, demonstrating the potential of mmWave scanning for routine BCA in a wide range of settings.
0
0
eess.IV 2026-05-11 2 theorems

Paired dataset lets AI upgrade low-end ultrasound scans

A Paired Point-of-Care Ultrasound Dataset for Image Quality Enhancement and Benchmarking via a cGAN Baseline

First accurate POCUS-to-high-end pairs train a cGAN that lifts SSIM from 0.29 to 0.54 and improves no-reference quality scores.

abstract click to expand
Purpose: We aim to enhance the image quality of point-of-care ultrasound (POCUS) devices using deep learning and a novel paired dataset of POCUS and high-end ultrasound images. Approach: We collected the first accurately paired dataset using a custom-built automated gantry system of low-end POCUS and high-end ultrasound images. A conditional generative adversarial network (cGAN) was utilized based on the pix2pix architecture, with a U-Net generator that incorporates both L1 and structural similarity index (SSIM) losses to improve perceptual quality. Pretraining on a simulation dataset further boosts performance. Evaluation was performed on 1064 paired ex vivo tissue and phantom ultrasound image sets. Results: Our approach improves the SSIM from 0.29 to 0.54 and PSNR from 19.16 dB to 22.41 dB. No-reference metrics also indicate substantial enhancement, with the Natural Image Quality Evaluator (NIQE) and Perception-based Image Quality Evaluator (PIQE) scores dropping from 7.95 to 4.44 and 31.12 to 19.99, respectively. Conclusions: This work presents the first publicly available accurately paired dataset of low-end POCUS to high end ultrasound images. Additionally, our results demonstrate the potential of the proposed framework to overcome hardware limitations of handheld POCUS, enhancing its diagnostic value in low-resource and point-of-care settings. The POCUS-IQ Dataset is publicly available at https://github.com/NKI-MedTech-AI/POCUS-IQ.
0
0
eess.IV 2026-05-11 2 theorems

Neural fields cut memory for dynamic 3D MRI at 16x speed

Model-based Dynamic 3D MRI Reconstructions using Neural Fields and Tensor Product Expansions

Tensor-product univariate networks represent magnetization continuously to preserve motion in undersampled cardiac scans.

Figure from the paper full image
abstract click to expand
Conventional MRI reconstruction methods treat images and coil sensitivities as discrete objects, leading to high memory demands and limited structural awareness that hamper effective regularization. These limitations hinder accurate reconstruction in highly undersampled scenarios, such as dynamic 3D cardiac magnetic resonance (CMR). We introduce a discretization-free, memory-efficient, model-based framework for dynamic 2D and 3D MRI reconstruction from highly undersampled data. We represent magnetization and coil sensitivities as continuous objects -- differentiable functions -- using tensor products of univariate neural fields. This tensor product structure enables scalable optimization in high-dimensional spatiotemporal settings. Our method outperforms state-of-the-art model-based reconstructions in dynamic 2D and 3D MR settings, preserving structure and motion even under aggressive undersampling (e.g., acceleration factor 16).
0
0
eess.IV 2026-05-11 2 theorems

One bitstream delivers coarse classes early and fine details later

Coarse-to-Fine: Progressive Image Compression for Semantically Hierarchical Classification

Latent channels ordered by CLIP class hierarchies improve broad recognition at low rates without later penalties.

abstract click to expand
Recent advances in learned image compression (LIC) have enabled practical deployments, spurring active research into image compression for machines and progressive coding schemes. However, their integration remains under-explored: prior works on progressive machine codec predominantly target sample-level difficulty adaptation (i.e., easy-to-hard), without considering semantic-level scalability. In this work, we introduce a semantic hierarchy-aware progressive codec that enables semantic scalability (i.e., coarse-to-fine) from a single bitstream. We first systematically categorize ImageNet-1K classes into CLIP embedding-based semantic hierarchies. Based on a channel-wise autoregressive framework, we decompose latent representations into hierarchically ordered channel blocks, each explicitly optimized for a corresponding semantic hierarchy. Extensive experiments demonstrate that our approach substantially improves coarse-level recognition at low bitrates while maintaining fine-grained accuracy at higher bitrates. By reframing progressive transmission through the lens of semantic scalability, our work provides an efficient and interpretable solution for task-adaptive image coding, outperforming existing progressive codecs under hierarchical evaluation.
1 0
0
eess.IV 2026-05-08

Neural codec with FFT encoder outperforms tokenizers on sensors

LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

Imposing FFT-like structure and variance rate penalty allows versatile use on low-power devices with better compression.

Figure from the paper full image
abstract click to expand
Modern sensors generate rich, high-fidelity data, yet applications operating on wearable or remote sensing devices remain constrained by bandwidth and power budgets. Standardized codecs such as JPEG and MPEG achieve efficient trade-offs between bitrate and perceptual quality but are designed for human perception, limiting their applicability to machine-perception tasks and non-traditional modalities such as spatial audio arrays, hyperspectral images, and 3D medical images. General-purpose compression schemes based on scalar quantization or resolution reduction are broadly applicable but fail to exploit inherent signal redundancies, resulting in suboptimal rate-distortion performance. Recent generative neural codecs, or tokenizers, model complex signal dependencies but are often over-parameterized, data-hungry, and modality-specific, making them impractical for resource-constrained environments. We introduce a Lightweight, Versatile, and Asymmetric neural codec architecture (LiVeAction), that addresses these limitations through two key ideas. (1) To reduce the complexity of the encoder to meet the resource constraints of the execution environments, we impose an FFT-like structure and reduce the overall size and depth of the neural-network-based analysis transform. (2) To allow arbitrary signal modalities and simplify training, we replace adversarial and perceptual losses with a variance-based rate penalty. Our design produces codecs that deliver superior rate-distortion performance compared to state-of-the-art generative tokenizers, while remaining practical for deployment on low-power sensors. We release our code, experiments, and python library at https://github.com/UT-SysML/liveaction .
0
0
eess.IV 2026-05-08

Optimal spline sketches recover FLIM lifetimes at 256x compression

Histogramless Time-Domain Sketched Fluorescence Lifetime Imaging

Fisher-placed knots on photon timestamps match full-histogram accuracy without building per-pixel counts

Figure from the paper full image
abstract click to expand
We present a statistics-aware compression strategy that processes photon timestamps directly from time-correlated single-photon counting (TCSPC) modules for time-domain fluorescence lifetime imaging (FLIM). Rather than storing or transmitting the full histogram per pixel, timestamps are projected onto sparse, non-uniform one-dimensional spline sketches, with knot positions optimally allocated based on Fisher information. This knot allocation concentrates sketch channels where the decay signal exhibits the greatest statistical discriminability, rather than using a uniform allocation. The proposed approach is extensively validated on synthetic mono- and bi-exponential decay data and on experimental fluorescent dye data, demonstrating comparable accuracy to full-histogram non-linear least-squares fitting (NLSF) and Poisson maximum-likelihood estimation (MLE) at compression ratios of up to 256x. We further validate the feasibility of integrating the timestamp-to-sketch projection directly into firmware via fixed-point (FXP) lookup-table (LUT) simulation, targeting high-spatial-resolution single-photon avalanche diode (SPAD) arrays subject to significant data-throughput constraints.
0
0
eess.IV 2026-05-07

Tumor-aware cropping lifts rectal MRI detection to 90 percent

Tumor-aware augmentation with task-guided attention analysis improves rectal cancer segmentation from magnetic resonance images

CT-pretrained transformers gain robustness on MRI tasks once padding waste and tumor variation gaps are removed.

Figure from the paper full image
abstract click to expand
Pretraining on large-scale datasets has been shown to improve transformer generalizability, even for out-of-domain (OOD) modalities and tasks. However, two common assumptions often fail under OOD transfer: that downstream datasets can be adapted to the fixed input geometry of pretrained models and that pretrained representations transfer effectively across imaging modalities. We show that these assumptions break down through two interacting failure modes in CT-to-MRI transfer: inefficient token usage caused by zero-padding to match pretrained input dimensions and ineffective feature adaptation. These failures led to accuracy degradation despite extensive fine-tuning. We investigated these failure modes using two CT-pretrained hierarchical shifted-window transformer backbones, SMIT and Swin UNETR, pretrained with different objectives and datasets. Mechanistic analysis introduced an attention dilution index (ADI), an entropy-based metric quantifying attention diverted toward uninformative padding tokens, and centered kernel alignment (CKA) to measure feature reuse in MRI tasks. ADI increased with zero-padding, while high feature reuse did not necessarily correspond to improved accuracy. To mitigate these issues, we introduced two interventions: a tumor-aware augmentation strategy to improve tumor appearance heterogeneity coverage and an anisotropic cropping strategy to restore token efficiency. Fine-tuning on identical rectal MRI datasets improved detection rates to 224/247 (90.7%) for SMIT and 219/247 (88.7%) for Swin UNETR, demonstrating improved robustness under CT-to-MRI transfer. This study is among the first to examine when pretrained transformers fail to transfer effectively across imaging modalities and how simple mitigation strategies, motivated by mechanistic analysis of datasets, can reduce transfer limitations while improving robustness and MRI detection.
0
0
eess.IV 2026-05-07

LLMs score high on MRI MCQs but low on GE scanner operations recall

MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

Top models reach 93-97 percent with choices but fall to 58-61 percent on open responses, especially weak in operational knowledge.

Figure from the paper full image
abstract click to expand
Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination. No systematic benchmark has evaluated vendor-specific scanner operational knowledge central to research MRI practice. Purpose: We developed MRI-Eval, a tiered benchmark for relative model comparison on MRI physics and GE scanner operations knowledge using primary multiple-choice questions (MCQ), with stem-only and primed diagnostic conditions as complementary analyses. Methods: MRI-Eval includes 1365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions. Five model families were evaluated (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B). MCQ was primary; stem-only removed options and used an independent LLM judge; primed stem-only tested responses to incorrect user claims. Results: Overall MCQ accuracy was 93.2% to 97.1%. GE scanner operations was the lowest category for every model (88.2% to 94.6%). In stem-only, frontier-model accuracy fell to 58.4% to 61.1%, and Llama 3.3 70B fell to 37.1%; GE scanner operations stem-only accuracy was 13.8% to 29.8%. Conclusion: High MCQ performance can mask weak free-text recall, especially for vendor-specific operational knowledge. MRI-Eval is most informative as a relative comparison benchmark rather than an absolute competency measure and supports caution in using raw LLM outputs for GE-specific protocol guidance.
0
0
eess.IV 2026-05-07

Tool gives CT brain scans MRI-level tissue maps and volumes

CTseg: A Tool for Brain CT Segmentation, Spatial Normalisation, and Volumetrics

Validation against paired MRI shows higher segmentation accuracy and better total brain volume agreement than adapting existing MRI methods

Figure from the paper full image
abstract click to expand
This paper presents and validates CTseg, a freely available software for brain CT segmentation, spatial normalisation, and volumetrics. CTseg builds on the Multi-Brain generative modelling framework, providing a CT-specific pipeline that produces tissue maps, deformation fields, and brain volume estimates in the same format as SPM's unified segmentation, thereby extending SPM's established analysis chain from MRI to CT. CTseg is designed for routine hospital CT scans without requiring preprocessing or resampling in deployment. Although CTseg has been adopted in clinical research spanning, among other things, stroke, dementia, and brain morphometry, a systematic validation against an independent reference standard has been lacking. Using paired MR/CT head scans, we evaluate CTseg across four dimensions: segmentation accuracy against an MRI-derived silver standard; spatial normalisation consistency through group-average sharpness and voxelwise coefficient of variation; brain volume agreement via intraclass correlation and Bland-Altman analysis; and downstream sex classification performance from normalised tissue maps. As a baseline, we apply SPM's MRI-based unified segmentation directly to the CT images. CTseg significantly outperformed this baseline for segmentation and normalisation, showed stronger TBV agreement, and achieved comparable TIV agreement. CTseg is freely available at https://github.com/WCHN/CTseg, and all experiment code is included in the repository for full reproducibility.
0
0
eess.IV 2026-05-07

Ultrasound AI models for breast density generalize to external data

External Validation of Deep Learning Models for BI-RADS Breast Density Prediction from Ultrasound Images

Models reach AUROCs of 0.87-0.90 for extreme density and maintain overall performance on a racially different cohort, with risk estimates no

Figure from the paper full image
abstract click to expand
We externally validated three deep learning models (DenseNet121, ViT-B/32, and ResNet50) for predicting mammographic breast density from breast ultrasound exams on an independent cohort. The external validation set comprised 2,000 ultrasound exams, including 500 cancer cases defined by an initial negative exam (BI-RADS 1 or 2) followed by a cancer diagnosis within 6 months to 10 years, and 1,500 negative controls matched by manufacturer and study year. Performance was measured using patient-level AUROC across four density categories: A (fatty), B (scattered), C (heterogeneous), and D (extremely dense). As a downstream assessment, we also evaluated 10-year risk prediction by incorporating age and AI-derived density into the Tyrer-Cuzick model and comparing performance against a reference model using age and mammography-reported density. All three models performed best in extremely dense breasts (AUROC 0.868-0.899), with strong performance in fatty (0.814-0.838) and scattered density (0.764-0.799), and lower performance in heterogeneously dense breasts (0.699-0.729). DenseNet121 achieved the highest overall performance (micro-averaged AUROC 0.885), and performance across categories was comparable between internal and external testing. For risk modeling, age combined with AI-derived density yielded a lower AUROC than age combined with mammography-reported density (0.541 vs. 0.570; p = 0.23), with no statistically significant difference. These findings indicate that deep learning models generalize well to external data with different racial composition for breast density assessment. While performance is strongest in extremely dense breasts, heterogeneously dense remains more challenging, highlighting the need for targeted optimization.
0
0
eess.IV 2026-05-07

Quantum-fuzzy fusion improves hyperspectral anomaly detection

Hyperspectral Anomaly Detection Using Einstein Fuzzy Computing and Quantum Neural Network

Einstein operations and a quantum defuzzifier combine classical and quantum scores to raise separation without prior target spectra.

Figure from the paper full image
abstract click to expand
In the remote sensing (RS) field, hyperspectral imagery provides rich spectral information and facilitates numerous critical applications, such as material identification. Among these applications, hyperspectral anomaly detection (HAD) aims to detect substances whose spectral characteristics deviate from background spectra, which are termed anomalies. However, many widely used HAD algorithms in the RS community identify anomalies by relying on a ``background reconstruction'' strategy. Furthermore, the lack of prior target hyperspectrum and real-world limitations collectively reduces the spectral discrepancy between anomaly and background, limiting the performance of mainstream detections. By exploring the widely applicable fuzzy theory in the RS field, this study develops an unsupervised hybrid quantum-fuzzy multi-criteria decision framework (HyFuHAD) to detect anomalies from multiple perspectives. In our HyFuHAD, each pixel is first fuzzified using multiple HAD-based membership functions (MFs), including morphological, geometrical, and statistical MFs, to obtain various types of fuzzy degrees. Then, a multi-fuzzy-rule system, empowered by Einstein fuzzy computing, infers the classical fuzzy detection from these fuzzy degrees with sub-second-level computing. The Einstein sum and product provide significantly smoother transitions compared to typical min-max-based fuzzy ``OR'' and ``AND'' during the fuzzy matching and inference steps, thereby enabling effective detections. Moreover, a lightweight quantum defuzzifier obtains the quantum fuzzy detection from fuzzy features derived from the proposed fuzzy feature aggregation network. Experiments demonstrate that our HyFuHAD algorithm achieves state-of-the-art performance by fusing the information from the quantum and classical detectors. The demo code will be publicly available at https://github.com/IHCLab/HyFuHAD.
0
0
eess.IV 2026-05-06

The paper shows that multipath reflections in highly reflective settings can be modeled…

Multipath Exploitation in Highly Reflective Environments for Enhanced Microwave Imaging via Inverse Source Reconstruction

Exploiting multipath reflections as virtual image sources via inverse source reconstruction and coherent combination enables effective…

abstract click to expand
Multipath effects significantly influence the quality of microwave imaging in highly reflective environments, while the physical measurement aperture size constrains resolution. It is shown that by exploiting multipath reflections, improved resolution can be achieved while maintaining acceptable artifact levels. Based on image theory, strong scattered fields from an ideal reflection plane can be represented by virtual image sources. Using a single-frequency inverse source solver, the spatially distributed original and image sources are reconstructed and separated, which allows separate application of the imaging algorithm for both of them. The coherent combination of both sets of sources together with appropriate phase correction results in an effective aperture expansion that yields superior resolution. Furthermore, this separation strategy significantly mitigates interference artifacts. Simulation results, supported by theoretical analysis and comparison with a ray-tracing enhanced backprojection algorithm are presented to verify the effectiveness of the proposed approach.
0
0
eess.IV 2026-05-06

Reference normalization enables 3D imaging from WiFi signals indoors

Phase-Corrected Near-Field Microwave Imaging via Inverse Source Reconstruction with Modulated Signals

Dividing probe data by a fixed reference signal restores coherence so a single-frequency inverse source solver can form images using narrowb

Figure from the paper full image
abstract click to expand
An inverse source reconstruction (ISR) based 3-D near-field (NF) passive radar microwave imaging method utilizing modulated signals is presented. The modulated signals from a non-cooperative transmitter are scattered by the targets of interest and captured by a fixed reference antenna together with an NF scanning probe at different positions. By normalizing with the reference signals, spatial coherence of the NF observations is obtained, and a single-frequency inverse source solver is subsequently utilized for ISR and image generation. A corresponding phase correction method is proposed for the coherent superposition of multi-frequency images and verified through simulations. In addition, it is shown that for realistic narrowband signals, an incoherent imaging approach is sufficient. The presented technical scheme is validated using a planar scanning system in a typical office room, where software-defined radios are employed for the transmitting and receiving of narrowband orthogonal frequency-division multiplexing signals at Wi-Fi operating frequencies. With the aid of background subtraction and reference signals, images of a mannequin placed in the office room are successfully obtained.
0
0
eess.IV 2026-05-06

Dante cuts epochs to 85% peak by 63% in new MRI domains

Dante: An Open Source Model Pre-Training and Fine-Tuning Tool for the Dafne Federated Framework for Medical Image Segmentation

Open-source module shows gradual unfreezing and LoRA outperform baselines for abdominal and brain lesion segmentation even with limited data

Figure from the paper full image
abstract click to expand
Adapting pre-trained deep learning segmentation models to new clinical domains is a persistent challenge in medical image analysis, particularly when annotated data at the target site are scarce. Parameter-efficient fine-tuning strategies offer a principled solution by selectively updating a controlled subset of model parameters, preserving previously acquired representations while reducing the risk of overfitting on small datasets. This paper introduces DAfNe TrainEr (Dante), an open-source module integrating with the Dafne federated segmentation ecosystem as a dedicated training and fine-tuning backend. Dante supports training from scratch with automatic architecture configuration, configurable layer freezing schedules, and Low-Rank Adaptation (LoRA) extended to N-dimensional convolutional layers through channel-wise factorization. To validate the module, Gradual Unfreezing (GU) and LoRA are assessed across realistic cross-domain MRI transfer scenarios covering abdominal organ segmentation and brain white matter lesion segmentation, under full-data and few-shot conditions. GU reduced the epochs required to reach 85% of peak performance by up to 63.6% compared to training from scratch, while LoRA achieved Dice Similarity Coefficients up to 0.957 in data-rich scenarios. Both strategies outperformed the baseline across all tested domains, with gains amplified by richer pre-training datasets. These results validate Dante as a domain-agnostic fine-tuning module for medical image segmentation in real clinical deployment conditions. Dante code is available at https://github.com/dafne-imaging/dafne-torch-trainer while Dafne ecosystem project is available at https://github.com/dafne-imaging.
0
0
eess.IV 2026-05-06 Recognition

Diffusion model generates coherent MRI and tabular patient data

Multimodal synthesis of MRI and tabular data with diffusion in a joint latent space via cross-attention

Joint latent space with cross-attention produces images whose anatomy matches synthesized clinical attributes on a large cohort dataset.

Figure from the paper full image
abstract click to expand
We propose a multimodal latent diffusion model that jointly synthesizes volumetric magnetic resonance imaging (MRI) and tabular clinical data within a shared latent space via cross-attention. This approach enables coherent joint representation learning of MRI and tabular modalities for generative modeling. Our model utilizes a variational autoencoder to fuse the two modalities before diffusion-based synthesis, allowing modality-appropriate reconstruction with separate decoders for MRI and tabular data. We evaluated the framework on data from the German National Cohort (NAKO Gesundheitsstudie), comprising over 10,000 participants with MRI scans and clinical tabular features such as age, sex, body measurements, and ethnicity. The generated MRI volumes exhibited anatomical plausibility and body composition consistent with the synthesized tabular attributes. Quantitative evaluation using Fr\'echet distance and precision-recall metrics confirmed high-fidelity image generation. In the tabular modality, our model outperformed CTGAN across standard evaluation metrics and achieved results comparable to TVAE, demonstrating competitive performance relative to established unimodal baselines. This work is, to our knowledge, the first to demonstrate the feasibility of jointly modeling MRI and mixed-type tabular data in a single latent diffusion framework, offering a proof-of-concept for generating coherent synthetic multimodal patient data and aligning with the broader goal of developing digital twins in healthcare.
0
0
eess.IV 2026-05-05

EMOVIS maps four basic emotions to real-time adjustments of color saturation

EMOVIS: Emotion-Optimized Image Processing

EMOVIS adds a calibrated mapping from Happy/Calm/Angry/Sad states to ISP controls and demonstrates 87 percent viewer preference in…

abstract click to expand
In cinematography, visual attributes such as color grading, contrast, and brightness are manipulated to reinforce the emotional narrative of a scene. However, conventional Image Signal Processors (ISPs) prioritize scene fidelity, effectively neglecting this expressive dimension. To bring this cinematic capability to real-time camera pipelines during video capture, we introduce EMOVIS (EMotion-Optimized VISual processing). We establish a systematic mapping between a compact set of high-level emotional states (Happy, Calm, Angry, Sad) and low-level ISP controls - including color saturation, local tone mapping, and sharpness - supported by a calibration user study with statistically significant effects across parameters. We propose a control framework that integrates these emotion-driven adjustments into standard ISP hardware without altering the underlying processing stages. Validation via blind A/B testing shows that viewers prefer the emotion-optimized rendering in 87% of trials when the target emotion matches the scene context, indicating that emotion-aligned ISP control improves perceived suitability for expressive visual content.
0
0
eess.IV 2026-05-05

Cool-chic 5.0 beats VVC by 11 percent with 10x fewer encoding steps

Cool-chic 5.0: Faster Encoding and Inter-Feature Entropy Modeling for Overfitted Image Compression

The updated overfitted codec matches top autoencoders while cutting decode complexity by a factor of 250.

Figure from the paper full image
abstract click to expand
Overfitted codecs compress an image by learning a decoder tailored to the content during the encoding. As such, they trade increased encoding complexity for strong compression performance and low decoding complexity. This work introduces Cool-chic 5.0, the latest version in the Cool-chic series of overfitted codecs, featuring an updated decoder architecture and an improved optimization process. Cool-chic 5.0 outperforms all overfitted codecs with 10 times less encoding iterations. It offers -11% rate reduction compared to the state-of-the-art conventional codec H.266/VVC. It is also competitive with modern autoencoders such as MLIC++ while featuring a decoding complexity 250 times lower. This work is made open-source at https://github.com/Orange-OpenSource/Cool-Chic.
0
0
eess.IV 2026-05-05

Margin distance prior lifts MSI specificity to 1.000 on external slides

Biological Spatial Priors Regularize Foundation Model Representations for Cross-Site MSI Generalization in Colorectal Cancer

Peripheral encoding of tumor invasive margin regularizes foundation models to ignore site-specific imaging patterns in colorectal cancer.

Figure from the paper full image
abstract click to expand
Predicting microsatellite instability (MSI) status from routine hematoxylin and eosin (H&E) whole slide images (WSIs) offers a practical alternative to molecular testing, but models trained at one institution tend to generalize poorly to slides acquired at a different site. Foundation model representations, despite their generality, still encode site-specific texture alongside the conserved biological morphology underlying MSI. We investigate whether tile-level spatial priors derived from known MSI histology can guide these representations toward more site-invariant features. We introduce a biologically motivated spatial prior based on peripheral distance encoding, reflecting the Crohn's-like peripheral lymphocytic reaction at the tumor invasive margin, and evaluate a secondary local immune neighborhood encoding reflecting the lymphocyte-to-tumor ratio in each tile's immediate spatial neighborhood. Both priors are injected into a TransMIL aggregator before self-attention, allowing the transformer to integrate spatial biological context with UNI2-h or Virchow2 features across all attention layers. We evaluate six foundation model and MIL aggregator combinations as a reference, then assess the effect of each spatial prior. Training on TCGA-COAD (137 slides) and evaluating externally on TCGA-READ (50 slides) without retraining, peripheral distance encoding achieves MSI AUC 0.959 +/- 0.012 on COAD and MSS specificity 1.000 on READ, compared to 0.957 and 0.939 for the strongest reference configuration. Local immune neighborhood encoding achieves comparable internal AUC but lower cross-site specificity, suggesting margin proximity encodes a more site-invariant biological signal than local immune density. Results suggest biologically grounded spatial priors act as regularizers that reduce reliance on site-specific imaging patterns.
0
0
eess.IV 2026-05-05

LiDAR-camera system achieves 97% overlap in mine enclosure tests

Development and Validation of an Integrated LiDAR-Camera System for Real-Time Monitoring of Underground Longwall Operations

Flameproof design with calibration maintains accuracy and transmits colorized point clouds at 10 Hz under 25 Mb/s while managing heat.

Figure from the paper full image
abstract click to expand
Real-time spatial monitoring in underground longwall operations is challenging due to methane-related safety risks, poor visibility, elevated thermal loads, spatial confinement, and bandwidth-limited communications. Currently available camera-based monitoring provides visual context but lacks direct depth information, while standalone underground LiDAR scanners are limited to monochromatic or periodic 3D mapping. This paper presents the design, integration, and experimental validation of a LiDAR-camera monitoring system built around a certified flameproof enclosure that prevents flame propagation into the surrounding atmosphere. The system combines a solid-state LiDAR, an industrial RGB camera, and an onboard processor within a compact hardware assembly, supporting LiDAR-camera fusion, low-light image enhancement, and real-time processing. Laboratory experiments evaluated LiDAR and camera performance through the protective polycarbonate dome and quantified optical and geometric distortions introduced by the enclosure. Thermal testing showed that iterative component placement, heat sinking, and passive conduction reduced peak surface temperature from 106 {\deg}C to 70 {\deg}C, with internal temperature stabilising at 57 {\deg}C. Furthermore, a representative longwall simulation was created to evaluate the complete sensing, fusion, and transmission workflow under controlled geometric and low-light conditions. In the final configuration, more than 97% of LiDAR points fell within the camera field of view, supporting reliable colourisation. Enclosure-aware calibration and correction maintained geometric accuracy, while processed colourised point clouds were transmitted at up to 10 Hz with sustained bandwidth below 25 Mb/s.
0
0
eess.IV 2026-05-05

One-step decoder recovers wireless images at 30 ms latency

DriftDecode: One-Step Wireless Image Decoding via Drifting-Inspired Detail Recovery

Drift-inspired loss restores details from preserved coarse structure, beating 10-step methods by 4.8x and adding up to 1.13 dB PSNR under f

Figure from the paper full image
abstract click to expand
Generative receivers for wireless image transmission can improve reconstruction quality, but diffusion-based and flow-based decoding relies on iterative inference and therefore incurs substantial latency. In wireless image transmission, however, the received signal already preserves the coarse structure of the source image. Wireless decoding is therefore better viewed as a recovery task than as image generation from scratch, and the main challenge lies in restoring channel-impaired details. Motivated by this recovery-oriented perspective, this paper proposes DriftDecode, a signal-to-noise ratio (SNR)-conditioned one-step decoder for wireless image reconstruction. DriftDecode couples a one-step U-Net decoder with a drift-inspired instance-level texture loss. The loss reformulates the drifting-field mechanism from generative drifting models in perceptual feature space, guiding each reconstructed local feature toward its spatially aligned ground-truth counterpart while suppressing mismatched textures. Experiments on DIV2K and MNIST under additive white Gaussian noise (AWGN) and Rayleigh fading channels show a favorable quality-latency tradeoff. DriftDecode achieves 30~ms decoding latency, providing a 4.8$\times$ speedup over a 10-step flow-matching decoder, while consistently outperforming MSE-only training and yielding up to 1.13~dB PSNR gain on MNIST under Rayleigh fading. These results support recovery-oriented one-step decoding as an effective alternative to iterative generative decoding for low-latency wireless image transmission.
0
0
eess.IV 2026-05-04 3 theorems

One optical shot profiles nanoparticle identity

Deep Speckle Holography Redefines Label-free Nanoparticle Phenotyping

Deep speckle holography extracts multidimensional signatures from complex mixtures in 0.9 seconds without labels or preprocessing.

abstract click to expand
Nanoparticle metrology has long been constrained by the assumption that, in mixed and unprocessed fluids, particle size, morphology, composition, and species-specific abundance cannot be resolved simultaneously from a single label-free measurement. Here, we revisit this long-standing limitation by showing that complex forward speckle-holographic fields define an information-rich optical space for multidimensional particle signatures. We report deep speckle holography, a physics-informed generative framework that profiles particle identity, size, morphology, and species-resolved abundance from a single non-contact optical measurement. Across purified suspensions, mixed particle populations, environmental waters, human urine, and other unprocessed native fluids, the method enables direct nanoparticle inference without purification, labeling, or destructive preprocessing, delivering concurrent multidimensional readouts in 0.9 s over a dynamic range spanning 10 orders of magnitude. Deep speckle holography establishes a route toward direct label-free nanoparticle phenotyping in real-world fluids, moving nanoscale measurement beyond isolated-particle characterization toward multidimensional inference in complex mixtures, and expanding the scope of questions nanoscale measurement can address, from real-time tracking of nanoparticle transformations in living and environmental systems to non-invasive quality control of nanomedicine formulations, and beyond.
0
0
eess.IV 2026-05-04

One-step flow generates 3D+t cardiac four-chamber meshes

Cardiac Mesh Flow: One-Step Generation of 3D+t Cardiac Four-Chamber Meshes via Flow Matching

Warps a template via learned deformation fields to yield coherent heart cycles with volume-based control.

Figure from the paper full image
abstract click to expand
Spatio-temporal (3D+t) generative modelling of cardiac shape and motion is crucial for understanding heart structure and function at population scale. Existing generative models for cardiac shape synthesis either adopt volumetric shape representations that lack anatomical correspondence across different time points and subjects, or rely on VAE-based frameworks that suffer from a trade-off between reconstruction fidelity and generative diversity. In this work, we propose Cardiac Mesh Flow, a novel generative flow model for 3D+t cardiac four-chamber mesh generation with anatomical correspondence, temporal coherence, and periodic consistency. Leveraging the flow matching technique, Cardiac Mesh Flow performs efficient one-step generation of multi-scale free-form deformation fields, which warp a template mesh to generate cardiac four-chamber meshes across a cardiac cycle. Furthermore, Cardiac Mesh Flow enables controllable generation conditioned on cardiac chamber volumes, allowing precise control of the synthetic heart. Experimental results demonstrate that Cardiac Mesh Flow achieves high fidelity and diversity on both unconditional and conditional generation, compared to state-of-the-art 3D+t cardiac mesh generation methods.
0
0
eess.IV 2026-05-04

MRI harmonization works without target data or sharing

A Target-Free Harmonization Method for MRI

TgtFreeHarmony picks the right target style by Bayesian search on a disentangled manifold guided by downstream task scores.

Figure from the paper full image
abstract click to expand
In MRI, variations in scan parameters, sequence, or hardware can lead to discrepancies in image appearance, even for the same subject. These inconsistencies, known as domain shifts, can hinder image analysis and degrade the performance of deep learning models trained on data from specific target domains. MRI image harmonization aims to address these issues by aligning source domain images to the target domain images while preserving biological information such as anatomical structures. However, most existing harmonization approaches require access to both source and target domain data in training or test time. This dependence induces data sharing between institutions, raising concerns about patient privacy and substantially limiting the harmonization approaches that can be practically deployed in clinical settings. To overcome these limitations, we introduce TgtFreeHarmony, the harmonization framework tailored for target-free scenarios, eliminating the need for target domain data and any data sharing, enabling privacy-preserving harmonization directly within the source institution. Our approach estimates the target domain style by searching the manifold of MRI domain style constructed via a disentanglement-based generator using Bayesian optimization guided by the performance of a downstream task model, which is trained on target domain data. We evaluated our method on the brain tissue segmentation task across multiple institutes and demonstrated that it effectively harmonizes source images into target images, leading to improved downstream task performance. By enabling harmonization without any access to target-domain data, TgtFreeHarmony establishes a new direction of harmonization preserving data privacy that can be realistically deployed within clinical environments.
0
0
eess.IV 2026-05-04

Blackwell NVENC UHQ gains quality at 400% latency cost

Evolution of NVENC Efficiency: A Longitudinal Analysis of HQ and UHQ Tuning Efficiency, Latency and Energy Trade-offs

Longitudinal analysis shows hybrid pipeline with up to 7 B-frames makes UHQ unsuitable for live interactions despite 22.79% BD-Rate boost.

abstract click to expand
The rapid expansion of uplink-intensive applications necessitates video coding solutions that balance high Rate-Distortion (RD) efficiency with ultra-low latency. This paper presents a longitudinal performance analysis of NVIDIA hardware encoding (NVENC), spanning from Pascal to the emerging Blackwell generation. We specifically evaluate the operational viability of the new "Ultra High Quality" (UHQ) tuning mode against standard low-latency configurations. Our results demonstrate that while the Blackwell architecture breaks historical efficiency plateaus, achieving a 5.94% BD-Rate gain in standard modes and up to 22.79% in UHQ modes, these gains incur severe system-level penalties. We reveal that UHQ operates as a hybrid pipeline, offloading complexity to CUDA cores and enforcing aggressive temporal structures (up to 7 B-frames) that increase end-to-end latency by over 400% and GPU board power consumption by up to 40%. Consequently, while UHQ successfully bridges the quality gap with software encoders, its prohibitive serialization delay renders it unsuitable for interactive real-time communications, positioning it instead as a specialized solution for Video-on-Demand (VoD) transcoding.
1 0
0
eess.IV 2026-05-04

Unsupervised network cleans real low-dose liver CT

Unsupervised Denoising of Real Clinical Low Dose Liver CT with Perceptual Attention Networks

Cycle-GAN training with perceptual attention on unpaired scans produces images approved by physicians for diagnosis.

Figure from the paper full image
abstract click to expand
With the development of deep learning, medical image processing has been widely used to assist clinical research. This paper focuses on the denoising problem of low-dose computed tomography using deep learning. Although low-dose computed tomography reduces radiation exposure to patients, it also introduces more noise, which may interfere with visual interpretation by physicians and affect diagnostic results. To address this problem, inspired by Cycle-GAN for unsupervised learning, this paper proposes an end-to-end unsupervised low-dose computed tomography denoising framework. The proposed framework combines a U-Net structure for multi-scale feature extraction, an attention mechanism for feature fusion, and a residual network for feature transformation. It also introduces perceptual loss to improve the network for the characteristics of medical images. In addition, we construct a real low-dose computed tomography dataset and design a large number of comparative experiments to validate the proposed method, using both image-based evaluation metrics and medical evaluation criteria. Compared with classical methods, the main advantage of this paper is that it addresses the limitation that real clinical data cannot be directly used for supervised learning, while still achieving excellent performance. The experimental results are also professionally evaluated by imaging physicians and meet clinical needs.
0
0
eess.IV 2026-05-04

AI misses more small lung nodules at certain z-phases

Reconstruction Interval Z-Phase Dependence of AI Detection Sensitivity in CT Lung Nodule Screening

Detection sensitivity varies by up to 18 points when reconstruction interval nears nodule diameter.

Figure from the paper full image
abstract click to expand
Background: Sensitivity of AI-assisted lung nodule detection systems is known to vary with CT acquisition parameters including radiation dose, reconstruction kernel, and slice thickness. However, the dependence of detection probability on nodule position within the reconstruction cycle -- the z-phase -- has not, to the author's knowledge, been characterized for deep learning-based detection systems. Methods: A retrospective analysis was performed using the LIDC-IDRI dataset. Detection results from a previously validated 154-case perturbation study were re-analyzed. For each consensus nodule (>=4-reader agreement), z-phase was defined as the fractional position of the nodule center within the reconstruction cycle, folded to [0, 0.5]. Detection sensitivity was stratified by z-phase bin, reconstruction interval (1mm, 3mm, 5mm), and by the ratio of reconstruction interval to nodule diameter (d/D). Results: At 5mm reconstruction interval, sensitivity was 71.6% vs 84.8% at 1mm baseline. Within the 5mm condition, sensitivity varied by 17.6 percentage points across z-phase bins. Stratified by d/D ratio, sensitivity was 92.4% for d/D < 0.5, 78.0% for 0.5 <= d/D < 1.0, and 61.4% for d/D >= 1.0, with a systematic z-phase effect present only in the d/D >= 1.0 stratum. Conclusions: AI detection sensitivity depends on the ratio of reconstruction interval to nodule diameter. When this ratio approaches or exceeds 1.0 -- as occurs for 3-6mm nodules at 5mm reconstruction -- z-phase becomes the dominant source of per-study detection variance. This stochastic effect is invisible to protocol-level quality metrics and not reflected in AI confidence scores.
0
0
eess.IV 2026-05-04

FedKPer balances generalization and personalization in medical FL

FedKPer: Tackling Generalization and Personalization in Medical Federated Learning via Knowledge Personalization

Knowledge personalization during local training and selective weighting of reliable updates in aggregation improve the trade-off while حفظنع

abstract click to expand
Federated learning (FL) holds great potential for medical applications. However, statistical heterogeneity across healthcare institutions poses a major challenge for FL, as the global model struggles both to generalize across unseen patient populations and to adapt to the unique data distributions of individual hospitals. This heterogeneity also exacerbates forgetting at both the global and local level, resulting in previous learned patient patterns to be misclassified after model updates. While prior work has largely treated generalization and personalization as separate challenges, we show that a better balance between the two can be achieved through selective alignment with the global model and a modified aggregation scheme, which together mitigate the effects of statistical heterogeneity. Specifically, we introduce FedKPer, which introduces knowledge personalization into the training stage of each local device. Afterwards, generalization is considered via the global model aggregation process, where local updates that are reliable and label-diverse are emphasized. We evaluate the performance of FedKPer, devising additional metrics that relate to common consequences of forgetting. Overall, we demonstrate FedKPer improves the generalization-personalization trade-off without sacrificing retention.
0
0
eess.IV 2026-05-04

Recurrent network fills holes in high-speed Lissajous CLE scans

Multi-frame Restoration for High-rate Lissajous Confocal Laser Endomicroscopy

MIRA restores sparse high-rate images by training on stitched slow-scan mosaics and reusing features across frames, beating baselines at low

Figure from the paper full image
abstract click to expand
Lissajous confocal laser endomicroscopy (CLE) is a promising solution for high speed in vivo optical biopsy for handheld scenarios. However, Lissajous scanning traces a resonant trajectory and samples only the visited pixels per frame; at high frame rates, many pixels remain unvisited, creating structured holes. In this work, we introduce the first benchmark for high-rate Lissajous CLE, consisting of low-quality video clips paired with high-quality reference images. The reference images are wide-FOV mosaics obtained by stitching stabilized, slow-scan frames of the same tissue, enabling temporally aligned supervision. Using this dataset, we propose MIRA, a lightweight recurrent framework for Lissajous CLE restoration that iteratively aggregates temporal context through feature reuse and displacement alignment. Our experiments demonstrate that MIRA outperforms both lightweight and high-complexity baselines in restoration quality while maintaining a favorable computational efficiency suitable for clinical deployment.
0
0
eess.IV 2026-05-04

CDNet fuses multi-source images via joint dictionary unfolding

Combined Dictionary Unfolding Network with Gradient-Adaptive Fidelity for Transferable Multi-Source Fusion

A block-sparse unfolding step from coupled dictionary learning cuts computation while matching or beating prior fusion quality on standard T

Figure from the paper full image
abstract click to expand
Deep Unfolding Network-based methods have emerged as effective solutions for multi-source image fusion by combining model-driven iterative optimization with data-driven deep learning. However, most existing deep unfolding image fusion methods are derived from alternating minimization, which updates the features of different modalities separately. This design introduces considerable computational and memory overhead, limiting deployment on resource-constrained edge devices. To address this issue, we propose CDNet, a lightweight Combined Dictionary Unfolding Network for multi-source image fusion. Rather than introducing a new sparse coding prior or empirically compressing an existing fusion network, CDNet translates the unique-common decomposition prior of coupled dictionary learning into a structurally constrained joint unfolding architecture. The resulting CDBlock follows a block-sparse interaction topology and performs a model-derived joint update of common and modality-specific representations, thereby streamlining feature learning and improving efficiency.In addition, we design a compact High- and Low-frequency Image Fidelity loss for unsupervised training without ground-truth images. We evaluate CDNet on four tasks, including multi-exposure image fusion, infrared and visible image fusion, medical image fusion, and infrared and visible image fusion for semantic segmentation. Experimental results show that CDNet achieves competitive or superior fusion performance with high efficiency. For infrared and visible image fusion, CDNet outperforms competing methods on four of six metrics on the TNO dataset and five of six metrics on the RoadScene dataset. In particular, it surpasses the second-best method by 1.23 dB and 1.59 dB in PSNR on TNO and RoadScene, respectively.
0
0
eess.IV 2026-05-01

Multitask model generates CT from MRI at different field strengths

A Proof-of-Concept Study of Multitask Learning for Cranial Synthetic CT Generation Across Heterogeneous MRI Field Strengths

Framework adapts to scanner variations while keeping anatomical detail for clinical planning and correction tasks.

abstract click to expand
Accurate synthesis of computed tomography (CT) images from magnetic resonance imaging (MRI) is clinically valuable for cranial applications such as attenuation correction, radiotherapy planning, and image-guided interventions. However, heterogeneity across MRI field strengths and acquisition protocols limits the generalizability of existing methods. In this study, we formulate cranial CT synthesis as a modular, structurally coupled problem and propose a deep learning framework to improve robustness across heterogeneous MRI conditions. The model is designed to adapt to variations in field strength and imaging protocols while preserving anatomical consistency. Experiments on multi-site datasets demonstrate improved performance and generalization compared with conventional approaches. The proposed method enables reliable CT synthesis across heterogeneous MRI settings, supporting broader clinical translation.
0
0
eess.IV 2026-05-01

Diffusion-OAMP embeds generative prior into OAMP for wireless images

Diffusion-OAMP for Joint Image Compression and Wireless Transmission

Pre-trained diffusion model serves as nonlinear estimator under SNR matching, enabling training-free recovery from compressed transmissions.

Figure from the paper full image
abstract click to expand
Joint image compression and wireless transmission remain relatively underexplored compared to generic image restoration, despite its importance in practical communication systems. We formulate this problem under an equivalent linear model, and propose Diffusion-OAMP, a training-free reconstruction framework that embeds a pre-trained diffusion model into the OAMP algorithm. In Diffusion-OAMP, the OAMP linear estimator produces pseudo-AWGN observations, while the diffusion model serves as a nonlinear estimator under an SNR-matching rule. This framework offers a way to incorporate multiple generative priors into OAMP. Experiments with varying compression ratios and noise levels show that Diffusion-OAMP performs favorably against classic methods in the evaluated settings.
0
0
eess.IV 2026-05-01

Rotary encodings cut aerodynamic prediction errors by up to 23% on cars

RETO: A Rotary-Enhanced Transformer Operator for High-Fidelity Prediction of Automotive Aerodynamics

RETO pairs sinusoidal-cosine global references with rotary positional encodings to lower relative L2 errors versus prior transformer methods

abstract click to expand
Rapid aerodynamic evaluation is crucial for modern vehicle design, yet existing neural operators struggle to capture intricate spatial correlations. We propose the rotary-enhanced transformer operator (RETO), a novel neural solver featuring a dual-stage spatial awareness mechanism: sinusoidal-cosine encodings for global referencing and rotary positional encodings (RoPE) for relative displacements. RoPE encodes spatial relations via unitary rotations, enforcing translation invariance and enhancing local gradient resolution. RETO is validated on ShapeNet and the high-fidelity DrivAerML benchmark. On ShapeNet, RETO achieves a relative $L_2$ error of 0.063, outperforming RegDGCNN at 0.125 and representing a 16\% improvement over the Transolver baseline, which yields an error of 0.075. These performance gains are further amplified on the DrivAerML dataset, where RETO achieves relative $L_2$ errors of 0.089 for surface pressure and 0.097 for velocity. In comparison, Transolver results in errors of 0.116 and 0.121 for the same metrics, indicating that RETO achieves precision enhancements of 23\% and 19\%, respectively. For comprehensive comparison, the surface pressure and velocity errors for AB-UBT are 0.102 and 0.124, while RegDGCNN yields 0.235 and 0.312, respectively. Information-theoretical analysis shows that the entropy peak of RETO at 0.35 is significantly lower than that of Transolver at 0.75 under $10^4$ resolution, indicating a focused attentional mechanism capable of preserving localized gradients against global diffusion.
0
0
eess.IV 2026-05-01

Lightweight network segments glottis at over 170 fps

A Real-time Scale-robust Network for Glottis Segmentation in Nasal Transnasal Intubation

The 19 MB model achieves 92.9 percent accuracy while handling scale changes in nasal intubation videos.

abstract click to expand
Nasotracheal intubation (NTI) is a critical clinical procedure for establishing and maintaining patient airway patency. Machine-assisted NTI has emerged as a pivotal approach for optimizing procedural efficiency and minimizing manual intervention. However, visual detection algorithms employed for NTI navigation encounter significant challenges, including complex anatomical environments and suboptimal illumination conditions surrounding the glottis. Additionally, the glottis presents considerable scale variability throughout the procedure, initially appearing as a small, difficult-to-capture structure before expanding to occupy nearly the entire field of view. Moreover, traditional visual detection methods often have high computational costs, making real-time, high-precision detection on portable devices challenging. To enhance NTI efficacy and address these challenges, this paper proposes a novel glottis segmentation framework optimized for vision-assisted NTI applications. First, we designed a lightweight, multi-receptive field feature extraction module to reduce intra-class differences, achieving robustness to scale variations of the glottis. This module was then stacked to form the backbone and neck of our network. Subsequently, we developed an advanced label assignment method and redefined the number of samples to further reduce intra-class differences and enhance accuracy in the complex NTI environment. Experiments on three distinct datasets demonstrate that our network surpasses state-of-the-art algorithms, achieving a segmentation mDice of 92.9\% with a compact model size of 19 MB and an inference speed exceeding 170 frames per second. % Our code and datasets will be open-sourced on GitHub after the manuscript is accepted. Our code and datasets are available at https://github.com/HBUT-CV/GlottisNet.
0
0
eess.IV 2026-05-01

Dynamic sparse attention improves hyperspectral super-resolution

Spectral Dynamic Attention Network for Hyperspectral Image Super-Resolution

DCSA and FE-FFN modules cut redundant spectral links and add frequency processing to reach top benchmark scores at competitive speed.

Figure from the paper full image
abstract click to expand
Hyperspectral image super-resolution is essential for enhancing the spatial fidelity of HSI data, yet existing deep learning methods often struggle with substantial spectral redundancy and the limited non-linear modeling capacity of standard feed-forward networks (FFNs). To address these challenges, we propose Spectral Dynamic Attention Network (SDANet), a framework designed to adaptively suppress redundant spectral interactions. SDANet integrates two key components: 1) Dynamic Channel Sparse Attention (DCSA) module that computes channel-wise correlations and selectively preserves the most informative attention responses through dynamic and data-dependent sparsification. 2) Frequency-Enhanced Feed-Forward Network (FE-FFN) that jointly models spatial and frequency-domain representations to enhance non-linear expressiveness. Extensive experiments on two benchmark datasets demonstrate that SDANet achieves state-of-the-art HISR performance while maintaining competitive efficiency. The code will be made publicly available at https://github.com/oucailab/SDANet.
0
0
eess.IV 2026-05-01

Network selects key bands to fuse multi-source remote sensing data

Representative Spectral Correlation Network for Multi-source Remote Sensing Image Classification

RSCNet uses cross-source guidance for spectral selection and adaptive fusion to outperform prior methods with lower complexity on benchmark

Figure from the paper full image
abstract click to expand
Hyperspectral image (HSI) and SAR/LiDAR data offer complementary spectral and structural information for land-cover classification. However, their effective fusion remains challenging due to two major limitations: The spectral redundancy in high-dimensional HSI and the heterogeneous characteristics between multi-source data. To this end, we propose Representative Spectral Correlation Network (RSCNet), a novel multi-source image classification framework specifically designed to address the above challenges through spectral selection and adaptive interaction. The network incorporates two key components: (1) Key Band Selection Module (KBSM) that adaptively selects task-relevant spectral bands from the original HSI under cross-source guidance, thereby alleviating redundancy and mitigating information loss from conventional PCA-based spectral reduction. Moreover, the learned band subset exhibits highly discriminative spectral structures that align with discriminative semantic cues, promoting compact yet expressive representations. (2) Cross-source Adaptive Fusion Module (CAFM) that performs cross-source attention weighting and local-global contextual refinement to enhance cross-source feature interaction. Experiments on three public benchmark datasets demonstrate that our RSCNet achieves superior performance compared with state-of-the-art methods, while maintaining substantially lower computational complexity. Our codes are publicly available at https://github.com/oucailab/RSCNet.
0
0
eess.IV 2026-04-30

Signed distance maps confine atrial scar maps to the wall

A Two Stage Pipeline for Left Atrial Wall Constrained Scar Segmentation and Localization from LGE-MR Images

Two-stage model derives cavity and wall geometry to cut false positives outside the thin left atrial wall in LGE-MRI

Figure from the paper full image
abstract click to expand
Accurate segmentation and localization of left atrial (LA) ablation scars from Late gadolinium enhancement (LGE)-MRI is essential for assessing the lesion completeness and guiding ablation therapy. Incomplete or discontinuous lesions can increase the recurrence rate of the therapy and inaccurate localization can misguide treatment planning. However, reliable quantification and localization of scar in LGE-MRI is challenging. The severely class imbalanced scar voxels, thin structure of the LA wall, and weak tissue contrast often lead to unrealistic scar predictions. In this paper, we propose a two stage nnUNet based framework that takes LA anatomy into account to help with more precise scar localization and segmentation. In the first stage, an nnUNet model is trained to segment the LA cavity. In the second stage, patient specific cavity and wall signed distance maps (SDMs) are derived from the predicted anatomy to use as geometry aware inputs, and explicitly encode each voxel's signed spatial relationship to the atrial cavity and wall. This approach transforms scar segmentation from a solely intensity-based classification into anatomy-conditioned localization task, providing a continuous spatial prior that stabilizes learning for the thin atrial wall and suppresses topologically invalid predictions. To further address boundary ambiguity, we introduce a wall ROI-masked weighted loss combined with boundary uncertainty-aware supervision strategy that restricts learning to the atrial wall, while accounting for severe class imbalance. We evaluated our approach on the LAScarQS 2022 dataset and achieved a Dice of 61.1% and ASSD of 1.711mm. Our reliable and effective framework improves scar segmentation and localization accuracy by enforcing anatomical validity through geometry-aware supervision, and lowering the false positive detections far away from the atrial wall.
0
0
eess.IV 2026-04-30

Unit-circle phase representation improves ptychography

Circular Phase Representation and Geometry-Aware Optimization for Ptychographic Image Reconstruction

Predicting cosine and sine components with geodesic loss avoids phase wrapping and preserves high frequencies better than standard deep学习方法.

Figure from the paper full image
abstract click to expand
Traditional iterative reconstruction methods are accurate but computationally expensive, limiting their use in high-throughput and real-time ptychography. Recent deep learning approaches improve speed, but often predict phase as a Euclidean scalar despite its $2\pi$ periodicity, which can introduce wrapping artifacts, discontinuities at $\pm\pi$, and a mismatch between the loss and the underlying signal geometry. We present a deep learning framework for ptychographic reconstruction that models phase on the unit circle using cosine and sine components. Phase error is optimized with a differentiable geodesic loss, which avoids branch-cut discontinuities and provides bounded gradients. The network further incorporates saturation-aware dual-gain input scaling, parallel encoder branches, and three decoders for amplitude, cosine, and sine prediction, together with a composite loss that promotes circular consistency and structural fidelity. Experiments on synthetic and experimental datasets show consistent improvements in both amplitude and phase reconstruction over existing deep learning methods. Frequency-domain analysis further shows better preservation of mid- and high-frequency phase content. The proposed method also provides substantial speedup over iterative solvers while maintaining physically consistent reconstructions.
0
0
eess.IV 2026-04-30

ECG attributions mapped to 3D heart space raise localization accuracy

Validating the Clinical Utility of CineECG 3D Reconstructions through Cross-Modal Feature Attribution

Cross-modal averaging lifts Dice overlap with expert annotations from 0.47 to 0.56 on 20 cases.

Figure from the paper full image
abstract click to expand
Deep learning models for 12-lead electrocardiogram (ECG) analysis achieve high diagnostic performance but lack the intuitive interpretability required for clinical integration. Standard feature attribution methods are limited by the inherent difficulty in mapping abstract waveform fluctuations to physical anatomical pathologies. To resolve this, we propose a cross-modal method that projects feature attributions from high-performance 12-lead ECG models onto the CineECG 3D anatomical space. Our study reveals that while models trained directly on CineECG signals suffer from reduced accuracy and incoherent attributions, the proposed mapping mechanism effectively recovers clinically relevant feature rankings. Validated against a ground-truth dataset of 20 cases annotated by domain experts, the mapped explanations yield a Dice score of 0.56, significantly outperforming the 0.47 baseline of standard 12-lead attributions. These findings indicate that cross-modal averaging mapping effectively filters attribution instability and improves the localization of pathological features, combining the diagnostic expressiveness of standard ECG with the intuitive clarity of anatomical visualization.
0
0
eess.IV 2026-04-30

Adaptive transforms from GMM compress semantic features better

Adaptive Transform Coding for Semantic Compression

Selecting mode-specific transforms and quantizers from mixture components matches neural methods on vision features.

Figure from the paper full image
abstract click to expand
Visual data compression is shifting from human-centered reconstruction to machine-oriented representation coding. In this setting, an image is often mapped to a compact semantic embedding, which is then compressed and transmitted for downstream inference. We propose an adaptive transform-coding method for semantic-feature compression motivated by the conditional rate-distortion function of a Gaussian mixture model. The scheme uses mode-dependent transforms and quantizers selected according to the inferred source component, enabling more efficient coding of heterogeneous feature distributions. Evaluations on features from widely used vision backbones and foundation models show that the proposed method outperforms or is competitive with state-of-the-art neural compression methods while preserving flexibility and interpretability.
0
0
eess.IV 2026-04-29

SAM keeps high accuracy in CT scans despite simulated shifts

Robustness Evaluation of a Foundation Segmentation Model Under Simulated Domain Shifts in Abdominal CT: Implications for Health Digital Twin Deployment

Mean Dice change stays below 0.01 with no rise in failures across moderate perturbations in spleen segmentation

Figure from the paper full image
abstract click to expand
Foundation segmentation models such as the Segment Anything Model (SAM) have demonstrated strong generalization across natural images; however, their robustness under clinically realistic medical imaging domain shifts remains insufficiently quantified. We present a systematic slice-level robustness audit of SAM (ViT-B) for spleen segmentation in abdominal CT using 1,051 nonempty slices from 41 volumes in the Medical Segmentation Decathlon. A standardized ground-truth-derived bounding-box protocol was used to isolate encoder robustness from prompt uncertainty. Controlled perturbations simulating inter-scanner variability, including Gaussian noise, blur, contrast scaling, gamma correction, and resolution mismatch, were applied across ten conditions. The clean baseline achieved a mean Dice score of 0.9145 (95% CI: [0.909, 0.919]) with a failure rate of 0.67%. Across all perturbations, the absolute mean {\Delta}Dice remained below 0.01. Paired Wilcoxon signed-rank tests with Benjamini-Hochberg false discovery rate correction identified statistically significant but small-magnitude changes under selected conditions, while McNemar analysis showed no significant increase in failure probability. These findings indicate that SAM exhibits stable segmentation behavior under moderate CT domain shifts, supporting its role as a robust foundation baseline for medical image segmentation research. As health digital twins increasingly incorporate foundation segmentation models for anatomical modeling and organ-level monitoring, formal characterization of robustness under real-world imaging variability is a necessary step toward trustworthy deployment.
0
0
eess.IV 2026-04-29

End-to-end 3DGS codec processes stereo video without pixel reconstruction

Generalizable 3D Gaussian Splatting enabled Semantic Coding for Real-Time Immersive Video Communications

Disparity-guided semantic coding and direct latent-to-Gaussian prediction deliver real-time rendering and better compression than separate 3

Figure from the paper full image
abstract click to expand
Real-time immersive video communications, particularly high-fidelity 3D telepresence, necessitates a synergistic balance between instantaneous dynamic scene reconstruction and high-efficiency data transmission. While recent advancements in feed-forward 3D Gaussian Splatting (3DGS) have enabled real-time rendering, performing multi-view video coding and 3D reconstruction in a decoupled manner leads to suboptimal compression efficiency and high computational complexity. To address this, we propose GS-SCNet, the first unified end-to-end framework that seamlessly integrates generalizable 3DGS reconstruction with a dedicated deep Semantic Coding pipeline. Our architecture is underpinned by two core technical contributions: (i) we introduce a Disparity-Guided Parallel Semantic Codec that exploits epipolar geometric priors to facilitate cross-view contextual interaction via disparity compensation and semantic fusion, thereby enabling real-time parallel processing of stereo streams while significantly enhancing rate-distortion performance, and (ii) we develop a Lightweight Gaussian Parameter Predictor which directly projects decoded semantic latents into 3DGS attributes, obviating the need for intermediate pixel-domain reconstruction. By coupling the codec with the task-specific predictor, our framework extracts geometric correlations only once, effectively eliminating the redundant computational bottleneck inherent in conventional decoupled paradigms. Extensive evaluations on both synthetic and real-world human datasets demonstrate that GS-SCNet achieves a superior trade-off across compression efficiency, rendering quality, and real-time performance. Notably, our framework exhibits strong cross-domain generalization and robustness against compression artifacts when applied to out-of-domain real-world data, significantly outperforming conventional decoupled transmission paradigms.
0
0
eess.IV 2026-04-28

Method turns global tissue proportions into pixel segmentations

Semantic Segmentation for Histopathology using Learned Regularization based on Global Proportions

VSLP fuses transformer confidence maps with Wasserstein fidelity and a learned regularizer to satisfy exact proportion counts without pixel-

Figure from the paper full image
abstract click to expand
In pathology, the spatial distribution and proportions of tissue types are key indicators of disease progression, and are more readily available than fine-grained annotations. However, these assessments are rarely mapped to pixel-wise segmentation. The task is fundamentally underdetermined, as many spatially distinct segmentations can satisfy the same global proportions in the absence of pixel-wise constraints. To address this, we introduce Variational Segmentation from Label Proportions (VSLP), a two-stage framework that infers dense segmentations from global label proportions, without any pixel-level annotations. This framework first leverages a pre-trained transformer model with test-time augmentation to produce a pixel-wise confidence estimate. In the second stage, these estimates are fused by solving a variational optimization problem that incorporates a Wasserstein data fidelity term alongside a learned regularizer. Unlike end-to-end networks, our variational method can visualize the fidelity-regularization energy, resulting in more interpretable segmentation. We validate our approach on two public datasets, achieving superior performance over existing weakly supervised and unsupervised methods. For one of these datasets, proportions have been estimated by an experienced pathologist to provide a realistic benchmark to the community. Furthermore, the method scales to an in-house dataset with noisy pathologist labels, severely outperforming state-of-the-art methods, thereby demonstrating practical applicability. The code and data will be made publicly available upon acceptance at https://github.com/xiaoliangpi/VSLP.
0
0
eess.IV 2026-04-28

Physics-informed AI cuts ocean oxygen sensor error to 2 umol/L

Deep Learning-Enabled Dissolved Oxygen Sensing in Biofouling Environments for Ocean Monitoring

Embedding the Stern-Volmer relation inside a visual transformer yields 90 percent lower error than standard methods despite algae fouling.

Figure from the paper full image
abstract click to expand
The escalating climate crisis and ecosystem degradation demand intelligent, low-cost sensors capable of robust, long-term monitoring in real-world environments. Absolute dissolved oxygen (DO) concentration is a key parameter for predicting climate tipping points. Inexpensive optoelectronic sensors based on microstructured polymer films doped with phosphorescent dyes could be readily deployable; however, signal drift and marine biofouling remain major challenges. Here, we introduce a sensing paradigm that combines camera-based DO sensors with a visual transformer (ViT)-based physics-informed neural network (PINN) for high-fidelity sensing under biofouling conditions. Training and testing data were obtained from an algae-laden water tank over 14 days to capture accelerated biofouling. The ViT-PINN, which embeds the Stern-Volmer (SV) equation into the loss function, reduces mean average error (MAE) by 92% and 89% compared to classical statistical and ML approaches, achieving ~2 umol/L absolute error. A deep ensemble further quantifies predictive uncertainty, enabling self-diagnostic sensing.
0
0
eess.IV 2026-04-28

Images reconstruct uniquely from sparse Laplacian fields

Shared-kernel Wavelet Neural Networks for Poisson Image Reconstruction

A compact shared-kernel wavelet network solves the Poisson equation with under 0.0002M parameters and linear speed.

Figure from the paper full image
abstract click to expand
The Laplacian operator transforms the image into its Laplacian field, which usually is sparse and satisfies a stable distribution. On the other hand, an image can be uniquely reconstructed from its Laplacian field via solving a Poisson equation with a proper boundary condition. Such uniqueness is mathematically guaranteed. Thanks to these properties, we propose to use the sparse Laplacian field to present the image. We first show that the Laplacian field is sparse and satisfies a stable distribution on hundreds images. Then, we show that the image can be accurately reconstruct from its Laplacian field. For the reconstruction task, we propose a shared-kernel wavelet neural network, which solves the Poisson equation and has three advantages. First, it has less than {\bf 0.0002M} parameters, which is compact enough for most of devices. Second, it has linear computation complexity, leading to a real-time reconstruction. Third, it achieves higher accuracy than previous methods. Several numerical experiments are conducted to show the effectiveness and efficiency of the sparse Laplacian field and the proposed Poisson solver. The proposed method can be applied in a large range of applications such as image compression, low light enhancement, object tracking, etc.
0
0
eess.IV 2026-04-27

Gaussian primitives reconstruct tissue absorption from scattered light

GS-DOT: Gaussian splatting-based image reconstruction for diffuse optical tomography

Replacing ray transport with diffusion functions yields accurate maps from noisy data while cutting memory use.

Figure from the paper full image
abstract click to expand
This work presents GS-DOT, a novel image reconstruction framework based on Gaussian Splatting (GS) for diffuse optical tomography (DOT). Inspired by GS for rendering applications, absorption coefficients are represented as a sparse sum of anisotropic Gaussian primitives optimized to fit measured time-resolved point-spread functions through analytic gradients and Adam optimization. This is the first adaptation of GS algorithms in the photon diffusion regime, where the ray transport function is replaced by the diffusion functions to enable accurate modeling of light transport in highly scattering media. Validation on synthetic tissue models demonstrate high accuracy in localization and quantification of reconstructed absorption maps for both clean and noisy signals. GS-DOT has demonstrated high robustness to noise and showed a huge reduction in memory demand.
0
0
eess.IV 2026-04-27

Fourth-order PDE despeckling outperforms second-order baselines

A Coupled Fourth Order Telegraph Diffusion Framework Using Grayscale Indicators for Image Despeckling

Coupled model with grayscale edge indicator reduces speckle in SAR and ultrasound while preserving textures, shown by higher PSNR and MSSIM.

Figure from the paper full image
abstract click to expand
Speckle noise severely limits the quality of images acquired from coherent imaging systems such as Synthetic Aperture Radar (SAR) and medical ultrasound. Traditional second-order PDE-based despeckling approaches, although popular, often introduce staircase artifacts and blur fine details. To overcome these limitations, we present a nonlinear, fourth-order coupled hyperbolic-parabolic PDE model that effectively reduces noise while preserving the structure. The framework consists of two evolution equations: one governing fourth-order diffusion for effective speckle reduction and smooth intensity transitions, and another refining an edge indicator to protect textures and structural features. The diffusion coefficient is adaptively constructed using both the image intensity variable u and a grayscale-based indicator function, ensuring structure-aware denoising while avoiding blocky artifacts and preserving fine details. We also prove the existence of a weak solution to the proposed model by applying Schauder fixed-point theorem. A finite-difference scheme with Gauss Seidel iteration is employed for efficient implementation. We compare the proposed model with the existing coupled second-order PDE model (HPCPDE) and the fourth-order telegraph diffusion model (TDFM). The results show that our model consistently outperforms these approaches. Experiments on standard grayscale images, real SAR and ultrasound data, as well as speckle-corrupted color images, demonstrate that the proposed method achieves superior performance over conventional PDE-based techniques in terms of PSNR, MSSIM, and Speckle Index.
0
0
eess.IV 2026-04-27

One adapted model segments colorectal cancer across CT

CRC-SAM: SAM-Based Multi-Modal Segmentation and Quantification of Colorectal Cancer in CT, Colonoscopy, and Histology Images

LoRA layers on a frozen foundation model enable consistent tumor outlining in three modalities with only minimal new parameters.

abstract click to expand
We present CRC-SAM, a unified framework for colorectal cancer segmentation across colonoscopy, CT, and histopathology images. Unlike prior single-modality methods, CRC-SAM provides consistent, modality-agnostic segmentation throughout the clinical workflow. Built on MedSAM, it incorporates low-rank adaptation (LoRA) layers into a frozen encoder, enabling efficient domain transfer to underrepresented modalities with minimal trainable parameters. Experiments on MSD-Colon, CVC-ClinicDB, and EBHI-Seg demonstrate superior performance across modalities, outperforming state-of-the-art baselines and highlighting the effectiveness of lightweight LoRA adaptation for foundation-model-based colorectal cancer analysis.
1 0
0
eess.IV 2026-04-27

CT map adapts regularization for better whole-body PET registration

CT-Guided Spatially-varying Regularization for Voxel-Wise Deformable Whole-Body PET Registration

Voxel-wise strength derived from paired CT yields statistically significant gains in alignment accuracy over uniform baselines on 296 cross-

Figure from the paper full image
abstract click to expand
Whole-body Positron Emission Tomography (PET) registration is essential for multi-parametric tumor characterization and assessment of metastatic disease progression. In deep learning-based deformable registration, the dense displacement field (DDF) regularizer is crucial for stabilizing optimization and preventing unrealistic deformations in large 3D volumes. A key challenge in whole-body deformable registration is anatomical heterogeneity, rigid structures (e.g., bones) should undergo stronger regularization, whereas soft tissues require more flexible deformation and weaker constraints. In this work, we propose a simple yet effective CT-guided spatially-varying regularization strategy for whole-body cross-tracer deformable PET registration. The key idea is to use the paired CT volume from the PET/CT acquisition to construct a voxel-wise regularization map for the DDF, replacing the conventional single global regularization weight. This yields anatomy-adaptive regularization strength across rigid and soft tissues. The proposed method is evaluated on a real clinical cross-tracer PET/CT dataset of 296 patients involving 18F-PSMA and 18F-FDG, showing that the proposed method achieves statistically significant improvements over weakly-supervised registration baseline in both whole-body registration performance and organ-wise alignment.
0
0
eess.IV 2026-04-27

Network synthesizes delayed liver MRI phase from earlier images

Triple-Phase Sequential Fusion Network for Hepatobiliary Phase Liver MRI Synthesis

TriPF-Net fuses T1, arterial and venous data with patient info to match real hepatobiliary images and avoid long delays.

Figure from the paper full image
abstract click to expand
Gadoxetate disodium-enhanced MRI is essential for the detection and characterization of hepatocellular carcinoma. However, acquisition of the hepatobiliary phase (HBP) requires a prolonged post-contrast delay, which reduces workflow efficiency and increases the risk of motion artifacts. In this study, we propose a Triple-Phase Sequential Fusion Network (TriPF-Net) to synthesize HBP images by leveraging the sequential information from pre-HBP sequences: while T1-weighted imaging serves as the indispensable baseline, the model adaptively integrates arterial-phase (AP) and venous-phase (VP) features when available. By modeling the tissue-specific contrast uptake and excretion dynamics across these three phases, TriPF-Net ensures robust HBP synthesis even under the stochastic absence of one or both dynamic contrast-enhanced sequences. The framework comprises an Enhanced Region-Guided Encoder and a Dynamic Feature Unification Module, optimized with a Region-Guided Sequential Fusion Loss to maintain physiological consistency. In addition, clinical variables, including age, sex, total bilirubin, and albumin, are incorporated to enhance physiological consistency. Compared with conventional methods, TriPF-Net achieved superior performance on datasets from two centers. On the internal dataset, the model achieved an MAE of 10.65, a PSNR of 23.27, and an SSIM of 0.76. On the external validation dataset, the corresponding values were 12.41, 23.11, and 0.78, respectively. This flexible solution enhances clinical workflow and lesion depiction, potentially eliminating the need for delayed HBP acquisition in HCC imaging.
0
0
eess.IV 2026-04-27

Useful nonrobust features are ubiquitous in biomedical images

Models using only these features exceed chance performance in standard tests yet underperform when data distributions change.

abstract click to expand
We study whether deep networks for medical imaging learn useful nonrobust features - predictive input patterns that are not human interpretable and highly susceptible to small adversarial perturbations - and how these features impact test performance. We show that models trained only on nonrobust features achieve well above chance accuracy across five MedMNIST classification tasks, confirming their predictive value in-distribution. Conversely, adversarially trained models that primarily rely on robust features sacrifice in-distribution accuracy but yield markedly better performance under controlled distribution shifts (MedMNIST-C). Overall, nonrobust features boost standard accuracy yet degrade out-of-distribution performance, revealing a practical robustness-accuracy trade-off in medical imaging classification tasks that should be tailored to the requirements of the deployment setting.
1 0
0
eess.IV 2026-04-27

Foundation models match specialists on cardiac MRI yet generalize better

Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction?

They stay competitive on matched data and reduce error on knee and brain scans at high acceleration without retraining.

Figure from the paper full image
abstract click to expand
The emergence of large-scale pretrained foundation models has transformed computer vision, enabling strong performance across diverse downstream tasks. However, their potential for physics-based inverse problems, such as accelerated cardiac MRI reconstruction, remains largely underexplored. In this work, we investigate whether natural-domain foundation models can serve as effective image priors for accelerated cardiac MRI reconstruction, and compare the performance obtained against domain-specific counterparts such as BiomedCLIP. We propose an unrolled reconstruction framework that incorporates pretrained, frozen visual encoders, such as CLIP, DINOv2, and BiomedCLIP, within each cascade to guide the reconstruction process. Through extensive experiments, we show that while task-specific state-of-the-art reconstruction models such as E2E-VarNet achieve superior performance in standard in-distribution settings, foundation-model-based approaches remain competitive. More importantly, in challenging cross-domain scenarios, where models are trained on cardiac MRI and evaluated on anatomically distinct knee and brain datasets--foundation models exhibit improved robustness, particularly under high acceleration factors and limited low-frequency sampling. We further observe that natural-image-pretrained models, such as CLIP, learn highly transferable structural representations, while domain-specific pretraining (BiomedCLIP) provides modest additional gains in more ill-posed regimes. Overall, our results suggest that pretrained foundation models offer a promising source of transferable priors, enabling improved robustness and generalization in accelerated MRI reconstruction.
0
0
eess.IV 2026-04-27

Multimodal AI ranks mouse dominance from raw videos

MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models

Fine-tuned models match tube-test results on unseen mouse interactions without custom vision engineering.

Figure from the paper full image
abstract click to expand
Understanding social dominance in animal behavior is critical for neuroscience and behavioral studies. In this work, we explore the capability of Multimodal Large Language Models(MLLMs) to analyze raw behavioral video of mice and predict their dominance hierarchy. We introduce MTT-Bench, a novel benchmark comprising annotated videos of pairwise mouse interactions for Mouse Tube Test analysis. Building on existing MLLM architectures, we fine-tune these models to perform zero-shot inference on unseen behavioral sequences, predicting social dominance without explicit labels during testing. Our framework demonstrates promising results, showing high agreement with tube test rankings. This work opens a new direction for applying foundation models to ethology and social behavior analysis, without the need to design domain-specific models.
0
0
eess.IV 2026-04-27

Network corrects pediatric PET without CT across scanners

Generalizable CT-Free PET Attenuation and Scatter Correction for Pediatric Patients

GPCN keeps anatomical accuracy stable on unseen scanner-tracer pairs by separating fixed structures from variable noise

Figure from the paper full image
abstract click to expand
Computed tomography (CT)-based attenuation and scatter correction improves quantitative PET but adds radiation exposure that is particularly undesirable in pediatric imaging. Existing CT-free methods are commonly trained in homogeneous settings and often degrade under scanner or radiotracer shifts, which limits their clinical utility. We propose the Generalizable PET Correction Network (GPCN), a dual-domain network for domain-robust CT-free PET attenuation and scatter correction. GPCN combines a multi-band contextual refinement module, which models pediatric anatomical variability through wavelet-based multiscale decomposition and long-range spatial context modeling, with a frequency-aware spectral decoupling module, which performs coordinate-conditioned amplitude/phase refinement in the Fourier domain. By synergizing multi-band spatial contextual modeling with asymmetric frequency-spectrum decoupling, the network explicitly separates invariant topological structures from domain-specific noise, thereby achieving precise quantitative recovery of both anatomical organs and focal lesions. This design aims to separate anatomy-dominant structures from domain-sensitive spectral residuals and to improve robustness across heterogeneous imaging conditions. We train and evaluate the method on 1085 pediatric whole-body PET scans acquired with two scanners and five radiotracers. In both joint training and zero-shot cross-domain evaluation, GPCN outperforms representative baselines and maintains stable quantitative accuracy on unseen scanner-tracer combinations. The method is further supported by ablation, region-wise quantitative analysis, and downstream segmentation experiments. In our cohort, the CT component of the conventional protocol corresponded to an average effective dose of 10.8 mSv, indicating the potential clinical value of reliable CT-free correction for pediatric PET.
0
0
eess.IV 2026-04-27

Phase tracking replaces full sweeps in nonlinear resonance tests

Fixed-phase Resonance Tracking for Fast Nonlinear Resonant Ultrasound Spectroscopy

A linearized model updates drive frequency from phase error to stay at instantaneous resonance while material properties evolve.

Figure from the paper full image
abstract click to expand
Nonlinear Resonant Ultrasound Spectroscopy (NRUS) experiments that rely on repeated sampling of resonance curves are inherently sensitive to measurement protocol due to evolution of material parameters caused by fast and slow dynamic effects. We introduce a model-assisted discrete-time resonance tracking method that maintains a system at its instantaneous resonance condition without the need to acquire full frequency sweeps. Resonance is defined through a prescribed phase relation between excitation and response, and the excitation frequency is iteratively updated using a linearized frequency--phase model. The procedure allows controlled suppression of transient wave buildup using optional feedforward correction with respect to an external control parameter. The method is demonstrated on NRUS and on conditioning--relaxation protocol conducted on a sandstone bar, providing estimates of resonance frequency and damping. Comparison with conventional approaches shows that measurement speed and mode stability significantly influence the inferred nonlinear indicators. The proposed framework is not limited to nonlinear acoustics and can be applied to arbitrary resonant systems with slowly evolving parameters.
0

browse all of eess.IV β†’ full archive Β· search Β· sub-categories