HASTE enables training-free dynamic compression of pre-trained CNNs by patch-wise LSH-based merging of redundant channels, reporting 46.2% FLOPs reduction on ResNet34 CIFAR-10 with 1.25% accuracy drop.
hub
MoCoGAN: Decomposing motion and content for video generation
21 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Sparse 2.5D trajectory transformers with masked pretraining reach 45% top-1 on Something-Something V2 and 54% on EPIC-Kitchens while improving fusion with DINOv2 and V-JEPA by up to 8.7 points.
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
DRFS is a new inversion-free editing technique for rectified flow models that models source-target velocity discrepancies and applies a time-dependent shift to improve fidelity and unify prior methods like DDS and FlowEdit.
Presents the ev-CIVIL dataset and benchmark showing that event-based cameras can support real-time detection of cracks and spalling in civil infrastructure under challenging lighting.
STBP computes exact closed-form bounds for the first convolutional layer of spatio-temporal networks and propagates scalable approximations through the rest to certify robustness under subset-frame or patch perturbations.
SOMA recovers spatio-temporal muscle behavior from multi-view RGB surface data and introduces the SKIM soft-tissue deformation dataset as the first such method from RGB observations.
AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.
Chessformer is a unified encoder-only transformer for chess that uses square tokens, geometric attention bias, and an attention-based policy head to set new records in human move prediction accuracy, playing strength, and interpretability.
SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietary models without internal access or per-task retraining.
Holi-DETR improves fashion item detection by integrating co-occurrence probabilities, inter-item spatial arrangements, and body keypoint relationships into the DETR architecture.
One-Forcing augments DMD with a GAN loss to enable stable one-step causal autoregressive video generation, reporting a VBench score of 83.76 as SOTA among one-step methods.
Frozen DINOv2-L features with k-NN classification and PCA/ICA refinement achieve state-of-the-art few-shot performance on four benchmarks without any backpropagation or fine-tuning.
Multi-narrow single-model ensembles outperform wide baselines in low-data image classification by learning diverse features but underperform in data-rich settings where training favors few paths.
FA-Seg delivers state-of-the-art training-free open-vocabulary segmentation performance (43.8% mIoU average) on standard benchmarks by extracting and refining attention from a single forward pass of a pretrained diffusion model.
MAPE combines a channel-attention U-Net (SAPE) trained on multi-model adversarial examples scheduled by PPSA to eliminate perturbations, reporting over 95.1% average defense on CIFAR-10 and 71.5% on Mini-ImageNet against black-box transferable attacks.
A dual-stream Transformer using frozen GazeLLE backbones and custom token fusion detects mutual gaze and joint attention from dual-camera recordings, outperforming CNN baselines and a multimodal LLM on caregiver-infant data.
Adding a P2 branch to YOLOX-Nano raises small-object AP by 31.10% on VisDrone; QIEA screens structures balancing accuracy, FLOPs, latency, memory and recall.
citing papers explorer
-
Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration
CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
-
QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding
Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
-
Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes
Multi-narrow single-model ensembles outperform wide baselines in low-data image classification by learning diverse features but underperform in data-rich settings where training favors few paths.