arxiv: 2603.19929 · v2 · submitted 2026-03-20 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RAM: Recover Any 3D Human Motion in-the-Wild

Sen Jia , Ning Zhu , Jinqin Zhong , Jiale Zhou , Huaping Zhang , Jenq-Neng Hwang , Lei Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D human motion reconstructionin-the-wild videosmulti-person trackingmarkerless motion capturetemporal HMRKalman filteringpose estimationzero-shot tracking

0 comments

The pith

RAM recovers accurate 3D human motions from unconstrained multi-person videos by combining semantic tracking with memory-based reconstruction and prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RAM, a complete pipeline for markerless 3D human motion recovery that operates on videos captured in everyday, uncontrolled environments. It begins with a motion-aware semantic tracker that applies adaptive Kalman filtering to maintain correct person identities even when people occlude one another or move quickly. A memory-augmented Temporal HMR module then reconstructs poses while drawing on stored spatio-temporal patterns to keep the motion smooth and consistent across frames. A lightweight predictor generates likely future poses, and a gated combiner decides how much to trust the reconstructed versus predicted information at each step. Experiments on the PoseTrack and 3DPW benchmarks show that the integrated system delivers higher tracking stability and lower 3D error than earlier methods when tested without any fine-tuning on the target data.

Core claim

RAM achieves robust identity association under severe occlusions and dynamic interactions through a motion-aware semantic tracker with adaptive Kalman filtering. It enhances motion reconstruction with a memory-augmented Temporal HMR module that injects spatio-temporal priors for consistent and smooth estimation. A lightweight Predictor forecasts future poses to maintain continuity, while a gated combiner adaptively fuses reconstructed and predicted features. Together these components produce substantially better zero-shot tracking stability and 3D accuracy on in-the-wild multi-person benchmarks such as PoseTrack and 3DPW.

What carries the argument

The RAM pipeline, which links a motion-aware semantic tracker, a memory-augmented Temporal HMR module, a lightweight future-pose Predictor, and a gated feature combiner.

Load-bearing premise

The assumption that the four separate modules integrate without conflicts or loss of performance to produce the claimed robustness and coherence.

What would settle it

A direct comparison on PoseTrack or 3DPW in which RAM shows no improvement, or shows worse tracking stability or 3D accuracy, than the best previous method would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.19929 by Huaping Zhang, Jenq-Neng Hwang, Jiale Zhou, Jinqin Zhong, Lei Li, Ning Zhu, Sen Jia.

**Figure 2.** Figure 2: Overview of RAM. The framework integrates four components: SegFollow for motion-guided temporal tracking, Temporal HMR [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on the Olympics Boxing dataset. Both 4DHumans and CoMotion suffer from identity switches and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

RAM incorporates a motion-aware semantic tracker with adaptive Kalman filtering to achieve robust identity association under severe occlusions and dynamic interactions. A memory-augmented Temporal HMR module further enhances human motion reconstruction by injecting spatio-temporal priors for consistent and smooth motion estimation. Moreover, a lightweight Predictor module forecasts future poses to maintain reconstruction continuity, while a gated combiner adaptively fuses reconstructed and predicted features to ensure coherence and robustness. Experiments on in-the-wild multi-person benchmarks such as PoseTrack and 3DPW, demonstrate that RAM substantially outperforms previous state-of-the-art in both Zero-shot tracking stability and 3D accuracy, offering a generalizable paradigm for markerless 3D human motion capture in-the-wild.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAM is a practical pipeline stitching Kalman tracking, memory-augmented Temporal HMR, a predictor, and gated fusion that claims big gains on multi-person 3D motion in the wild, but the abstract supplies no numbers or ablations to judge whether the integration actually works.

read the letter

RAM stitches a semantic tracker with adaptive Kalman filtering to a memory-augmented Temporal HMR, adds a lightweight pose predictor, and uses a gated combiner to fuse the outputs. The goal is stable identity tracking and coherent 3D reconstruction when people occlude each other or move unpredictably in real scenes. On PoseTrack and 3DPW it claims clear wins in zero-shot tracking stability and 3D accuracy over prior work. That combination targets a genuine pain point for anyone who needs markerless capture outside the lab. The memory module for spatio-temporal priors and the predictor for continuity are reasonable engineering choices that could reduce jitter and drift. The gated fusion step is the piece that feels most specific to this paper. It tries to adaptively balance reconstructed and forecasted features, which might help when one source is unreliable. Those elements are the parts that could be new enough to matter for follow-up work. The main weakness is the complete absence of numbers, error bars, ablation tables, or failure cases in the abstract. Without those it is impossible to tell how large the gains actually are or whether the modules interact cleanly. The central claim of substantial outperformance therefore rests on trust rather than visible evidence. If the full paper has solid quantitative comparisons and controls, the integration could be useful; right now it is hard to know. This paper is aimed at practitioners who build 3D human motion systems for animation, sports, or surveillance. A reader already working on multi-person pose estimation would get value from the module descriptions and the benchmark choices. I would bring it to a reading group to talk through the fusion logic and see the actual results. It deserves peer review because the problem is relevant and the pipeline is plausible, even though the current write-up will need heavy revision to include the missing experiments and analysis.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces RAM, a pipeline for in-the-wild 3D human motion recovery that combines a motion-aware semantic tracker with adaptive Kalman filtering for identity association under occlusion, a memory-augmented Temporal HMR module that injects spatio-temporal priors, a lightweight Predictor for future-pose forecasting, and a gated combiner that fuses reconstructed and predicted features. Experiments on PoseTrack and 3DPW are reported to show substantial gains over prior state-of-the-art in zero-shot tracking stability and 3D accuracy.

Significance. If the quantitative claims are substantiated, the work supplies a practical, modular architecture for markerless multi-person 3D capture that directly targets the failure modes of existing HMR and tracking pipelines. The explicit separation of tracking, memory-augmented reconstruction, prediction, and adaptive fusion offers a reusable template that could be adopted or extended by the community.

major comments (2)

[Abstract] Abstract: the central claim that RAM 'substantially outperforms previous state-of-the-art' on PoseTrack and 3DPW is stated without any numerical results, error bars, ablation tables, or baseline numbers. This absence makes it impossible to assess the magnitude or statistical reliability of the reported gains.
[Experiments] The manuscript provides no ablation or failure-case analysis demonstrating that the four modules (semantic tracker, memory-augmented Temporal HMR, Predictor, gated combiner) remain effective when combined; the integration claim therefore rests on an untested assumption.

minor comments (1)

Notation for the gated combiner and the memory-augmented HMR is introduced without an accompanying equation or diagram, making the fusion mechanism difficult to reproduce from the text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and will revise the manuscript to strengthen the presentation of results and validation of the proposed modules.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that RAM 'substantially outperforms previous state-of-the-art' on PoseTrack and 3DPW is stated without any numerical results, error bars, ablation tables, or baseline numbers. This absence makes it impossible to assess the magnitude or statistical reliability of the reported gains.

Authors: We agree that the abstract would be more informative with explicit quantitative support. In the revised manuscript we will add the key performance deltas (e.g., tracking MOTA and 3D MPJPE improvements on both PoseTrack and 3DPW) together with references to the corresponding tables and any reported standard deviations or error bars from the experimental section. revision: yes
Referee: [Experiments] The manuscript provides no ablation or failure-case analysis demonstrating that the four modules (semantic tracker, memory-augmented Temporal HMR, Predictor, gated combiner) remain effective when combined; the integration claim therefore rests on an untested assumption.

Authors: We acknowledge the absence of explicit ablations and failure-case studies. While the current experiments demonstrate end-to-end gains, we will add a dedicated ablation subsection that isolates the contribution of each module and their pairwise combinations, as well as a short qualitative analysis of representative failure cases under heavy occlusion and fast motion. These additions will be included in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a modular pipeline (semantic tracker with adaptive Kalman filtering, memory-augmented Temporal HMR, lightweight Predictor, and gated combiner) whose integration is presented as delivering improved zero-shot tracking stability and 3D accuracy on external benchmarks PoseTrack and 3DPW. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce any central claim to its own inputs by construction. The argument remains self-contained against external data and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the work appears to rest on standard assumptions from prior human-mesh-recovery and tracking literature.

axioms (1)

domain assumption Existing Temporal HMR and Kalman-filter tracking methods provide reliable base performance that can be extended by memory and prediction modules.
Abstract invokes these components without re-deriving them.

pith-pipeline@v0.9.0 · 5431 in / 1281 out tokens · 32404 ms · 2026-05-15T08:41:50.080339+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RAM incorporates a motion-aware semantic tracker with adaptive Kalman filtering... memory-augmented Temporal HMR module... lightweight Predictor module... gated combiner
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on PoseTrack and 3DPW... zero-shot tracking stability and 3D accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
cs.CV 2026-05 unverdicted novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

INTENT mitigates cross-modal correspondence noise and modality-inherent noise in composed image retrieval via FFT-based visual invariant composition and bi-objective discriminative learning.
HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

HABIT improves robustness in composed image retrieval under noisy triplets by quantifying sample cleanliness via mutual information transition rates and applying dual-consistency progressive learning to retain good pa...
ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

ReTrack calibrates directional bias in composed video features using semantic disentanglement and bidirectional evidence alignment to improve retrieval performance on CVR and CIR tasks.
Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
cs.AI 2026-05 unverdicted novelty 5.0

SCM-GRPO grounds multi-hop fact verification in structural causal models and applies GRPO reinforcement learning to optimize reasoning chain length, outperforming baselines on HoVer and EX-FEVER.
Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
cs.AI 2026-05 unverdicted novelty 4.0

The SCM-GRPO framework models multi-hop fact verification as causal inference and applies reinforcement learning to optimize reasoning depth, reporting outperformance on HoVer and EX-FEVER.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 5 Pith papers · 1 internal anchor

[1]

Posetrack: A benchmark for human pose estima- tion and tracking

Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov, Leonid Pishchulin, Anton Milan, Juergen Gall, and Bernt Schiele. Posetrack: A benchmark for human pose estima- tion and tracking. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5167–5176, 2018

work page 2018
[2]

Tracking without bells and whistles

Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. Tracking without bells and whistles. InProceedings of the IEEE/CVF international conference on computer vision, pages 941–951, 2019

work page 2019
[3]

Simple online and realtime tracking

Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In2016 IEEE international conference on image processing (ICIP), pages 3464–3468. Ieee, 2016

work page 2016
[4]

Keep it smpl: Automatic estimation of 3d human pose and shape from a single image

Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. InEuropean conference on computer vision, pages 561–578. Springer, 2016

work page 2016
[5]

Deep representation learning for human motion prediction and classification

Judith Butepage, Michael J Black, Danica Kragic, and Hed- vig Kjellstrom. Deep representation learning for human motion prediction and classification. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 6158–6166, 2017

work page 2017
[6]

Bayesian optimization for controlled image editing via llms

Chengkun Cai, Haoliang Liu, Xu Zhao, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, John Lee, Jenq-Neng Hwang, and Lei Li. Bayesian optimization for controlled image editing via llms. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2025

work page 2025
[7]

The role of deductive and inductive reasoning in large language models

Chengkun Cai, Xu Zhao, Haoliang Liu, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, Jenq-Neng Hwang, and Lei Li. The role of deductive and inductive reasoning in large language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2025

work page 2025
[8]

Pretraining on the test set is no longer all you need: A debate-driven approach to QA benchmarks

Linbo Cao and Jinman Zhao. Pretraining on the test set is no longer all you need: A debate-driven approach to QA benchmarks. InSecond Conference on Language Modeling, 2025

work page 2025
[9]

Deft: Detection embeddings for tracking

Mohamed Chaabane, Peter Zhang, J Ross Beveridge, and Stephen O’Hara. Deft: Detection embeddings for tracking. arXiv preprint arXiv:2102.02267, 2021

work page arXiv 2021
[10]

Large language models in mental health: A sys- tematic review of applications, innovations, and ethical chal- lenges.Journal of Industrial Integration and Management, 2025

Yisong Chen, Yifan Gao, Sijing Yu, Chuqing Zhao, and Yang Lu. Large language models in mental health: A sys- tematic review of applications, innovations, and ethical chal- lenges.Journal of Industrial Integration and Management, 2025

work page 2025
[11]

Graph and temporal convolutional networks for 3d multi-person pose estimation in monocular videos

Yu Cheng, Bo Wang, Bo Yang, and Robby T Tan. Graph and temporal convolutional networks for 3d multi-person pose estimation in monocular videos. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1157– 1165, 2021

work page 2021
[12]

Mixformer: End-to-end tracking with iterative mixed atten- tion

Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed atten- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 13608–13618, 2022

work page 2022
[13]

Tapir: Tracking any point with per-frame initialization and temporal refinement

Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 10061– 10072, 2023

work page 2023
[14]

Panocontext-former: Panoramic total scene un- derstanding with a transformer

Yuan Dong, Chuan Fang, Liefeng Bo, Zilong Dong, and Ping Tan. Panocontext-former: Panoramic total scene un- derstanding with a transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28087–28097, 2024

work page 2024
[15]

Forensichub: A unified benchmark & codebase for all-domain fake image detection and localization.arXiv preprint arXiv:2505.11003, 2025

Bo Du, Xuekang Zhu, Xiaochen Ma, Chenfan Qu, Kai- wen Feng, Zhe Yang, Chi-Man Pun, Jian Liu, and Ji-Zhe Zhou. Forensichub: A unified benchmark & codebase for all-domain fake image detection and localization.arXiv preprint arXiv:2505.11003, 2025

work page arXiv 2025
[16]

Pansharpening for thin-cloud contaminated remote sensing images: a unified framework and benchmark dataset

Songcheng Du, Yang Zou, Jiaxin Li, Mingxuan Liu, Ying Li, Changjing Shang, and Qiang Shen. Pansharpening for thin-cloud contaminated remote sensing images: a unified framework and benchmark dataset. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3696– 3704, 2026

work page 2026
[17]

Unsupervised hyper- spectral image super-resolution via self-supervised modality decoupling.International Journal of Computer Vision, 134: 152, 2026

Songcheng Du, Yang Zou, Zixu Wang, Xingyuan Li, Ying Li, Changjing Shang, and Qiang Shen. Unsupervised hyper- spectral image super-resolution via self-supervised modality decoupling.International Journal of Computer Vision, 134: 152, 2026

work page 2026
[18]

Humans in 4d: Re- constructing and tracking humans with transformers

Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4d: Re- constructing and tracking humans with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14783–14794, 2023

work page 2023
[19]

Flex: extrinsic parameters-free multi-view 3d human motion reconstruction

Brian Gordon, Sigal Raab, Guy Azov, Raja Giryes, and Daniel Cohen-Or. Flex: extrinsic parameters-free multi-view 3d human motion reconstruction. InEuropean Conference on Computer Vision, pages 176–196. Springer, 2022

work page 2022
[20]

Mocount: Motion-based repetitive ac- tion counting

Ruocheng Gu, Sen Jia, Yule Ma, Jinqin Zhong, Jenq-Neng Hwang, and Lei Li. Mocount: Motion-based repetitive ac- tion counting. InProceedings of the 33rd ACM International Conference on Multimedia, pages 9026–9034, 2025

work page 2025
[21]

Learning to learn weight generation via local consistency diffusion.arXiv preprint arXiv:2502.01117, 2025

Yunchuan Guan, Yu Liu, Ke Zhou, Zhiqi Shen, Serge Be- longie, Jenq-Neng Hwang, and Lei Li. Learning to learn weight generation via local consistency diffusion.arXiv preprint arXiv:2502.01117, 2025. Accepted to CVPR 2026

work page arXiv 2025
[22]

A survey of weight space learning: Under- standing, representation, and generation.arXiv preprint arXiv:2603.10090, 2026

Xin Han et al. A survey of weight space learning: Under- standing, representation, and generation.arXiv preprint arXiv:2603.10090, 2026. Comprehensive survey covering Mc-Di-like methods

work page arXiv 2026
[23]

Per- ceiving longer sequences with bi-directional cross-attention transformers.Advances in Neural Information Processing Systems, 37:94097–94129, 2024

Markus Hiller, Krista A Ehinger, and Tom Drummond. Per- ceiving longer sequences with bi-directional cross-attention transformers.Advances in Neural Information Processing Systems, 37:94097–94129, 2024

work page 2024
[24]

Self-supervised 3d mesh reconstruction from single images

Tao Hu, Liwei Wang, Xiaogang Xu, Shu Liu, and Jiaya Jia. Self-supervised 3d mesh reconstruction from single images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6002–6011, 2021

work page 2021
[25]

Mrag-suite: A di- agnostic evaluation platform for visual retrieval-augmented generation.arXiv preprint arXiv:2509.24253, 2025

Yuelyu Ji, Wuwei Lan, and Patrick NG. Mrag-suite: A di- agnostic evaluation platform for visual retrieval-augmented generation.arXiv preprint arXiv:2509.24253, 2025

work page arXiv 2025
[26]

Reason- to-rank: Distilling direct and comparative reasoning from large language models for document reranking

Yuelyu Ji, Zhuochun Li, Rui Meng, and Daqing He. Reason- to-rank: Distilling direct and comparative reasoning from large language models for document reranking. InProceed- ings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, page 2320–2329, New York, NY , USA, 2025. Association for Computing Machinery

work page 2025
[27]

Sadhu, Zhuochun Li, Xizhi Wu, Shyam Visweswaran, and Yanshan Wang

Yuelyu Ji, Wenhe Ma, Sonish Sivarajkumar, Hang Zhang, Eugene M. Sadhu, Zhuochun Li, Xizhi Wu, Shyam Visweswaran, and Yanshan Wang. Mitigating the risk of health inequity exacerbated by large language models.npj Digital Medicine, 8(1):246, 2025

work page 2025
[28]

Retrieval–reasoning processes for multi-hop question an- swering: A four-axis design framework and empirical trends

Yuelyu Ji, Zhuochun Li, Rui Meng, and Daqing He. Retrieval–reasoning processes for multi-hop question an- swering: A four-axis design framework and empirical trends. arXiv preprint arXiv:2601.00536, 2026

work page arXiv 2026
[29]

Adaptive masking enhances visual grounding.arXiv preprint arXiv:2410.03161, 2024

Sen Jia and Lei Li. Adaptive masking enhances visual grounding.arXiv preprint arXiv:2410.03161, 2024

work page arXiv 2024
[30]

Multiply: Reconstruction of multiple people from monocular video in the wild

Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin, Otmar Hilliges, and Jie Song. Multiply: Reconstruction of multiple people from monocular video in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 109–118, 2024

work page 2024
[31]

Back to optimization: Diffusion-based zero-shot 3d human pose estimation

Zhongyu Jiang, Zhuoran Zhou, Lei Li, Wenhao Chai, Cheng- Yen Yang, and Jenq-Neng Hwang. Back to optimization: Diffusion-based zero-shot 3d human pose estimation. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision (WACV), 2024

work page 2024
[32]

Unihpr: Unified human pose representation via singular value contrastive learning

Zhongyu Jiang, Wenhao Chai, Lei Li, Zhuoran Zhou, Cheng- Yen Yang, and Jenq-Neng Hwang. Unihpr: Unified human pose representation via singular value contrastive learning. arXiv preprint arXiv:2510.19078, 2025

work page arXiv 2025
[33]

A new approach to linear filtering and prediction problems

Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. 1960

work page 1960
[34]

End-to-end recovery of human shape and pose

Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018

work page 2018
[35]

Lp-detr: Layer-wise progressive relation for object detection

Zhengjian Kang, Ye Zhang, Xiaoyu Deng, Xintao Li, and Yongzhe Zhang. Lp-detr: Layer-wise progressive relation for object detection. InInternational Conference on Intelligent Computing, pages 144–156. Springer, 2025

work page 2025
[36]

Multi- person physics-based pose estimation for combat sports

Hossein Feizollah Zadeh Khoiee, David Labbe, Thomas Romeas, Jocelyn Faubert, and Sheldon Andrews. Multi- person physics-based pose estimation for combat sports. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5832–5841, 2025

work page 2025
[37]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 4015–4026, 2023

work page 2023
[38]

Pare: Part attention regressor for 3d human body estimation

Muhammed Kocabas, Chun-Hao P Huang, Otmar Hilliges, and Michael J Black. Pare: Part attention regressor for 3d human body estimation. InProceedings of the IEEE/CVF international conference on computer vision, pages 11127– 11137, 2021

work page 2021
[39]

Learning to reconstruct 3d human pose and shape via model-fitting in the loop

Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. InProceedings of the IEEE/CVF international conference on computer vision, pages 2252–2261, 2019

work page 2019
[40]

The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955

Harold W Kuhn. The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955

work page 1955
[41]

Mocap everyone everywhere: Lightweight motion capture with smartwatches and a head- mounted camera

Jiye Lee and Hanbyul Joo. Mocap everyone everywhere: Lightweight motion capture with smartwatches and a head- mounted camera. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 1091–1100, 2024

work page 2024
[42]

Wav2sem: Plug-and-play audio semantic decou- pling for 3d speech-driven facial animation.arXiv preprint arXiv:2505.23290, 2025

Hao Li, Ju Dai, Xin Zhao, Feng Zhou, Junjun Pan, and Lei Li. Wav2sem: Plug-and-play audio semantic decou- pling for 3d speech-driven facial animation.arXiv preprint arXiv:2505.23290, 2025

work page arXiv 2025
[43]

Au-blendshape for fine-grained stylized 3d facial ex- pression manipulation.arXiv preprint arXiv:2507.12001, 2025

Hao Li, Ju Dai, Feng Zhou, Kaida Ning, Lei Li, and Junjun Pan. Au-blendshape for fine-grained stylized 3d facial ex- pression manipulation.arXiv preprint arXiv:2507.12001, 2025

work page arXiv 2025
[44]

Mask-fpan: Semi-supervised face parsing in the wild with de-occlusion and uv gan.arXiv preprint arXiv:2212.09098, 2022

Lei Li, Tianfang Zhang, Zhongfeng Kang, and Xikun Jiang. Mask-fpan: Semi-supervised face parsing in the wild with de-occlusion and uv gan.arXiv preprint arXiv:2212.09098, 2022

work page arXiv 2022
[45]

Chatmotion: A multimodal multi-agent for human motion analysis.arXiv preprint arXiv:2502.18180, 2025

Lei Li, Sen Jia, Jianhao Wang, Zhaochong An, Jiaang Li, Jenq-Neng Hwang, and Serge Belongie. Chatmotion: A multimodal multi-agent for human motion analysis.arXiv preprint arXiv:2502.18180, 2025

work page arXiv 2025
[46]

Human Motion Instruction Tuning

Lei Li, Sen Jia, Jianhao Wang, Zhongyu Jiang, Feng Zhou, Ju Dai, Tianfang Zhang, Zongkai Wu, and Jenq-Neng Hwang. Human Motion Instruction Tuning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[47]

Multiple human mo- tion understanding

Lei Li, Sen Jia, and Jenq-Neng Hwang. Multiple human mo- tion understanding. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6297–6305, 2026

work page 2026
[48]

Time3d: End-to-end joint monocu- lar 3d object detection and tracking for autonomous driving

Peixuan Li and Jieyu Jin. Time3d: End-to-end joint monocu- lar 3d object detection and tracking for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3885–3894, 2022

work page 2022
[49]

Sep- prune: Structured pruning for efficient deep speech separa- tion.arXiv preprint arXiv:2505.12079, 2025

Yuqi Li, Kai Li, Xin Yin, Zhifei Yang, Junhao Dong, Zeyu Dong, Chuanguang Yang, Yingli Tian, and Yao Lu. Sep- prune: Structured pruning for efficient deep speech separa- tion.arXiv preprint arXiv:2505.12079, 2025

work page arXiv 2025
[50]

Frequency-aligned knowledge distillation for lightweight spatiotemporal forecasting

Yuqi Li, Chuanguang Yang, Hansheng Zeng, Zeyu Dong, Zhulin An, Yongjun Xu, Yingli Tian, and Hao Wu. Frequency-aligned knowledge distillation for lightweight spatiotemporal forecasting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7262– 7272, 2025

work page 2025
[51]

A comprehensive survey of interaction techniques in 3d scene generation.Authorea Preprints, 2026

Yuqi Li, Siwei Meng, Chuanguang Yang, Weilun Feng, Jun- ming Liu, Zhulin An, Yikai Wang, and Yingli Tian. A comprehensive survey of interaction techniques in 3d scene generation.Authorea Preprints, 2026

work page 2026
[52]

Cliff: Carrying location information in full frames into human pose and shape estimation

Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu, and Youliang Yan. Cliff: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision, pages 590–606. Springer, 2022

work page 2022
[53]

Graphrag under fire, 2025

Jiacheng Liang, Yuhui Wang, Changjiang Li, Rongyi Zhu, Tanqiu Jiang, Neil Gong, and Ting Wang. Graphrag under fire, 2025

work page 2025
[54]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[55]

Discovering what you can control: Interventional boundary discovery for reinforcement learning, 2026

Jiaxin Liu. Discovering what you can control: Interventional boundary discovery for reinforcement learning, 2026

work page 2026
[56]

Reasonact: Progressive training for fine-grained video reasoning in small models.Proceed- ings of the AAAI Conference on Artificial Intelligence, 40 (9):7188–7196, 2026

Jiaxin Liu and Zhaolu Kang. Reasonact: Progressive training for fine-grained video reasoning in small models.Proceed- ings of the AAAI Conference on Artificial Intelligence, 40 (9):7188–7196, 2026

work page 2026
[57]

Graph canvas for controllable 3d scene generation.arXiv preprint arXiv:2412.00091, 2024

Libin Liu, Shen Chen, Sen Jia, Jingzhe Shi, Zhongyu Jiang, Can Jin, Wu Zongkai, Jenq-Neng Hwang, and Lei Li. Graph canvas for controllable 3d scene generation.arXiv preprint arXiv:2412.00091, 2024

work page arXiv 2024
[58]

Smpl: A skinned multi- person linear model

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi- person linear model. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023

work page 2023
[59]

Towards generalizable and interpretable motion predic- tion: A deep variational bayes approach

Juanwu Lu, Wei Zhan, Masayoshi Tomizuka, and Yeping Hu. Towards generalizable and interpretable motion predic- tion: A deep variational bayes approach. InInternational Conference on Artificial Intelligence and Statistics, pages 4717–4725. PMLR, 2024

work page 2024
[60]

Diffmot: A real-time diffusion-based multiple object tracker with non-linear prediction

Weiyi Lv, Yuhang Huang, Ning Zhang, Ruei-Sung Lin, Mei Han, and Dan Zeng. Diffmot: A real-time diffusion-based multiple object tracker with non-linear prediction. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19321–19330, 2024

work page 2024
[61]

MS-DETR: Towards effective video mo- ment retrieval and highlight detection by joint motion- semantic learning

Hongxu Ma, Guanshuo Wang, Fufu Yu, Qiong Jia, and Shouhong Ding. MS-DETR: Towards effective video mo- ment retrieval and highlight detection by joint motion- semantic learning. InProceedings of the 33rd ACM In- ternational Conference on Multimedia (ACM MM), pages 4514–4523, 2025. Oral Presentation

work page 2025
[62]

Fine-grained zero-shot ob- ject detection

Hongxu Ma, Chenbo Zhang, Lu Zhang, Jiaogen Zhou, Ji- hong Guan, and Shuigeng Zhou. Fine-grained zero-shot ob- ject detection. InProceedings of the 33rd ACM International Conference on Multimedia (ACM MM), pages 4504–4513,

work page
[63]

GoR: A unified and extensible generative framework for ordinal regression

Hongxu Ma, Han Zhou, Kai Tian, Xuefeng Zhang, Chunjie Chen, Han Li, Jihong Guan, and Shuigeng Zhou. GoR: A unified and extensible generative framework for ordinal regression. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. ICLR 2026

work page 2026
[64]

Imdl-benco: A comprehensive bench- mark and codebase for image manipulation detection & localization.Advances in Neural Information Processing Systems, 37:134591–134613, 2025

Xiaochen Ma, Xuekang Zhu, Lei Su, Bo Du, Zhuohang Jiang, Bingkui Tong, Zeyu Lei, Xinyu Yang, Chi-Man Pun, Jiancheng Lv, et al. Imdl-benco: A comprehensive bench- mark and codebase for image manipulation detection & localization.Advances in Neural Information Processing Systems, 37:134591–134613, 2025

work page 2025
[65]

3d human pose estimation from a single image via distance matrix regression

Francesc Moreno-Noguer. 3d human pose estimation from a single image via distance matrix regression. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2823–2832, 2017

work page 2017
[66]

Comotion: Concurrent multi- person 3d motion.arXiv preprint arXiv:2504.12186, 2025

Alejandro Newell, Peiyun Hu, Lahav Lipson, Stephan R Richter, and Vladlen Koltun. Comotion: Concurrent multi- person 3d motion.arXiv preprint arXiv:2504.12186, 2025

work page arXiv 2025
[67]

Towards robust and smooth 3d multi-person pose estimation from monocular videos in the wild

Sungchan Park, Eunyi You, Inhoe Lee, and Joonseok Lee. Towards robust and smooth 3d multi-person pose estimation from monocular videos in the wild. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14772–14782, 2023

work page 2023
[68]

Motiontrack: Learning robust short- term and long-term motions for multi-object tracking

Zheng Qin, Sanping Zhou, Le Wang, Jinghai Duan, Gang Hua, and Wei Tang. Motiontrack: Learning robust short- term and long-term motions for multi-object tracking. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 17939–17948, 2023

work page 2023
[69]

Tracking people by pre- dicting 3d appearance, location and pose

Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, and Jitendra Malik. Tracking people by pre- dicting 3d appearance, location and pose. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2740–2749, 2022

work page 2022
[70]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Global-to-local modeling for video-based 3d human pose and shape estimation

Xiaolong Shen, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, and Yi Yang. Global-to-local modeling for video-based 3d human pose and shape estimation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8887–8896, 2023

work page 2023
[72]

Scaling law for time series forecasting

Jingzhe Shi, Qinwei Ma, Huan Ma, and Lei Li. Scaling law for time series forecasting. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2024

work page 2024
[73]

Explaining context length scaling and bounds for language models.arXiv preprint arXiv:2502.01481, 2025

Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jeng- Neng Hwang, and Lei Li. Explaining context length scaling and bounds for language models.arXiv preprint arXiv:2502.01481, 2025

work page arXiv 2025
[74]

Medal s: Spatio-textual prompt model for medical segmentation

Pengcheng Shi, Jiawei Chen, Jiaqi Liu, Xinglin Zhang, Tao Chen, and Lei Li. Medal s: Spatio-textual prompt model for medical segmentation. InCVPR 2025: Foundation Models for 3D Biomedical Image Segmentation

work page 2025
[75]

Diffusion-based neural network weights generation.arXiv preprint arXiv:2402.18153, 2024

Bedionita Soro, Nicol `o Andreis, et al. Diffusion-based neural network weights generation.arXiv preprint arXiv:2402.18153, 2024. ICLR 2025

work page arXiv 2024
[76]

Multiphys: Multi-person physics-aware 3d motion estimation

Nicolas Ugrinovic, Boxiao Pan, Georgios Pavlakos, De- spoina Paschalidou, Bokui Shen, Jordi Sanchez-Riera, Francesc Moreno-Noguer, and Leonidas Guibas. Multiphys: Multi-person physics-aware 3d motion estimation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2331–2340, 2024

work page 2024
[77]

Recovering accurate 3d human pose in the wild using imus and a moving camera

Timo von Marcard, Roberto Henschel, Michael Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), 2018

work page 2018
[78]

Reasoning or retrieval? a study of answer attribution on large reasoning models, 2025

Yuhui Wang, Changjiang Li, Guangke Chen, Jiacheng Liang, and Ting Wang. Reasoning or retrieval? a study of answer attribution on large reasoning models, 2025

work page 2025
[79]

Self-destructive language model, 2025

Yuhui Wang, Rongyi Zhu, and Ting Wang. Self-destructive language model, 2025

work page 2025
[80]

Simple online and realtime tracking with a deep association metric

Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017

work page 2017

Showing first 80 references.