Benchmarking Single-Factor Physical Video-to-Audio Generation

Arushi Goel; Gantavya Bhatt; Gopala Anumanchipalli; Kevin J. Shih; Ming-Yu Liu; Sang-gil Lee; Siddharth Gururani; Tingle Li; Zhifeng Kong

arxiv: 2605.30339 · v1 · pith:WOEANIJVnew · submitted 2026-05-28 · 💻 cs.CV · cs.MM· cs.SD· eess.AS

Benchmarking Single-Factor Physical Video-to-Audio Generation

Tingle Li , Siddharth Gururani , Kevin J. Shih , Gantavya Bhatt , Sang-gil Lee , Zhifeng Kong , Arushi Goel , Gopala Anumanchipalli

show 1 more author

Ming-Yu Liu

This is my paper

Pith reviewed 2026-06-29 07:40 UTC · model grok-4.3

classification 💻 cs.CV cs.MMcs.SDeess.AS

keywords video-to-audio generationphysical reasoning benchmarkcounterfactual evaluationtemporal alignmentmultimodal modelsgenerative modelsphysics-based metrics

0 comments

The pith

Video-to-audio models rely more on text captions than visuals for physical accuracy, but captions degrade temporal alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FlatSounds, a benchmark that uses controlled counterfactual video pairs varying one physical factor and single-video pattern tests to check if generated audio reflects specific physical properties and timings. Evaluation of current models shows they draw physics and semantics more from accompanying text captions than from the video pixels themselves. Captions raise accuracy on physical and semantic aspects yet reduce how precisely the audio syncs with video events. The work argues this pattern indicates models are not learning physical processes directly from visual input. The introduced metrics also track closely with human judgments on the benchmark data.

Core claim

FlatSounds audits V2A models through counterfactual pairs that isolate single physical factors and pattern tests that check internal consistency and trends. State-of-the-art models depend more on text captions than the visual stream; captions raise physical and semantic accuracy but lower temporal alignment, showing the need to learn physical processes directly from pixels.

What carries the argument

FlatSounds benchmark of controlled counterfactual pairs and single-video pattern tests that isolate and probe specific physical properties and timings.

If this is right

Adding text captions improves physical and semantic accuracy in generated audio.
Captions reduce the temporal alignment between generated audio and video events.
Physics-based metrics from the benchmark correlate strongly with human preference ratings.
Progress requires training that extracts physical processes from pixels rather than from text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Strengthening the visual encoder could reduce the observed accuracy-alignment trade-off.
Benchmarks that vary one factor at a time could be applied to other generation tasks such as text-to-video or audio-to-video.
Training sets that emphasize visual-physical correspondences without captions might lessen reliance on text.
Human preference alignment with the metrics suggests the benchmark can serve as a scalable proxy for subjective evaluation.

Load-bearing premise

The controlled counterfactual pairs and single-video pattern tests isolate specific physical properties and timings without confounding factors from video generation or model internals.

What would settle it

A model that receives only raw video input and still produces audio whose physical accuracy and timing match or exceed those of caption-conditioned models on the same counterfactual pairs would falsify the claim of primary caption reliance.

Figures

Figures reproduced from arXiv: 2605.30339 by Arushi Goel, Gantavya Bhatt, Gopala Anumanchipalli, Kevin J. Shih, Ming-Yu Liu, Sang-gil Lee, Siddharth Gururani, Tingle Li, Zhifeng Kong.

**Figure 1.** Figure 1: FlatSounds for video-to-audio physical benchmark. We test whether current video-to-audio models generate sound that reflects controlled changes in physical factors. For counterfactual pairs, we time-warp videos so that only a single physical factor (e.g., jar fullness) differs while impact timing remains aligned, and compare generated physical features such as pitch (left). For single-video tests, we probe… view at source ↗

**Figure 2.** Figure 2: FlatSounds dataset. How should sound change when we manipulate specific visual properties of a scene? FlatSounds contains indoor recordings of everyday household objects producing sound under controlled variations. Many clips are arranged into time-aligned counterfactual pairs in which we alter a single factor (material, environment, texture, etc.) while keeping the rest of the scene fixed, allowing us to … view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 5.** Figure 5: Visualization of a clean example of the energy envelope [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 4.** Figure 4: Win-rates for pairwise preference test. Each cell (i, j) represents how often the model for row i is preferred over the model for column j. Overall, the pairwise preferences broadly support the ranking trends reported in the main paper’s Human Evaluation discussion. We refer the reader there for the final ELObased ordering. R ′ B = RB + K · (SB − EB), where SA, SB are the outcome scores: - SA = 1, SB = 0 … view at source ↗

**Figure 6.** Figure 6: Example of a difficult case for automatically detecting [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of (a) material types, (b) hit counts, and (c) metric-change annotations used to create counterfactual pairs. A pair [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Plot showing the Attack time and Decay Rate of a de [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Plot showing all the spectral features, and the F0 contour [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Plot showing RT60 and DRR measures of a detected [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison of original and physics-aware (our) captions for train (top), banjo (middle), and cloth (bottom) samples. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

read the original abstract

Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single-video pattern tests that probe internal consistency and directional trends. These settings test whether the generated audio correctly reflects specific physical properties and timings. Our evaluation of state-of-the-art models reveals a consistent trade-off: models rely more on text captions than the visual stream to infer physics and semantics. Captions generally improve physical and semantic accuracy, but paradoxically degrade temporal alignment. Our results highlight the need to move beyond audio quality toward learning physical processes directly from pixels. Finally, we find that our physics-based metrics correlate strongly with human preference tests on our own data. Project webpage: https://research.nvidia.com/labs/cosmos-lab/flatsounds/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlatSounds is a new benchmark that shows V2A models lean on captions more than visuals, with captions helping accuracy but hurting timing alignment.

read the letter

This paper introduces FlatSounds, a benchmark for testing physical reasoning in video-to-audio generation models. It uses controlled counterfactual pairs where one physical factor changes and single-video tests for consistency.

What stands out is the evaluation of state-of-the-art models revealing they depend more on text captions than the visual input. Captions boost physical and semantic accuracy but reduce temporal alignment. The physics-based metrics also correlate with human preferences.

The setup looks solid for what it is. The single-factor design helps isolate effects, and the correlation with humans is a positive sign that the metrics aren't arbitrary.

One potential soft spot is whether the video generation process for the counterfactuals introduces unintended variations that could confound the results. The paper claims to control for this, but without the full details it's hard to be sure. Still, the approach is better than existing perceptual tests.

This is useful for the V2A research community. Readers interested in moving these models toward better physical grounding will find the benchmark and the identified gaps helpful.

It deserves serious peer review. The contribution is a clear benchmark with concrete findings that can guide future work.

Referee Report

0 major / 2 minor

Summary. The paper introduces FlatSounds, a benchmark for auditing physical reasoning in video-to-audio (V2A) models via (1) controlled counterfactual pairs that vary a single physical factor and (2) single-video pattern tests that probe internal consistency and directional trends. Evaluation of state-of-the-art models shows consistent reliance on text captions over the visual stream: captions improve physical and semantic accuracy but degrade temporal alignment. Physics-based metrics are reported to correlate strongly with human preference judgments.

Significance. If the results hold, the work provides a needed shift from perceptual realism metrics toward controlled tests of physical correctness in generative models. The documented caption-vs-visual trade-off and the correlation between automated physics metrics and human preferences are actionable findings that can guide future model development toward direct pixel-based physical learning.

minor comments (2)

The abstract states that 'physics-based metrics correlate strongly with human preference tests on our own data' but does not report the exact correlation coefficient, sample size, or statistical test used; this detail should be added to the results section for reproducibility.
Section describing the FlatSounds construction (counterfactual pairs) would benefit from an explicit statement of how video generation artifacts or model-internal confounders were ruled out, even if only as a limitations paragraph.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our FlatSounds benchmark, its significance for shifting evaluation toward physical correctness in V2A models, and the recommendation of minor revision. No specific major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces the FlatSounds benchmark consisting of controlled counterfactual pairs and single-video pattern tests, then reports empirical observations on model behavior (text vs. visual reliance, accuracy vs. alignment trade-offs) and a correlation between physics metrics and human preferences. No derivation chain, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on direct evaluation of existing models against the new benchmark rather than any self-referential reduction of results to inputs by construction. This is a standard empirical benchmarking study with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking paper with no mathematical derivations, free parameters, or invented entities described in the abstract.

pith-pipeline@v0.9.1-grok · 5761 in / 1143 out tokens · 26058 ms · 2026-06-29T07:40:12.388884+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 14 canonical work pages · 6 internal anchors

[1]

Counterfactual vision and language learning

Ehsan Abbasnejad, Damien Teney, Amin Parvaneh, Javen Shi, and Anton van den Hengel. Counterfactual vision and language learning. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 10044–10054, 2020. 3

2020
[2]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- dation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss- 20b model card.arXiv preprint arXiv:2508.10925, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

The sound of water: Inferring physi- cal properties from pouring liquids

Piyush Bagad, Makarand Tapaswi, Cees GM Snoek, and Andrew Zisserman. The sound of water: Inferring physi- cal properties from pouring liquids. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. 1

2025
[5]

The acquisition of physical knowledge in infancy: A summary in eight lessons.Blackwell Handbook of Childhood Cognitive Development, pages 47–83, 2002

Ren ´ee Baillargeon. The acquisition of physical knowledge in infancy: A summary in eight lessons.Blackwell Handbook of Childhood Cognitive Development, pages 47–83, 2002. 1

2002
[6]

Videophy: Evaluating phys- ical commonsense for video generation

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai- Wei Chang, and Aditya Grover. Videophy: Evaluating phys- ical commonsense for video generation. InInternational Conference on Learning Representations (ICLR), 2025. 3

2025
[7]

MIT press, 1997

Jens Blauert.Spatial hearing: the psychophysics of human sound localization. MIT press, 1997. 3

1997
[8]

MIT press, 1994

Albert S Bregman.Auditory scene analysis: The perceptual organization of sound. MIT press, 1994. 3

1994
[9]

Human sensitivity to acoustic information from vessel filling.Journal of Experi- mental Psychology: Human Perception and Performance, 26 (1):313, 2000

Patrick A Cabe and John B Pittenger. Human sensitivity to acoustic information from vessel filling.Journal of Experi- mental Psychology: Human Perception and Performance, 26 (1):313, 2000. 2

2000
[10]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zis- serman. Vggsound: A large-scale audio-visual dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725, 2020. 2, 5, 9

2020
[11]

Video-guided foley sound generation with multimodal con- trols

Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Ni- eto, David Bourgin, Andrew Owens, and Justin Salamon. Video-guided foley sound generation with multimodal con- trols. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 18770– 18781, 2025. 1

2025
[12]

Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 28901–28911, 2025. 1, 2, 3, 5, 9

2025
[13]

Pitch perception

Alain de Cheveign ´e. Pitch perception. InOxford Hand- book of Auditory Science: Hearing. Oxford University Press,
[14]

Physbench: Benchmarking and enhancing vision-language models for physical world understanding.arXiv preprint arXiv:2501.16411, 2025

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world under- standing.arXiv preprint arXiv:2501.16411, 2025. 3

work page arXiv 2025
[15]

A study of the just noticeable difference of early decay time for sym- phonic halls.The Journal of the Acoustical Society of Amer- ica, 151(1):80–94, 2022

Fernando del Solar Dorrego and Michelle C Vigeant. A study of the just noticeable difference of early decay time for sym- phonic halls.The Journal of the Acoustical Society of Amer- ica, 151(1):80–94, 2022. 3

2022
[16]

Cogview2: Faster and better text-to-image generation via hi- erarchical transformers

Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hi- erarchical transformers. InAdvances in Neural Information Processing Systems (NeurIPS), pages 16890–16902, 2022. 2

2022
[17]

Mit Press, 2010

Andy Farnell.Designing sound. Mit Press, 2010. 1, 3

2010
[18]

Auditory correlates of perceived mallet hard- ness for a set of recorded percussive sound events.The Jour- nal of the Acoustical Society of America, 87(1):311–322,

Daniel J Freed. Auditory correlates of perceived mallet hard- ness for a set of recorded percussive sound events.The Jour- nal of the Acoustical Society of America, 87(1):311–322,
[19]

What in the world do we hear?: An eco- logical approach to auditory event perception.Ecological Psychology, 5(1):1–29, 1993

William W Gaver. What in the world do we hear?: An eco- logical approach to auditory event perception.Ecological Psychology, 5(1):1–29, 1993. 3

1993
[20]

Shortcut learning in deep neural networks

Robert Geirhos, J ¨orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Fe- lix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020. 3

2020
[21]

Audio set: An ontology and human- labeled dataset for audio events

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human- labeled dataset for audio events. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017. 2

2017
[22]

Giordano and Stephen McAdams

Bruno L. Giordano and Stephen McAdams. Material iden- tification of real impact sounds: Effects of size variation in steel, glass, wood, and plexiglass plates.The Journal of the Acoustical Society of America, 119(2):1171–1181, 2006. 4

2006
[23]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 15180–15190,
[24]

World Models

David Ha and J ¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018. 1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Av-link: Temporally-aligned diffusion features for cross-modal audio-video generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, and Sergey Tulyakov. Av-link: Temporally-aligned diffusion features for cross-modal audio-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19373–19385, 2025. 2

2025
[26]

Routledge, 2013

David Howard and Jamie Angus.Acoustics and psychoa- coustics. Routledge, 2013. 2

2013
[27]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21807–21818, 2024. 2, 3

2024
[28]

Taming visually guided sound generation

Vladimir Iashin and Esa Rahtu. Taming visually guided sound generation. InThe British Machine Vision Conference (BMVC), 2021. 2

2021
[29]

Synchformer: Efficient synchronization from sparse cues

Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisser- man. Synchformer: Efficient synchronization from sparse cues. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329,
[30]

Krumhansl

Paul Iverson and Carol L. Krumhansl. Isolating the dynamic attributes of musical timbre.The Journal of the Acoustical Society of America, 94(5):2595–2603, 1993. 4

1993
[31]

Intro- ducing parselmouth: A python interface to praat.Journal of Phonetics, 71:1–15, 2018

Yannick Jadoul, Bill Thompson, and Bart De Boer. Intro- ducing parselmouth: A python interface to praat.Journal of Phonetics, 71:1–15, 2018. 3

2018
[32]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2901–2910, 2017. 3

2017
[33]

Frechet audio distance: A reference-free metric for evaluating music enhancement algorithms

Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Frechet audio distance: A reference-free metric for evaluating music enhancement algorithms. InIn- terspeech, 2019. 1, 2, 6

2019
[34]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL- HLT), Volume 1 (Long and Short Papers), pages 119–132,

2019
[35]

Hearing shape.Journal of Experimental Psychology: Human Per- ception and Performance, 26(1):279, 2000

Andrew J Kunkler-Peck and Michael T Turvey. Hearing shape.Journal of Experimental Psychology: Human Per- ception and Performance, 26(1):279, 2000. 1

2000
[36]

A path towards autonomous machine intelli- gence version 0.9

Yann LeCun. A path towards autonomous machine intelli- gence version 0.9. 2, 2022-06-27.OpenReview, 62(1):1–62,

2022
[37]

Wiley-IEEE Press, Hoboken, N.J, 2 edition, 2023

Alexander Lerch.An Introduction to Audio Content Anal- ysis: Music Information Retrieval Tasks and Applications. Wiley-IEEE Press, Hoboken, N.J, 2 edition, 2023. 4

2023
[38]

Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychol- ogy, 49(4):764–766, 2013

Christophe Leys, Christophe Ley, Olivier Klein, Philippe Bernard, and Laurent Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychol- ogy, 49(4):764–766, 2013. 5

2013
[39]

Audi- oldm: Text-to-audio generation with latent diffusion models

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. Audi- oldm: Text-to-audio generation with latent diffusion models. InInternational Conference on Machine Learning (ICML),
[40]

Thinksound: Chain-of- thought reasoning in multimodal large language models for audio generation and editing

Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. Thinksound: Chain-of- thought reasoning in multimodal large language models for audio generation and editing. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2025. 1, 2, 5

2025
[41]

Tell what you hear from what you see-video to audio generation through text

Xiulong Liu, Kun Su, and Eli Shlizerman. Tell what you hear from what you see-video to audio generation through text. InAdvances in Neural Information Processing Systems (NeurIPS), pages 101337–101366, 2024. 2

2024
[42]

Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), pages 48855–48876, 2023. 2

2023
[43]

Omni-captioner: Data pipeline, mod- els, and benchmark for omni detailed perception

Ziyang Ma, Ruiyang Xu, Zhenghao Xing, Yunfei Chu, Yux- uan Wang, Jinzheng He, Jin Xu, Pheng-Ann Heng, Kai Yu, Junyang Lin, et al. Omni-captioner: Data pipeline, mod- els, and benchmark for omni detailed perception. InInter- national Conference on Learning Representations (ICLR),
[44]

Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis.Neuron, 71(5):926–940, 2011

Josh H McDermott and Eero P Simoncelli. Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis.Neuron, 71(5):926–940, 2011. 1, 3

2011
[45]

Towards world simulator: Crafting physical commonsense-based benchmark for video generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quan- feng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation. InIn- ternational Conference on Machine Learning (ICML), 2025. 3

2025
[46]

Brill, 2012

Brian CJ Moore.An introduction to the psychology of hear- ing. Brill, 2012. 3

2012
[47]

Do generative video models understand physical principles?

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Synthesizing sounds from rigid-body simulations

James F O’Brien, Chen Shen, and Christine M Gatchalian. Synthesizing sounds from rigid-body simulations. InPro- ceedings of the 2002 ACM SIGGRAPH/Eurographics Sym- posium on Computer Animation, pages 175–181, 2002. 1, 3

2002
[49]

Toward verifiable and reproducible human evalu- ation for text-to-image generation

Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, Yuta Nakashima, Esa Rahtu, Janne Heikkil ¨a, and Shin’ichi Satoh. Toward verifiable and reproducible human evalu- ation for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14277–14286, 2023. 2

2023
[50]

Visually indicated sounds

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Tor- ralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 2405–2413, 2016. 1, 2, 9

2016
[51]

Masked generative video-to-audio transformers with enhanced synchronicity

Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, and Joan Serr`a. Masked generative video-to-audio transformers with enhanced synchronicity. InEuropean Conference on Com- puter Vision (ECCV), pages 247–264. Springer, 2024. 2

2024
[52]

The MIT press, 2017

Jonas Peters, Dominik Janzing, and Bernhard Sch ¨olkopf.El- ements of causal inference: foundations and learning algo- rithms. The MIT press, 2017. 2, 3

2017
[53]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PmLR, 2021. 2

2021
[55]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2016. 6

2016
[56]

New method of measuring reverbera- tion time.The Journal of the Acoustical Society of America, 37(3):409–412, 1965

Manfred R Schroeder. New method of measuring reverbera- tion time.The Journal of the Acoustical Society of America, 37(3):409–412, 1965. 5

1965
[57]

Estimates of the regression coefficient based on kendall’s tau.Journal of the American Statistical Association, 63(324):1379–1389, 1968

Pranab Kumar Sen. Estimates of the regression coefficient based on kendall’s tau.Journal of the American Statistical Association, 63(324):1379–1389, 1968. 5

1968
[58]

Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025. 2, 5

work page arXiv 2025
[59]

I hear your true colors: Image guided audio generation

Roy Sheffer and Yossi Adi. I hear your true colors: Image guided audio generation. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 2

2023
[60]

The development of em- bodied cognition: Six lessons from babies.Artificial Life, 11 (1-2):13–29, 2005

Linda Smith and Michael Gasser. The development of em- bodied cognition: Six lessons from babies.Artificial Life, 11 (1-2):13–29, 2005. 1

2005
[61]

From vision to au- dio and beyond: A unified model for audio-visual representa- tion and generation

Kun Su, Xiulong Liu, and Eli Shlizerman. From vision to au- dio and beyond: A unified model for audio-visual representa- tion and generation. InInternational Conference on Machine Learning (ICML), 2024. 2

2024
[62]

Qwen3-vl: the multimodal large language model series.https://github.com/QwenLM/Qwen3- VL,

Qwen Team. Qwen3-vl: the multimodal large language model series.https://github.com/QwenLM/Qwen3- VL,
[63]

Temporally aligned audio for video with autoregression

Ilpo Viertola, Vladimir Iashin, and Esa Rahtu. Temporally aligned audio for video with autoregression. InIEEE Inter- national Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 1–5, 2025. 2

2025
[64]

Evaluating generative audio systems and their metrics

Ashvala Vinay and Alexander Lerch. Evaluating generative audio systems and their metrics. InInternational Society for Music Information Retrieval Conference (ISMIR), pages 858–865, 2022. 2

2022
[65]

Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St ´efan J

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St ´efan J. van der Walt, Matthew Brett, Joshua Wil- son, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, ˙Ilhan Polat, Yu Feng, Eric ...

2020
[66]

Sharpness as an attribute of the tim- bre of steady sounds.Acustica, 30:159–172, 1974

Gottfried von Bismarck. Sharpness as an attribute of the tim- bre of steady sounds.Acustica, 30:159–172, 1974. 4

1974
[67]

Kling-foley: Multimodal diffusion transformer for high-quality video-to-audio generation.arXiv preprint arXiv:2506.19774, 2025

Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Ji- ahui Zhao, Nan Li, et al. Kling-foley: Multimodal diffu- sion transformer for high-quality video-to-audio generation. arXiv preprint arXiv:2506.19774, 2025. 2, 3

work page arXiv 2025
[68]

The sound of simu- lation: Learning multimodal sim-to-real robot policies with generative audio

Renhao Wang, Haoran Geng, Tingle Li, Philipp Wu, Feishi Wang, Gopala Anumanchipalli, Trevor Darrell, Boyi Li, Pieter Abbeel, Jitendra Malik, et al. The sound of simu- lation: Learning multimodal sim-to-real robot policies with generative audio. InProceedings of the 9th Conference on Robot Learning (CoRL), 2025. 1

2025
[69]

Frieren: Efficient video-to-audio generation network with rectified flow matching

Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, and Zhou Zhao. Frieren: Efficient video-to-audio generation network with rectified flow matching. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), pages 128118–128138,
[70]

What you see is what matters: A novel vi- sual and physics-based metric for evaluating video genera- tion quality.arXiv preprint arXiv:2411.13609, 2024

Zihan Wang, Songlin Li, Lingyan Hao, Xinyu Hu, and Bowen Song. What you see is what matters: A novel vi- sual and physics-based metric for evaluating video genera- tion quality.arXiv preprint arXiv:2411.13609, 2024. 3

work page arXiv 2024
[71]

Peter Welch. The use of fast fourier transform for the esti- mation of power spectra: A method based on time averaging over short, modified periodograms.IEEE Transactions on Audio and Electroacoustics, 15(2):70–73, 2003. 4

2003
[72]

Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 1, 2, 6

2023
[73]

A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024

Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024. 2

2024
[74]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

Video-to-audio generation with hidden alignment.arXiv preprint arXiv:2407.07464,

Manjie Xu, Chenxing Li, Xinyi Tu, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, and Dong Yu. Video-to-audio generation with hidden alignment.arXiv preprint arXiv:2407.07464,

work page arXiv
[76]

Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025

Chenyu Zhang, Daniil Cherniavskii, Andrii Zadaianchuk, Antonios Tragoudaras, Antonios V ozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, and Ef- stratios Gavves. Morpheus: Benchmarking physical reason- ing of video generative models with real physical experi- ments.arXiv preprint arXiv:2504.02918, 2025. 3

work page arXiv 2025
[77]

Uniavgen: Unified audio and video generation with asymmetric cross-modal interactions, 2025

Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, You- liang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, and Limin Wang. Uniavgen: Unified audio and video generation with asymmetric cross-modal interactions, 2025. 2

2025
[78]

Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds.arXiv preprint arXiv:2407.01494, 2024

Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds.arXiv preprint arXiv:2407.01494, 2024. 1, 2, 5

work page arXiv 2024
[79]

Pai-bench: A comprehensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025

Fengzhe Zhou, Jiannan Huang, Jialuo Li, Deva Ramanan, and Humphrey Shi. Pai-bench: A comprehensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025. 3

work page arXiv 2025
[80]

Visual to sound: Generating natural sound for videos in the wild

Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L Berg. Visual to sound: Generating natural sound for videos in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3550–3558, 2018. 1

2018

Showing first 80 references.

[1] [1]

Counterfactual vision and language learning

Ehsan Abbasnejad, Damien Teney, Amin Parvaneh, Javen Shi, and Anton van den Hengel. Counterfactual vision and language learning. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 10044–10054, 2020. 3

2020

[2] [2]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- dation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss- 20b model card.arXiv preprint arXiv:2508.10925, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

The sound of water: Inferring physi- cal properties from pouring liquids

Piyush Bagad, Makarand Tapaswi, Cees GM Snoek, and Andrew Zisserman. The sound of water: Inferring physi- cal properties from pouring liquids. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. 1

2025

[5] [5]

The acquisition of physical knowledge in infancy: A summary in eight lessons.Blackwell Handbook of Childhood Cognitive Development, pages 47–83, 2002

Ren ´ee Baillargeon. The acquisition of physical knowledge in infancy: A summary in eight lessons.Blackwell Handbook of Childhood Cognitive Development, pages 47–83, 2002. 1

2002

[6] [6]

Videophy: Evaluating phys- ical commonsense for video generation

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai- Wei Chang, and Aditya Grover. Videophy: Evaluating phys- ical commonsense for video generation. InInternational Conference on Learning Representations (ICLR), 2025. 3

2025

[7] [7]

MIT press, 1997

Jens Blauert.Spatial hearing: the psychophysics of human sound localization. MIT press, 1997. 3

1997

[8] [8]

MIT press, 1994

Albert S Bregman.Auditory scene analysis: The perceptual organization of sound. MIT press, 1994. 3

1994

[9] [9]

Human sensitivity to acoustic information from vessel filling.Journal of Experi- mental Psychology: Human Perception and Performance, 26 (1):313, 2000

Patrick A Cabe and John B Pittenger. Human sensitivity to acoustic information from vessel filling.Journal of Experi- mental Psychology: Human Perception and Performance, 26 (1):313, 2000. 2

2000

[10] [10]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zis- serman. Vggsound: A large-scale audio-visual dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725, 2020. 2, 5, 9

2020

[11] [11]

Video-guided foley sound generation with multimodal con- trols

Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Ni- eto, David Bourgin, Andrew Owens, and Justin Salamon. Video-guided foley sound generation with multimodal con- trols. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 18770– 18781, 2025. 1

2025

[12] [12]

Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 28901–28911, 2025. 1, 2, 3, 5, 9

2025

[13] [13]

Pitch perception

Alain de Cheveign ´e. Pitch perception. InOxford Hand- book of Auditory Science: Hearing. Oxford University Press,

[14] [14]

Physbench: Benchmarking and enhancing vision-language models for physical world understanding.arXiv preprint arXiv:2501.16411, 2025

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world under- standing.arXiv preprint arXiv:2501.16411, 2025. 3

work page arXiv 2025

[15] [15]

A study of the just noticeable difference of early decay time for sym- phonic halls.The Journal of the Acoustical Society of Amer- ica, 151(1):80–94, 2022

Fernando del Solar Dorrego and Michelle C Vigeant. A study of the just noticeable difference of early decay time for sym- phonic halls.The Journal of the Acoustical Society of Amer- ica, 151(1):80–94, 2022. 3

2022

[16] [16]

Cogview2: Faster and better text-to-image generation via hi- erarchical transformers

Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hi- erarchical transformers. InAdvances in Neural Information Processing Systems (NeurIPS), pages 16890–16902, 2022. 2

2022

[17] [17]

Mit Press, 2010

Andy Farnell.Designing sound. Mit Press, 2010. 1, 3

2010

[18] [18]

Auditory correlates of perceived mallet hard- ness for a set of recorded percussive sound events.The Jour- nal of the Acoustical Society of America, 87(1):311–322,

Daniel J Freed. Auditory correlates of perceived mallet hard- ness for a set of recorded percussive sound events.The Jour- nal of the Acoustical Society of America, 87(1):311–322,

[19] [19]

What in the world do we hear?: An eco- logical approach to auditory event perception.Ecological Psychology, 5(1):1–29, 1993

William W Gaver. What in the world do we hear?: An eco- logical approach to auditory event perception.Ecological Psychology, 5(1):1–29, 1993. 3

1993

[20] [20]

Shortcut learning in deep neural networks

Robert Geirhos, J ¨orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Fe- lix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020. 3

2020

[21] [21]

Audio set: An ontology and human- labeled dataset for audio events

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human- labeled dataset for audio events. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017. 2

2017

[22] [22]

Giordano and Stephen McAdams

Bruno L. Giordano and Stephen McAdams. Material iden- tification of real impact sounds: Effects of size variation in steel, glass, wood, and plexiglass plates.The Journal of the Acoustical Society of America, 119(2):1171–1181, 2006. 4

2006

[23] [23]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 15180–15190,

[24] [24]

World Models

David Ha and J ¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018. 1

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

Av-link: Temporally-aligned diffusion features for cross-modal audio-video generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, and Sergey Tulyakov. Av-link: Temporally-aligned diffusion features for cross-modal audio-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19373–19385, 2025. 2

2025

[26] [26]

Routledge, 2013

David Howard and Jamie Angus.Acoustics and psychoa- coustics. Routledge, 2013. 2

2013

[27] [27]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21807–21818, 2024. 2, 3

2024

[28] [28]

Taming visually guided sound generation

Vladimir Iashin and Esa Rahtu. Taming visually guided sound generation. InThe British Machine Vision Conference (BMVC), 2021. 2

2021

[29] [29]

Synchformer: Efficient synchronization from sparse cues

Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisser- man. Synchformer: Efficient synchronization from sparse cues. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329,

[30] [30]

Krumhansl

Paul Iverson and Carol L. Krumhansl. Isolating the dynamic attributes of musical timbre.The Journal of the Acoustical Society of America, 94(5):2595–2603, 1993. 4

1993

[31] [31]

Intro- ducing parselmouth: A python interface to praat.Journal of Phonetics, 71:1–15, 2018

Yannick Jadoul, Bill Thompson, and Bart De Boer. Intro- ducing parselmouth: A python interface to praat.Journal of Phonetics, 71:1–15, 2018. 3

2018

[32] [32]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2901–2910, 2017. 3

2017

[33] [33]

Frechet audio distance: A reference-free metric for evaluating music enhancement algorithms

Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Frechet audio distance: A reference-free metric for evaluating music enhancement algorithms. InIn- terspeech, 2019. 1, 2, 6

2019

[34] [34]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL- HLT), Volume 1 (Long and Short Papers), pages 119–132,

2019

[35] [35]

Hearing shape.Journal of Experimental Psychology: Human Per- ception and Performance, 26(1):279, 2000

Andrew J Kunkler-Peck and Michael T Turvey. Hearing shape.Journal of Experimental Psychology: Human Per- ception and Performance, 26(1):279, 2000. 1

2000

[36] [36]

A path towards autonomous machine intelli- gence version 0.9

Yann LeCun. A path towards autonomous machine intelli- gence version 0.9. 2, 2022-06-27.OpenReview, 62(1):1–62,

2022

[37] [37]

Wiley-IEEE Press, Hoboken, N.J, 2 edition, 2023

Alexander Lerch.An Introduction to Audio Content Anal- ysis: Music Information Retrieval Tasks and Applications. Wiley-IEEE Press, Hoboken, N.J, 2 edition, 2023. 4

2023

[38] [38]

Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychol- ogy, 49(4):764–766, 2013

Christophe Leys, Christophe Ley, Olivier Klein, Philippe Bernard, and Laurent Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median.Journal of Experimental Social Psychol- ogy, 49(4):764–766, 2013. 5

2013

[39] [39]

Audi- oldm: Text-to-audio generation with latent diffusion models

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. Audi- oldm: Text-to-audio generation with latent diffusion models. InInternational Conference on Machine Learning (ICML),

[40] [40]

Thinksound: Chain-of- thought reasoning in multimodal large language models for audio generation and editing

Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. Thinksound: Chain-of- thought reasoning in multimodal large language models for audio generation and editing. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2025. 1, 2, 5

2025

[41] [41]

Tell what you hear from what you see-video to audio generation through text

Xiulong Liu, Kun Su, and Eli Shlizerman. Tell what you hear from what you see-video to audio generation through text. InAdvances in Neural Information Processing Systems (NeurIPS), pages 101337–101366, 2024. 2

2024

[42] [42]

Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), pages 48855–48876, 2023. 2

2023

[43] [43]

Omni-captioner: Data pipeline, mod- els, and benchmark for omni detailed perception

Ziyang Ma, Ruiyang Xu, Zhenghao Xing, Yunfei Chu, Yux- uan Wang, Jinzheng He, Jin Xu, Pheng-Ann Heng, Kai Yu, Junyang Lin, et al. Omni-captioner: Data pipeline, mod- els, and benchmark for omni detailed perception. InInter- national Conference on Learning Representations (ICLR),

[44] [44]

Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis.Neuron, 71(5):926–940, 2011

Josh H McDermott and Eero P Simoncelli. Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis.Neuron, 71(5):926–940, 2011. 1, 3

2011

[45] [45]

Towards world simulator: Crafting physical commonsense-based benchmark for video generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quan- feng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation. InIn- ternational Conference on Machine Learning (ICML), 2025. 3

2025

[46] [46]

Brill, 2012

Brian CJ Moore.An introduction to the psychology of hear- ing. Brill, 2012. 3

2012

[47] [47]

Do generative video models understand physical principles?

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Synthesizing sounds from rigid-body simulations

James F O’Brien, Chen Shen, and Christine M Gatchalian. Synthesizing sounds from rigid-body simulations. InPro- ceedings of the 2002 ACM SIGGRAPH/Eurographics Sym- posium on Computer Animation, pages 175–181, 2002. 1, 3

2002

[49] [49]

Toward verifiable and reproducible human evalu- ation for text-to-image generation

Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, Yuta Nakashima, Esa Rahtu, Janne Heikkil ¨a, and Shin’ichi Satoh. Toward verifiable and reproducible human evalu- ation for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14277–14286, 2023. 2

2023

[50] [50]

Visually indicated sounds

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Tor- ralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 2405–2413, 2016. 1, 2, 9

2016

[51] [51]

Masked generative video-to-audio transformers with enhanced synchronicity

Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, and Joan Serr`a. Masked generative video-to-audio transformers with enhanced synchronicity. InEuropean Conference on Com- puter Vision (ECCV), pages 247–264. Springer, 2024. 2

2024

[52] [52]

The MIT press, 2017

Jonas Peters, Dominik Janzing, and Bernhard Sch ¨olkopf.El- ements of causal inference: foundations and learning algo- rithms. The MIT press, 2017. 2, 3

2017

[53] [53]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PmLR, 2021. 2

2021

[55] [55]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2016. 6

2016

[56] [56]

New method of measuring reverbera- tion time.The Journal of the Acoustical Society of America, 37(3):409–412, 1965

Manfred R Schroeder. New method of measuring reverbera- tion time.The Journal of the Acoustical Society of America, 37(3):409–412, 1965. 5

1965

[57] [57]

Estimates of the regression coefficient based on kendall’s tau.Journal of the American Statistical Association, 63(324):1379–1389, 1968

Pranab Kumar Sen. Estimates of the regression coefficient based on kendall’s tau.Journal of the American Statistical Association, 63(324):1379–1389, 1968. 5

1968

[58] [58]

Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025. 2, 5

work page arXiv 2025

[59] [59]

I hear your true colors: Image guided audio generation

Roy Sheffer and Yossi Adi. I hear your true colors: Image guided audio generation. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 2

2023

[60] [60]

The development of em- bodied cognition: Six lessons from babies.Artificial Life, 11 (1-2):13–29, 2005

Linda Smith and Michael Gasser. The development of em- bodied cognition: Six lessons from babies.Artificial Life, 11 (1-2):13–29, 2005. 1

2005

[61] [61]

From vision to au- dio and beyond: A unified model for audio-visual representa- tion and generation

Kun Su, Xiulong Liu, and Eli Shlizerman. From vision to au- dio and beyond: A unified model for audio-visual representa- tion and generation. InInternational Conference on Machine Learning (ICML), 2024. 2

2024

[62] [62]

Qwen3-vl: the multimodal large language model series.https://github.com/QwenLM/Qwen3- VL,

Qwen Team. Qwen3-vl: the multimodal large language model series.https://github.com/QwenLM/Qwen3- VL,

[63] [63]

Temporally aligned audio for video with autoregression

Ilpo Viertola, Vladimir Iashin, and Esa Rahtu. Temporally aligned audio for video with autoregression. InIEEE Inter- national Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 1–5, 2025. 2

2025

[64] [64]

Evaluating generative audio systems and their metrics

Ashvala Vinay and Alexander Lerch. Evaluating generative audio systems and their metrics. InInternational Society for Music Information Retrieval Conference (ISMIR), pages 858–865, 2022. 2

2022

[65] [65]

Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St ´efan J

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St ´efan J. van der Walt, Matthew Brett, Joshua Wil- son, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, ˙Ilhan Polat, Yu Feng, Eric ...

2020

[66] [66]

Sharpness as an attribute of the tim- bre of steady sounds.Acustica, 30:159–172, 1974

Gottfried von Bismarck. Sharpness as an attribute of the tim- bre of steady sounds.Acustica, 30:159–172, 1974. 4

1974

[67] [67]

Kling-foley: Multimodal diffusion transformer for high-quality video-to-audio generation.arXiv preprint arXiv:2506.19774, 2025

Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Ji- ahui Zhao, Nan Li, et al. Kling-foley: Multimodal diffu- sion transformer for high-quality video-to-audio generation. arXiv preprint arXiv:2506.19774, 2025. 2, 3

work page arXiv 2025

[68] [68]

The sound of simu- lation: Learning multimodal sim-to-real robot policies with generative audio

Renhao Wang, Haoran Geng, Tingle Li, Philipp Wu, Feishi Wang, Gopala Anumanchipalli, Trevor Darrell, Boyi Li, Pieter Abbeel, Jitendra Malik, et al. The sound of simu- lation: Learning multimodal sim-to-real robot policies with generative audio. InProceedings of the 9th Conference on Robot Learning (CoRL), 2025. 1

2025

[69] [69]

Frieren: Efficient video-to-audio generation network with rectified flow matching

Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, and Zhou Zhao. Frieren: Efficient video-to-audio generation network with rectified flow matching. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), pages 128118–128138,

[70] [70]

What you see is what matters: A novel vi- sual and physics-based metric for evaluating video genera- tion quality.arXiv preprint arXiv:2411.13609, 2024

Zihan Wang, Songlin Li, Lingyan Hao, Xinyu Hu, and Bowen Song. What you see is what matters: A novel vi- sual and physics-based metric for evaluating video genera- tion quality.arXiv preprint arXiv:2411.13609, 2024. 3

work page arXiv 2024

[71] [71]

Peter Welch. The use of fast fourier transform for the esti- mation of power spectra: A method based on time averaging over short, modified periodograms.IEEE Transactions on Audio and Electroacoustics, 15(2):70–73, 2003. 4

2003

[72] [72]

Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 1, 2, 6

2023

[73] [73]

A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024

Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024. 2

2024

[74] [74]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[75] [75]

Video-to-audio generation with hidden alignment.arXiv preprint arXiv:2407.07464,

Manjie Xu, Chenxing Li, Xinyi Tu, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, and Dong Yu. Video-to-audio generation with hidden alignment.arXiv preprint arXiv:2407.07464,

work page arXiv

[76] [76]

Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025

Chenyu Zhang, Daniil Cherniavskii, Andrii Zadaianchuk, Antonios Tragoudaras, Antonios V ozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, and Ef- stratios Gavves. Morpheus: Benchmarking physical reason- ing of video generative models with real physical experi- ments.arXiv preprint arXiv:2504.02918, 2025. 3

work page arXiv 2025

[77] [77]

Uniavgen: Unified audio and video generation with asymmetric cross-modal interactions, 2025

Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, You- liang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, and Limin Wang. Uniavgen: Unified audio and video generation with asymmetric cross-modal interactions, 2025. 2

2025

[78] [78]

Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds.arXiv preprint arXiv:2407.01494, 2024

Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds.arXiv preprint arXiv:2407.01494, 2024. 1, 2, 5

work page arXiv 2024

[79] [79]

Pai-bench: A comprehensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025

Fengzhe Zhou, Jiannan Huang, Jialuo Li, Deva Ramanan, and Humphrey Shi. Pai-bench: A comprehensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025. 3

work page arXiv 2025

[80] [80]

Visual to sound: Generating natural sound for videos in the wild

Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L Berg. Visual to sound: Generating natural sound for videos in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3550–3558, 2018. 1

2018