pith. machine review for the scientific record. sign in

arxiv: 2604.04496 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

The Indra Representation Hypothesis for Multimodal Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords Indra representationmultimodal alignmentYoneda embeddingfoundation modelsrelational structurecross-modal alignmentcategory theory
0
0 comments X

The pith

Unimodal foundation models converge on a shared relational structure that the Indra representation extracts for training-free multimodal alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that models trained on single modalities like vision or text are implicitly learning the same network of relations among data points. This shared structure, likened to Indra's Net, is more expressive than isolated sample embeddings. The authors use the V-enriched Yoneda embedding to turn each sample into a complete profile of its relations to all others under a cost function. Experiments show these Indra representations improve alignment and robustness when crossing between vision, language, and audio models without any retraining.

Core claim

The Indra Representation Hypothesis states that unimodal foundation models converge to implicitly reflect a shared relational structure underlying reality. This is formalized by defining the Indra representation of each sample as its relational profile with respect to all other samples, obtained via the V-enriched Yoneda embedding. The resulting representation is unique, complete, and structure-preserving under a given cost function, and when instantiated with angular distance it yields consistent gains in cross-model and cross-modal alignment.

What carries the argument

The V-enriched Yoneda embedding, which converts each sample into a relational profile of its distances or relations to every other sample under a chosen cost function.

If this is right

  • Indra representations enable training-free alignment between different unimodal models.
  • They increase robustness when transferring across architectures and modalities.
  • The same relational profiles work for vision, language, and audio data.
  • Alignment becomes a matter of matching relational structures rather than raw embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on additional modalities such as video or sensor data to check if the relational convergence holds more broadly.
  • It offers a way to compare models by their induced relation graphs instead of direct vector similarity.
  • If the hypothesis is correct, future foundation models might be designed explicitly to preserve these relational profiles from the start.

Load-bearing premise

Unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality.

What would settle it

If replacing standard embeddings with Indra representations fails to improve alignment accuracy or robustness in cross-modal retrieval or classification tasks across vision, language, and audio, the hypothesis would be contradicted.

read the original abstract

Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra's Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra's Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be unique, complete, and structure-preserving under a given cost function. We instantiate the Indra representation using angular distance and evaluate it in cross-model and cross-modal scenarios involving vision, language, and audio. Extensive experiments demonstrate that Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a theoretically grounded and practical framework for training-free alignment of unimodal foundation models. Our code is available at https://github.com/Jianglin954/Indra.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the Indra Representation Hypothesis, which posits that representations learned by unimodal foundation models converge to reflect a shared relational structure underlying reality. The authors formalize this using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others under a cost function. They claim this formulation is unique, complete, and structure-preserving. The hypothesis is instantiated using angular distance as the cost function and evaluated in cross-model and cross-modal scenarios across vision, language, and audio modalities. Experiments reportedly show consistent improvements in robustness and alignment, offering a training-free framework for multimodal alignment, with code made available.

Significance. If the formal claims are rigorously proven and the experimental results hold with appropriate controls, this work could significantly advance multimodal learning by providing a category-theoretic foundation for aligning representations without retraining. The application of enriched category theory to capture relational profiles is an interesting approach. The provision of open-source code enhances reproducibility. However, the current lack of detailed derivations and quantitative experimental reporting reduces the immediate impact.

major comments (2)
  1. [§3] §3 (Formalization): The assertion that the Indra representation is 'unique, complete, and structure-preserving' under a given cost function is stated without derivation steps or proof sketches. In particular, it is not verified that the angular distance cost function induces a valid V-enrichment (e.g., satisfying compositionality of hom-objects) to which the enriched Yoneda lemma applies directly and yields the claimed properties.
  2. [§5] §5 (Experiments): The claims of consistent enhancement in robustness and alignment lack quantitative details, baselines, controls, statistical significance, or error bars. This undermines assessment of the magnitude and reliability of the reported improvements across architectures and modalities.
minor comments (2)
  1. The notation for the V-enriched category, hom-objects, and relational profile should be introduced with explicit equations early in the formalization section to improve clarity.
  2. [Abstract] The abstract refers to 'extensive experiments' but the manuscript should include a dedicated table or section summarizing datasets, models, and metrics used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of the Indra Representation Hypothesis to advance multimodal alignment. We appreciate the opportunity to strengthen the manuscript and will incorporate detailed derivations and expanded experimental reporting in the revision.

read point-by-point responses
  1. Referee: [§3] §3 (Formalization): The assertion that the Indra representation is 'unique, complete, and structure-preserving' under a given cost function is stated without derivation steps or proof sketches. In particular, it is not verified that the angular distance cost function induces a valid V-enrichment (e.g., satisfying compositionality of hom-objects) to which the enriched Yoneda lemma applies directly and yields the claimed properties.

    Authors: We agree that explicit derivation steps and verification would improve rigor and clarity. The uniqueness, completeness, and structure-preserving properties follow directly from the V-enriched Yoneda lemma, which provides a fully faithful embedding that preserves the enriched hom-objects and relational structure. Angular distance defines a valid V-enrichment because it satisfies the metric axioms (non-negativity, symmetry, and the triangle inequality), allowing the hom-objects to form a V-category with appropriate compositionality. In the revised manuscript, we will add a dedicated subsection containing proof sketches that verify the V-enrichment axioms for angular distance and demonstrate how the enriched Yoneda lemma yields the stated properties without additional assumptions. revision: yes

  2. Referee: [§5] §5 (Experiments): The claims of consistent enhancement in robustness and alignment lack quantitative details, baselines, controls, statistical significance, or error bars. This undermines assessment of the magnitude and reliability of the reported improvements across architectures and modalities.

    Authors: We acknowledge that the current experimental section would benefit from more granular quantitative reporting to allow full assessment of the results. The experiments evaluate Indra representations (instantiated with angular distance) against raw unimodal embeddings in cross-model and cross-modal settings across vision, language, and audio, measuring improvements in alignment and robustness metrics. In the revision, we will expand the section with detailed tables reporting specific numerical gains (e.g., mean percentage improvements in cosine similarity and robustness scores), explicit baselines (including original embeddings and alternative metrics such as Euclidean distance), controls for architecture and modality variations, statistical significance via paired t-tests with p-values, and error bars computed from multiple independent runs with different random seeds. revision: yes

Circularity Check

0 steps flagged

No circularity: formalization rests on external category theory theorem

full rationale

The derivation defines the Indra representation as the image of the V-enriched Yoneda embedding applied to samples equipped with a cost function. Uniqueness, completeness, and structure-preservation are direct consequences of the enriched Yoneda lemma, an independent result from category theory literature rather than a self-derived or fitted property. Angular distance is chosen as an explicit instantiation of the cost function, not a parameter tuned to the target metrics. Experimental evaluations on cross-model and cross-modal alignment tasks compare against external benchmarks and do not reduce to the input relational profiles by construction. No self-citations are invoked to justify the core claims, and the hypothesis is tested rather than assumed tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard category-theory results for the Yoneda embedding plus the novel hypothesis that unimodal representations converge to relational structure; the main added entity is the Indra representation itself.

free parameters (1)
  • cost function
    The general cost function is part of the formalization; its concrete choice (angular distance) is an instantiation that affects the evaluated profiles.
axioms (1)
  • standard math V-enriched Yoneda embedding is unique, complete, and structure-preserving
    Invoked directly in the formalization section of the abstract as a background result from category theory.
invented entities (1)
  • Indra representation no independent evidence
    purpose: Relational profile of each sample with respect to others that enables alignment
    New concept introduced to capture the hypothesized shared relational structure; no independent evidence outside the paper's experiments is supplied.

pith-pipeline@v0.9.0 · 5516 in / 1503 out tokens · 59303 ms · 2026-05-10T20:00:31.881251+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 9 canonical work pages · 6 internal anchors

  1. [1]

    Nocaps: Novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019

  2. [2]

    wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

  3. [3]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  4. [4]

    Review of particle physics.Physical Review D—Particles, Fields, Gravitation, and Cosmology, 86(1):010001, 2012

    Juerg Beringer, J-F Arguin, RM Barnett, K Copic, O Dahl, DE Groom, C-J Lin, J Lys, H Mu- rayama, CG Wohl, et al. Review of particle physics.Physical Review D—Particles, Fields, Gravitation, and Cosmology, 86(1):010001, 2012

  5. [5]

    Truth is universal: Robust detection of lies in llms.Advances in Neural Information Processing Systems, 37, 2024

    Lennart Bürger, Fred A Hamprecht, and Boaz Nadler. Truth is universal: Robust detection of lies in llms.Advances in Neural Information Processing Systems, 37, 2024

  6. [6]

    Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

  7. [7]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020

  8. [8]

    Local graph convolutional networks for cross-modal hashing

    Yudong Chen, Sen Wang, Jianglin Lu, Zhi Chen, Zheng Zhang, and Zi Huang. Local graph convolutional networks for cross-modal hashing. InProceedings of the 29th ACM international conference on multimedia, pages 1921–1928, 2021

  9. [9]

    Shambhala Publications, Boston, 1993

    Thomas Cleary.The Flower Ornament Scripture: A Translation of the Avatamsaka Sutra. Shambhala Publications, Boston, 1993

  10. [10]

    Cook.Hua-Yen Buddhism: The Jewel Net of Indra

    F.H. Cook.Hua-Yen Buddhism: The Jewel Net of Indra. Iaswr Series. Pennsylvania State University Press, 1977

  11. [11]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1, pages 4171–4186, 2019

  12. [12]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

  13. [13]

    Springer, 2006

    Eduardo J Dubuc.Kan extensions in enriched category theory, volume 145. Springer, 2006

  14. [14]

    Bernard Quaritch, London, 1859

    Michael Faraday.Experimental Researches in Electricity. Bernard Quaritch, London, 1859. Originally published as a series of papers between 1831 and 1855

  15. [15]

    Philological Society, Oxford, 1957

    John Rupert Firth, editor.Studies in Linguistic Analysis. Philological Society, Oxford, 1957. Special volume of the Philological Society

  16. [16]

    Timit acoustic-phonetic continuous speech corpus.(No Title), 1993

    John S Garofolo, Lori F Lamel, William M Fisher, David S Pallett, Nancy L Dahlgren, Victor Zue, and Jonathan G Fiscus. Timit acoustic-phonetic continuous speech corpus.(No Title), 1993

  17. [17]

    Gergen.Relational Being: Beyond Self and Community

    Kenneth J. Gergen.Relational Being: Beyond Self and Community. Oxford University Press, New York, 2009. 10

  18. [18]

    John Wiley & Sons, 2020

    David Griffiths.Introduction to elementary particles. John Wiley & Sons, 2020

  19. [19]

    Bootstrap your own latent-a new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altmann, Corentin Tallec, Pierre-Henri Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. InAdvances in Neural Information Processing Systems, volume 33, pages 21271–21283, 2020

  20. [20]

    arXiv preprint arXiv:2401.12181 , year=

    Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, and Dimitris Bertsimas. Universal neurons in gpt2 language models.arXiv preprint arXiv:2401.12181, 2024

  21. [21]

    Inductive representation learning on large graphs

    Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. InAdvances in Neural Information Processing Systems, 2017

  22. [22]

    Indra’s net, 2022

    Harvard FAS CAMLab. Indra’s net, 2022

  23. [23]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

  24. [24]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  25. [25]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654, 2020

  26. [26]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718, 2021

  27. [27]

    Universality of representation in biological and artificial neural networks.bioRxiv, 2024

    Eghbal Hosseini, Colton Casto, Noga Zaslavsky, Colin Conwell, Mark Richardson, and Evelina Fedorenko. Universality of representation in biological and artificial neural networks.bioRxiv, 2024

  28. [28]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

  29. [29]

    The platonic representa- tion hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representa- tion hypothesis. InProceedings of the 41st International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2024

  30. [30]

    Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010

    Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010

  31. [31]

    Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019

  32. [32]

    Kasulis.Engaging Japanese Philosophy: A Short History

    Thomas P. Kasulis.Engaging Japanese Philosophy: A Short History. University of Hawai’i Press, Honolulu, 2018

  33. [33]

    Cambridge University Press, Cambridge, 1982

    Gregory Maxwell Kelly.Basic Concepts of Enriched Category Theory, volume 64 ofLondon Mathematical Society Lecture Note Series. Cambridge University Press, Cambridge, 1982

  34. [34]

    Privileged representational axes in biological and artificial neural networks.bioRxiv, pages 2024–06, 2024

    Meenakshi Khosla, Alex H Williams, Josh McDermott, and Nancy Kanwisher. Privileged representational axes in biological and artificial neural networks.bioRxiv, pages 2024–06, 2024

  35. [35]

    Kipf and Max Welling

    Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017

  36. [36]

    Grounding language models to images for multimodal inputs and outputs

    Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal inputs and outputs. InInternational Conference on Machine Learning, pages 17283–17300. PMLR, 2023. 11

  37. [37]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Technical Report

  38. [38]

    Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

  39. [39]

    Metric spaces, generalized logic, and closed categories.Rendiconti del seminario matématico e fisico di Milano, 43:135–166, 1973

    F William Lawvere. Metric spaces, generalized logic, and closed categories.Rendiconti del seminario matématico e fisico di Milano, 43:135–166, 1973

  40. [40]

    Mitra, A., Khanpour, H., Rosset, C., and Awadallah, A

    Andrew Lee, Melanie Weber, Fernanda Viégas, and Martin Wattenberg. Shared global and local geometry of language model embeddings.arXiv preprint arXiv:2503.21073, 2025

  41. [41]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  42. [42]

    Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

  43. [43]

    Deeper insights into graph convolutional networks for semi-supervised learning

    Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  44. [44]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014

  45. [45]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  46. [46]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019

  47. [47]

    A convnet for the 2020s.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  48. [48]

    Scale-free graph-language models

    Jianglin Lu, Yixuan Liu, Yitian Zhang, and Yun Fu. Scale-free graph-language models. InThe Thirteenth International Conference on Learning Representations, 2025

  49. [49]

    Representation potentials of foundation models for multimodal alignment: A survey

    Jianglin Lu, Hailing Wang, Yi Xu, Yizhou Wang, Kuo Yang, and Yun Fu. Representation potentials of foundation models for multimodal alignment: A survey. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16669–16684. Association for Computational Linguistics, 2025

  50. [50]

    Low-rank adaptive graph embedding for unsupervised feature extraction.Pattern Recognition, 113:107758, 2021

    Jianglin Lu, Hailing Wang, Jie Zhou, Yudong Chen, Zhihui Lai, and Qinghua Hu. Low-rank adaptive graph embedding for unsupervised feature extraction.Pattern Recognition, 113:107758, 2021

  51. [51]

    Latent graph inference with limited supervision

    Jianglin Lu, Yi Xu, Huan Wang, Yue Bai, and Yun Fu. Latent graph inference with limited supervision. InAdvances in Neural Information Processing Systems, 2023

  52. [52]

    O’Connor

    Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Mohamed El Amine Seddik, Sanath Narayan, Karttikeya Mangalam, and Noel E. O’Connor. Do vision and language encoders represent the world similarly? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024

  53. [53]

    Culture and the self: Implications for cognition, emotion, and motivation

    Hazel Rose Markus and Shinobu Kitayama. Culture and the self: Implications for cognition, emotion, and motivation. InCollege student development and academic life, pages 264–293. Routledge, 2014. 12

  54. [54]

    Clarendon Press, Oxford, first edition, 1873

    James Clerk Maxwell.A Treatise on Electricity and Magnetism. Clarendon Press, Oxford, first edition, 1873

  55. [55]

    Linearly mapping from image to text space

    Jack Merullo, Louis Castricato, Carsten Eickhoff, and Ellie Pavlick. Linearly mapping from image to text space. InThe Eleventh International Conference on Learning Representations, 2023

  56. [56]

    Distributed repre- sentations of words and phrases and their compositionality.Advances in neural information processing systems, 26, 2013

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality.Advances in neural information processing systems, 26, 2013

  57. [57]

    Relative representations enable zero-shot latent space communication

    Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. Relative representations enable zero-shot latent space communication. In The Eleventh International Conference on Learning Representations, 2023

  58. [58]

    Can language models learn to listen? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10083–10093, 2023

    Evonne Ng, Sanjay Subramanian, Dan Klein, Angjoo Kanazawa, Trevor Darrell, and Shiry Ginosar. Can language models learn to listen? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10083–10093, 2023

  59. [59]

    What do language models hear? probing for auditory representations in language models

    Jerry Ngo and Yoon Kim. What do language models hear? probing for auditory representations in language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 5435–5448, 2024

  60. [60]

    Gpt-4 technical report, 2024

    OpenAI. Gpt-4 technical report, 2024

  61. [61]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  62. [62]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

  63. [63]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

  64. [64]

    Courier Dover Publications, 2017

    Emily Riehl.Category theory in context. Courier Dover Publications, 2017

  65. [65]

    On linear identifiability of learned rep- resentations

    Geoffrey Roeder, Luke Metz, and Durk Kingma. On linear identifiability of learned rep- resentations. InInternational Conference on Machine Learning, pages 9030–9039. PMLR, 2021

  66. [66]

    wav2vec: Unsupervised pre-training for speech recognition.arXiv preprint arXiv:1904.05862, 2019

    Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition.arXiv preprint arXiv:1904.05862, 2019

  67. [67]

    A vision check-up for language models

    Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez- Munoz, Shivam Duggal, Phillip Isola, and Antonio Torralba. A vision check-up for language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 14410–14419, 2024

  68. [68]

    Analysing the generalisation and reliability of steering vectors

    Daniel Tan, David Chanin, Aengus Lynch, Brooks Paige, Dimitrios Kanoulas, Adrià Garriga- Alonso, and Robert Kirk. Analysing the generalisation and reliability of steering vectors. Advances in Neural Information Processing Systems, 37, 2024

  69. [69]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  70. [70]

    Attention is all you need.Advances in neural information processing systems, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 2017. 13

  71. [71]

    Graph attention networks.stat, 1050(20):10–48550, 2017

    Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, Yoshua Bengio, et al. Graph attention networks.stat, 1050(20):10–48550, 2017

  72. [72]

    Deep hashing network for unsupervised domain adaptation

    Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017

  73. [73]

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.Journal of machine learning research, 11(12), 2010

    Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.Journal of machine learning research, 11(12), 2010

  74. [74]

    Towards universality: Studying mechanistic similarity across language model architectures

    Junxuan Wang, Xuyang Ge, Wentao Shu, Qiong Tang, Yunhua Zhou, Zhengfu He, and Xipeng Qiu. Towards universality: Studying mechanistic similarity across language model architectures. InThe Thirteenth International Conference on Learning Representations, 2025

  75. [75]

    Fuzzy multi-subspace clustering.IEEE Transactions on Fuzzy Systems, pages 1–14, 2026

    Yangbo Wang, Jie Zhou, Mingli Song, Yue Guo, and Jianglin Lu. Fuzzy multi-subspace clustering.IEEE Transactions on Fuzzy Systems, pages 1–14, 2026

  76. [76]

    Testing the natural abstraction hypothesis.AI Alignment Forum, 2021

    John Wentworth. Testing the natural abstraction hypothesis.AI Alignment Forum, 2021

  77. [77]

    Convnext v2: Co-designing and scaling convnets with masked autoencoders

    Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  78. [78]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword- to-caption augmentation

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword- to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 14