Recognition: 2 theorem links
· Lean TheoremThe Indra Representation Hypothesis for Multimodal Alignment
Pith reviewed 2026-05-10 20:00 UTC · model grok-4.3
The pith
Unimodal foundation models converge on a shared relational structure that the Indra representation extracts for training-free multimodal alignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Indra Representation Hypothesis states that unimodal foundation models converge to implicitly reflect a shared relational structure underlying reality. This is formalized by defining the Indra representation of each sample as its relational profile with respect to all other samples, obtained via the V-enriched Yoneda embedding. The resulting representation is unique, complete, and structure-preserving under a given cost function, and when instantiated with angular distance it yields consistent gains in cross-model and cross-modal alignment.
What carries the argument
The V-enriched Yoneda embedding, which converts each sample into a relational profile of its distances or relations to every other sample under a chosen cost function.
If this is right
- Indra representations enable training-free alignment between different unimodal models.
- They increase robustness when transferring across architectures and modalities.
- The same relational profiles work for vision, language, and audio data.
- Alignment becomes a matter of matching relational structures rather than raw embeddings.
Where Pith is reading between the lines
- The approach could be tested on additional modalities such as video or sensor data to check if the relational convergence holds more broadly.
- It offers a way to compare models by their induced relation graphs instead of direct vector similarity.
- If the hypothesis is correct, future foundation models might be designed explicitly to preserve these relational profiles from the start.
Load-bearing premise
Unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality.
What would settle it
If replacing standard embeddings with Indra representations fails to improve alignment accuracy or robustness in cross-modal retrieval or classification tasks across vision, language, and audio, the hypothesis would be contradicted.
read the original abstract
Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra's Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra's Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be unique, complete, and structure-preserving under a given cost function. We instantiate the Indra representation using angular distance and evaluate it in cross-model and cross-modal scenarios involving vision, language, and audio. Extensive experiments demonstrate that Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a theoretically grounded and practical framework for training-free alignment of unimodal foundation models. Our code is available at https://github.com/Jianglin954/Indra.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the Indra Representation Hypothesis, which posits that representations learned by unimodal foundation models converge to reflect a shared relational structure underlying reality. The authors formalize this using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others under a cost function. They claim this formulation is unique, complete, and structure-preserving. The hypothesis is instantiated using angular distance as the cost function and evaluated in cross-model and cross-modal scenarios across vision, language, and audio modalities. Experiments reportedly show consistent improvements in robustness and alignment, offering a training-free framework for multimodal alignment, with code made available.
Significance. If the formal claims are rigorously proven and the experimental results hold with appropriate controls, this work could significantly advance multimodal learning by providing a category-theoretic foundation for aligning representations without retraining. The application of enriched category theory to capture relational profiles is an interesting approach. The provision of open-source code enhances reproducibility. However, the current lack of detailed derivations and quantitative experimental reporting reduces the immediate impact.
major comments (2)
- [§3] §3 (Formalization): The assertion that the Indra representation is 'unique, complete, and structure-preserving' under a given cost function is stated without derivation steps or proof sketches. In particular, it is not verified that the angular distance cost function induces a valid V-enrichment (e.g., satisfying compositionality of hom-objects) to which the enriched Yoneda lemma applies directly and yields the claimed properties.
- [§5] §5 (Experiments): The claims of consistent enhancement in robustness and alignment lack quantitative details, baselines, controls, statistical significance, or error bars. This undermines assessment of the magnitude and reliability of the reported improvements across architectures and modalities.
minor comments (2)
- The notation for the V-enriched category, hom-objects, and relational profile should be introduced with explicit equations early in the formalization section to improve clarity.
- [Abstract] The abstract refers to 'extensive experiments' but the manuscript should include a dedicated table or section summarizing datasets, models, and metrics used.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential of the Indra Representation Hypothesis to advance multimodal alignment. We appreciate the opportunity to strengthen the manuscript and will incorporate detailed derivations and expanded experimental reporting in the revision.
read point-by-point responses
-
Referee: [§3] §3 (Formalization): The assertion that the Indra representation is 'unique, complete, and structure-preserving' under a given cost function is stated without derivation steps or proof sketches. In particular, it is not verified that the angular distance cost function induces a valid V-enrichment (e.g., satisfying compositionality of hom-objects) to which the enriched Yoneda lemma applies directly and yields the claimed properties.
Authors: We agree that explicit derivation steps and verification would improve rigor and clarity. The uniqueness, completeness, and structure-preserving properties follow directly from the V-enriched Yoneda lemma, which provides a fully faithful embedding that preserves the enriched hom-objects and relational structure. Angular distance defines a valid V-enrichment because it satisfies the metric axioms (non-negativity, symmetry, and the triangle inequality), allowing the hom-objects to form a V-category with appropriate compositionality. In the revised manuscript, we will add a dedicated subsection containing proof sketches that verify the V-enrichment axioms for angular distance and demonstrate how the enriched Yoneda lemma yields the stated properties without additional assumptions. revision: yes
-
Referee: [§5] §5 (Experiments): The claims of consistent enhancement in robustness and alignment lack quantitative details, baselines, controls, statistical significance, or error bars. This undermines assessment of the magnitude and reliability of the reported improvements across architectures and modalities.
Authors: We acknowledge that the current experimental section would benefit from more granular quantitative reporting to allow full assessment of the results. The experiments evaluate Indra representations (instantiated with angular distance) against raw unimodal embeddings in cross-model and cross-modal settings across vision, language, and audio, measuring improvements in alignment and robustness metrics. In the revision, we will expand the section with detailed tables reporting specific numerical gains (e.g., mean percentage improvements in cosine similarity and robustness scores), explicit baselines (including original embeddings and alternative metrics such as Euclidean distance), controls for architecture and modality variations, statistical significance via paired t-tests with p-values, and error bars computed from multiple independent runs with different random seeds. revision: yes
Circularity Check
No circularity: formalization rests on external category theory theorem
full rationale
The derivation defines the Indra representation as the image of the V-enriched Yoneda embedding applied to samples equipped with a cost function. Uniqueness, completeness, and structure-preservation are direct consequences of the enriched Yoneda lemma, an independent result from category theory literature rather than a self-derived or fitted property. Angular distance is chosen as an explicit instantiation of the cost function, not a parameter tuned to the target metrics. Experimental evaluations on cross-model and cross-modal alignment tasks compare against external benchmarks and do not reduce to the input relational profiles by construction. No self-citations are invoked to justify the core claims, and the hypothesis is tested rather than assumed tautological.
Axiom & Free-Parameter Ledger
free parameters (1)
- cost function
axioms (1)
- standard math V-enriched Yoneda embedding is unique, complete, and structure-preserving
invented entities (1)
-
Indra representation
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Definition 1 (Sample Category). ... Hom-objects: ... cost function d(Xi,Xj) ... composition ... triangle inequality. ... V-enriched Yoneda embedding ... hXi(Xj)=d(Xj,Xi)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Theorem 1. The V-enriched Yoneda embedding Y ... is V-fully faithful. ... each sample Xi can be uniquely represented by its cost vector d(·,Xi)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Nocaps: Novel object captioning at scale
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019
2019
-
[2]
wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020
2020
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Review of particle physics.Physical Review D—Particles, Fields, Gravitation, and Cosmology, 86(1):010001, 2012
Juerg Beringer, J-F Arguin, RM Barnett, K Copic, O Dahl, DE Groom, C-J Lin, J Lys, H Mu- rayama, CG Wohl, et al. Review of particle physics.Physical Review D—Particles, Fields, Gravitation, and Cosmology, 86(1):010001, 2012
2012
-
[5]
Truth is universal: Robust detection of lies in llms.Advances in Neural Information Processing Systems, 37, 2024
Lennart Bürger, Fred A Hamprecht, and Boaz Nadler. Truth is universal: Robust detection of lies in llms.Advances in Neural Information Processing Systems, 37, 2024
2024
-
[6]
Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022
2022
-
[7]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020
2020
-
[8]
Local graph convolutional networks for cross-modal hashing
Yudong Chen, Sen Wang, Jianglin Lu, Zhi Chen, Zheng Zhang, and Zi Huang. Local graph convolutional networks for cross-modal hashing. InProceedings of the 29th ACM international conference on multimedia, pages 1921–1928, 2021
1921
-
[9]
Shambhala Publications, Boston, 1993
Thomas Cleary.The Flower Ornament Scripture: A Translation of the Avatamsaka Sutra. Shambhala Publications, Boston, 1993
1993
-
[10]
Cook.Hua-Yen Buddhism: The Jewel Net of Indra
F.H. Cook.Hua-Yen Buddhism: The Jewel Net of Indra. Iaswr Series. Pennsylvania State University Press, 1977
1977
-
[11]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1, pages 4171–4186, 2019
2019
-
[12]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021
2021
-
[13]
Springer, 2006
Eduardo J Dubuc.Kan extensions in enriched category theory, volume 145. Springer, 2006
2006
-
[14]
Bernard Quaritch, London, 1859
Michael Faraday.Experimental Researches in Electricity. Bernard Quaritch, London, 1859. Originally published as a series of papers between 1831 and 1855
-
[15]
Philological Society, Oxford, 1957
John Rupert Firth, editor.Studies in Linguistic Analysis. Philological Society, Oxford, 1957. Special volume of the Philological Society
1957
-
[16]
Timit acoustic-phonetic continuous speech corpus.(No Title), 1993
John S Garofolo, Lori F Lamel, William M Fisher, David S Pallett, Nancy L Dahlgren, Victor Zue, and Jonathan G Fiscus. Timit acoustic-phonetic continuous speech corpus.(No Title), 1993
1993
-
[17]
Gergen.Relational Being: Beyond Self and Community
Kenneth J. Gergen.Relational Being: Beyond Self and Community. Oxford University Press, New York, 2009. 10
2009
-
[18]
John Wiley & Sons, 2020
David Griffiths.Introduction to elementary particles. John Wiley & Sons, 2020
2020
-
[19]
Bootstrap your own latent-a new approach to self-supervised learning
Jean-Bastien Grill, Florian Strub, Florent Altmann, Corentin Tallec, Pierre-Henri Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. InAdvances in Neural Information Processing Systems, volume 33, pages 21271–21283, 2020
2020
-
[20]
arXiv preprint arXiv:2401.12181 , year=
Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, and Dimitris Bertsimas. Universal neurons in gpt2 language models.arXiv preprint arXiv:2401.12181, 2024
-
[21]
Inductive representation learning on large graphs
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. InAdvances in Neural Information Processing Systems, 2017
2017
-
[22]
Indra’s net, 2022
Harvard FAS CAMLab. Indra’s net, 2022
2022
-
[23]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020
2020
-
[24]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
2016
-
[25]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654, 2020
work page internal anchor Pith review arXiv 2006
-
[26]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718, 2021
work page internal anchor Pith review arXiv 2021
-
[27]
Universality of representation in biological and artificial neural networks.bioRxiv, 2024
Eghbal Hosseini, Colton Casto, Noga Zaslavsky, Colin Conwell, Mark Richardson, and Evelina Fedorenko. Universality of representation in biological and artificial neural networks.bioRxiv, 2024
2024
-
[28]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021
2021
-
[29]
The platonic representa- tion hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representa- tion hypothesis. InProceedings of the 41st International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2024
2024
-
[30]
Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010
Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010
2010
-
[31]
Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019
Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019
2019
-
[32]
Kasulis.Engaging Japanese Philosophy: A Short History
Thomas P. Kasulis.Engaging Japanese Philosophy: A Short History. University of Hawai’i Press, Honolulu, 2018
2018
-
[33]
Cambridge University Press, Cambridge, 1982
Gregory Maxwell Kelly.Basic Concepts of Enriched Category Theory, volume 64 ofLondon Mathematical Society Lecture Note Series. Cambridge University Press, Cambridge, 1982
1982
-
[34]
Privileged representational axes in biological and artificial neural networks.bioRxiv, pages 2024–06, 2024
Meenakshi Khosla, Alex H Williams, Josh McDermott, and Nancy Kanwisher. Privileged representational axes in biological and artificial neural networks.bioRxiv, pages 2024–06, 2024
2024
-
[35]
Kipf and Max Welling
Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017
2017
-
[36]
Grounding language models to images for multimodal inputs and outputs
Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal inputs and outputs. InInternational Conference on Machine Learning, pages 17283–17300. PMLR, 2023. 11
2023
-
[37]
Learning multiple layers of features from tiny images
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Technical Report
2009
-
[38]
Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012
2012
-
[39]
Metric spaces, generalized logic, and closed categories.Rendiconti del seminario matématico e fisico di Milano, 43:135–166, 1973
F William Lawvere. Metric spaces, generalized logic, and closed categories.Rendiconti del seminario matématico e fisico di Milano, 43:135–166, 1973
1973
-
[40]
Mitra, A., Khanpour, H., Rosset, C., and Awadallah, A
Andrew Lee, Melanie Weber, Fernanda Viégas, and Martin Wattenberg. Shared global and local geometry of language model embeddings.arXiv preprint arXiv:2503.21073, 2025
-
[41]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023
2023
-
[42]
Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022
2022
-
[43]
Deeper insights into graph convolutional networks for semi-supervised learning
Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018
2018
-
[44]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014
2014
-
[45]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
2023
-
[46]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[47]
A convnet for the 2020s.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
2022
-
[48]
Scale-free graph-language models
Jianglin Lu, Yixuan Liu, Yitian Zhang, and Yun Fu. Scale-free graph-language models. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[49]
Representation potentials of foundation models for multimodal alignment: A survey
Jianglin Lu, Hailing Wang, Yi Xu, Yizhou Wang, Kuo Yang, and Yun Fu. Representation potentials of foundation models for multimodal alignment: A survey. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16669–16684. Association for Computational Linguistics, 2025
2025
-
[50]
Low-rank adaptive graph embedding for unsupervised feature extraction.Pattern Recognition, 113:107758, 2021
Jianglin Lu, Hailing Wang, Jie Zhou, Yudong Chen, Zhihui Lai, and Qinghua Hu. Low-rank adaptive graph embedding for unsupervised feature extraction.Pattern Recognition, 113:107758, 2021
2021
-
[51]
Latent graph inference with limited supervision
Jianglin Lu, Yi Xu, Huan Wang, Yue Bai, and Yun Fu. Latent graph inference with limited supervision. InAdvances in Neural Information Processing Systems, 2023
2023
-
[52]
O’Connor
Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Mohamed El Amine Seddik, Sanath Narayan, Karttikeya Mangalam, and Noel E. O’Connor. Do vision and language encoders represent the world similarly? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024
2024
-
[53]
Culture and the self: Implications for cognition, emotion, and motivation
Hazel Rose Markus and Shinobu Kitayama. Culture and the self: Implications for cognition, emotion, and motivation. InCollege student development and academic life, pages 264–293. Routledge, 2014. 12
2014
-
[54]
Clarendon Press, Oxford, first edition, 1873
James Clerk Maxwell.A Treatise on Electricity and Magnetism. Clarendon Press, Oxford, first edition, 1873
-
[55]
Linearly mapping from image to text space
Jack Merullo, Louis Castricato, Carsten Eickhoff, and Ellie Pavlick. Linearly mapping from image to text space. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[56]
Distributed repre- sentations of words and phrases and their compositionality.Advances in neural information processing systems, 26, 2013
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality.Advances in neural information processing systems, 26, 2013
2013
-
[57]
Relative representations enable zero-shot latent space communication
Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. Relative representations enable zero-shot latent space communication. In The Eleventh International Conference on Learning Representations, 2023
2023
-
[58]
Can language models learn to listen? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10083–10093, 2023
Evonne Ng, Sanjay Subramanian, Dan Klein, Angjoo Kanazawa, Trevor Darrell, and Shiry Ginosar. Can language models learn to listen? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10083–10093, 2023
2023
-
[59]
What do language models hear? probing for auditory representations in language models
Jerry Ngo and Yoon Kim. What do language models hear? probing for auditory representations in language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 5435–5448, 2024
2024
-
[60]
Gpt-4 technical report, 2024
OpenAI. Gpt-4 technical report, 2024
2024
-
[61]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021
2021
-
[63]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023
2023
-
[64]
Courier Dover Publications, 2017
Emily Riehl.Category theory in context. Courier Dover Publications, 2017
2017
-
[65]
On linear identifiability of learned rep- resentations
Geoffrey Roeder, Luke Metz, and Durk Kingma. On linear identifiability of learned rep- resentations. InInternational Conference on Machine Learning, pages 9030–9039. PMLR, 2021
2021
-
[66]
wav2vec: Unsupervised pre-training for speech recognition.arXiv preprint arXiv:1904.05862, 2019
Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition.arXiv preprint arXiv:1904.05862, 2019
-
[67]
A vision check-up for language models
Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez- Munoz, Shivam Duggal, Phillip Isola, and Antonio Torralba. A vision check-up for language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 14410–14419, 2024
2024
-
[68]
Analysing the generalisation and reliability of steering vectors
Daniel Tan, David Chanin, Aengus Lynch, Brooks Paige, Dimitrios Kanoulas, Adrià Garriga- Alonso, and Robert Kirk. Analysing the generalisation and reliability of steering vectors. Advances in Neural Information Processing Systems, 37, 2024
2024
-
[69]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
Attention is all you need.Advances in neural information processing systems, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 2017. 13
2017
-
[71]
Graph attention networks.stat, 1050(20):10–48550, 2017
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, Yoshua Bengio, et al. Graph attention networks.stat, 1050(20):10–48550, 2017
2017
-
[72]
Deep hashing network for unsupervised domain adaptation
Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017
2017
-
[73]
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.Journal of machine learning research, 11(12), 2010
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.Journal of machine learning research, 11(12), 2010
2010
-
[74]
Towards universality: Studying mechanistic similarity across language model architectures
Junxuan Wang, Xuyang Ge, Wentao Shu, Qiong Tang, Yunhua Zhou, Zhengfu He, and Xipeng Qiu. Towards universality: Studying mechanistic similarity across language model architectures. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[75]
Fuzzy multi-subspace clustering.IEEE Transactions on Fuzzy Systems, pages 1–14, 2026
Yangbo Wang, Jie Zhou, Mingli Song, Yue Guo, and Jianglin Lu. Fuzzy multi-subspace clustering.IEEE Transactions on Fuzzy Systems, pages 1–14, 2026
2026
-
[76]
Testing the natural abstraction hypothesis.AI Alignment Forum, 2021
John Wentworth. Testing the natural abstraction hypothesis.AI Alignment Forum, 2021
2021
-
[77]
Convnext v2: Co-designing and scaling convnets with masked autoencoders
Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
2023
-
[78]
Large-scale contrastive language-audio pretraining with feature fusion and keyword- to-caption augmentation
Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword- to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 14
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.