arxiv: 2604.04496 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

The Indra Representation Hypothesis for Multimodal Alignment

Jianglin Lu , Hailing Wang , Kuo Yang , Yitian Zhang , Simon Jenni , Yun Fu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords Indra representationmultimodal alignmentYoneda embeddingfoundation modelsrelational structurecross-modal alignmentcategory theory

0 comments

The pith

Unimodal foundation models converge on a shared relational structure that the Indra representation extracts for training-free multimodal alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that models trained on single modalities like vision or text are implicitly learning the same network of relations among data points. This shared structure, likened to Indra's Net, is more expressive than isolated sample embeddings. The authors use the V-enriched Yoneda embedding to turn each sample into a complete profile of its relations to all others under a cost function. Experiments show these Indra representations improve alignment and robustness when crossing between vision, language, and audio models without any retraining.

Core claim

The Indra Representation Hypothesis states that unimodal foundation models converge to implicitly reflect a shared relational structure underlying reality. This is formalized by defining the Indra representation of each sample as its relational profile with respect to all other samples, obtained via the V-enriched Yoneda embedding. The resulting representation is unique, complete, and structure-preserving under a given cost function, and when instantiated with angular distance it yields consistent gains in cross-model and cross-modal alignment.

What carries the argument

The V-enriched Yoneda embedding, which converts each sample into a relational profile of its distances or relations to every other sample under a chosen cost function.

If this is right

Indra representations enable training-free alignment between different unimodal models.
They increase robustness when transferring across architectures and modalities.
The same relational profiles work for vision, language, and audio data.
Alignment becomes a matter of matching relational structures rather than raw embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on additional modalities such as video or sensor data to check if the relational convergence holds more broadly.
It offers a way to compare models by their induced relation graphs instead of direct vector similarity.
If the hypothesis is correct, future foundation models might be designed explicitly to preserve these relational profiles from the start.

Load-bearing premise

Unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality.

What would settle it

If replacing standard embeddings with Indra representations fails to improve alignment accuracy or robustness in cross-modal retrieval or classification tasks across vision, language, and audio, the hypothesis would be contradicted.

read the original abstract

Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra's Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra's Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be unique, complete, and structure-preserving under a given cost function. We instantiate the Indra representation using angular distance and evaluate it in cross-model and cross-modal scenarios involving vision, language, and audio. Extensive experiments demonstrate that Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a theoretically grounded and practical framework for training-free alignment of unimodal foundation models. Our code is available at https://github.com/Jianglin954/Indra.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's Indra hypothesis offers a category-theoretic take on relational representations for training-free alignment, though the formal claims need more support.

read the letter

The paper's core move is to frame convergent unimodal representations as reflecting a shared relational structure, then use V-enriched Yoneda embedding to turn each sample into a profile of its relations to others. This is meant to give a training-free route to better cross-model and cross-modal alignment in vision, language, and audio settings. They instantiate the idea with angular distance and report gains in robustness and alignment, with code released for inspection.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the Indra Representation Hypothesis, which posits that representations learned by unimodal foundation models converge to reflect a shared relational structure underlying reality. The authors formalize this using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others under a cost function. They claim this formulation is unique, complete, and structure-preserving. The hypothesis is instantiated using angular distance as the cost function and evaluated in cross-model and cross-modal scenarios across vision, language, and audio modalities. Experiments reportedly show consistent improvements in robustness and alignment, offering a training-free framework for multimodal alignment, with code made available.

Significance. If the formal claims are rigorously proven and the experimental results hold with appropriate controls, this work could significantly advance multimodal learning by providing a category-theoretic foundation for aligning representations without retraining. The application of enriched category theory to capture relational profiles is an interesting approach. The provision of open-source code enhances reproducibility. However, the current lack of detailed derivations and quantitative experimental reporting reduces the immediate impact.

major comments (2)

[§3] §3 (Formalization): The assertion that the Indra representation is 'unique, complete, and structure-preserving' under a given cost function is stated without derivation steps or proof sketches. In particular, it is not verified that the angular distance cost function induces a valid V-enrichment (e.g., satisfying compositionality of hom-objects) to which the enriched Yoneda lemma applies directly and yields the claimed properties.
[§5] §5 (Experiments): The claims of consistent enhancement in robustness and alignment lack quantitative details, baselines, controls, statistical significance, or error bars. This undermines assessment of the magnitude and reliability of the reported improvements across architectures and modalities.

minor comments (2)

The notation for the V-enriched category, hom-objects, and relational profile should be introduced with explicit equations early in the formalization section to improve clarity.
[Abstract] The abstract refers to 'extensive experiments' but the manuscript should include a dedicated table or section summarizing datasets, models, and metrics used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of the Indra Representation Hypothesis to advance multimodal alignment. We appreciate the opportunity to strengthen the manuscript and will incorporate detailed derivations and expanded experimental reporting in the revision.

read point-by-point responses

Referee: [§3] §3 (Formalization): The assertion that the Indra representation is 'unique, complete, and structure-preserving' under a given cost function is stated without derivation steps or proof sketches. In particular, it is not verified that the angular distance cost function induces a valid V-enrichment (e.g., satisfying compositionality of hom-objects) to which the enriched Yoneda lemma applies directly and yields the claimed properties.

Authors: We agree that explicit derivation steps and verification would improve rigor and clarity. The uniqueness, completeness, and structure-preserving properties follow directly from the V-enriched Yoneda lemma, which provides a fully faithful embedding that preserves the enriched hom-objects and relational structure. Angular distance defines a valid V-enrichment because it satisfies the metric axioms (non-negativity, symmetry, and the triangle inequality), allowing the hom-objects to form a V-category with appropriate compositionality. In the revised manuscript, we will add a dedicated subsection containing proof sketches that verify the V-enrichment axioms for angular distance and demonstrate how the enriched Yoneda lemma yields the stated properties without additional assumptions. revision: yes
Referee: [§5] §5 (Experiments): The claims of consistent enhancement in robustness and alignment lack quantitative details, baselines, controls, statistical significance, or error bars. This undermines assessment of the magnitude and reliability of the reported improvements across architectures and modalities.

Authors: We acknowledge that the current experimental section would benefit from more granular quantitative reporting to allow full assessment of the results. The experiments evaluate Indra representations (instantiated with angular distance) against raw unimodal embeddings in cross-model and cross-modal settings across vision, language, and audio, measuring improvements in alignment and robustness metrics. In the revision, we will expand the section with detailed tables reporting specific numerical gains (e.g., mean percentage improvements in cosine similarity and robustness scores), explicit baselines (including original embeddings and alternative metrics such as Euclidean distance), controls for architecture and modality variations, statistical significance via paired t-tests with p-values, and error bars computed from multiple independent runs with different random seeds. revision: yes

Circularity Check

0 steps flagged

No circularity: formalization rests on external category theory theorem

full rationale

The derivation defines the Indra representation as the image of the V-enriched Yoneda embedding applied to samples equipped with a cost function. Uniqueness, completeness, and structure-preservation are direct consequences of the enriched Yoneda lemma, an independent result from category theory literature rather than a self-derived or fitted property. Angular distance is chosen as an explicit instantiation of the cost function, not a parameter tuned to the target metrics. Experimental evaluations on cross-model and cross-modal alignment tasks compare against external benchmarks and do not reduce to the input relational profiles by construction. No self-citations are invoked to justify the core claims, and the hypothesis is tested rather than assumed tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard category-theory results for the Yoneda embedding plus the novel hypothesis that unimodal representations converge to relational structure; the main added entity is the Indra representation itself.

free parameters (1)

cost function
The general cost function is part of the formalization; its concrete choice (angular distance) is an instantiation that affects the evaluated profiles.

axioms (1)

standard math V-enriched Yoneda embedding is unique, complete, and structure-preserving
Invoked directly in the formalization section of the abstract as a background result from category theory.

invented entities (1)

Indra representation no independent evidence
purpose: Relational profile of each sample with respect to others that enables alignment
New concept introduced to capture the hypothesized shared relational structure; no independent evidence outside the paper's experiments is supplied.

pith-pipeline@v0.9.0 · 5516 in / 1503 out tokens · 59303 ms · 2026-05-10T20:00:31.881251+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Definition 1 (Sample Category). ... Hom-objects: ... cost function d(Xi,Xj) ... composition ... triangle inequality. ... V-enriched Yoneda embedding ... hXi(Xj)=d(Xj,Xi)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Theorem 1. The V-enriched Yoneda embedding Y ... is V-fully faithful. ... each sample Xi can be uniquely represented by its cost vector d(·,Xi)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 9 canonical work pages · 6 internal anchors

[1]

Nocaps: Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019

2019
[2]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

2020
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Review of particle physics.Physical Review D—Particles, Fields, Gravitation, and Cosmology, 86(1):010001, 2012

Juerg Beringer, J-F Arguin, RM Barnett, K Copic, O Dahl, DE Groom, C-J Lin, J Lys, H Mu- rayama, CG Wohl, et al. Review of particle physics.Physical Review D—Particles, Fields, Gravitation, and Cosmology, 86(1):010001, 2012

2012
[5]

Truth is universal: Robust detection of lies in llms.Advances in Neural Information Processing Systems, 37, 2024

Lennart Bürger, Fred A Hamprecht, and Boaz Nadler. Truth is universal: Robust detection of lies in llms.Advances in Neural Information Processing Systems, 37, 2024

2024
[6]

Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

2022
[7]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020

2020
[8]

Local graph convolutional networks for cross-modal hashing

Yudong Chen, Sen Wang, Jianglin Lu, Zhi Chen, Zheng Zhang, and Zi Huang. Local graph convolutional networks for cross-modal hashing. InProceedings of the 29th ACM international conference on multimedia, pages 1921–1928, 2021

1921
[9]

Shambhala Publications, Boston, 1993

Thomas Cleary.The Flower Ornament Scripture: A Translation of the Avatamsaka Sutra. Shambhala Publications, Boston, 1993

1993
[10]

Cook.Hua-Yen Buddhism: The Jewel Net of Indra

F.H. Cook.Hua-Yen Buddhism: The Jewel Net of Indra. Iaswr Series. Pennsylvania State University Press, 1977

1977
[11]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1, pages 4171–4186, 2019

2019
[12]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

2021
[13]

Springer, 2006

Eduardo J Dubuc.Kan extensions in enriched category theory, volume 145. Springer, 2006

2006
[14]

Bernard Quaritch, London, 1859

Michael Faraday.Experimental Researches in Electricity. Bernard Quaritch, London, 1859. Originally published as a series of papers between 1831 and 1855
[15]

Philological Society, Oxford, 1957

John Rupert Firth, editor.Studies in Linguistic Analysis. Philological Society, Oxford, 1957. Special volume of the Philological Society

1957
[16]

Timit acoustic-phonetic continuous speech corpus.(No Title), 1993

John S Garofolo, Lori F Lamel, William M Fisher, David S Pallett, Nancy L Dahlgren, Victor Zue, and Jonathan G Fiscus. Timit acoustic-phonetic continuous speech corpus.(No Title), 1993

1993
[17]

Gergen.Relational Being: Beyond Self and Community

Kenneth J. Gergen.Relational Being: Beyond Self and Community. Oxford University Press, New York, 2009. 10

2009
[18]

John Wiley & Sons, 2020

David Griffiths.Introduction to elementary particles. John Wiley & Sons, 2020

2020
[19]

Bootstrap your own latent-a new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altmann, Corentin Tallec, Pierre-Henri Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. InAdvances in Neural Information Processing Systems, volume 33, pages 21271–21283, 2020

2020
[20]

arXiv preprint arXiv:2401.12181 , year=

Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, and Dimitris Bertsimas. Universal neurons in gpt2 language models.arXiv preprint arXiv:2401.12181, 2024

work page arXiv 2024
[21]

Inductive representation learning on large graphs

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. InAdvances in Neural Information Processing Systems, 2017

2017
[22]

Indra’s net, 2022

Harvard FAS CAMLab. Indra’s net, 2022

2022
[23]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

2020
[24]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[25]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654, 2020

work page internal anchor Pith review arXiv 2006
[26]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718, 2021

work page internal anchor Pith review arXiv 2021
[27]

Universality of representation in biological and artificial neural networks.bioRxiv, 2024

Eghbal Hosseini, Colton Casto, Noga Zaslavsky, Colin Conwell, Mark Richardson, and Evelina Fedorenko. Universality of representation in biological and artificial neural networks.bioRxiv, 2024

2024
[28]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

2021
[29]

The platonic representa- tion hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representa- tion hypothesis. InProceedings of the 41st International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2024

2024
[30]

Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010

Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010

2010
[31]

Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019

2019
[32]

Kasulis.Engaging Japanese Philosophy: A Short History

Thomas P. Kasulis.Engaging Japanese Philosophy: A Short History. University of Hawai’i Press, Honolulu, 2018

2018
[33]

Cambridge University Press, Cambridge, 1982

Gregory Maxwell Kelly.Basic Concepts of Enriched Category Theory, volume 64 ofLondon Mathematical Society Lecture Note Series. Cambridge University Press, Cambridge, 1982

1982
[34]

Privileged representational axes in biological and artificial neural networks.bioRxiv, pages 2024–06, 2024

Meenakshi Khosla, Alex H Williams, Josh McDermott, and Nancy Kanwisher. Privileged representational axes in biological and artificial neural networks.bioRxiv, pages 2024–06, 2024

2024
[35]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017

2017
[36]

Grounding language models to images for multimodal inputs and outputs

Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal inputs and outputs. InInternational Conference on Machine Learning, pages 17283–17300. PMLR, 2023. 11

2023
[37]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Technical Report

2009
[38]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

2012
[39]

Metric spaces, generalized logic, and closed categories.Rendiconti del seminario matématico e fisico di Milano, 43:135–166, 1973

F William Lawvere. Metric spaces, generalized logic, and closed categories.Rendiconti del seminario matématico e fisico di Milano, 43:135–166, 1973

1973
[40]

Mitra, A., Khanpour, H., Rosset, C., and Awadallah, A

Andrew Lee, Melanie Weber, Fernanda Viégas, and Martin Wattenberg. Shared global and local geometry of language model embeddings.arXiv preprint arXiv:2503.21073, 2025

work page arXiv 2025
[41]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023
[42]

Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

2022
[43]

Deeper insights into graph convolutional networks for semi-supervised learning

Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[44]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014

2014
[45]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[46]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[47]

A convnet for the 2020s.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[48]

Scale-free graph-language models

Jianglin Lu, Yixuan Liu, Yitian Zhang, and Yun Fu. Scale-free graph-language models. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[49]

Representation potentials of foundation models for multimodal alignment: A survey

Jianglin Lu, Hailing Wang, Yi Xu, Yizhou Wang, Kuo Yang, and Yun Fu. Representation potentials of foundation models for multimodal alignment: A survey. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16669–16684. Association for Computational Linguistics, 2025

2025
[50]

Low-rank adaptive graph embedding for unsupervised feature extraction.Pattern Recognition, 113:107758, 2021

Jianglin Lu, Hailing Wang, Jie Zhou, Yudong Chen, Zhihui Lai, and Qinghua Hu. Low-rank adaptive graph embedding for unsupervised feature extraction.Pattern Recognition, 113:107758, 2021

2021
[51]

Latent graph inference with limited supervision

Jianglin Lu, Yi Xu, Huan Wang, Yue Bai, and Yun Fu. Latent graph inference with limited supervision. InAdvances in Neural Information Processing Systems, 2023

2023
[52]

O’Connor

Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Mohamed El Amine Seddik, Sanath Narayan, Karttikeya Mangalam, and Noel E. O’Connor. Do vision and language encoders represent the world similarly? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024

2024
[53]

Culture and the self: Implications for cognition, emotion, and motivation

Hazel Rose Markus and Shinobu Kitayama. Culture and the self: Implications for cognition, emotion, and motivation. InCollege student development and academic life, pages 264–293. Routledge, 2014. 12

2014
[54]

Clarendon Press, Oxford, first edition, 1873

James Clerk Maxwell.A Treatise on Electricity and Magnetism. Clarendon Press, Oxford, first edition, 1873
[55]

Linearly mapping from image to text space

Jack Merullo, Louis Castricato, Carsten Eickhoff, and Ellie Pavlick. Linearly mapping from image to text space. InThe Eleventh International Conference on Learning Representations, 2023

2023
[56]

Distributed repre- sentations of words and phrases and their compositionality.Advances in neural information processing systems, 26, 2013

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality.Advances in neural information processing systems, 26, 2013

2013
[57]

Relative representations enable zero-shot latent space communication

Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. Relative representations enable zero-shot latent space communication. In The Eleventh International Conference on Learning Representations, 2023

2023
[58]

Can language models learn to listen? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10083–10093, 2023

Evonne Ng, Sanjay Subramanian, Dan Klein, Angjoo Kanazawa, Trevor Darrell, and Shiry Ginosar. Can language models learn to listen? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10083–10093, 2023

2023
[59]

What do language models hear? probing for auditory representations in language models

Jerry Ngo and Yoon Kim. What do language models hear? probing for auditory representations in language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 5435–5448, 2024

2024
[60]

Gpt-4 technical report, 2024

OpenAI. Gpt-4 technical report, 2024

2024
[61]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

2021
[63]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

2023
[64]

Courier Dover Publications, 2017

Emily Riehl.Category theory in context. Courier Dover Publications, 2017

2017
[65]

On linear identifiability of learned rep- resentations

Geoffrey Roeder, Luke Metz, and Durk Kingma. On linear identifiability of learned rep- resentations. InInternational Conference on Machine Learning, pages 9030–9039. PMLR, 2021

2021
[66]

wav2vec: Unsupervised pre-training for speech recognition.arXiv preprint arXiv:1904.05862, 2019

Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition.arXiv preprint arXiv:1904.05862, 2019

work page arXiv 1904
[67]

A vision check-up for language models

Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez- Munoz, Shivam Duggal, Phillip Isola, and Antonio Torralba. A vision check-up for language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 14410–14419, 2024

2024
[68]

Analysing the generalisation and reliability of steering vectors

Daniel Tan, David Chanin, Aengus Lynch, Brooks Paige, Dimitrios Kanoulas, Adrià Garriga- Alonso, and Robert Kirk. Analysing the generalisation and reliability of steering vectors. Advances in Neural Information Processing Systems, 37, 2024

2024
[69]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Attention is all you need.Advances in neural information processing systems, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 2017. 13

2017
[71]

Graph attention networks.stat, 1050(20):10–48550, 2017

Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, Yoshua Bengio, et al. Graph attention networks.stat, 1050(20):10–48550, 2017

2017
[72]

Deep hashing network for unsupervised domain adaptation

Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017

2017
[73]

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.Journal of machine learning research, 11(12), 2010

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.Journal of machine learning research, 11(12), 2010

2010
[74]

Towards universality: Studying mechanistic similarity across language model architectures

Junxuan Wang, Xuyang Ge, Wentao Shu, Qiong Tang, Yunhua Zhou, Zhengfu He, and Xipeng Qiu. Towards universality: Studying mechanistic similarity across language model architectures. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[75]

Fuzzy multi-subspace clustering.IEEE Transactions on Fuzzy Systems, pages 1–14, 2026

Yangbo Wang, Jie Zhou, Mingli Song, Yue Guo, and Jianglin Lu. Fuzzy multi-subspace clustering.IEEE Transactions on Fuzzy Systems, pages 1–14, 2026

2026
[76]

Testing the natural abstraction hypothesis.AI Alignment Forum, 2021

John Wentworth. Testing the natural abstraction hypothesis.AI Alignment Forum, 2021

2021
[77]

Convnext v2: Co-designing and scaling convnets with masked autoencoders

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[78]

Large-scale contrastive language-audio pretraining with feature fusion and keyword- to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword- to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 14

2023