Recognition: 2 theorem links
· Lean TheoremKosmos-2: Grounding Multimodal Large Language Models to the World
Pith reviewed 2026-05-12 05:13 UTC · model grok-4.3
The pith
Kosmos-2 adds visual grounding to multimodal LLMs by encoding referring expressions as Markdown links to bounding-box location tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kosmos-2 acquires grounding by representing each referring expression as a Markdown link of the form [text span](bounding boxes), where the bounding boxes are encoded as ordered sequences of location tokens; training on the constructed GrIT corpus of grounded image-text pairs then integrates this behavior into the model without loss of its prior multimodal and in-context learning abilities, as measured across referring comprehension, phrase grounding, referring generation, and general perception-language tasks.
What carries the argument
The Markdown-link representation of referring expressions, written as [text span](bounding boxes) with bounding boxes expressed as sequences of location tokens.
If this is right
- The model handles referring expression comprehension and phrase grounding tasks at competitive levels.
- It can generate referring expressions that correctly identify image regions.
- Performance on perception-language and general language tasks remains intact.
- Grounding becomes available for integration into a range of downstream multimodal applications.
Where Pith is reading between the lines
- The same link-based grounding format could be applied to video or 3-D scenes to let models refer to objects across time or depth.
- Embodied agents might use the resulting outputs to plan physical actions that reference specific world locations.
- Unified training on grounded text and world-modeling data could reduce the separation between language and spatial reasoning modules.
Load-bearing premise
Encoding referring expressions as Markdown links to location-token sequences and training on GrIT will add usable grounding without degrading the model's other language and multimodal capabilities.
What would settle it
If Kosmos-2 shows no improvement over non-grounded multimodal baselines on referring-expression comprehension or phrase-grounding benchmarks, or if scores on standard visual-question-answering tasks drop after the same training, the grounding method would be shown ineffective.
read the original abstract
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability into downstream applications. We evaluate Kosmos-2 on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Code and pretrained models are available at https://aka.ms/kosmos-2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Kosmos-2, a multimodal large language model that represents referring expressions as Markdown links of the form [text span](bounding boxes) where bounding boxes are encoded as sequences of location tokens. The model is trained on a newly constructed GrIT dataset of grounded image-text pairs in addition to existing multimodal corpora. Evaluations cover referring expression comprehension, phrase grounding, referring expression generation, perception-language tasks, and language understanding/generation. The authors claim that Kosmos-2 integrates grounding into downstream applications and lays the foundation for Embodiment AI via convergence of language, perception, action, and world modeling.
Significance. If the location-token representation and GrIT training demonstrably add effective grounding while preserving other MLLM capabilities, the work would provide a practical technical contribution to multimodal alignment. The GrIT dataset construction and Markdown-link format could serve as reusable baselines. The broader Embodiment-AI claim, however, rests on an untested extrapolation from perception-language results to action and world modeling.
major comments (2)
- [Abstract] Abstract: The statement that the work 'lays out the foundation for the development of Embodiment AI' and constitutes 'a key step toward artificial general intelligence' via 'the big convergence of language, multimodal perception, action, and world modeling' is not supported by the listed evaluations ((i) referring expression comprehension/phrase grounding, (ii) referring expression generation, (iii) perception-language tasks, (iv) language understanding/generation). No action, planning, navigation, or interaction tasks are described, so the convergence claim remains aspirational.
- [Evaluation] Evaluation sections: The manuscript must supply quantitative results, baselines, and ablations for the grounding tasks (e.g., comparison of location-token accuracy against prior phrase-grounding methods). Without these, it is impossible to verify whether the Markdown-link representation adds grounding capability without degrading other MLLM performance.
minor comments (2)
- [Abstract] The abstract states that code and pretrained models are available at https://aka.ms/kosmos-2 but provides no license, version, or reproducibility details.
- [Method] Notation for location tokens should be defined once with an explicit example (e.g., the exact token vocabulary and how bounding-box coordinates are discretized) rather than introduced only in passing.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive feedback. We respond point-by-point to the major comments below and indicate the planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The statement that the work 'lays out the foundation for the development of Embodiment AI' and constitutes 'a key step toward artificial general intelligence' via 'the big convergence of language, multimodal perception, action, and world modeling' is not supported by the listed evaluations ((i) referring expression comprehension/phrase grounding, (ii) referring expression generation, (iii) perception-language tasks, (iv) language understanding/generation). No action, planning, navigation, or interaction tasks are described, so the convergence claim remains aspirational.
Authors: We agree that the current evaluations are confined to perception-language tasks and do not include action, planning, or navigation experiments. The abstract statement is intended to highlight how the introduced grounding capability provides a necessary foundation for future integration with action and world modeling. To prevent any overstatement, we will revise the abstract to more precisely delineate the present contributions while framing the convergence as a direction for subsequent research. revision: yes
-
Referee: [Evaluation] Evaluation sections: The manuscript must supply quantitative results, baselines, and ablations for the grounding tasks (e.g., comparison of location-token accuracy against prior phrase-grounding methods). Without these, it is impossible to verify whether the Markdown-link representation adds grounding capability without degrading other MLLM performance.
Authors: The manuscript already reports quantitative results on referring expression comprehension and phrase grounding, together with comparisons to prior methods on standard benchmarks. We recognize that additional ablations isolating the location-token representation and Markdown-link format would further clarify its incremental benefit and confirm preservation of other capabilities. We will therefore add these ablations and expanded baseline analyses in the revised version. revision: partial
Circularity Check
No circularity in derivation chain; claims are empirical
full rationale
The paper describes an empirical construction: refer expressions are represented as Markdown links to location-token sequences, GrIT data is built from multimodal corpora, and Kosmos-2 is trained on this data plus existing MLLM corpora. All listed evaluations (referring expression comprehension, phrase grounding, generation, perception-language tasks, language understanding) are direct experimental outcomes of this training. The Embodiment-AI foundation claim is a forward-looking statement with no equations, uniqueness theorems, or self-citations that reduce the result to its inputs by construction. No self-definitional, fitted-prediction, or ansatz-smuggling patterns appear; the work is self-contained against its own benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multimodal corpora contain extractable grounded image-text pairs suitable for training
invented entities (2)
-
GrIT dataset
no independent evidence
-
Location tokens
no independent evidence
Forward citations
Cited by 42 Pith papers
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
-
RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
-
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
-
Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings
Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.
-
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
-
STORM: End-to-End Referring Multi-Object Tracking in Videos
STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
-
MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments
MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.
-
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning
OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
3D-VLA: A 3D Vision-Language-Action Generative World Model
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
-
Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations
Controlled counterfactual perturbations reveal no correlation between embedding cosine similarity and approximation behavior in two visual grounding models.
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
-
AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion
AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.
-
CoNewsReader: Supporting Comprehensive Understanding and Raising Critical Thoughts on Social Media News Through Comments
CoNewsReader integrates user comments with an LLM to improve critical news reading on social media, with a 24-participant study showing gains in comprehension and critical thinking over baseline interfaces.
-
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...
-
Latent Denoising Improves Visual Alignment in Large Multimodal Models
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
-
G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
G-MIXER achieves state-of-the-art zero-shot composed image retrieval by using geodesic mixup to build diverse implicit candidates and MLLM-derived explicit semantics for re-ranking.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
Perception Encoder: The best visual embeddings are not at the output of the network
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
-
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
-
Retentive Network: A Successor to Transformer for Large Language Models
RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
-
CoNewsReader: Supporting Comprehensive Understanding and Raising Critical Thoughts on Social Media News Through Comments
CoNewsReader leverages comments and LLMs to support critical news reading on social media, with a within-subjects study of 24 students showing more engaging experiences and better comprehension and critical thought pe...
-
DIAGRAMS: A Review Framework for Reasoning-Level Attribution in Diagram QA
DIAGRAMS introduces a schema-driven annotation tool that proposes reasoning-level evidence regions for Diagram QA pairs and reports 85.39% precision and 75.30% recall against human final selections on six datasets.
-
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
Agentic AI for remote sensing requires new designs centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and physical validity rather than generic extensions.
-
Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation
PND mitigates object hallucination in vision-language models via dual-path contrastive decoding that boosts visual evidence and penalizes linguistic priors, yielding up to 6.5% gains on POPE, MME, and CHAIR benchmarks.
-
AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
-
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
-
A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models
A patch-augmented cross-view regularization method reduces backdoor attack success rates in multimodal LLMs by enforcing output differences between original and perturbed views while using entropy constraints to prese...
-
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...
-
Scrapyard AI
Obsolete AI models left behind by rapid development can be repurposed like scrap materials to analyze and communicate the environmental and social effects of global mining.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2201.07520 , year=
[AHR+22] Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. CM3: A causal masked multimodal model of the Internet. ArXiv, abs/2201.07520,
-
[2]
Breaking common sense: WHOOPS! a vision-and-language benchmark of synthetic and compositional images
[BGBH+23] Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. Breaking common sense: WHOOPS! a vision-and-language benchmark of synthetic and compositional images. ArXiv, abs/2303.07274,
-
[3]
Cheng, B., Girshick, R., Dollar, P., Berg, A
[CSL+21] Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Geo rey E. Hinton. Pix2seq: A language modeling framework for object detection. ArXiv, abs/2109.10852,
-
[4]
Coarse-to-fine vision-language pre-training with fusion in the backbone
[DKG+22] Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, and Lijuan Wang. Coarse-to-fine vision-language pre-training with fusion in the backbone. ArXiv, abs/2206.07643,
-
[5]
PaLM-E: An Embodied Multimodal Language Model
[DXS+23] Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-...
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Language is not all you need: Aligning perception with language models
[HDW+23] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language is not all you need: Aligning perception with language models. ArXiv, abs/2302.14045,
-
[7]
Language models are general-purpose interfaces
[HSD+22] Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shum- ing Ma, and Furu Wei. Language models are general-purpose interfaces. ArXiv, abs/2206.06336,
-
[8]
Association for Computational Linguistics. [JMC+23] Woojeong Jin, Subhabrata Mukherjee, Yu Cheng, Yelong Shen, Weizhu Chen, Ahmed Hassan Awadallah, Damien Jose, and Xiang Ren. Grill: Grounded vision- language pre-training via aligning text and image regions. ArXiv, abs/2305.14676,
-
[9]
Mdetr - modulated detection for end-to-end multi-modal understanding
[KSL+21] Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, and Nicolas Carion. Mdetr - modulated detection for end-to-end multi-modal understanding. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1760– 1770,
work page 2021
-
[10]
[LHV+23] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688,
-
[11]
12 [LLSH23] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
[LLWL23] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
arXiv preprint arXiv:1908.03557 , year=
[LYY+19] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visual- bert: A simple and performant baseline for vision and language.ArXiv, abs/1908.03557,
-
[14]
[MHT+15] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana-Maria Camburu, Alan Loddon Yuille, and Kevin P. Murphy. Generation and comprehension of unambiguous object descriptions. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11–20,
work page 2016
-
[15]
TorchScale: Transformers at scale
[MWH+22] Shuming Ma, Hongyu Wang, Shaohan Huang, Wenhui Wang, Zewen Chi, Li Dong, Alon Benhaim, Barun Patra, Vishrav Chaudhary, Xia Song, and Furu Wei. TorchScale: Transformers at scale. CoRR, abs/2211.13184,
-
[16]
LAION-5B: An open large-scale dataset for training next generation image-text models
[SBV+22] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wight- man, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402,
work page internal anchor Pith review arXiv
-
[17]
Visionllm: Large language model is also an open-ended decoder for vision-centric tasks
[WCC+23] Wen Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Y . Qiao, and Jifeng Dai. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. ArXiv, abs/2305.11175,
-
[18]
Foundation transformers.CoRR, abs/2210.06423,
[WMH+22] Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary, Xia Song, and Furu Wei. Foundation transformers.CoRR, abs/2210.06423,
-
[19]
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
[WPN+19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537,
work page internal anchor Pith review arXiv 1905
-
[20]
[YPY+16] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. ArXiv, abs/1608.00272,
-
[21]
[YTBB17] Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L. Berg. A joint speaker-listener- reinforcer model for referring expressions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3521–3529. IEEE Computer Society,
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.