pith. machine review for the scientific record. sign in

arxiv: 1707.05612 · v4 · submitted 2017-07-18 · 💻 cs.LG · cs.CL· cs.CV

Recognition: unknown

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Authors on Pith no claims yet
classification 💻 cs.LG cs.CLcs.CV
keywords retrievalembeddingshardapproachfunctionslossmethodsms-coco
0
0 comments X
read the original abstract

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Federated Cross-Modal Retrieval with Missing Modalities via Semantic Routing and Adapter Personalization

    cs.CV 2026-04 unverdicted novelty 6.0

    RCSR is a personalization-friendly federated framework that improves cross-modal retrieval accuracy and stability under missing modalities via semantic routing and adapters.

  2. PandaGPT: One Model To Instruction-Follow Them All

    cs.CL 2023-05 conditional novelty 6.0

    A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.

  3. Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval

    cs.CV 2026-04 unverdicted novelty 5.0

    STBIR fuses sketches and text via curriculum robustness, category optimization, and staged alignment to outperform prior methods on a new fine-grained benchmark dataset.