pith. sign in

arxiv: 1810.12440 · v2 · pith:SGPJ6K4Snew · submitted 2018-10-29 · 💻 cs.CV

TallyQA: Answering Complex Counting Questions

classification 💻 cs.CV
keywords countingquestionstallyqaansweringcomplexnetworksrelationalgorithm
0
0 comments X
read the original abstract

Most counting questions in visual question answering (VQA) datasets are simple and require no more than object detection. Here, we study algorithms for complex counting questions that involve relationships between objects, attribute identification, reasoning, and more. To do this, we created TallyQA, the world's largest dataset for open-ended counting. We propose a new algorithm for counting that uses relation networks with region proposals. Our method lets relation networks be efficiently used with high-resolution imagery. It yields state-of-the-art results compared to baseline and recent systems on both TallyQA and the HowMany-QA benchmark.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

    cs.CV 2025-04 unverdicted novelty 7.0

    FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperfor...

  2. DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

    cs.CV 2023-08 unverdicted novelty 6.0

    DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.