Recognition: no theorem link
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Pith reviewed 2026-05-11 14:54 UTC · model grok-4.3
The pith
TriviaQA introduces a distant-supervision dataset of 95,000 trivia questions and evidence documents where current models reach only 40 percent accuracy against 80 percent for humans.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers. Neither approach comes close to human性能 (23
What carries the argument
The TriviaQA collection of question-answer pairs paired with six independent evidence documents per question, which supplies distant supervision while enforcing compositional questions and cross-sentence reasoning.
If this is right
- Systems must improve compositional and cross-sentence reasoning to reach high accuracy on TriviaQA.
- Distant supervision from multiple evidence documents can serve as a scalable training signal for reading-comprehension models.
- Lexical and syntactic variability between questions and evidence must be explicitly modeled to close the performance gap.
- Future benchmarks should incorporate similar independently sourced evidence to maintain difficulty.
Where Pith is reading between the lines
- The dataset's construction method could be replicated for other domains to create distant-supervision resources without manual annotation.
- Strong performance on TriviaQA may transfer to downstream tasks that require integrating scattered facts, such as multi-hop question answering.
- The observed gap invites investigation of whether hybrid feature-neural architectures can narrow it faster than either approach alone.
Load-bearing premise
The independently gathered evidence documents supply high-quality distant supervision that is sufficient to answer the questions.
What would settle it
A single model that reaches near 80 percent accuracy on the held-out TriviaQA test set while using only the provided evidence documents would falsify the claim that the dataset remains a significant challenge.
read the original abstract
We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. We also present two baseline algorithms: a feature-based classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that TriviaQA is a challenging testbed that is worth significant future study. Data and code available at -- http://nlp.cs.washington.edu/triviaqa/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TriviaQA, a large-scale reading comprehension dataset with 95K trivia questions and over 650K question-answer-evidence triples. Evidence documents (six per question on average) are gathered independently via web search to provide distant supervision. The authors claim the dataset features more complex, compositional questions than prior resources like SQuAD, with greater syntactic/lexical variability between questions and answer sentences plus a higher requirement for cross-sentence reasoning. Two baselines are evaluated—a feature-based classifier and a neural model adapted from SQuAD—achieving 23% and 40% respectively against 80% human performance, positioning TriviaQA as a challenging benchmark.
Significance. If the distant supervision quality holds and the variability/reasoning claims are substantiated, TriviaQA would be a significant addition to RC benchmarks by emphasizing multi-document settings, noisy evidence, and compositional reasoning. The public release of data and code, plus the clear performance gap, would usefully drive future model development beyond single-paragraph SQuAD-style tasks.
major comments (3)
- [Dataset Construction] Dataset Construction section: the central claim that TriviaQA provides 'high quality distant supervision' and is challenging due to reasoning demands (rather than label noise) requires explicit validation that answer spans are present in the independently gathered evidence documents. The manuscript should report the fraction of questions for which the answer appears in at least one of the six documents (via exact string match or normalized matching) and describe any manual verification on a sample; without this, the 23%/40% baseline scores cannot be confidently attributed to syntactic variability or cross-sentence reasoning.
- [Analysis] Analysis section (comparison to SQuAD and other datasets): the claims of 'considerable syntactic and lexical variability' and 'more cross sentence reasoning' are load-bearing for the 'challenging testbed' conclusion, yet the abstract and provided details give no concrete quantification method (e.g., no mention of dependency-parse distance, n-gram overlap statistics, or manual annotation protocol for reasoning hops). These metrics must be defined and reported with inter-annotator agreement if manual.
- [Baselines] Baselines section: the neural baseline is described as 'state-of-the-art' on SQuAD, but implementation details are needed on how the multi-document evidence is handled (e.g., concatenation strategy, truncation, or per-document scoring) since this directly affects whether the 40% result reflects the dataset's claimed difficulty or an incomplete adaptation.
minor comments (2)
- [Abstract] Abstract: the phrase 'six per question on average' should be accompanied by the exact mean and standard deviation of evidence documents per question for precision.
- [Dataset] The paper should include a small table or paragraph in the Dataset section reporting basic statistics on question length, answer type distribution, and evidence document lengths to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and will revise the paper to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Dataset Construction] Dataset Construction section: the central claim that TriviaQA provides 'high quality distant supervision' and is challenging due to reasoning demands (rather than label noise) requires explicit validation that answer spans are present in the independently gathered evidence documents. The manuscript should report the fraction of questions for which the answer appears in at least one of the six documents (via exact string match or normalized matching) and describe any manual verification on a sample; without this, the 23%/40% baseline scores cannot be confidently attributed to syntactic variability or cross-sentence reasoning.
Authors: We agree that providing explicit statistics on answer span presence is important for validating the quality of distant supervision. Although the manuscript emphasizes the independent gathering of evidence via web search, we will add to the Dataset Construction section the requested fraction of questions where the answer appears in at least one document, computed via both exact string match and normalized matching. We will also describe the manual verification performed on a random sample of questions to confirm the presence and relevance of answers. This addition will strengthen the claim of high-quality supervision. revision: yes
-
Referee: [Analysis] Analysis section (comparison to SQuAD and other datasets): the claims of 'considerable syntactic and lexical variability' and 'more cross sentence reasoning' are load-bearing for the 'challenging testbed' conclusion, yet the abstract and provided details give no concrete quantification method (e.g., no mention of dependency-parse distance, n-gram overlap statistics, or manual annotation protocol for reasoning hops). These metrics must be defined and reported with inter-annotator agreement if manual.
Authors: The Analysis section provides qualitative and some quantitative comparisons to SQuAD, but we acknowledge that more explicit quantification methods are needed to support the claims. In the revised manuscript, we will define and report concrete metrics such as average n-gram overlap between questions and answer sentences, syntactic variability measured via dependency parse distances, and the proportion of questions requiring cross-sentence reasoning based on a manually annotated sample with reported inter-annotator agreement. This will make the claims more rigorous. revision: yes
-
Referee: [Baselines] Baselines section: the neural baseline is described as 'state-of-the-art' on SQuAD, but implementation details are needed on how the multi-document evidence is handled (e.g., concatenation strategy, truncation, or per-document scoring) since this directly affects whether the 40% result reflects the dataset's claimed difficulty or an incomplete adaptation.
Authors: We appreciate this point as it clarifies how the baseline was adapted to the multi-document setting. The neural model processes the evidence by concatenating the top-ranked documents up to the maximum sequence length, applying truncation where necessary, and selecting the highest scoring answer span across documents. We will include these specific implementation details in the Baselines section of the revised version to allow full reproducibility and proper interpretation of the results. revision: yes
Circularity Check
No circularity in dataset construction or baseline evaluation
full rationale
The paper introduces TriviaQA via collection of 95K trivia questions and independent evidence documents (six per question on average), followed by direct comparison to prior datasets and runs of standard baselines (feature-based classifier and neural network) that achieve 23% and 40% F1 versus human 80%. No equations, parameter fittings presented as predictions, self-citations that bear central claims, or ansatzes are involved. The claim that TriviaQA is a challenging testbed follows from the reported performance gap on the collected data without any reduction to self-defined quantities or imported uniqueness theorems. The work is self-contained as a data release plus empirical baselines.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 49 Pith papers
-
Online Learning-to-Defer with Varying Experts
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
Passage Re-ranking with BERT
Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.
-
Task-Aware Calibration: Provably Optimal Decoding in LLMs
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
-
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation
XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
-
HaS: Accelerating RAG through Homology-Aware Speculative Retrieval
HaS accelerates RAG retrieval via homology-aware speculative retrieval and homologous query re-identification validation, cutting latency 24-37% with 1-2% accuracy drop on tested datasets.
-
Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models
Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.
-
LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models
LoSA caches prefix attention for stable tokens in block-wise DLMs and applies sparse attention only to active tokens, preserving near-dense accuracy while achieving 1.54x lower attention density and up to 4.14x speedup.
-
PolyReal: A Benchmark for Real-World Polymer Science Workflows
PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.
-
Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems
Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.
-
Group-in-Group Policy Optimization for LLM Agent Training
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
-
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
-
APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation
APCD reduces LLM hallucinations by expanding decoding paths adaptively when entropy signals uncertainty and by contrasting divergent paths to control their interaction.
-
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
-
Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation
DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.
-
T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.
-
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
-
PRAG: End-to-End Privacy-Preserving Retrieval-Augmented Generation
PRAG delivers end-to-end private RAG with 72-74% recall via non-interactive homomorphic approximations, interactive client assistance, and operation-error estimation to preserve ranking quality.
-
Mixture of Heterogeneous Grouped Experts for Language Modeling
MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.
-
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
-
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...
-
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instr...
-
Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation
Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.
-
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
-
BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models
BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.
-
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-paramete...
-
Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.
-
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Towards Understanding Sycophancy in Language Models
Sycophancy is prevalent in state-of-the-art AI assistants and is likely driven in part by human preferences that favor agreement over truthfulness.
-
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
-
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
Semantic entropy improves uncertainty estimation in natural language generation by incorporating semantic equivalences, outperforming standard entropy baselines on predicting model accuracy for question answering.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
-
FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs
FreezeEmpath achieves emotionally expressive speech output and strong performance on empathetic dialogue, speech emotion recognition, and spoken QA tasks by training with a frozen LLM on existing speech datasets.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
-
Ministral 3
Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.
-
Gemma: Open Models Based on Gemini Research and Technology
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
-
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation
MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
Reference graph
Works this paper leans on
-
[1]
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs http://aclweb.org/anthology/D/D13/D13-1160.pdf. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGD...
work page 2013
- [2]
-
[3]
Jordan Boyd-Graber, Brianna Satinoff, He He, and Hal Daum\' e III. 2012. Besting the quiz master: Crowdsourcing incremental classification games http://www.aclweb.org/anthology/D12-1118. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning\/ . Association for Computatio...
work page 2012
-
[4]
Qingqing Cai and Alexander Yates. 2013. Large-scale semantic parsing via schema matching and lexicon extension http://www.aclweb.org/anthology/P13-1042. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\/ . Association for Computational Linguistics, Sofia, Bulgaria, pages 423--433. http://ww...
work page 2013
-
[5]
Danqi Chen, Jason Bolton, and Christopher D. Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension task http://www.aclweb.org/anthology/P16-1223. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\/ . Association for Computational Linguistics, Berlin, Germany, page...
work page 2016
- [6]
-
[7]
Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2014. Open question answering over curated and extracted knowledge bases https://doi.org/10.1145/2623330.2623677. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\/ . ACM, New York, NY, USA, KDD '14, pages 1156--1165. https://doi.org/10.1145/2623330.262...
-
[8]
Paolo Ferragina and Ugo Scaiella. 2010. Tagme: On-the-fly annotation of short text fragments (by wikipedia entities) https://doi.org/10.1145/1871437.1871689. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management\/ . ACM, New York, NY, USA, CIKM '10, pages 1625--1628. https://doi.org/10.1145/1871437.1871689 https:/...
-
[9]
David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty. 2010. Building watson: An overview of the deepqa project. AI MAGAZINE\/ 31(3):59--79
work page 2010
-
[10]
He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daum\' e III. 2016. Opponent modeling in deep reinforcement learning http://proceedings.mlr.press/v48/he16.html. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning\/ . PMLR, New York, New York, USA, volume 48 of Proceedings of Machin...
work page 2016
-
[11]
Karl Moritz Hermann, Tom\' a s Ko c isk\' y , Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend http://arxiv.org/abs/1506.03340. In Advances in Neural Information Processing Systems\/ . http://arxiv.org/abs/1506.03340 http://arxiv.org/abs/1506.03340
- [12]
-
[13]
Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld. 2011. Knowledge-based weak supervision for information extraction of overlapping relations http://www.aclweb.org/anthology/P11-1055. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies\/ . Association for Com...
work page 2011
-
[14]
Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daum\' e III. 2014. A neural network for factoid question answering over paragraphs http://www.aclweb.org/anthology/D14-1070. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ . Association for Computational Linguistics, Doha, Qata...
work page 2014
-
[15]
Mandar Joshi, Uma Sawant, and Soumen Chakrabarti. 2014. Knowledge graph and corpus driven segmentation and answer inference for telegraphic entity-seeking queries http://www.aclweb.org/anthology/D14-1117. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ . Association for Computational Linguistics, Doha, Q...
work page 2014
-
[16]
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations https://arxiv.org/abs/1704.04683. CoRR\/ https://arxiv.org/abs/1704.04683 https://arxiv.org/abs/1704.04683
work page Pith review arXiv 2017
-
[17]
Peng Li, Wei Li, Zhengyan He, Xuguang Wang, Ying Cao, Jie Zhou, and Wei Xu. 2016. Dataset and neural recurrent sequence labeling model for open-domain factoid question answering https://arxiv.org/abs/1607.06275. CoRR\/ https://arxiv.org/abs/1607.06275 https://arxiv.org/abs/1607.06275
-
[18]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. https://arxiv.org/pdf/1611.09268.pdf MS MARCO : A human generated machine reading comprehension dataset . In Workshop in Advances in Neural Information Processing Systems\/ . https://arxiv.org/pdf/1611.09268.pdf https://arxiv.org/pdf/1611.09268.pdf
work page internal anchor Pith review arXiv 2016
-
[19]
Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. 2016. Who did what: A large-scale person-centered cloze dataset https://aclweb.org/anthology/D16-1241. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing\/ . Association for Computational Linguistics, Austin, Texas, pages 2230--2235. https://...
work page 2016
-
[20]
Denis Paperno, Germ\' a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. 2016. The lambada dataset: Word prediction requiring a broad discourse context http://www.aclweb.org/anthology/P16-1144. In Proceedings of the 54th Annual Meeting of the Association for Computatio...
work page 2016
-
[21]
Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables http://aclweb.org/anthology/P/P15/P15-1142.pdf. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Pro...
work page 2015
-
[22]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text https://aclweb.org/anthology/D16-1264. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing\/ . Association for Computational Linguistics, Austin, Texas, pages 2383--2392. https://aclweb....
work page 2016
-
[23]
Matthew Richardson, Christopher J.C. Burges, and Erin Renshaw. 2013. http://www.aclweb.org/anthology/D13-1020 MCTest : A challenge dataset for the open-domain machine comprehension of text . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing\/ . Association for Computational Linguistics, Seattle, Washington, USA, pag...
work page 2013
-
[24]
Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text http://dl.acm.org/citation.cfm?id=1889788.1889799. In Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part III\/ . Springer-Verlag, Berlin, Heidelberg, ECML PKDD'10, pages 148--163. http:/...
-
[25]
Uma Sawant and Soumen Chakrabarti. 2013. Learning joint query interpretation and response ranking https://doi.org/10.1145/2488388.2488484. In Proceedings of the 22Nd International Conference on World Wide Web\/ . ACM, New York, NY, USA, WWW '13, pages 1099--1110. https://doi.org/10.1145/2488388.2488484 https://doi.org/10.1145/2488388.2488484
-
[26]
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension https://arxiv.org/abs/1611.01603. In Proceedings of the International Conference on Learning Representations (ICLR)\/ . https://arxiv.org/abs/1611.01603 https://arxiv.org/abs/1611.01603
- [27]
-
[28]
Ellen M. Voorhees and Dawn M. Tice. 2000. Building a question answering test collection https://doi.org/10.1145/345508.345577. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval\/ . ACM, New York, NY, USA, SIGIR '00, pages 200--207. https://doi.org/10.1145/345508.345577 https://doi.org...
-
[29]
Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. 2015. Machine comprehension with syntax, frames, and semantics http://www.aclweb.org/anthology/P15-2115. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)\/ ...
work page 2015
-
[30]
Qiang Wu, Christopher J. Burges, Krysta M. Svore, and Jianfeng Gao. 2010. Adapting boosting for information retrieval measures https://doi.org/10.1007/s10791-009-9112-1. Inf. Retr.\/ 13(3):254--270. https://doi.org/10.1007/s10791-009-9112-1 https://doi.org/10.1007/s10791-009-9112-1
-
[31]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention https://arxiv.org/abs/1502.03044. In Proceedings of the International Conference on Machine Learning\/ . https://arxiv.org/abs/1502.03044 https://arxiv.o...
-
[32]
Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain question answering http://aclweb.org/anthology/D15-1237. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing\/ . Association for Computational Linguistics, Lisbon, Portugal, pages 2013--2018. http://aclweb.org/anthology/D15-1...
work page 2015
-
[33]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification http://www.aclweb.org/anthology/N16-1174. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies\/ . Association for Compu...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.