{"work":{"id":"3c5a795a-85bd-45a1-8e65-06a9c5641111","openalex_id":"https://openalex.org/W2889787757","doi":"10.18653/v1/d18-1259","arxiv_id":null,"raw_key":null,"title":"H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering","authors":[{"given":"Zhilin","family":"Yang","sequence":"first","affiliation":[]},{"given":"Peng","family":"Qi","sequence":"additional","affiliation":[]},{"given":"Saizheng","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Yoshua","family":"Bengio","sequence":"additional","affiliation":[]},{"given":"William","family":"Cohen","sequence":"additional","affiliation":[]},{"given":"Ruslan","family":"Salakhutdinov","sequence":"additional","affiliation":[]},{"given":"Christopher D.","family":"Manning","sequence":"additional","affiliation":[]}],"authors_text":"Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D","year":2018,"venue":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","abstract":null,"external_url":"https://doi.org/10.18653/v1/d18-1259","cited_by_count":739,"metadata_source":"doi_reference","metadata_fetched_at":"2026-06-29T04:43:06.452895+00:00","pith_arxiv_id":null,"created_at":"2026-05-08T17:13:36.997258+00:00","updated_at":"2026-06-29T04:43:06.452895+00:00","title_quality_ok":true,"display_title":"Cohen and Ruslan Salakhutdinov and Christopher D","render_title":"Cohen and Ruslan Salakhutdinov and Christopher D"},"hub":{"state":{"work_id":"3c5a795a-85bd-45a1-8e65-06a9c5641111","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":88,"external_cited_by_count":739,"distinct_field_count":9,"first_pith_cited_at":"2023-05-23T16:49:14+00:00","last_pith_cited_at":"2026-06-26T12:46:27+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T06:58:26.727083+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":2},{"context_role":"dataset","n":2}],"polarity_counts":[{"context_polarity":"background","n":2},{"context_polarity":"use_dataset","n":2}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-06-28T14:48:04.246924+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension","work_id":"d05a9c57-9d88-473a-aa65-efb13f9dee25","shared_citers":22},{"title":"Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps","work_id":"fd3a2c44-aeea-48a7-b162-d3f0c6d43f35","shared_citers":21},{"title":"Dense Passage Retrieval for Open-Domain Question Answering","work_id":"083391f8-812d-430f-8d08-89a03031ce6c","shared_citers":20},{"title":"Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov","work_id":"45551929-96dc-40f3-9f89-10e76731cc24","shared_citers":17},{"title":"♫ M u S i Q ue: Multihop Questions via Single-hop Question Composition","work_id":"dd4f6eb0-477f-42c4-8d8a-8c8637815f98","shared_citers":15},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":14},{"title":"Lost in the Middle: How Language Models Use Long Contexts","work_id":"37c05e13-4a24-44f8-a1c4-da1bbe7223aa","shared_citers":12},{"title":"SQ u AD : 100,000+ questions for machine comprehension of text","work_id":"8e6a63f7-90ad-4b5e-8493-c26145f74b69","shared_citers":12},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":11},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":11},{"title":"When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories","work_id":"29f5140a-4d58-413f-8205-4e2d6e7cab77","shared_citers":11},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":10},{"title":"Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =","work_id":"1b10f2a9-a178-4d23-97fb-8db2354c7e6c","shared_citers":10},{"title":"FEVER: a large-scale dataset for Fact Extraction and VERification","work_id":"b696f75f-e5ad-4555-9c12-e292e77c388f","shared_citers":9},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":9},{"title":"and Lewis, Mike , editor =","work_id":"d690fac4-0cde-42d6-958a-77a77c0e7bd0","shared_citers":8},{"title":"BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"3e3c8ac8-b858-4b22-af32-393d98c883e0","shared_citers":7},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":7},{"title":"Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions","work_id":"26cdd617-acab-4f25-9b38-a106fd3bf382","shared_citers":7},{"title":"Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang","work_id":"9c9c91f9-a321-4e33-8897-d66f6ad0f659","shared_citers":7},{"title":"Retrieval-Augmented Generation for Large Language Models: A Survey","work_id":"b80d2790-6cd9-4c87-b3c4-de404f99a80e","shared_citers":7},{"title":"Robertson and Hugo Zaragoza , title =","work_id":"3dfaa21d-3751-420b-84f7-aeceda058b63","shared_citers":7},{"title":"Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks","work_id":"cf07889b-1f35-4d81-9514-4ad3ed223c57","shared_citers":7},{"title":"Chandra and Dexter C","work_id":"c3270592-bd69-4213-95e1-4aaf8312be9b","shared_citers":6}],"time_series":[{"n":1,"year":2023},{"n":5,"year":2024},{"n":3,"year":2025},{"n":69,"year":2026}],"dependency_candidates":[{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping","primary_cat":"cs.CL","context_text":"88 + TG-Norm 47.24 50.17 22.68 52.40 46.27 + TG-Norm +D t-rescaling 47.94 50.54 22.77 52.00 46.71 + TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQA [ 34]. We train and evaluate on three backbones: Qwen3-4B, Qwen3-8B, and Qwen2.5-7B. We reportExact Match (EM)as the primary metric on every benchmark as well as the average accuracy across all evaluation samples.","citing_arxiv_id":"2605.06200"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Retrieval from Within: An Intrinsic Capability of Attention-Based Models","primary_cat":"cs.LG","context_text":"requiring no additional compressor or compression-specific training (distinct from latent-compression approaches [11]). We find that small values such as Lp ∈ {3,5,7} substantially reduce MaxSim cost while preserving the shared-representation design. 4 Benchmarks and Experimental Setup We evaluate INTRA on four Wikipedia-based QA benchmarks: HotPotQA [38], 2WikiMultihopQA [12], MuSiQue [34], and Natural Questions [19]. Together they span bridge and comparison reasoning, cleaner two-hop evidence chains, compositionally harder multi-hop questions, and single-hop open- domain QA. We build one shared retrieval candidate pool for all four benchmarks under a fixed budget of approximately 100M tokens, containing 759K chunks in total.","citing_arxiv_id":"2605.05806"}]},"error":null,"updated_at":"2026-06-28T14:47:58.300575+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-06-28T14:47:58.217810+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Cohen and Ruslan Salakhutdinov and Christopher D","claims":[{"claim_text":"88 + TG-Norm 47.24 50.17 22.68 52.40 46.27 + TG-Norm +D t-rescaling 47.94 50.54 22.77 52.00 46.71 + TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQ","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"requiring no additional compressor or compression-specific training (distinct from latent-compression approaches [11]). We find that small values such as Lp ∈ {3,5,7} substantially reduce MaxSim cost while preserving the shared-representation design. 4 Benchmarks and Experimental Setup We evaluate INTRA on four Wikipedia-based QA benchmarks: HotPotQA [38], 2WikiMultihopQA [12], MuSiQue [34], and Natural Questions [19]. Together they span bridge and comparison reasoning, cleaner two-hop evidence ","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"query involves chaining together multiple related facts across entities, CTI reports, or time (e.g., actor → uses → malware → targets → sector, or comparing campaigns over time). Dense retrieval that returns the top-𝑘 most relevant text chunks [20, 22] can fail when evidence is distributed across distant text fragments, when constraints must be satisfied jointly, or when the answer depends on chaining multiple facts [ 40]. Equally important, LLM-based CTI assistants must reliably abstain when th","claim_type":"background","confidence":0.7,"evidence_strength":"citation_context"},{"claim_text":"Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.),Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=R9KnuFlvnU. [69] Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024.URL https://arxiv. org/abs/2406.12045, 2024. [70] Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, and Xua","claim_type":"background","confidence":0.5,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Cohen and Ruslan Salakhutdinov and Christopher D because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (2 contexts).","role_counts":[{"n":2,"context_role":"background"},{"n":2,"context_role":"dataset"}]},"error":null,"updated_at":"2026-06-28T14:48:04.249998+00:00"}},"summary":{"title":"Cohen and Ruslan Salakhutdinov and Christopher D","claims":[{"claim_text":"88 + TG-Norm 47.24 50.17 22.68 52.40 46.27 + TG-Norm +D t-rescaling 47.94 50.54 22.77 52.00 46.71 + TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQ","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"requiring no additional compressor or compression-specific training (distinct from latent-compression approaches [11]). We find that small values such as Lp ∈ {3,5,7} substantially reduce MaxSim cost while preserving the shared-representation design. 4 Benchmarks and Experimental Setup We evaluate INTRA on four Wikipedia-based QA benchmarks: HotPotQA [38], 2WikiMultihopQA [12], MuSiQue [34], and Natural Questions [19]. Together they span bridge and comparison reasoning, cleaner two-hop evidence ","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"query involves chaining together multiple related facts across entities, CTI reports, or time (e.g., actor → uses → malware → targets → sector, or comparing campaigns over time). Dense retrieval that returns the top-𝑘 most relevant text chunks [20, 22] can fail when evidence is distributed across distant text fragments, when constraints must be satisfied jointly, or when the answer depends on chaining multiple facts [ 40]. Equally important, LLM-based CTI assistants must reliably abstain when th","claim_type":"background","confidence":0.7,"evidence_strength":"citation_context"},{"claim_text":"Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.),Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=R9KnuFlvnU. [69] Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024.URL https://arxiv. org/abs/2406.12045, 2024. [70] Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, and Xua","claim_type":"background","confidence":0.5,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Cohen and Ruslan Salakhutdinov and Christopher D because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (2 contexts).","role_counts":[{"n":2,"context_role":"background"},{"n":2,"context_role":"dataset"}]},"graph":{"co_cited":[{"title":"TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension","work_id":"d05a9c57-9d88-473a-aa65-efb13f9dee25","shared_citers":22},{"title":"Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps","work_id":"fd3a2c44-aeea-48a7-b162-d3f0c6d43f35","shared_citers":21},{"title":"Dense Passage Retrieval for Open-Domain Question Answering","work_id":"083391f8-812d-430f-8d08-89a03031ce6c","shared_citers":20},{"title":"Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov","work_id":"45551929-96dc-40f3-9f89-10e76731cc24","shared_citers":17},{"title":"♫ M u S i Q ue: Multihop Questions via Single-hop Question Composition","work_id":"dd4f6eb0-477f-42c4-8d8a-8c8637815f98","shared_citers":15},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":14},{"title":"Lost in the Middle: How Language Models Use Long Contexts","work_id":"37c05e13-4a24-44f8-a1c4-da1bbe7223aa","shared_citers":12},{"title":"SQ u AD : 100,000+ questions for machine comprehension of text","work_id":"8e6a63f7-90ad-4b5e-8493-c26145f74b69","shared_citers":12},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":11},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":11},{"title":"When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories","work_id":"29f5140a-4d58-413f-8205-4e2d6e7cab77","shared_citers":11},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":10},{"title":"Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =","work_id":"1b10f2a9-a178-4d23-97fb-8db2354c7e6c","shared_citers":10},{"title":"FEVER: a large-scale dataset for Fact Extraction and VERification","work_id":"b696f75f-e5ad-4555-9c12-e292e77c388f","shared_citers":9},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":9},{"title":"and Lewis, Mike , editor =","work_id":"d690fac4-0cde-42d6-958a-77a77c0e7bd0","shared_citers":8},{"title":"BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"3e3c8ac8-b858-4b22-af32-393d98c883e0","shared_citers":7},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":7},{"title":"Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions","work_id":"26cdd617-acab-4f25-9b38-a106fd3bf382","shared_citers":7},{"title":"Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang","work_id":"9c9c91f9-a321-4e33-8897-d66f6ad0f659","shared_citers":7},{"title":"Retrieval-Augmented Generation for Large Language Models: A Survey","work_id":"b80d2790-6cd9-4c87-b3c4-de404f99a80e","shared_citers":7},{"title":"Robertson and Hugo Zaragoza , title =","work_id":"3dfaa21d-3751-420b-84f7-aeceda058b63","shared_citers":7},{"title":"Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks","work_id":"cf07889b-1f35-4d81-9514-4ad3ed223c57","shared_citers":7},{"title":"Chandra and Dexter C","work_id":"c3270592-bd69-4213-95e1-4aaf8312be9b","shared_citers":6}],"time_series":[{"n":1,"year":2023},{"n":5,"year":2024},{"n":3,"year":2025},{"n":69,"year":2026}],"dependency_candidates":[{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping","primary_cat":"cs.CL","context_text":"88 + TG-Norm 47.24 50.17 22.68 52.40 46.27 + TG-Norm +D t-rescaling 47.94 50.54 22.77 52.00 46.71 + TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQA [ 34]. We train and evaluate on three backbones: Qwen3-4B, Qwen3-8B, and Qwen2.5-7B. We reportExact Match (EM)as the primary metric on every benchmark as well as the average accuracy across all evaluation samples.","citing_arxiv_id":"2605.06200"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Retrieval from Within: An Intrinsic Capability of Attention-Based Models","primary_cat":"cs.LG","context_text":"requiring no additional compressor or compression-specific training (distinct from latent-compression approaches [11]). We find that small values such as Lp ∈ {3,5,7} substantially reduce MaxSim cost while preserving the shared-representation design. 4 Benchmarks and Experimental Setup We evaluate INTRA on four Wikipedia-based QA benchmarks: HotPotQA [38], 2WikiMultihopQA [12], MuSiQue [34], and Natural Questions [19]. Together they span bridge and comparison reasoning, cleaner two-hop evidence chains, compositionally harder multi-hop questions, and single-hop open- domain QA. We build one shared retrieval candidate pool for all four benchmarks under a fixed budget of approximately 100M tokens, containing 759K chunks in total.","citing_arxiv_id":"2605.05806"}]},"authors":[]}}