{"work":{"id":"d05a9c57-9d88-473a-aa65-efb13f9dee25","openalex_id":"https://openalex.org/W2612431505","doi":"10.18653/v1/p17-1147","arxiv_id":null,"raw_key":null,"title":"In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","authors":[{"given":"Mandar","family":"Joshi","sequence":"first","affiliation":[]},{"given":"Eunsol","family":"Choi","sequence":"additional","affiliation":[]},{"given":"Daniel","family":"Weld","sequence":"additional","affiliation":[]},{"given":"Luke","family":"Zettlemoyer","sequence":"additional","affiliation":[]}],"authors_text":"Joshi, M","year":2017,"venue":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","abstract":null,"external_url":"https://doi.org/10.18653/v1/p17-1147","cited_by_count":606,"metadata_source":"doi_reference","metadata_fetched_at":"2026-07-01T09:05:36.090694+00:00","pith_arxiv_id":null,"created_at":"2026-05-08T17:13:36.989995+00:00","updated_at":"2026-07-01T09:05:36.090694+00:00","title_quality_ok":false,"display_title":"TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension","render_title":"TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension"},"hub":{"state":{"work_id":"d05a9c57-9d88-473a-aa65-efb13f9dee25","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":104,"external_cited_by_count":606,"distinct_field_count":7,"first_pith_cited_at":"2019-05-24T05:48:49+00:00","last_pith_cited_at":"2026-06-30T07:24:10+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-07-01T21:21:30.808205+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"dataset","n":6},{"context_role":"background","n":3},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"use_dataset","n":6},{"context_polarity":"unclear","n":2},{"context_polarity":"background","n":1},{"context_polarity":"use_method","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension","claims":[{"claim_text":"+ TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQA [ 34]. We train and evaluate on three backbones: Qwen3-4B, Qwen3-8B, and Qwen2.5-7B. We reportEx","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"and TriviaQA. For Natural Questions (NQ), we use the dpr-w100 split from ir_datasets to represent open-domain, real-world user queries [34, 35, 36]. For PubMedQA, we adopt the pqa_labeled configuration to model medical question answering, where accurate technical retrieval is needed [37]. For TriviaQA, we employ the rc (reading comprehension) configuration [38]. Using a fixed random seed, we sample 50 benign queries from each dataset for the utility-oriented evaluation of retrieval and generatio","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"7750 248.2226 17.8641 266.0867 0.3658 BnB INT8 138.96 139.09 0.9880 56.3886 056.3886 0.0265 NF4 138.96 144.16 0.9124 155.4506 0 155.4506 0.0750 FP4 138.96 138.10 0.9196 145.1767 0 145.1767 0.1306 GPTQ GPTQ-4bit 138.96 140.37 0.9298 136.7867 0 136.7867 0.1422 Benchmarks and scoring.Five benchmarks:MMLU[ 28],ARC[ 29] (multiple-choice knowledge), TriviaQA[ 30],SQuAD[ 31] (short-horizon QA), andGSM8K[ 32] (multi-step reasoning). All risks are computed teacher-forced (prompt c and targets y scored in","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"significantly on knowledge-intensive and adversarial benchmarks, collapsing on TruthfulQA. We attribute this to the absence of a principled density model, making it unable to generalize across different instruction-tuning regimes. TruthfulQA remains the hardest setting for all methods, as its questions target misconceptions deeply encoded in pretraining weights [16]. Yet,PCNETleads across all models also on this dataset, with Mistral-7B achieving the highest AUROC, consistent with the hypothesis","claim_type":"dataset","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"(by non-expert validators who are experts in other domains; at least 15 min, avg ~37 min, allowing Google) Part 1: answer Q (correct answer & explanations not shown) Part 2: provide feedback on the following dimensions (correct answer & explanations shown to the validator) Include this Q in the DIAMOND set because (1)2 out of 2 expert validators agree* (2)≤ 1 out of 3 non-expert validators answers correctly •Post-hoc agreement: Is the answer uncontroversial? •Is your background suﬃcient to answe","claim_type":"method","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alex Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-LM: In search of the next generation of training sets for language models. InThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum? id=CNWdWn47IE. [40] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for read","claim_type":"dataset","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension because it crossed a citation-hub threshold. Current citing contexts most often use it as dataset evidence (6 contexts).","role_counts":[{"n":6,"context_role":"dataset"},{"n":3,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-06-30T12:29:54.307445+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"89203f73-bcbb-42d4-a175-dea3b337ba8b","orcid":null,"display_name":"Mandar Joshi"},{"id":"e236980d-6b87-4ec2-bfcf-98d8c2aa9594","orcid":null,"display_name":"Eunsol Choi"},{"id":"88674829-bf39-4318-b987-7d823fb4ac71","orcid":null,"display_name":"Daniel Weld"},{"id":"383382cc-d254-4d3b-92b6-03e2ff131701","orcid":null,"display_name":"Luke Zettlemoyer"}]},"error":null,"updated_at":"2026-06-30T12:29:55.012091+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-06-28T10:07:23.808771+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov","work_id":"45551929-96dc-40f3-9f89-10e76731cc24","shared_citers":26},{"title":"Cohen and Ruslan Salakhutdinov and Christopher D","work_id":"3c5a795a-85bd-45a1-8e65-06a9c5641111","shared_citers":21},{"title":"URL https:// doi.org/10.18653/v1/p19-1472","work_id":"11bfc949-547c-40f3-a86d-953eb9b2154c","shared_citers":16},{"title":"SQ u AD : 100,000+ questions for machine comprehension of text","work_id":"8e6a63f7-90ad-4b5e-8493-c26145f74b69","shared_citers":15},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":15},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":15},{"title":"Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M","work_id":"56ac41e4-5078-4307-aa88-20a9d4e90afc","shared_citers":14},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":14},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":12},{"title":"Lost in the Middle: How Language Models Use Long Contexts","work_id":"37c05e13-4a24-44f8-a1c4-da1bbe7223aa","shared_citers":12},{"title":"Dense Passage Retrieval for Open-Domain Question Answering","work_id":"083391f8-812d-430f-8d08-89a03031ce6c","shared_citers":11},{"title":"B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions","work_id":"b0eff16f-bbcd-4d66-a41d-d89ff07a80e5","shared_citers":10},{"title":"Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps","work_id":"fd3a2c44-aeea-48a7-b162-d3f0c6d43f35","shared_citers":10},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":10},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":10},{"title":"When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories","work_id":"29f5140a-4d58-413f-8205-4e2d6e7cab77","shared_citers":10},{"title":"doi: 10.18653/v1/n19-1421","work_id":"628930d3-897c-43a2-8d6d-589da959e066","shared_citers":9},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":9},{"title":"Liu, and Matt Gardner","work_id":"fb5c5440-b2da-4115-91bf-0fc02e505c13","shared_citers":9},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":8},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":8},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":8},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":8},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":8}],"time_series":[{"n":1,"year":2019},{"n":1,"year":2020},{"n":1,"year":2021},{"n":1,"year":2022},{"n":3,"year":2023},{"n":9,"year":2024},{"n":14,"year":2025},{"n":48,"year":2026}],"dependency_candidates":[{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head","primary_cat":"cs.CL","context_text":"7750 248.2226 17.8641 266.0867 0.3658 BnB INT8 138.96 139.09 0.9880 56.3886 056.3886 0.0265 NF4 138.96 144.16 0.9124 155.4506 0 155.4506 0.0750 FP4 138.96 138.10 0.9196 145.1767 0 145.1767 0.1306 GPTQ GPTQ-4bit 138.96 140.37 0.9298 136.7867 0 136.7867 0.1422 Benchmarks and scoring.Five benchmarks:MMLU[ 28],ARC[ 29] (multiple-choice knowledge), TriviaQA[ 30],SQuAD[ 31] (short-horizon QA), andGSM8K[ 32] (multi-step reasoning). All risks are computed teacher-forced (prompt c and targets y scored in a single forward pass over the gold span), producing a deterministic per-sample CE loss whose expectation gives the model's riskRM , and|∆R|is the target-vs-proxy gap we report. Calibration and hyperparameters.","citing_arxiv_id":"2605.11608"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping","primary_cat":"cs.CL","context_text":"+ TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQA [ 34]. We train and evaluate on three backbones: Qwen3-4B, Qwen3-8B, and Qwen2.5-7B. We reportExact Match (EM)as the primary metric on every benchmark as well as the average accuracy across all evaluation samples. This experiment setting deliberately avoids proprietary APIs and heavyweight tool infrastructure, keeping the evaluation reproducible and concentrating on the progress of the RL algorithm.","citing_arxiv_id":"2605.06200"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits","primary_cat":"cs.CL","context_text":"significantly on knowledge-intensive and adversarial benchmarks, collapsing on TruthfulQA. We attribute this to the absence of a principled density model, making it unable to generalize across different instruction-tuning regimes. TruthfulQA remains the hardest setting for all methods, as its questions target misconceptions deeply encoded in pretraining weights [16]. Yet,PCNETleads across all models also on this dataset, with Mistral-7B achieving the highest AUROC, consistent with the hypothesis that more capable models develop more geometrically separable truth representations [3]. 6 Table 1: Hallucination detection performance. We report AUROC and F1 across four datasets and three seeds (mean±std). Best results per model and dataset are highlighted inbold.","citing_arxiv_id":"2605.05953"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models","primary_cat":"cs.LG","context_text":"Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alex Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-LM: In search of the next generation of training sets for language models. InThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum? id=CNWdWn47IE. [40] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 1601-1611, Vancouver, Canada, July 2017.","citing_arxiv_id":"2604.21106"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks","primary_cat":"cs.CR","context_text":"and TriviaQA. For Natural Questions (NQ), we use the dpr-w100 split from ir_datasets to represent open-domain, real-world user queries [34, 35, 36]. For PubMedQA, we adopt the pqa_labeled configuration to model medical question answering, where accurate technical retrieval is needed [37]. For TriviaQA, we employ the rc (reading comprehension) configuration [38]. Using a fixed random seed, we sample 50 benign queries from each dataset for the utility-oriented evaluation of retrieval and generation quality. These benign queries are distinct from the attack-specific poisoning and membership-inference query/probe sets described in Section 5.3. We execute each benign query against the 700-document ingested corpus.","citing_arxiv_id":"2604.20932"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models","primary_cat":"cs.CL","context_text":"names, and alternative spellings should all be considered the same. If the Provided Answer is correct say exactly \"True\", otherwise say \"False\". Question 1: \"when did the nfl start playing in london\" Provided Answer: \"According to the provided search results, the NFL started playing regular season games in London as part of the NFL International Series in 2007. Specifically: Document [5] states: \"The NFL International Series was inaugurated in 2007 to host NFL regular season games outside the United States. Played at the new Wembley Stadium in London (rebuilt and reopened in 2007), the series increased from one to two games for the 2013 season, to three games for the 2014 season, and then to four games from the 2017 season.\" Document [9] also mentions: \"Since 2007, the league has held multiple regular season games in London each season as part of NFL London Games, allowing the league to test solutions to some of the","citing_arxiv_id":"2404.18796"},{"n":1,"role":"method","polarity":"use_method","paper_title":"GPQA: A Graduate-Level Google-Proof Q&A Benchmark","primary_cat":"cs.AI","context_text":"(by non-expert validators who are experts in other domains; at least 15 min, avg ~37 min, allowing Google) Part 1: answer Q (correct answer & explanations not shown) Part 2: provide feedback on the following dimensions (correct answer & explanations shown to the validator) Include this Q in the DIAMOND set because (1)2 out of 2 expert validators agree* (2)≤ 1 out of 3 non-expert validators answers correctly •Post-hoc agreement: Is the answer uncontroversial? •Is your background suﬃcient to answer correctly? •Q diﬃculty •Did you understand Q fully, now that you see the explanations? •Detailed feedback •Q & answer choice revisions Revised question and choicesMethylcyclopentadiene (which exists as a ﬂuxional mixture of isomers)","citing_arxiv_id":"2311.12022"}]},"error":null,"updated_at":"2026-06-28T10:07:14.148659+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-06-28T10:07:35.750332+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension","claims":[{"claim_text":"+ TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQA [ 34]. We train and evaluate on three backbones: Qwen3-4B, Qwen3-8B, and Qwen2.5-7B. We reportEx","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"and TriviaQA. For Natural Questions (NQ), we use the dpr-w100 split from ir_datasets to represent open-domain, real-world user queries [34, 35, 36]. For PubMedQA, we adopt the pqa_labeled configuration to model medical question answering, where accurate technical retrieval is needed [37]. For TriviaQA, we employ the rc (reading comprehension) configuration [38]. Using a fixed random seed, we sample 50 benign queries from each dataset for the utility-oriented evaluation of retrieval and generatio","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"7750 248.2226 17.8641 266.0867 0.3658 BnB INT8 138.96 139.09 0.9880 56.3886 056.3886 0.0265 NF4 138.96 144.16 0.9124 155.4506 0 155.4506 0.0750 FP4 138.96 138.10 0.9196 145.1767 0 145.1767 0.1306 GPTQ GPTQ-4bit 138.96 140.37 0.9298 136.7867 0 136.7867 0.1422 Benchmarks and scoring.Five benchmarks:MMLU[ 28],ARC[ 29] (multiple-choice knowledge), TriviaQA[ 30],SQuAD[ 31] (short-horizon QA), andGSM8K[ 32] (multi-step reasoning). All risks are computed teacher-forced (prompt c and targets y scored in","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"significantly on knowledge-intensive and adversarial benchmarks, collapsing on TruthfulQA. We attribute this to the absence of a principled density model, making it unable to generalize across different instruction-tuning regimes. TruthfulQA remains the hardest setting for all methods, as its questions target misconceptions deeply encoded in pretraining weights [16]. Yet,PCNETleads across all models also on this dataset, with Mistral-7B achieving the highest AUROC, consistent with the hypothesis","claim_type":"dataset","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"(by non-expert validators who are experts in other domains; at least 15 min, avg ~37 min, allowing Google) Part 1: answer Q (correct answer & explanations not shown) Part 2: provide feedback on the following dimensions (correct answer & explanations shown to the validator) Include this Q in the DIAMOND set because (1)2 out of 2 expert validators agree* (2)≤ 1 out of 3 non-expert validators answers correctly •Post-hoc agreement: Is the answer uncontroversial? •Is your background suﬃcient to answe","claim_type":"method","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alex Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-LM: In search of the next generation of training sets for language models. InThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum? id=CNWdWn47IE. [40] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for read","claim_type":"dataset","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension because it crossed a citation-hub threshold. Current citing contexts most often use it as dataset evidence (6 contexts).","role_counts":[{"n":6,"context_role":"dataset"},{"n":3,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-06-30T12:29:55.014621+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension","claims":[{"claim_text":"+ TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQA [ 34]. We train and evaluate on three backbones: Qwen3-4B, Qwen3-8B, and Qwen2.5-7B. We reportEx","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"and TriviaQA. For Natural Questions (NQ), we use the dpr-w100 split from ir_datasets to represent open-domain, real-world user queries [34, 35, 36]. For PubMedQA, we adopt the pqa_labeled configuration to model medical question answering, where accurate technical retrieval is needed [37]. For TriviaQA, we employ the rc (reading comprehension) configuration [38]. Using a fixed random seed, we sample 50 benign queries from each dataset for the utility-oriented evaluation of retrieval and generatio","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"7750 248.2226 17.8641 266.0867 0.3658 BnB INT8 138.96 139.09 0.9880 56.3886 056.3886 0.0265 NF4 138.96 144.16 0.9124 155.4506 0 155.4506 0.0750 FP4 138.96 138.10 0.9196 145.1767 0 145.1767 0.1306 GPTQ GPTQ-4bit 138.96 140.37 0.9298 136.7867 0 136.7867 0.1422 Benchmarks and scoring.Five benchmarks:MMLU[ 28],ARC[ 29] (multiple-choice knowledge), TriviaQA[ 30],SQuAD[ 31] (short-horizon QA), andGSM8K[ 32] (multi-step reasoning). All risks are computed teacher-forced (prompt c and targets y scored in","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"significantly on knowledge-intensive and adversarial benchmarks, collapsing on TruthfulQA. We attribute this to the absence of a principled density model, making it unable to generalize across different instruction-tuning regimes. TruthfulQA remains the hardest setting for all methods, as its questions target misconceptions deeply encoded in pretraining weights [16]. Yet,PCNETleads across all models also on this dataset, with Mistral-7B achieving the highest AUROC, consistent with the hypothesis","claim_type":"dataset","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"(by non-expert validators who are experts in other domains; at least 15 min, avg ~37 min, allowing Google) Part 1: answer Q (correct answer & explanations not shown) Part 2: provide feedback on the following dimensions (correct answer & explanations shown to the validator) Include this Q in the DIAMOND set because (1)2 out of 2 expert validators agree* (2)≤ 1 out of 3 non-expert validators answers correctly •Post-hoc agreement: Is the answer uncontroversial? •Is your background suﬃcient to answe","claim_type":"method","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alex Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-LM: In search of the next generation of training sets for language models. InThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum? id=CNWdWn47IE. [40] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for read","claim_type":"dataset","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension because it crossed a citation-hub threshold. Current citing contexts most often use it as dataset evidence (6 contexts).","role_counts":[{"n":6,"context_role":"dataset"},{"n":3,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-06-28T10:07:23.812068+00:00"}},"summary":{"title":"TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension","claims":[{"claim_text":"+ TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQA [ 34]. We train and evaluate on three backbones: Qwen3-4B, Qwen3-8B, and Qwen2.5-7B. We reportEx","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"and TriviaQA. For Natural Questions (NQ), we use the dpr-w100 split from ir_datasets to represent open-domain, real-world user queries [34, 35, 36]. For PubMedQA, we adopt the pqa_labeled configuration to model medical question answering, where accurate technical retrieval is needed [37]. For TriviaQA, we employ the rc (reading comprehension) configuration [38]. Using a fixed random seed, we sample 50 benign queries from each dataset for the utility-oriented evaluation of retrieval and generatio","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"7750 248.2226 17.8641 266.0867 0.3658 BnB INT8 138.96 139.09 0.9880 56.3886 056.3886 0.0265 NF4 138.96 144.16 0.9124 155.4506 0 155.4506 0.0750 FP4 138.96 138.10 0.9196 145.1767 0 145.1767 0.1306 GPTQ GPTQ-4bit 138.96 140.37 0.9298 136.7867 0 136.7867 0.1422 Benchmarks and scoring.Five benchmarks:MMLU[ 28],ARC[ 29] (multiple-choice knowledge), TriviaQA[ 30],SQuAD[ 31] (short-horizon QA), andGSM8K[ 32] (multi-step reasoning). All risks are computed teacher-forced (prompt c and targets y scored in","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"significantly on knowledge-intensive and adversarial benchmarks, collapsing on TruthfulQA. We attribute this to the absence of a principled density model, making it unable to generalize across different instruction-tuning regimes. TruthfulQA remains the hardest setting for all methods, as its questions target misconceptions deeply encoded in pretraining weights [16]. Yet,PCNETleads across all models also on this dataset, with Mistral-7B achieving the highest AUROC, consistent with the hypothesis","claim_type":"dataset","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"(by non-expert validators who are experts in other domains; at least 15 min, avg ~37 min, allowing Google) Part 1: answer Q (correct answer & explanations not shown) Part 2: provide feedback on the following dimensions (correct answer & explanations shown to the validator) Include this Q in the DIAMOND set because (1)2 out of 2 expert validators agree* (2)≤ 1 out of 3 non-expert validators answers correctly •Post-hoc agreement: Is the answer uncontroversial? •Is your background suﬃcient to answe","claim_type":"method","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alex Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-LM: In search of the next generation of training sets for language models. InThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum? id=CNWdWn47IE. [40] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for read","claim_type":"dataset","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension because it crossed a citation-hub threshold. Current citing contexts most often use it as dataset evidence (6 contexts).","role_counts":[{"n":6,"context_role":"dataset"},{"n":3,"context_role":"background"},{"n":1,"context_role":"method"}]},"graph":{"co_cited":[{"title":"Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov","work_id":"45551929-96dc-40f3-9f89-10e76731cc24","shared_citers":26},{"title":"Cohen and Ruslan Salakhutdinov and Christopher D","work_id":"3c5a795a-85bd-45a1-8e65-06a9c5641111","shared_citers":21},{"title":"URL https:// doi.org/10.18653/v1/p19-1472","work_id":"11bfc949-547c-40f3-a86d-953eb9b2154c","shared_citers":16},{"title":"SQ u AD : 100,000+ questions for machine comprehension of text","work_id":"8e6a63f7-90ad-4b5e-8493-c26145f74b69","shared_citers":15},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":15},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":15},{"title":"Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M","work_id":"56ac41e4-5078-4307-aa88-20a9d4e90afc","shared_citers":14},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":14},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":12},{"title":"Lost in the Middle: How Language Models Use Long Contexts","work_id":"37c05e13-4a24-44f8-a1c4-da1bbe7223aa","shared_citers":12},{"title":"Dense Passage Retrieval for Open-Domain Question Answering","work_id":"083391f8-812d-430f-8d08-89a03031ce6c","shared_citers":11},{"title":"B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions","work_id":"b0eff16f-bbcd-4d66-a41d-d89ff07a80e5","shared_citers":10},{"title":"Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps","work_id":"fd3a2c44-aeea-48a7-b162-d3f0c6d43f35","shared_citers":10},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":10},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":10},{"title":"When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories","work_id":"29f5140a-4d58-413f-8205-4e2d6e7cab77","shared_citers":10},{"title":"doi: 10.18653/v1/n19-1421","work_id":"628930d3-897c-43a2-8d6d-589da959e066","shared_citers":9},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":9},{"title":"Liu, and Matt Gardner","work_id":"fb5c5440-b2da-4115-91bf-0fc02e505c13","shared_citers":9},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":8},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":8},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":8},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":8},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":8}],"time_series":[{"n":1,"year":2019},{"n":1,"year":2020},{"n":1,"year":2021},{"n":1,"year":2022},{"n":3,"year":2023},{"n":9,"year":2024},{"n":14,"year":2025},{"n":48,"year":2026}],"dependency_candidates":[{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head","primary_cat":"cs.CL","context_text":"7750 248.2226 17.8641 266.0867 0.3658 BnB INT8 138.96 139.09 0.9880 56.3886 056.3886 0.0265 NF4 138.96 144.16 0.9124 155.4506 0 155.4506 0.0750 FP4 138.96 138.10 0.9196 145.1767 0 145.1767 0.1306 GPTQ GPTQ-4bit 138.96 140.37 0.9298 136.7867 0 136.7867 0.1422 Benchmarks and scoring.Five benchmarks:MMLU[ 28],ARC[ 29] (multiple-choice knowledge), TriviaQA[ 30],SQuAD[ 31] (short-horizon QA), andGSM8K[ 32] (multi-step reasoning). All risks are computed teacher-forced (prompt c and targets y scored in a single forward pass over the gold span), producing a deterministic per-sample CE loss whose expectation gives the model's riskRM , and|∆R|is the target-vs-proxy gap we report. Calibration and hyperparameters.","citing_arxiv_id":"2605.11608"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping","primary_cat":"cs.CL","context_text":"+ TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQA [ 34]. We train and evaluate on three backbones: Qwen3-4B, Qwen3-8B, and Qwen2.5-7B. We reportExact Match (EM)as the primary metric on every benchmark as well as the average accuracy across all evaluation samples. This experiment setting deliberately avoids proprietary APIs and heavyweight tool infrastructure, keeping the evaluation reproducible and concentrating on the progress of the RL algorithm.","citing_arxiv_id":"2605.06200"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits","primary_cat":"cs.CL","context_text":"significantly on knowledge-intensive and adversarial benchmarks, collapsing on TruthfulQA. We attribute this to the absence of a principled density model, making it unable to generalize across different instruction-tuning regimes. TruthfulQA remains the hardest setting for all methods, as its questions target misconceptions deeply encoded in pretraining weights [16]. Yet,PCNETleads across all models also on this dataset, with Mistral-7B achieving the highest AUROC, consistent with the hypothesis that more capable models develop more geometrically separable truth representations [3]. 6 Table 1: Hallucination detection performance. We report AUROC and F1 across four datasets and three seeds (mean±std). Best results per model and dataset are highlighted inbold.","citing_arxiv_id":"2605.05953"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models","primary_cat":"cs.LG","context_text":"Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alex Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-LM: In search of the next generation of training sets for language models. InThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum? id=CNWdWn47IE. [40] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 1601-1611, Vancouver, Canada, July 2017.","citing_arxiv_id":"2604.21106"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks","primary_cat":"cs.CR","context_text":"and TriviaQA. For Natural Questions (NQ), we use the dpr-w100 split from ir_datasets to represent open-domain, real-world user queries [34, 35, 36]. For PubMedQA, we adopt the pqa_labeled configuration to model medical question answering, where accurate technical retrieval is needed [37]. For TriviaQA, we employ the rc (reading comprehension) configuration [38]. Using a fixed random seed, we sample 50 benign queries from each dataset for the utility-oriented evaluation of retrieval and generation quality. These benign queries are distinct from the attack-specific poisoning and membership-inference query/probe sets described in Section 5.3. We execute each benign query against the 700-document ingested corpus.","citing_arxiv_id":"2604.20932"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models","primary_cat":"cs.CL","context_text":"names, and alternative spellings should all be considered the same. If the Provided Answer is correct say exactly \"True\", otherwise say \"False\". Question 1: \"when did the nfl start playing in london\" Provided Answer: \"According to the provided search results, the NFL started playing regular season games in London as part of the NFL International Series in 2007. Specifically: Document [5] states: \"The NFL International Series was inaugurated in 2007 to host NFL regular season games outside the United States. Played at the new Wembley Stadium in London (rebuilt and reopened in 2007), the series increased from one to two games for the 2013 season, to three games for the 2014 season, and then to four games from the 2017 season.\" Document [9] also mentions: \"Since 2007, the league has held multiple regular season games in London each season as part of NFL London Games, allowing the league to test solutions to some of the","citing_arxiv_id":"2404.18796"},{"n":1,"role":"method","polarity":"use_method","paper_title":"GPQA: A Graduate-Level Google-Proof Q&A Benchmark","primary_cat":"cs.AI","context_text":"(by non-expert validators who are experts in other domains; at least 15 min, avg ~37 min, allowing Google) Part 1: answer Q (correct answer & explanations not shown) Part 2: provide feedback on the following dimensions (correct answer & explanations shown to the validator) Include this Q in the DIAMOND set because (1)2 out of 2 expert validators agree* (2)≤ 1 out of 3 non-expert validators answers correctly •Post-hoc agreement: Is the answer uncontroversial? •Is your background suﬃcient to answer correctly? •Q diﬃculty •Did you understand Q fully, now that you see the explanations? •Detailed feedback •Q & answer choice revisions Revised question and choicesMethylcyclopentadiene (which exists as a ﬂuxional mixture of isomers)","citing_arxiv_id":"2311.12022"}]},"authors":[{"id":"88674829-bf39-4318-b987-7d823fb4ac71","orcid":null,"display_name":"Daniel Weld","source":"manual","import_confidence":0.72},{"id":"e236980d-6b87-4ec2-bfcf-98d8c2aa9594","orcid":null,"display_name":"Eunsol Choi","source":"manual","import_confidence":0.72},{"id":"383382cc-d254-4d3b-92b6-03e2ff131701","orcid":null,"display_name":"Luke Zettlemoyer","source":"manual","import_confidence":0.72},{"id":"89203f73-bcbb-42d4-a175-dea3b337ba8b","orcid":null,"display_name":"Mandar Joshi","source":"manual","import_confidence":0.72}]}}