{"total":12,"items":[{"citing_arxiv_id":"2606.09881","ref_index":258,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Toward Calibrated, Fair, and accurate Deepfake Detection","primary_cat":"cs.LG","submitted_at":"2026-06-03T05:44:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03085","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Multi-component Causal Tracing in Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-06-02T03:15:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A unified multi-component causal tracing method that uses soft interventions and a metric transformation to efficiently select critical LLM components for a target performance metric.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10442","ref_index":96,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs","primary_cat":"cs.CY","submitted_at":"2026-05-11T12:12:28+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 8-14, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2002. URLhttps://aclanthology.org/N18-2002. [95] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99-106, 2021. [96] K. S. Shejole and P. Bhattacharyya. Stereodetect: Detecting stereotypes and anti-stereotypes the correct way using social psychological underpinnings.arXiv preprint arXiv:2504.03352, 2025. [97] K. Simbeck and M. Mahran. Mechanistic interpretability with saes: Probing religion, violence, and geography in large language models.arXiv preprint arXiv:2509."},{"citing_arxiv_id":"2605.07622","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Is She Even Relevant? When BERT Ignores Explicit Gender Cues","primary_cat":"cs.CL","submitted_at":"2026-05-08T11:48:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06856","ref_index":60,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility","primary_cat":"cs.LG","submitted_at":"2026-05-07T18:56:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.10636","ref_index":24,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Mitigating Extrinsic Gender Bias for Bangla Classification Tasks","primary_cat":"cs.CL","submitted_at":"2024-11-16T00:04:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Constructs gender-perturbed Bangla classification benchmarks and proposes RandSymKL debiasing that reduces extrinsic gender bias in pretrained models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.11794","ref_index":156,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"DataComp-LM: In search of the next generation of training sets for language models","primary_cat":"cs.LG","submitted_at":"2024-06-17T17:42:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"we train on ∼ 400k documents split equally between positive and negative classes. We experiment with different options for positive data and fix negative data as a random sample from a version of our RefinedWeb reproduction. For the perplexity filtering and the top-k average logits strategies, we utilize a 154M parameter causal Transformer trained on a mix of English Wikipedia, the books subset ofRedPajama-v1, and peS2o [ 156, 168] (see Appendix J for more implementation details). We compare the aforementioned approaches in Table 4 and find that fastText-based filtering outperforms all other approaches. We next aim to understand how different fastText training recipes affect its effectiveness as a data filtering network [59]. Text classifier ablations. To better understand the limits offastText, we train several variants,"},{"citing_arxiv_id":"2312.11805","ref_index":87,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Gemini: A Family of Highly Capable Multimodal Models","primary_cat":"cs.CL","submitted_at":"2023-12-19T02:39:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2204.02311","ref_index":130,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PaLM: Scaling Language Modeling with Pathways","primary_cat":"cs.CL","submitted_at":"2022-04-05T16:11:45+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2110.08193","ref_index":51,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"BBQ: A Hand-Built Bias Benchmark for Question Answering","primary_cat":"cs.CL","submitted_at":"2021-10-15T16:43:46+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BBQ is a new benchmark dataset showing that QA models often default to social stereotypes, achieving up to 3.4 points higher accuracy when the correct answer aligns with bias.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1906.10256","ref_index":30,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Good Secretaries, Bad Truck Drivers? Occupational Gender Stereotypes in Sentiment Analysis","primary_cat":"cs.CL","submitted_at":"2019-06-24T22:31:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Authors release a new 800-sentence gender-balanced profession dataset and use it to test occupational gender stereotypes in three sentiment analysis models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1905.00537","ref_index":138,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems","primary_cat":"cs.CL","submitted_at":"2019-05-02T00:41:50+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}