pith. sign in

hub

PubMedQA: A Dataset for Biomedical Research Question Answering

33 Pith papers cite this work. Polarity classification is still indexing.

33 Pith papers citing it

hub tools

citation-role summary

background 2 dataset 1

citation-polarity summary

representative citing papers

Can AI Agents Synthesize Scientific Conclusions?

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.

TaDA: Calibrated Probe Gating for Task-Domain LoRA Merging

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

TaDA merges task-domain LoRAs via calibrated per-layer gating and subspace-aware merging, reaching 0.452 avg accuracy on six scientific QA benchmarks and 85.9% on six image classification benchmarks.

ANN Search: Recall What Matters

cs.IR · 2026-06-03 · conditional · novelty 6.0

ANN search quality is better assessed by 1/Ratio@k than Recall@k because the former tracks downstream task utility more closely while allowing substantially lower computational cost.

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

cs.CL · 2026-06-02 · accept · novelty 6.0

Large-scale evaluation shows retrieval-augmented generation yields only marginal and inconsistent gains (1-2 points) over no-retrieval baselines in biomedical QA, with model choice dominating retriever or corpus effects.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Can AI-Generated Text be Reliably Detected?

cs.CL · 2023-03-17 · unverdicted · novelty 6.0

Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between human and AI distributions.

citing papers explorer

Showing 33 of 33 citing papers.