PQR is a dual-module iterative framework that generates diverse and realistic queries to elicit failures in QA agents, detecting 23-78% more unhelpful responses than prior methods.
arXiv preprint arXiv:2310.13345 , year=
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
Presents GradingAttack with token- and prompt-level adversarial attacks that compromise LLM educational grading agents on multiple datasets, showing prompt-level attacks succeed more while token-level are stealthier.
Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
citing papers explorer
-
PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures
PQR is a dual-module iterative framework that generates diverse and realistic queries to elicit failures in QA agents, detecting 23-78% more unhelpful responses than prior methods.
-
Benign Overfitting in Adversarial Training for Vision Transformers
Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
-
GradingAttack: Exposing Security Vulnerabilities in LLM Based Educational Grading Agents
Presents GradingAttack with token- and prompt-level adversarial attacks that compromise LLM educational grading agents on multiple datasets, showing prompt-level attacks succeed more while token-level are stealthier.
-
Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI
Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.
-
TrustLLM: Trustworthiness in Large Language Models
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.