On llms-driven synthetic data generation, curation, and evaluation: A survey,

Long, Lin, Wang, Rui, Xiao, Ruixuan, Zhao, Junbo, Ding, Xiao, Chen, Gang · 2024 · arXiv 2406.15126

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

cs.CL · 2025-04-29 · unverdicted · novelty 7.0

The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

EHRBench uses an EHR-LLM-KB pipeline to automatically create 960,067 reliable QA items spanning diagnosis, treatment, and prognosis for large-scale LLM evaluation in clinical decision making.

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

cs.CL · 2026-05-28 · unverdicted · novelty 5.0

Fine-tuning a Spanish biomedical encoder on Gemini-generated synthetic data for multiple languages yields a bi-encoder that matches or exceeds BioBERT-ST on clinical code retrieval metrics, with further gains from cross-encoder reranking on most languages.

Occupational Prompting Reveals Cultural Bias in Large Language Models

cs.CY · 2026-05-19 · unverdicted · novelty 5.0

Occupational prompting of open-weight LLMs elicits structured value patterns in Inglehart-Welzel cultural space, extending prior nationality-based cultural bias evaluations.

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

cs.AI · 2026-04-27 · unverdicted · novelty 5.0

STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

cs.CL · 2026-05-11 · unverdicted · novelty 4.0

Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.

Synthetic Data Generation for Brain-Computer Interfaces: Overview, Benchmarking, and Future Directions

cs.LG · 2026-03-11 · accept · novelty 4.0

A survey that taxonomizes synthetic brain signal generation methods into four categories, benchmarks them on motor imagery, seizure detection, SSVEP, and auditory attention tasks, and outlines evaluation principles and future directions for data-efficient BCIs.

ShieldGemma: Generative AI Content Moderation Based on Gemma

cs.CL · 2024-07-31 · unverdicted · novelty 4.0

ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.

Sustainability via LLM Right-sizing

cs.CL · 2025-04-17 · unverdicted · novelty 3.0

Empirical comparison shows smaller open-weight LLMs achieve strong performance on everyday work tasks, supporting task-aware selection over always using the largest models for sustainability and cost reasons.

citing papers explorer

Showing 9 of 9 citing papers.

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models cs.CL · 2025-04-29 · unverdicted · none · ref 3
The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs cs.AI · 2026-05-28 · unverdicted · none · ref 62
EHRBench uses an EHR-LLM-KB pipeline to automatically create 960,067 reliable QA items spanning diagnosis, treatment, and prognosis for large-scale LLM evaluation in clinical decision making.
Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages cs.CL · 2026-05-28 · unverdicted · none · ref 29
Fine-tuning a Spanish biomedical encoder on Gemini-generated synthetic data for multiple languages yields a bi-encoder that matches or exceeds BioBERT-ST on clinical code retrieval metrics, with further gains from cross-encoder reranking on most languages.
Occupational Prompting Reveals Cultural Bias in Large Language Models cs.CY · 2026-05-19 · unverdicted · none · ref 39
Occupational prompting of open-weight LLMs elicits structured value patterns in Inglehart-Welzel cultural space, extending prior nationality-based cultural bias evaluations.
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator cs.AI · 2026-04-27 · unverdicted · none · ref 19
STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents cs.CL · 2026-05-11 · unverdicted · none · ref 183
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
Synthetic Data Generation for Brain-Computer Interfaces: Overview, Benchmarking, and Future Directions cs.LG · 2026-03-11 · accept · none · ref 119
A survey that taxonomizes synthetic brain signal generation methods into four categories, benchmarks them on motor imagery, seizure detection, SSVEP, and auditory attention tasks, and outlines evaluation principles and future directions for data-efficient BCIs.
ShieldGemma: Generative AI Content Moderation Based on Gemma cs.CL · 2024-07-31 · unverdicted · none · ref 15
ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.
Sustainability via LLM Right-sizing cs.CL · 2025-04-17 · unverdicted · none · ref 22
Empirical comparison shows smaller open-weight LLMs achieve strong performance on everyday work tasks, supporting task-aware selection over always using the largest models for sustainability and cost reasons.

On llms-driven synthetic data generation, curation, and evaluation: A survey,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer