The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
On llms-driven synthetic data generation, curation, and evaluation: A survey,
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
EHRBench uses an EHR-LLM-KB pipeline to automatically create 960,067 reliable QA items spanning diagnosis, treatment, and prognosis for large-scale LLM evaluation in clinical decision making.
Fine-tuning a Spanish biomedical encoder on Gemini-generated synthetic data for multiple languages yields a bi-encoder that matches or exceeds BioBERT-ST on clinical code retrieval metrics, with further gains from cross-encoder reranking on most languages.
Occupational prompting of open-weight LLMs elicits structured value patterns in Inglehart-Welzel cultural space, extending prior nationality-based cultural bias evaluations.
STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
A survey that taxonomizes synthetic brain signal generation methods into four categories, benchmarks them on motor imagery, seizure detection, SSVEP, and auditory attention tasks, and outlines evaluation principles and future directions for data-efficient BCIs.
ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.
Empirical comparison shows smaller open-weight LLMs achieve strong performance on everyday work tasks, supporting task-aware selection over always using the largest models for sustainability and cost reasons.
citing papers explorer
-
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
-
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs
EHRBench uses an EHR-LLM-KB pipeline to automatically create 960,067 reliable QA items spanning diagnosis, treatment, and prognosis for large-scale LLM evaluation in clinical decision making.
-
Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages
Fine-tuning a Spanish biomedical encoder on Gemini-generated synthetic data for multiple languages yields a bi-encoder that matches or exceeds BioBERT-ST on clinical code retrieval metrics, with further gains from cross-encoder reranking on most languages.
-
Occupational Prompting Reveals Cultural Bias in Large Language Models
Occupational prompting of open-weight LLMs elicits structured value patterns in Inglehart-Welzel cultural space, extending prior nationality-based cultural bias evaluations.
-
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
-
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
-
Synthetic Data Generation for Brain-Computer Interfaces: Overview, Benchmarking, and Future Directions
A survey that taxonomizes synthetic brain signal generation methods into four categories, benchmarks them on motor imagery, seizure detection, SSVEP, and auditory attention tasks, and outlines evaluation principles and future directions for data-efficient BCIs.
-
ShieldGemma: Generative AI Content Moderation Based on Gemma
ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.
-
Sustainability via LLM Right-sizing
Empirical comparison shows smaller open-weight LLMs achieve strong performance on everyday work tasks, supporting task-aware selection over always using the largest models for sustainability and cost reasons.