ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
C3: Zero-shot text-to-SQL with ChatGPT,
20 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
A semantic-layer-mediated NL2SQL agent using SMQ achieves 94.15% execution accuracy on the 547-task Spider2-snow benchmark with Gemini 3 Pro.
Residual skill optimization creates complementary Text-to-SQL agents by training each new skill on prior ensemble failures, yielding accuracy gains on Spider2-Lite and transfer to other dialects and tasks.
EXPO-SQL improves Text-to-SQL by using clause-level rewards derived from execution error messages and incremental clause execution instead of uniform query-level rewards.
ROSE is an intent-centered NL2SQL metric using an adversarial Prover-Refuter cascade that achieves higher human-expert agreement than prior metrics on a new validation set.
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
The authors define a taxonomy for LLM-enhanced relational operators categorized into Select, Match, Impute, Cluster and Order, and release LROBench to evaluate single and multi-operator queries on semantic database processing.
ZAS-SQL distills rules from zero-shot Text-to-SQL failures to reach 87.2-88.6% execution accuracy on Spider, new zero-shot SOTA surpassing some GPT-4 few-shot and fine-tuned baselines.
EviLink combines multi-hypothesis schema grounding with uncertainty-guided evidence acquisition, reporting 90.15% field-level recall and 123.30K average tokens on Spider2-Snow while improving downstream SQL generation.
EGRefine optimizes column renamings via execution-grounded verification and view materialization to recover Text-to-SQL accuracy lost to schema naming issues while guaranteeing query equivalence.
PiLLar is the first LLM-guided Monte-Carlo Tree Search framework for joint schema-value matching on pivot tables, achieving 87.94% average accuracy on a new benchmark PTbench derived from real-world domains.
A self-healing LLM pipeline for natural language to PostgreSQL translation achieves up to 9.3 percentage point accuracy gains on benchmarks through error diagnosis and anti-regression mechanisms.
AV-SQL uses a pipeline of LLM agents to generate intermediate CTE views that decompose complex Text-to-SQL queries, reaching 70.38% execution accuracy on Spider 2.0.
KaSLA applies knapsack optimization hierarchically to schema linking for LLM text-to-SQL, claiming better results than large models and improved SQL generation on Spider and BIRD.
RAS conditions each new Cypher query attempt on prior execution errors through ICL and reduces execution error rate by 41-50% at n=5 versus 32-38% for independent scaling across three Neo4j datasets and five models.
SecureMCP integrates RBAC with five sequential defense modules in an MCP server to achieve 82.3% policy compliance against adversarial LLM SQL queries in AIoT while preserving execution accuracy.
MARS-SQL trains a multi-agent RL system with ReAct-style interaction and generative validation to produce SQL queries, reaching 77.84% execution accuracy on BIRD dev and 89.75% on Spider test.
XiYan-SQL achieves SOTA Text-to-SQL accuracy by combining schema filtering, a multi-generator ensemble fine-tuned on varied SQL formats, and a selection model.
CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.
Introduces a standardized evaluation setup and SQL-D1 agent for diffusion language models on NL2SQL, claiming structural robustness advantages over autoregressive models.
citing papers explorer
-
ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
-
A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases
A semantic-layer-mediated NL2SQL agent using SMQ achieves 94.15% execution accuracy on the 547-task Spider2-snow benchmark with Gemini 3 Pro.
-
Residual Skill Optimization for Text-to-SQL Ensembles
Residual skill optimization creates complementary Text-to-SQL agents by training each new skill on prior ensemble failures, yielding accuracy gains on Spider2-Lite and transfer to other dialects and tasks.
-
EXPO-SQL: Execution-based Clause-level Policy Optimization for Text-to-SQL
EXPO-SQL improves Text-to-SQL by using clause-level rewards derived from execution error messages and incremental clause execution instead of uniform query-level rewards.
-
ROSE: An Intent-Centered Evaluation Metric for NL2SQL
ROSE is an intent-centered NL2SQL metric using an adversarial Prover-Refuter cascade that achieves higher human-expert agreement than prior metrics on a new validation set.
-
NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
-
Large Language Model-Enhanced Relational Operators: Taxonomy, Benchmark, and Analysis
The authors define a taxonomy for LLM-enhanced relational operators categorized into Select, Match, Impute, Cluster and Order, and release LROBench to evaluate single and multi-operator queries on semantic database processing.
-
ZAS-SQL: Distilling Rules from Failures for Zero-Shot Text-to-SQL
ZAS-SQL distills rules from zero-shot Text-to-SQL failures to reach 87.2-88.6% execution accuracy on Spider, new zero-shot SOTA surpassing some GPT-4 few-shot and fine-tuned baselines.
-
EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQL
EviLink combines multi-hypothesis schema grounding with uncertainty-guided evidence acquisition, reporting 90.15% field-level recall and 123.30K average tokens on Spider2-Snow while improving downstream SQL generation.
-
EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement
EGRefine optimizes column renamings via execution-grounded verification and view materialization to recover Text-to-SQL accuracy lost to schema naming issues while guaranteeing query equivalence.
-
PiLLar: Matching for Pivot Table Schema via LLM-guided Monte-Carlo Tree Search
PiLLar is the first LLM-guided Monte-Carlo Tree Search framework for joint schema-value matching on pivot tables, achieving 87.94% average accuracy on a new benchmark PTbench derived from real-world domains.
-
SQL Query Engine: A Self-Healing LLM Pipeline for Natural Language to PostgreSQL Translation
A self-healing LLM pipeline for natural language to PostgreSQL translation achieves up to 9.3 percentage point accuracy gains on benchmarks through error diagnosis and anti-regression mechanisms.
-
AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views
AV-SQL uses a pipeline of LLM agents to generate intermediate CTE views that decompose complex Text-to-SQL queries, reaching 70.38% execution accuracy on Spider 2.0.
-
Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation
KaSLA applies knapsack optimization hierarchically to schema linking for LLM text-to-SQL, claiming better results than large models and improved SQL generation on Spider and BIRD.
-
RAS: Reflection-Augmented Scaling with In-Context Learning for Executable Cypher Query Generation
RAS conditions each new Cypher query attempt on prior execution errors through ICL and reduces execution error rate by 41-50% at n=5 versus 32-38% for independent scaling across three Neo4j datasets and five models.
-
SecureMCP: A Policy-Enforced LLM Data Access Framework for AIoT Systems via Model Context Protocol
SecureMCP integrates RBAC with five sequential defense modules in an MCP server to achieve 82.3% policy compliance against adversarial LLM SQL queries in AIoT while preserving execution accuracy.
-
MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL
MARS-SQL trains a multi-agent RL system with ReAct-style interaction and generative validation to produce SQL queries, reaching 77.84% execution accuracy on BIRD dev and 89.75% on Spider test.
-
XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL
XiYan-SQL achieves SOTA Text-to-SQL accuracy by combining schema filtering, a multi-generator ensemble fine-tuned on varied SQL formats, and a selection model.
-
CHESS: Contextual Harnessing for Efficient SQL Synthesis
CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.
-
Are Diffusion Language Models Good Database Analysts?
Introduces a standardized evaluation setup and SQL-D1 agent for diffusion language models on NL2SQL, claiming structural robustness advantages over autoregressive models.