NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
hub
A survey on in-context learning
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
years
2026 10representative citing papers
A tabular foundation model with LLM-as-Observer features predicts AI agent decisions in controlled games, outperforming baselines by 4 AUC points and 14% lower error at K=16 interactions.
RADAGAS-GPT4o achieves a 22.73% bypass rate against 10 WAFs, succeeding more against AI/ML-based firewalls than rule-based ones.
Causal evidence from representation analysis and interventions shows LLMs use both genuine structure inference and induction circuits in parallel for in-context graph learning.
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
GRaSp optimizes in-context examples for LLMs via synthetic generation, clustering, dimensionality reduction, and genetic algorithms with diversity-adaptive mutation, reaching 45.84% micro-F1 on financial NER with real data and outperforming zero-shot and random few-shot baselines.
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
Cross-domain memory transfer in coding agents boosts average performance by 3.7% via meta-knowledge like validation routines, with high-level abstractions transferring better than low-level traces.
BM25 retrieval makes many-shot ICL for low-resource MT roughly 5x more data-efficient, with 50 examples matching 250 random ones and 250 matching 1000.
citing papers explorer
-
NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
-
Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling
A tabular foundation model with LLM-as-Observer features predicts AI agent decisions in controlled games, outperforming baselines by 4 AUC points and 14% lower error at K=16 interactions.
-
Adversarial SQL Injection Generation with LLM-Based Architectures
RADAGAS-GPT4o achieves a 22.73% bypass rate against 10 WAFs, succeeding more against AI/ML-based firewalls than rule-based ones.
-
Belief or Circuitry? Causal Evidence for In-Context Graph Learning
Causal evidence from representation analysis and interventions shows LLMs use both genuine structure inference and induction circuits in parallel for in-context graph learning.
-
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
-
GRaSp: Automatic Example Optimization for In-Context Learning in Low-Data Tasks
GRaSp optimizes in-context examples for LLMs via synthetic generation, clustering, dimensionality reduction, and genetic algorithms with diversity-adaptive mutation, reaching 45.84% micro-F1 on financial NER with real data and outperforming zero-shot and random few-shot baselines.
-
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents
Cross-domain memory transfer in coding agents boosts average performance by 3.7% via meta-knowledge like validation routines, with high-level abstractions transferring better than low-level traces.
-
An Empirical Study of Many-Shot In-Context Learning for Machine Translation of Low-Resource Languages
BM25 retrieval makes many-shot ICL for low-resource MT roughly 5x more data-efficient, with 50 examples matching 250 random ones and 250 matching 1000.