LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
Codescore: Evaluating code generation by learning code execution
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
MASPO jointly optimizes prompts in multi-agent LLM systems via downstream-success evaluation and evolutionary beam search, delivering 2.9 average accuracy gains over prior methods across six tasks.
RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-in-the-Middle prompting performed best.
citing papers explorer
-
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
-
MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems
MASPO jointly optimizes prompts in multi-agent LLM systems via downstream-success evaluation and evolutionary beam search, delivering 2.9 average accuracy gains over prior methods across six tasks.
-
RuC: HDL-Agnostic Rule Completion Benchmark Generation
RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-in-the-Middle prompting performed best.