IndiaFinBench is the first public benchmark for LLMs on Indian financial regulatory text, with twelve models scoring 70.4-89.7% accuracy and all outperforming a 69% human baseline.
citation dossier · hub
Financebench: A new benchmark for financial question answering
why this work matters in Pith
Pith has found this work in 17 reviewed papers. Its strongest current cluster is cs.CL (5 papers). The largest review-status bucket among citing papers is UNVERDICTED (15 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.
hub tools
years
2026 17representative citing papers
BizCompass is a dual-axis benchmark evaluating LLMs on business knowledge in finance, economics, statistics, and operations management, linked to analyst, trader, and consultant roles, with public datasets released after testing open and commercial models.
Pico reduces LoRA merge interference by calibrating over-shared directions in the B matrix before merging, yielding 3.4-8.3 point accuracy gains and sometimes beating joint training.
Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
LLMs show low sycophancy to direct contradictions in financial tasks but high sycophancy to user preference contradictions, with input filtering as one recovery approach.
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
PBKV predicts agent invocations in dynamic LLM workflows to manage KV-cache reuse, delivering up to 1.85x speedup over LRU and 1.26x over KVFlow.
AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.
Tree reasoning outperforms vector search on complex document queries but a hybrid approach balances results across tiers, with validation showing an 11.7-point gap on real finance documents.
Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.
Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
Structured memory improves precision on deterministic financial calculations while retrieval-augmented generation outperforms in conversational settings, supporting a hybrid deployment framework for resource-constrained SMEs.
FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.
citing papers explorer
-
IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text
IndiaFinBench is the first public benchmark for LLMs on Indian financial regulatory text, with twelve models scoring 70.4-89.7% accuracy and all outperforming a 69% human baseline.
-
BizCompass: Benchmarking the Reasoning Capabilities of LLMs in Business Knowledge and Applications
BizCompass is a dual-axis benchmark evaluating LLMs on business knowledge in finance, economics, statistics, and operations management, linked to analyst, trader, and consultant roles, with public datasets released after testing open and commercial models.
-
Crowded in B-Space: Calibrating Shared Directions for LoRA Merging
Pico reduces LoRA merge interference by calibrating over-shared directions in the B matrix before merging, yielding 3.4-8.3 point accuracy gains and sometimes beating joint training.
-
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
-
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
-
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
-
Training Transformers for KV Cache Compressibility
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
-
LATTICE: Evaluating Decision Support Utility of Crypto Agents
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
-
The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications
LLMs show low sycophancy to direct contradictions in financial tasks but high sycophancy to user preference contradictions, with input filtering as one recovery approach.
-
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
-
Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management
PBKV predicts agent invocations in dynamic LLM workflows to manage KV-cache reuse, delivering up to 1.85x speedup over LRU and 1.26x over KVFlow.
-
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.
-
Adaptive Query Routing: A Tier-Based Framework for Hybrid Retrieval Across Financial, Legal, and Medical Documents
Tree reasoning outperforms vector search on complex document queries but a hybrid approach balances results across tiers, with validation showing an 11.7-point gap on real finance documents.
-
Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG
Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.
-
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
-
Architecture Matters More Than Scale: A Comparative Study of Retrieval and Memory Augmentation for Financial QA Under SME Compute Constraints
Structured memory improves precision on deterministic financial calculations while retrieval-augmented generation outperforms in conversational settings, supporting a hybrid deployment framework for resource-constrained SMEs.
-
Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout
FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.