hub Canonical reference

BloombergGPT: A Large Language Model for Finance

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann · 2023 · cs.LG · arXiv 2303.17564

Canonical reference. 100% of citing Pith papers cite this work as background.

69 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 69 citing papers arXiv PDF

abstract

The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. Additionally, we explain our modeling choices, training process, and evaluation methodology. We release Training Chronicles (Appendix C) detailing our experience in training BloombergGPT.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11

citation-polarity summary

background 11

representative citing papers

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

q-fin.CP · 2026-04-03 · conditional · novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

cs.SE · 2025-06-16 · conditional · novelty 8.0

First study of 1,899 MCP servers finds eight distinct vulnerabilities (only three traditional), 7.2% with general issues, 5.5% with tool poisoning, and 66% with code smells, urging MCP-specific security practices.

AI Trading's Alpha Singularity: Emergent Market Reasoning through Agent-to-Agent Self-Evolution

cs.AI · 2026-06-28 · reject · novelty 7.0

Multi-agent LLM system Agora under Sealed Joint Search conditions produces +1.87 holdout Sharpe on CSI 1000 over a 91-day sealed period, exceeding the best baseline at +1.334 under favorable seed.

It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

MUSE framework shows LLM conformity to user pushback arises from both sycophantic alignment and epistemic uncertainty, with both increasing when users appear expert or suggestions seem plausible.

From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery

cs.CE · 2026-05-14 · unverdicted · novelty 7.0

QuantEvolver applies reinforcement fine-tuning to evolve an LLM policy for generating executable alpha factor expressions, yielding higher-quality and more complementary factors than prompt-based baselines on market benchmarks.

AutoRedTrader: Autonomous Red Teaming of Trading Agents through Synthetic Misinformation Injection

cs.CE · 2026-05-09 · unverdicted · novelty 7.0

AutoRedTrader generates synthetic financial misinformation via behavioral bias manipulation and agent feedback to red-team LLM trading agents, reaching 69% exposure and 26.67% attack success on Bitcoin data simulations.

From Hypotheses to Factors: Constrained LLM Agents in Cryptocurrency Markets

q-fin.PM · 2026-04-29 · unverdicted · novelty 7.0

Constrained LLM agents discover cryptocurrency factors that produce a portfolio with 44.55% annualized return and Sharpe ratio of 1.55 in pure out-of-sample 2024-2026 testing after trading costs.

Detecting Corporate AI-Washing via Cross-Modal Semantic Inconsistency Learning

cs.CY · 2026-03-24 · unverdicted · novelty 7.0

AWASH detects AI-washing via cross-modal inconsistency reasoning on a new trimodal benchmark of 88k corporate disclosure triplets, achieving F1 0.882 with a CMID network that grounds claims against patents and hiring data.

SynBench: A Benchmark for Differentially Private Text Generation

cs.AI · 2025-09-18 · conditional · novelty 7.0

SynBench benchmarks DP text generators across nine datasets and uses a new MIA to show that public pre-training on portions of private data overestimates synthetic text quality and breaks DP privacy bounds.

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

CLExEval introduces a human-annotated evaluation framework on 40 rare cases that identifies verbosity bias, hidden knowledge paradox, and 68.6% reasoning-to-output mismatch in LLMs while showing LLM-as-a-Judge overestimates reliability.

Fast Unlearning at Scale via Margin Self-Correction

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

MASC achieves competitive forget-retain trade-offs in language model unlearning at lower computational cost via margin self-correction and an online stopping criterion on TOFU, MUSE News, and MUSE Books.

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

IPO-Mine releases a toolkit and large multimodal dataset for structured analysis of IPO filings and shows state-of-the-art models diverge from human judgments on chart quality and misleadingness.

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

cs.LG · 2026-05-16 · unverdicted · novelty 6.0

Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.

Agentic Retrieval-Augmented Generation for Financial Document Question Answering

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

FinAgent-RAG achieves 76.81-78.46% execution accuracy on financial QA benchmarks by combining contrastive retrieval, program-of-thought code generation, and adaptive strategy routing, outperforming baselines by 5.62-9.32 points.

Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls

cs.CL · 2026-05-04 · unverdicted · novelty 6.0

Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.

RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization

cs.CL · 2026-04-26 · unverdicted · novelty 6.0

RouteNLP is a closed-loop LLM routing framework using conformal cascading and distillation co-optimization that cut inference costs by 58% in an 8-week enterprise deployment while preserving 91% acceptance and high quality on benchmarks.

Cross-Stock Predictability via LLM-Augmented Semantic Networks

q-fin.PM · 2026-04-21 · unverdicted · novelty 6.0

LLM filtering of embedding-based stock networks raises long-short Sharpe ratio from 0.742 to 0.820 and cuts max drawdown from -10.47% to -7.85% in 2011-2019 S&P 500 backtests.

QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance

cs.MA · 2026-04-20 · unverdicted · novelty 6.0

QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.

MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model

cs.CE · 2026-04-20 · unverdicted · novelty 6.0

MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.

SenseAI: A Human-in-the-Loop Dataset for RLHF-Aligned Financial Sentiment Reasoning

cs.CL · 2026-04-06 · unverdicted · novelty 6.0

SenseAI is a human-in-the-loop financial sentiment dataset with reasoning processes and market outcomes that reveals predictable LLM error patterns like Latent Reasoning Drift for RLHF alignment.

SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

cs.SE · 2026-04-06 · unverdicted · novelty 6.0

SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.

citing papers explorer

Showing 15 of 15 citing papers after filters.

Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers cs.SE · 2025-06-16 · conditional · none · ref 148 · internal anchor
First study of 1,899 MCP servers finds eight distinct vulnerabilities (only three traditional), 7.2% with general issues, 5.5% with tool poisoning, and 66% with code smells, urging MCP-specific security practices.
SynBench: A Benchmark for Differentially Private Text Generation cs.AI · 2025-09-18 · conditional · none · ref 45 · internal anchor
SynBench benchmarks DP text generators across nine datasets and uses a new MIA to show that public pre-training on portions of private data overestimates synthetic text quality and breaks DP privacy bounds.
Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection cs.LG · 2025-12-15 · unverdicted · none · ref 43 · internal anchor
FinFRE-RAG combines importance-guided feature reduction with label-aware retrieval-augmented generation to boost LLM performance on tabular fraud detection across four public datasets while providing human-readable rationales.
MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications cs.AI · 2025-11-17 · unverdicted · none · ref 41 · internal anchor
MM-Telco creates multimodal benchmarks for telecom and demonstrates that fine-tuned LLMs and VLMs achieve significant performance gains on domain-specific tasks.
Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain cs.CL · 2025-09-07 · unverdicted · none · ref 36 · internal anchor
CoRT achieves 95% average attack success rate on nine LLMs by using iterative risk-concealing prompts and a controller that scores concealment levels on a new 522-instruction financial risk benchmark.
The Dark Side of LLMs: Agent-based Attack Vectors for System-level Compromise cs.CR · 2025-07-09 · conditional · none · ref 30 · internal anchor
Testing 18 LLMs found 94.4% vulnerable to direct prompt injection for malware installation, 83.3% to RAG backdoor attacks, and 100% to inter-agent trust exploitation in multi-agent systems.
From Time Series Analysis to Question Answering: A Survey in the LLM Era cs.LG · 2025-06-13 · accept · none · ref 108 · internal anchor
A survey proposing a taxonomy of Injective, Bridging, and Internal Alignment paradigms to evolve TSA into user-driven Time Series Question Answering with LLMs.
BoHA: Blockwise Hadamard Product Adaptation for Parameter-Efficient Fine-Tuning cs.LG · 2025-09-25 · unverdicted · none · ref 2 · internal anchor
BoHA partitions frozen weights into a b by b grid and applies independent low-rank Hadamard factors per block, outperforming LoRA on matched-budget single-task averages while retaining 57.66% first-stage accuracy in a commonsense-to-arithmetic continual-learning test on Llama-3.2-3B.
MulFSA: Multi-level Financial Sentiment Analysis Framework for Bond Market cs.CE · 2025-04-03 · unverdicted · none · ref 13 · internal anchor
MulFSA combines micro-level firm sentiment, meso-level industry sentiment, and duration-aware smoothing from PLMs/LLMs to extract a daily sentiment index that reduces credit spread forecast errors by 10.25% MAE and 11.94% MAPE on a 1.35M-text Chinese bond corpus.
Multi-Model Synthetic Training for Mission-Critical Small Language Models cs.CL · 2025-09-16 · unverdicted · none · ref 5 · internal anchor
Fine-tunes Qwen2.5-7B on 21,543 synthetic maritime Q&A pairs generated from 3.2B AIS records by GPT-4o and o3-mini, reaching 75% accuracy at 261x lower inference cost than larger models.
What Factors Affect LLMs and RLLMs in Financial Question Answering? cs.CL · 2025-07-11 · unverdicted · none · ref 16 · internal anchor
Prompting and agent methods boost standard LLMs on financial QA by simulating long chain-of-thought reasoning, but reasoning LLMs already have this capability and show limited further gains, while multilingual alignment helps mainly by lengthening reasoning with minimal benefit for reasoning models.
Deploying Large AI Models on Resource-Limited Devices with Split Federated Learning cs.LG · 2025-04-12 · unverdicted · none · ref 4 · internal anchor
SFLAM is a quantized split federated fine-tuning framework for large AI models that reduces device memory, energy use, and latency via split learning, optimization strategies, and simulations showing gains over conventional methods.
Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector cs.CL · 2025-09-08 · unverdicted · none · ref 4 · internal anchor
Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.
Bridging Language Models and Financial Analysis q-fin.ST · 2025-03-14 · unverdicted · none · ref 106 · internal anchor
A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.
MetaGraph: A Large-Scale Meta-Analysis of GenAI in Financial NLP (2022-2025) cs.CL · 2025-09-11 · unreviewed · ref 49 · internal anchor

BloombergGPT: A Large Language Model for Finance

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer