Beyond Semantic Similarity: A Two-Phase Non-Parametric Retrieval Workflow for Corporate Credit Underwriting

Ezekiel Tee Kongquan; Kelvin Heng; Kenneth Zhu Ke; Linus Ng Junjia; Zhao Jing Yuan

arxiv: 2605.20684 · v1 · pith:GY4FKLXFnew · submitted 2026-05-20 · 💻 cs.CL

Beyond Semantic Similarity: A Two-Phase Non-Parametric Retrieval Workflow for Corporate Credit Underwriting

Linus Ng Junjia , Ezekiel Tee Kongquan , Kelvin Heng , Kenneth Zhu Ke , Zhao Jing Yuan This is my paper

Pith reviewed 2026-05-21 05:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords retrieval augmented generationcorporate credit underwritingutility rankingsimilarity-utility gapnon-parametric retrievalmultilingual financial documentsLLM as judge

0 comments

The pith

A two-phase retrieval workflow using utility scoring outperforms semantic similarity for corporate credit underwriting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard RAG systems suffer from a similarity-utility gap in long financial documents, where semantically similar passages often fail to provide decision-relevant information for credit analysis. To address this, it proposes separating broad candidate retrieval from precise utility ranking in a two-phase architecture. This includes multilingual lexical and dense retrieval, followed by an adaptive controller and LLM-as-a-Judge scoring based on analytical usefulness. A sympathetic reader would care because this directly tackles the practical challenge of analysts spending excessive time on document review in corporate lending decisions, with reported reductions in production settings.

Core claim

The authors present a two-phase non-parametric retrieval architecture for corporate credit underwriting that constructs a broad candidate pool with lexical and dense multilingual methods in the first phase, then applies an adaptive retrieval controller using query intent and document structure signals, followed by LLM-as-a-Judge utility scoring to rank passages by analytical usefulness rather than semantic proximity, preserving structural fidelity in text and tables through context-aware extraction.

What carries the argument

The two-phase non-parametric retrieval architecture that separates high-recall candidate retrieval from high-precision utility ranking via LLM-as-a-Judge scoring.

If this is right

Significantly outperforms naive retrieval baselines on a multilingual corpus of proprietary financial documents with analyst-curated relevance labels.
Reduces document review time from several hours to approximately three minutes in production deployment across more than 800 credit analysts.
Supports on-premise deployment to meet enterprise data governance requirements.
Maintains fidelity for both narrative text and complex financial tables.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This workflow might extend to other domains requiring extraction of decision-useful information from heterogeneous documents, such as regulatory compliance or investment research.
Testing the correlation between LLM utility scores and actual credit decision outcomes could validate or refine the ranking mechanism.

Load-bearing premise

The assumption that the LLM-as-a-Judge utility scoring mechanism accurately ranks passages according to analytical usefulness for credit decisions rather than semantic proximity, and that analyst-curated relevance labels are reliable ground truth.

What would settle it

An experiment where credit analysts directly compare the decision utility of passages retrieved by the new system against those from standard semantic retrieval, measuring agreement with ground truth labels or impact on underwriting accuracy.

Figures

Figures reproduced from arXiv: 2605.20684 by Ezekiel Tee Kongquan, Kelvin Heng, Kenneth Zhu Ke, Linus Ng Junjia, Zhao Jing Yuan.

read the original abstract

Corporate credit underwriting requires analysts to extract actionable evidence from long, heterogeneous financial documents spanning hundreds of pages and multiple languages. Standard Retrieval-Augmented Generation (RAG) pipelines optimize for semantic similarity, which frequently surfaces passages that are topically related but lack decision utility, a problem we term the similarity-utility gap. We propose a two-phase non-parametric retrieval architecture that separates high-recall candidate retrieval from high-precision utility ranking. The first phase combines lexical and dense multilingual retrieval to construct a broad candidate pool. The second phase applies an adaptive retrieval controller that filters candidates using query intent and document structure signals, followed by an LLM-as-a-Judge utility scoring mechanism that ranks passages by analytical usefulness rather than semantic proximity. A context-aware extraction module preserves structural fidelity across narrative text and complex financial tables. The system is deployed entirely on-premise to satisfy enterprise data governance requirements. Evaluated on a multilingual corpus of proprietary financial documents with analyst-curated relevance labels, the system significantly outperforms naive retrieval baselines. In production deployment across more than 800 credit analysts, document review time was reduced from several hours to approximately three minutes, demonstrating the practical value of utility-aware RAG architectures for document-intensive decision-support workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper describes a practical two-phase retrieval system for long financial documents in credit underwriting with claimed production time savings, but the evaluation lacks reported metrics, validation for the LLM judge, and label reliability details.

read the letter

The main takeaway is that this work adapts standard retrieval components into a two-phase setup for corporate credit analysis and reports real deployment gains, but the evidence for those gains stays thin on specifics. It combines lexical and dense multilingual retrieval for broad candidates, then uses an adaptive controller with document structure signals and an LLM-as-a-Judge step to rank by analytical usefulness instead of similarity. A context-aware extractor handles tables and narrative text, and the whole thing runs on-premise. They evaluated it on proprietary multilingual financial documents with analyst-curated labels and say it beats naive baselines, plus they deployed it to more than 800 analysts where review time dropped from hours to roughly three minutes. That combination for this exact domain and the production claim are the concrete parts worth noting. The approach targets a clear operational issue where semantic matches often miss decision-relevant content in hundred-page reports. The on-premise choice and table handling show attention to enterprise constraints. The soft spots sit in the evaluation. The abstract gives no quantitative metrics, baseline definitions, statistical tests, or error analysis, and the data is proprietary, so external checks are limited. There are no details on LLM judge calibration, human agreement rates, or how the analyst labels were curated and validated. The production time reduction also lacks measurement method or controls. These gaps mean the outperformance and efficiency claims rest on unexamined elements, which matches the stress-test concern. This paper is for practitioners building retrieval tools in finance or regulated document workflows rather than for core retrieval researchers. A reader facing similar long-document review tasks could pick up usable ideas on the two-phase split and structure signals. It shows honest engagement with the problem and literature on RAG limitations, so the thinking holds up even if the results need more backing. I would bring it to a reading group as an applied case study. It deserves peer review to surface the missing validation pieces and see if the full methods add substance.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a two-phase non-parametric retrieval architecture for corporate credit underwriting to address the similarity-utility gap in standard RAG pipelines. Phase one combines lexical and dense multilingual retrieval for high-recall candidate generation; phase two uses an adaptive retrieval controller with query intent and document structure signals followed by an LLM-as-a-Judge utility scorer to rank passages by analytical usefulness. A context-aware extraction module handles narrative text and financial tables. The system is deployed on-premise and is evaluated on a proprietary multilingual corpus with analyst-curated labels, where it reportedly outperforms naive baselines; production use across >800 analysts is said to reduce document review time from hours to ~3 minutes.

Significance. If the empirical results can be substantiated with quantitative metrics and validation details, the work could demonstrate practical value for utility-aware retrieval in regulated, document-intensive domains such as financial analysis, highlighting benefits of separating recall from precision-oriented ranking and on-premise deployment.

major comments (3)

[Abstract] Abstract and Evaluation section: the central claim of significant outperformance over naive retrieval baselines supplies no quantitative metrics (e.g., precision@K, recall@K, or utility-specific scores), baseline definitions, statistical tests, or error analysis, preventing assessment of effect size or attribution to the two-phase design versus annotation artifacts.
[Abstract] Abstract and Evaluation section: the LLM-as-a-Judge utility scoring mechanism, which is load-bearing for the claim that ranking occurs by analytical usefulness rather than semantic proximity, provides no prompt details, calibration procedure, or human-LLM agreement statistics.
[Abstract] Production Deployment paragraph: the reported reduction in document review time from several hours to approximately three minutes across >800 analysts lacks a description of measurement methodology, controls, or how time savings were quantified, weakening the practical-impact claim.

minor comments (2)

Add a related-work section to contrast the proposed controller and LLM judge against prior multi-stage or utility-aware retrieval methods.
Clarify operational details of the adaptive retrieval controller, including how query intent and document structure signals are extracted and combined.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate planned revisions to improve clarity and substantiation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract and Evaluation section: the central claim of significant outperformance over naive retrieval baselines supplies no quantitative metrics (e.g., precision@K, recall@K, or utility-specific scores), baseline definitions, statistical tests, or error analysis, preventing assessment of effect size or attribution to the two-phase design versus annotation artifacts.

Authors: We agree that the abstract would be strengthened by referencing key quantitative results. The Evaluation section defines the naive baselines as standard lexical (BM25) and dense multilingual retrievers without the adaptive controller or utility ranking stage, and reports precision@K, recall@K, and utility scores based on analyst-curated labels. We will revise the abstract to summarize relative gains and expand the Evaluation section with statistical tests and error analysis to better attribute improvements to the two-phase design. Exact values remain relative due to the proprietary corpus. revision: yes
Referee: [Abstract] Abstract and Evaluation section: the LLM-as-a-Judge utility scoring mechanism, which is load-bearing for the claim that ranking occurs by analytical usefulness rather than semantic proximity, provides no prompt details, calibration procedure, or human-LLM agreement statistics.

Authors: We will add the full prompt template for the LLM-as-a-Judge utility scorer to an appendix. We will also describe the calibration procedure and include human-LLM agreement statistics obtained during internal validation to support the distinction between utility and semantic similarity. revision: yes
Referee: [Abstract] Production Deployment paragraph: the reported reduction in document review time from several hours to approximately three minutes across >800 analysts lacks a description of measurement methodology, controls, or how time savings were quantified, weakening the practical-impact claim.

Authors: We will expand the Production Deployment paragraph with a high-level description of the measurement approach, including system logs for document review duration and pre/post-deployment comparisons with a subset of analysts. Detailed controls and raw quantification data are limited by internal confidentiality policies, but we will clarify the basis for the reported average reduction. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims only

full rationale

The paper describes a two-phase retrieval architecture and reports empirical results from a proprietary corpus evaluation plus production deployment metrics. No mathematical derivations, equations, fitted parameters, or self-referential predictions appear in the provided text. The central claims rest on observed outperformance and time reduction rather than any quantity defined by the authors' own prior equations or self-citations that reduce the result to its inputs by construction. This is the most common honest finding for applied system papers without formal derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that an LLM can serve as a reliable proxy for human analyst judgment of passage utility and that separating recall from utility ranking yields net gains without introducing new biases or missing critical evidence.

axioms (2)

domain assumption LLM-as-a-Judge can accurately score analytical usefulness for credit decisions
Invoked in the second-phase ranking mechanism described in the abstract.
domain assumption Analyst-curated relevance labels provide unbiased ground truth for system evaluation
Used to claim outperformance over baselines.

pith-pipeline@v0.9.0 · 5757 in / 1476 out tokens · 40028 ms · 2026-05-21T05:50:35.034386+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-phase non-parametric retrieval architecture that separates high-recall candidate retrieval from high-precision utility ranking... LLM-as-a-Judge utility scoring mechanism that ranks passages by analytical usefulness rather than semantic proximity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

2025 , eprint=

Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation , author=. 2025 , eprint=

work page 2025
[2]

2024 , eprint=

HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction , author=. 2024 , eprint=

work page 2024
[3]

, booktitle =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and Kuksa, Pavel and et al. , booktitle =. Retrieval-Augmented Generation for Knowledge-Intensive. 2020 , url =

work page 2020
[4]

2024 , eprint=

Retrieval-Augmented Generation for Large Language Models: A Survey , author=. 2024 , eprint=

work page 2024
[5]

2022 , eprint=

ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction , author=. 2022 , eprint=

work page 2022
[6]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023
[7]

2024 , eprint=

Hierarchical Retrieval-Augmented Generation Model with Rethink for Multi-hop Question Answering , author=. 2024 , eprint=

work page 2024
[8]

2023 , eprint=

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. 2023 , eprint=

work page 2023

[1] [1]

2025 , eprint=

Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation , author=. 2025 , eprint=

work page 2025

[2] [2]

2024 , eprint=

HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction , author=. 2024 , eprint=

work page 2024

[3] [3]

, booktitle =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and Kuksa, Pavel and et al. , booktitle =. Retrieval-Augmented Generation for Knowledge-Intensive. 2020 , url =

work page 2020

[4] [4]

2024 , eprint=

Retrieval-Augmented Generation for Large Language Models: A Survey , author=. 2024 , eprint=

work page 2024

[5] [5]

2022 , eprint=

ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction , author=. 2022 , eprint=

work page 2022

[6] [6]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023

[7] [7]

2024 , eprint=

Hierarchical Retrieval-Augmented Generation Model with Rethink for Multi-hop Question Answering , author=. 2024 , eprint=

work page 2024

[8] [8]

2023 , eprint=

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. 2023 , eprint=

work page 2023