pith. sign in

arxiv: 2606.24346 · v1 · pith:YGTHSWTWnew · submitted 2026-06-23 · 💻 cs.IR · cs.CL

PETRA: Transforming Web Text for Petroleum-Engineering Domain Adaptation

Pith reviewed 2026-06-25 22:30 UTC · model grok-4.3

classification 💻 cs.IR cs.CL
keywords petroleum engineeringdomain adaptationinformation retrievalsynthetic supervisiondense retrievalrerankingweb text curationenergy domain classifier
0
0 comments X

The pith

PETRA turns public web text into 1.36 million petroleum-engineering chunks and synthetic training pairs that raise first-stage nDCG from 0.703 to 0.763.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Petroleum-engineering search suffers from a shortage of domain-specific relevance labels even though relevant passages exist on the public web. PETRA addresses the gap by filtering web pages with an energy-domain classifier, generating chunk-grounded queries, writing LLM hard negatives, and mining retrieval candidates to produce both embedding training rows and teacher-scored reranker data. The resulting models, when fused or adapted, deliver measurable lifts on an in-domain test set and on public Earth Science and reasoning benchmarks. The construction shows that retrieval performance improves only when the mined data is repackaged as teacher-scored lists drawn from the inference-time distribution.

Core claim

The central claim is that high-recall energy-domain curation combined with a 98.4%-accurate classifier, chunk-grounded query generation, LLM-written hard negatives, and retrieval-mined candidate lists yields a dataset of 1.36M chunks that supplies effective synthetic supervision. Training on this data raises first-stage in-domain nDCG from 0.703 to 0.763 via score fusion. Reranker adaptation produces 44% relative improvement on a public Earth Science benchmark and 23% on a six-task reasoning panel. Experiments with failed recipes demonstrate that high accuracy on synthetic labels alone does not predict retrieval gains and that retrieval-mined data helps only after being turned into teacher-s

What carries the argument

The PETRA pipeline of energy-domain classification, chunk-grounded query generation, LLM hard-negative writing, and retrieval-mined candidate lists that convert web text into synthetic supervision for dense retrieval and reranking.

If this is right

  • Score fusion of the adapted first-stage retriever raises in-domain nDCG to 0.763.
  • Reranker adaptation yields a 44% relative gain on the public Earth Science benchmark.
  • Reranker adaptation yields a 23% gain on the six-task reasoning-intensive panel.
  • High train-holdout accuracy on synthetic labels does not predict downstream retrieval gains.
  • Retrieval-mined data improves performance only after repackaging as teacher-scored candidate lists sampled from the inference-time distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same curation steps could be reused in other technical fields that have abundant web text but scarce labeled retrieval data.
  • If the 98.4% classifier systematically admits off-topic or low-quality chunks, the generated queries and negatives will misalign with real user needs.
  • Live query logs from petroleum-engineering search engines would provide a direct test of whether the synthetic supervision matches actual relevance patterns.
  • The approach implicitly assumes that an LLM can generate hard negatives that reflect the failure modes of the base retriever at inference time.

Load-bearing premise

The energy-domain classifier selects chunks whose distribution matches the actual information needs of petroleum engineers.

What would settle it

Collect human relevance judgments on a held-out set of real petroleum-engineering queries and recompute nDCG; if the 0.06-point gain disappears, the synthetic labels do not align with actual user relevance.

Figures

Figures reproduced from arXiv: 2606.24346 by (2) Inception AI), Adrian Garcia-Garcia (2), Aya El Mir (1), Federico Castanedo (2), Hachem Madmoun (1), Kirill Dubovikov (1), Larry Murray (2), Martin Takac (1), Omar El Mansouri (1), Onkar Pandit (2), Salem Lahlou (1) ((1) Mohamed bin Zayed University of Artificial Intelligence, Sandeep Kumar (1), Sunil Kumar Sahu (2), Supriyo Ghosh (2), Writabrata Bhattacharya (2), Yanda Li (1).

Figure 1
Figure 1. Figure 1: PETRA data construction pipeline. Curation distills open sources into the curated corpus (§ [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of curated energy chunks across the oil-and-gas domain taxonomy. [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
read the original abstract

Petroleum-engineering search exposes a supervision gap for strong general retrievers: relevant evidence exists in public web text, but domain relevance labels are scarce. To address this gap, we propose PETRA, a large-scale Petroleum Engineering Text for Retrieval Adaptation dataset and pipeline that converts noisy public web data into a curated domain corpus and synthetic supervision for dense retrieval and reranking. PETRA contains 1.36M curated chunks, approximately 2B token equivalents, $\approx$859k, embedding training rows from $\approx$224k anchors, and roughly 400k teacher-scored reranker candidate rows. Its construction combines high-recall energy-domain curation, an energy-domain classifier with 98.4% test accuracy, chunk-grounded query generation, LLM-written hard negatives, and retrieval-mined candidate lists. PETRA improves first-stage in-domain Normalized Discounted Cumulative Gain (nDCG) from 0.703 to 0.763 through score fusion. Reranker adaptation improves the public Earth Science benchmark by 44% relative and a six-task reasoning-intensive panel by 23%. Failed training recipes show that high train-holdout accuracy on synthetic labels does not predict retrieval gains; retrieval-mined data helps only after being repackaged as teacher-scored candidate lists sampled from the inference-time candidate distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes PETRA, a 1.36M-chunk curated corpus (~2B tokens) and pipeline for petroleum-engineering domain adaptation of dense retrievers and rerankers. It uses an energy-domain classifier (98.4% test accuracy), chunk-grounded query generation, LLM hard negatives, and retrieval-mined candidates to create synthetic supervision. The central claims are an in-domain first-stage nDCG lift from 0.703 to 0.763 via score fusion, plus 44% relative reranker gain on an Earth Science benchmark and 23% on a six-task reasoning panel; negative results show that high synthetic-label accuracy does not predict retrieval gains.

Significance. If the reported gains hold under proper validation, the work supplies a concrete, large-scale recipe for closing the supervision gap in specialized technical IR domains by repurposing noisy web text. The explicit reporting of failed recipes (high train-holdout accuracy on synthetic labels failing to translate to retrieval) is a strength that helps delineate effective supervision strategies. The scale (859k embedding rows, 400k reranker rows) and cross-benchmark evaluation add practical value for domain-adaptation research.

major comments (2)
  1. [Abstract] Abstract: the nDCG improvement (0.703 → 0.763) is stated without statistical significance tests, variance across random seeds, baseline implementation details, or data-split information; these omissions are load-bearing for the central claim that the PETRA pipeline produces reliable first-stage gains.
  2. [Abstract] Abstract (pipeline): the energy-domain classifier is reported at 98.4% test accuracy on a held-out set, yet no experiment validates that the marginal distribution of the resulting 1.36M chunks matches the sub-topic or query distribution of actual petroleum-engineering user needs; this distributional alignment is required for the synthetic supervision to support generalizable retrieval improvements.
minor comments (1)
  1. [Abstract] Abstract: the 'six-task reasoning-intensive panel' is referenced without naming the tasks or providing a citation, reducing clarity on the reranker evaluation scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the scale of the dataset, the value of reporting negative results, and the practical contributions to domain-adaptation research. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the nDCG improvement (0.703 → 0.763) is stated without statistical significance tests, variance across random seeds, baseline implementation details, or data-split information; these omissions are load-bearing for the central claim that the PETRA pipeline produces reliable first-stage gains.

    Authors: We agree that the abstract presentation of the nDCG numbers would be strengthened by these details. The full manuscript already describes baseline implementations and data splits in the Experiments section. However, statistical significance tests and variance across seeds were not computed for the reported first-stage results. We will add these (including p-values and standard deviations over multiple random seeds) to the results tables and revise the abstract to reference the validation. revision: yes

  2. Referee: [Abstract] Abstract (pipeline): the energy-domain classifier is reported at 98.4% test accuracy on a held-out set, yet no experiment validates that the marginal distribution of the resulting 1.36M chunks matches the sub-topic or query distribution of actual petroleum-engineering user needs; this distributional alignment is required for the synthetic supervision to support generalizable retrieval improvements.

    Authors: The 98.4% classifier accuracy is measured on a held-out set drawn from the same web-crawl sources used for the final corpus, and query generation is performed directly from the curated chunks. This provides a content-based proxy for domain alignment. We do not possess proprietary petroleum-engineering query logs that would enable a direct distributional comparison. We will add an explicit limitations paragraph discussing this assumption and the indirect evidence from in-domain and Earth-Science benchmark gains. revision: partial

Circularity Check

0 steps flagged

No circularity: gains measured on held-out and external data, not by construction

full rationale

The paper builds PETRA via classifier curation (98.4% test accuracy), chunk-grounded query generation, and retrieval-mined candidates, then reports nDCG lift (0.703→0.763) and reranker gains on held-out in-domain data plus external benchmarks (Earth Science +44%, six-task panel +23%). It explicitly demonstrates that high synthetic-label accuracy does not predict retrieval gains, confirming the evaluation quantities are independent of the training objective. No step reduces a claimed result to a fitted parameter, self-definition, or self-citation chain; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that public web text contains usable petroleum-engineering evidence and on the standard assumption that LLM-generated queries and negatives can serve as proxies for human relevance judgments. No free parameters are fitted inside the reported results, and no new physical or mathematical entities are postulated.

axioms (2)
  • domain assumption Public web text contains relevant evidence for petroleum-engineering queries that can be recovered by high-recall curation
    Stated in the opening sentence of the abstract as the premise for the supervision gap.
  • domain assumption An energy-domain classifier achieving 98.4% test accuracy produces chunks whose distribution supports effective retrieval training
    Invoked when the classifier is used to filter the corpus before query generation and negative mining.

pith-pipeline@v0.9.1-grok · 5874 in / 1397 out tokens · 34724 ms · 2026-06-25T22:30:48.068344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

117 extracted references · 20 canonical work pages

  1. [2]

    International Conference on Learning Representations , volume=

    Bright: A realistic and challenging benchmark for reasoning-intensive retrieval , author=. International Conference on Learning Representations , volume=

  2. [3]

    2025 , publisher =

    Mistral Large 3 675B Instruct 2512 , author =. 2025 , publisher =

  3. [4]

    2025 , version =

    josk0 , title =. 2025 , version =

  4. [5]

    Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li , booktitle =

  5. [6]

    ACM Transactions on Information Systems (TOIS) , volume=

    Cumulated gain-based evaluation of IR techniques , author=. ACM Transactions on Information Systems (TOIS) , volume=. 2002 , publisher=

  6. [8]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

    Dense Passage Retrieval for Open-Domain Question Answering , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

  7. [11]

    Sentence-

    Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-

  8. [12]

    The Probabilistic Relevance Framework:

    Robertson, Stephen and Zaragoza, Hugo , journal =. The Probabilistic Relevance Framework:

  9. [13]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =

    Fact or Fiction: Verifying Scientific Claims , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =

  10. [14]

    Companion Proceedings of the The Web Conference 2018 , year =

    Maia, Macedo and Handschuh, Siegfried and Freitas, Andr. Companion Proceedings of the The Web Conference 2018 , year =

  11. [15]

    Proceedings of the 38th European Conference on Information Retrieval , year =

    A Full-Text Learning to Rank Dataset for Medical Information Retrieval , author =. Proceedings of the 38th European Conference on Information Retrieval , year =

  12. [16]

    International Conference on Learning Representations , year =

    Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , author =. International Conference on Learning Representations , year =

  13. [17]

    Qu, Yingqi and Ding, Yuchen and Liu, Jing and Liu, Kai and Ren, Ruiyang and Zhao, Wayne Xin and Dong, Daxiang and Wu, Hua and Wang, Haifeng , booktitle =

  14. [18]

    International Conference on Learning Representations , year =

    Promptagator: Few-shot Dense Retrieval From 8 Examples , author =. International Conference on Learning Representations , year =

  15. [19]

    Wang, Kexin and Thakur, Nandan and Reimers, Nils and Gurevych, Iryna , booktitle =

  16. [20]

    Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with

    Thakur, Nandan and Zhang, Crystina and Ma, Xueguang and Lin, Jimmy , journal =. Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with

  17. [21]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

  18. [22]

    NIPS Deep Learning and Representation Learning Workshop , year =

    Distilling the Knowledge in a Neural Network , author =. NIPS Deep Learning and Representation Learning Workshop , year =

  19. [23]

    Yadav, Prateek and Tam, Derek and Choshen, Leshem and Raffel, Colin and Bansal, Mohit , booktitle =

  20. [24]

    International Conference on Machine Learning , year =

    Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch , author =. International Conference on Machine Learning , year =

  21. [25]

    2026 , howpublished =

    Jennings, Joseph and Patwary, Mostofa and Subramanian, Sandeep and Prabhumoye, Shrimai and Dattagupta, Ayush and Jawa, Vibhu and Liu, Jiwei and Wolf, Ryan and Yurick, Sarah and Singh, Varun and Chang, Dong Hyuk and Tang, Ao and Lane, Lawrence and Truong, Charlie and Vu, Huy and Garg, Abhinav and Mahajan, Praateek and Karpov, Nikolay and K. 2026 , howpublished =

  22. [26]

    and Stoica, Ion , booktitle =

    Moritz, Philipp and Nishihara, Robert and Wang, Stephanie and Tumanov, Alexey and Liaw, Richard and Liang, Eric and Elibol, Melih and Yang, Zongheng and Paul, William and Jordan, Michael I. and Stoica, Ion , booktitle =. Ray: A Distributed Framework for Emerging

  23. [27]

    and Zhang, Hao and Stoica, Ion , booktitle =

    Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with

  24. [29]

    arXiv preprint arXiv:2412.15115 , year =

    Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

  25. [30]

    Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages =

    Bag of Tricks for Efficient Text Classification , author =. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages =

  26. [32]

    Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and others , journal =

  27. [33]

    Zhou, Chunting and Liu, Pengfei and Xu, Puxin and Iyer, Srini and Sun, Jiao and Mao, Yuning and Ma, Xuezhe and Efrat, Avia and Yu, Ping and Yu, Lili and Zhang, Susan and Ghosh, Gargi and Lewis, Mike and Zettlemoyer, Luke and Levy, Omer , booktitle =

  28. [35]

    Proceedings of the Seventeenth ACM International Conference on Web Search and Data Mining , year =

    K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization , author =. Proceedings of the Seventeenth ACM International Conference on Web Search and Data Mining , year =

  29. [36]

    Lin, Zhouhan and Deng, Cheng and Zhou, Le and Zhang, Tianhang and Xu, Yi and Xu, Yutong and He, Zhongmou and Shi, Yuanyuan and Dai, Beiya and Song, Yunchong and others , journal =

  30. [37]

    Chebbi, Amal and Kolade, Babajide , journal =. Towards

  31. [38]

    Computers & Geosciences , volume =

    Cordeiro, F. Computers & Geosciences , volume =

  32. [42]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

    Analysis of Automated Document Relevance Annotation for Information Retrieval in Oil and Gas Industry , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

  33. [43]

    Efficiency-Effectiveness Reranking

    Peng, Zhiyuan and Wei, Ting-Ruen and Song, Tingyu and Zhao, Yilun , booktitle =. Efficiency-Effectiveness Reranking. 2025 , pages =

  34. [44]

    Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '24) , year =

    C-Pack: Packed Resources For General Chinese Embeddings , author =. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '24) , year =

  35. [46]

    2023 , pages =

    Muennighoff, Niklas and Tazi, Nouamane and Magne, Loic and Reimers, Nils , booktitle =. 2023 , pages =

  36. [47]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

    Improving Text Embeddings with Large Language Models , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

  37. [49]

    Findings of the Association for Computational Linguistics: EMNLP 2020 , year =

    Document Ranking with a Pretrained Sequence-to-Sequence Model , author =. Findings of the Association for Computational Linguistics: EMNLP 2020 , year =

  38. [50]

    Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23) , year =

    RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses , author =. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23) , year =

  39. [51]

    Sun, Weiwei and Yan, Lingyong and Ma, Xinyu and Wang, Shuaiqiang and Ren, Pengjie and Chen, Zhumin and Yin, Dawei and Ren, Zhaochun , booktitle =. Is. 2023 , pages =

  40. [52]

    Beyond Yes and No: Improving Zero-Shot

    Zhuang, Honglei and Qin, Zhen and Hui, Kai and Wu, Junru and Yan, Le and Wang, Xuanhui and Bendersky, Michael , booktitle =. Beyond Yes and No: Improving Zero-Shot. 2024 , pages =

  41. [53]

    Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22) , year =

    InPars: Unsupervised Dataset Generation for Information Retrieval , author =. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22) , year =

  42. [54]

    2023 , pages =

    Saad-Falcon, Jon and Khattab, Omar and Santhanam, Keshav and Florian, Radu and Franz, Martin and Roukos, Salim and Sil, Avirup and Sultan, Md Arafat and Potts, Christopher , booktitle =. 2023 , pages =

  43. [55]

    Proceedings of the 39th International Conference on Machine Learning , series =

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author =. Proceedings of the 39th International Conference on Machine Learning , series =. 2022 , publisher =

  44. [56]

    The Eleventh International Conference on Learning Representations (ICLR) , year =

    Editing Models with Task Arithmetic , author =. The Eleventh International Conference on Learning Representations (ICLR) , year =

  45. [57]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =

    Don't Stop Pretraining: Adapt Language Models to Domains and Tasks , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =

  46. [58]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

    Greenback Bears and Fiscal Hawks: Finance is a Jungle and Text Embeddings Must Adapt , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

  47. [59]

    2025 , pages =

    Ethiraj, Vignesh and D, Ashwath and Menon, Sidhanth and Vijay, Divya and Kannan, Vidhyakshaya , booktitle =. 2025 , pages =

  48. [60]

    and Fancher, Elizabeth and Gerasimov, Irina and Mehrabian, Armin and Sanders, Lauren and Costes, Sylvain V

    Bhattacharjee, Bishwaranjan and Trivedi, Aashka and Muraoka, Masayasu and Ramasubramanian, Muthukumaran and Udagawa, Takuma and Gurung, Iksha and Pantha, Nishan and Zhang, Rong and Dandala, Bharath and Ramachandran, Rahul and Maskey, Manil and Bugbee, Kaylin and Little, Michael M. and Fancher, Elizabeth and Gerasimov, Irina and Mehrabian, Armin and Sander...

  49. [61]

    2024 , pages =

    Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and Zhang, Meishan and Li, Wenjie and Zhang, Min , booktitle =. 2024 , pages =

  50. [62]

    2024 , pages =

    Choi, Nayoung and Lee, Youngjune and Cho, Gyu-Hwung and Jeong, Haeyu and Kong, Jungmin and Kim, Saehun and Park, Keunchan and Cho, Sarah and Jeong, Inchang and Nam, Gyohee and Han, Sunghoon and Yang, Wonil and Choi, Jaeho , booktitle =. 2024 , pages =

  51. [63]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

    Distilling Cross-Modal Knowledge into Domain-Specific Retrievers for Enhanced Industrial Document Understanding , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

  52. [64]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

    Generalized Embedding Models for Industry 4.0 Applications , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

  53. [65]

    Yao, Yichen and Wan, Jiahe and Hong, Yuxin and Zhang, Mengna and Yang, Junhan and Jiang, Zhouyu and Xu, Qing and Lu, Kuan and Xu, Yinghui and Chu, Wei and Wang, Emma and Qi, Yuan , year =

  54. [66]

    Prompting Is Not Enough: Defining Quality in Synthetic

  55. [67]

    Peter Anderson, Mano Vikash Janardhanan, Jason He, Wei Cheng, and Charlie Flanagan. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.26 Greenback bears and fiscal hawks: Finance is a jungle and text embeddings must adapt . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 362--370. Associ...

  56. [68]

    Anonymous Authors . 2026. Prompting is not enough: Defining quality in synthetic QA generation for technical domains. Under submission at EMNLP 2026 (Industry Track)

  57. [69]

    Little, Elizabeth Fancher, Irina Gerasimov, Armin Mehrabian, Lauren Sanders, Sylvain V

    Bishwaranjan Bhattacharjee, Aashka Trivedi, Masayasu Muraoka, Muthukumaran Ramasubramanian, Takuma Udagawa, Iksha Gurung, Nishan Pantha, Rong Zhang, Bharath Dandala, Rahul Ramachandran, Manil Maskey, Kaylin Bugbee, Michael M. Little, Elizabeth Fancher, Irina Gerasimov, Armin Mehrabian, Lauren Sanders, Sylvain V. Costes, Sergi Blanco-Cuaresma, and 17 other...

  58. [70]

    Hudson, Ehsan Adeli, Russ B

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ B. Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dorottya Demszky, and 34 others. 2021. https://arxiv.or...

  59. [71]

    Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. 2022. https://doi.org/10.1145/3477495.3531863 Inpars: Unsupervised dataset generation for information retrieval . In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22), pages 2387--2392. Association for Computing Machinery

  60. [72]

    Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. 2016. A full-text learning to rank dataset for medical information retrieval. In Proceedings of the 38th European Conference on Information Retrieval

  61. [73]

    Amal Chebbi and Babajide Kolade. 2025. Towards EnergyGPT : A large language model specialized for the energy sector. arXiv preprint arXiv:2509.07177

  62. [74]

    Nayoung Choi, Youngjune Lee, Gyu-Hwung Cho, Haeyu Jeong, Jungmin Kong, Saehun Kim, Keunchan Park, Sarah Cho, Inchang Jeong, Gyohee Nam, Sunghoon Han, Wonil Yang, and Jaeho Choi. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.46 RRAD istill: Distilling LLM s' passage ranking ability for long-tail queries document re-ranking on a search engine . In P...

  63. [75]

    Christodoulos Constantinides, Shuxin Lin, and Dhaval C Patel. 2025. https://doi.org/10.18653/v1/2025.emnlp-industry.155 Generalized embedding models for industry 4.0 applications . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2234--2251. Association for Computational Linguistics

  64. [76]

    de Souza, Diogo da Silva Machado Gomes, Renato Rocha Souza, and Fl \'a vio Code c o Coelho

    F \'a bio Corr \^e a Cordeiro, Patricia Ferreira da Silva, Alexandre Tessarollo, Cl \'a udia Freitas, E. de Souza, Diogo da Silva Machado Gomes, Renato Rocha Souza, and Fl \'a vio Code c o Coelho. 2024. PetroNLP : Resources for natural language processing and information extraction for the oil and gas industry. Computers & Geosciences, 193:105714

  65. [77]

    Jo \ a o Vitor Mariano Correia, Murilo Missano Bell, Jo \ a o Vitor Robiatti Amorim, Jonas Queiroz, Daniel Pedronette, Ivan Rizzo Guilherme, and Felipe Lima de Oliveira. 2025. https://doi.org/10.18653/v1/2025.emnlp-industry.132 Analysis of automated document relevance annotation for information retrieval in oil and gas industry . In Proceedings of the 202...

  66. [78]

    Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B

    Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, and Ming-Wei Chang. 2023. Promptagator: Few-shot dense retrieval from 8 examples. In International Conference on Learning Representations

  67. [79]

    Cheng Deng, Tianhang Zhang, Zhongmou He, Yi Xu, Qiyuan Chen, Yuanyuan Shi, Luoyi Fu, Weinan Zhang, Xinbing Wang, Chenghu Zhou, Zhouhan Lin, and Junxian He. 2024. K2: A foundation language model for geoscience knowledge understanding and utilization. In Proceedings of the Seventeenth ACM International Conference on Web Search and Data Mining

  68. [80]

    Vignesh Ethiraj, Ashwath D, Sidhanth Menon, Divya Vijay, and Vidhyakshaya Kannan. 2025. https://doi.org/10.18653/v1/2025.emnlp-industry.168 T - VEC : A telecom-specific vectorization model with enhanced semantic understanding via deep triplet loss fine-tuning . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Indu...

  69. [81]

    Facebook AI Research . 2017. fastText language identification models. https://fasttext.cc/docs/en/language-identification.html

  70. [82]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  71. [83]

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C \'e sar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, S \'e bastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. 2023. Textbooks are all you need. arXiv prep...

  72. [84]

    Suchin Gururangan, Ana Marasovi \'c , Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. https://doi.org/10.18653/v1/2020.acl-main.740 Don't stop pretraining: Adapt language models to domains and tasks . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342--8360. Association for...

  73. [85]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop. ArXiv:1503.02531

  74. [86]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations

  75. [87]

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. 2023. https://openreview.net/forum?id=6t0Kwf8-jrj Editing models with task arithmetic . In The Eleventh International Conference on Learning Representations (ICLR)

  76. [88]

    a rvelin and Jaana Kek \

    Kalervo J \"a rvelin and Jaana Kek \"a l \"a inen. 2002. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422--446

  77. [89]

    Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Shrimai Prabhumoye, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ryan Wolf, Sarah Yurick, Varun Singh, Dong Hyuk Chang, Ao Tang, Lawrence Lane, Charlie Truong, Huy Vu, Abhinav Garg, Praateek Mahajan, Nikolay Karpov, and Oliver K \"o nig. 2026. NeMo-Curator : a toolkit for data curation. https://github.com...

  78. [90]

    Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427--431

  79. [91]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6769--6781

  80. [92]

    Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, and Iftekhar Naim. 2024. Gecko: Versatile text embeddings distil...

Showing first 80 references.