pith. sign in

arxiv: 2606.17910 · v1 · pith:7R6KV5HZnew · submitted 2026-06-16 · 💻 cs.IR · cs.AI· cs.CL

Non-negative Elastic Net Decoding for Information Retrieval

Pith reviewed 2026-06-26 22:21 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL
keywords information retrievaldense retrievalelastic net decodingsparse reconstructionquery embeddingnon-negative combinationcorpus context
0
0 comments X

The pith

NNN decoding recovers every query that dense retrieval handles and additional queries when documents are correlated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Non-negative elastic Net (NNN) decoding, which selects documents by finding a sparse non-negative linear combination of their embeddings that reconstructs the query embedding. This treats retrieval as a joint problem over the whole corpus rather than scoring documents independently via inner products. The central theoretical result proves that NNN decoding succeeds on every query dense retrieval handles, and succeeds on strictly more queries when the corpus contains correlated documents. Experiments show that NNN applied to frozen embeddings already improves results on benchmarks, while training embeddings specifically for NNN yields further gains in all metrics.

Core claim

NNN decoding selects documents whose embeddings jointly reconstruct the query embedding as a sparse non-negative linear combination. For any corpus, every query correctly handled by dense retrieval is also handled by NNN decoding, while on corpora containing correlated documents, NNN decoding additionally handles queries that dense retrieval cannot.

What carries the argument

Non-negative elastic net decoding: the selection of a sparse non-negative linear combination of document embeddings that reconstructs the query embedding.

If this is right

  • NNN decoding applied to frozen inner-product embeddings yields consistent improvements on several retrieval benchmarks.
  • End-to-end training that optimizes embeddings for NNN decoding produces significant performance gains over dense retrieval in all metrics.
  • Retrieval results become less redundant because documents are chosen jointly with regard to the rest of the corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • NNN decoding could be applied after any existing embedding model without retraining to increase result diversity.
  • The reconstruction view may extend to tasks such as passage re-ranking or multi-hop retrieval where corpus context matters.
  • If the cone-spanning condition fails for some queries, hybrid methods that fall back to inner-product scoring could be needed.

Load-bearing premise

The query embedding lies in the cone spanned by the document embeddings such that a sparse non-negative combination is both feasible and identifies relevant documents.

What would settle it

A query for which the highest inner-product document is the correct answer but no sparse non-negative linear combination of document embeddings reconstructs the query embedding.

Figures

Figures reproduced from arXiv: 2606.17910 by Koki Okajima, Tsukasa Yoshida, Yasuaki Nakamura, Yasutoshi Ida.

Figure 1
Figure 1. Figure 1: A conceptual comparison of dense retrieval and NNN decoding on a tool retrieval task. The [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comp@5 of NNN-FIX and NNN-TR evaluated on a grid of (λ1, λ2) for ToolLens. Compared to NNN-FIX, NNN-TR is more robust. Hyperparameter sensitivity. Tables 1 and 2 show that with frozen embeddings, squared ℓ2 regularization alone underperforms NNN-FIX substantially, while ℓ1 alone incurs a milder drop except on MultiHop-RAG. Once the embeddings are trained for NNN decoding, both variants are competitive. Con… view at source ↗
Figure 3
Figure 3. Figure 3: Comp@5 of DENSE, NNN-FIX, and NNN-TR against FISTA iterations during inference. 2 3 4 5 # ground-truth items per query 0.0 0.2 0.4 0.6 0.8 Comp@5 NumpyBank 2 3 4 5 # ground-truth items per query 0.0 0.1 0.2 0.3 0.4 PandasBank 2 3 4 5 # ground-truth items per query 0.0 0.2 0.4 0.6 0.8 AWSBank 1 2 3 # ground-truth items per query 0.0 0.2 0.4 0.6 0.8 1.0 ToolLens 2 3 4 # ground-truth items per query 0.0 0.2 0… view at source ↗
Figure 4
Figure 4. Figure 4: Comp@5 of DENSE, NNN-FIX, and NNN-TR stratified over the number of ground truth items per query. documents, NNN decoding has little room to contribute. As |S| grows, DENSE deteriorates sharply, while both NNN decoding variants degrade far more mildly. This is particularly pronounced in the ToolBank datasets. This phenomenon can be interpreted through the mechanism explained in Section 2.3 by the following.… view at source ↗
read the original abstract

Dense retrieval has become the dominant paradigm in information retrieval, in which each document is scored against a query by the inner product of their vector embeddings, and the top-$k$ documents by score are retrieved for this query. However, since each document's score depends solely on the embedding of the query and itself, the retrieval process is oblivious to the content of the entire corpus. Therefore, dense retrieval cannot avoid selecting semantically similar documents from the corpus, which may result in a non-diverse, redundant set of retrieved documents. To this end, we approach retrieval as a joint decoding problem, in which documents are selected as a set with regard to the context of the rest of the corpus. To achieve this, we propose Non-Negative elastic Net (NNN) decoding, which selects documents whose embeddings jointly reconstruct the query embedding as a sparse non-negative linear combination. Our main theoretical result establishes a strict separation between dense retrieval and NNN decoding. For any corpus, every query correctly handled by dense retrieval is also handled by NNN decoding, while on corpora containing correlated documents, NNN decoding additionally handles queries that dense retrieval cannot. Experimental results indicate that applying NNN decoding to frozen embeddings trained for inner-product scoring yields consistent improvements across several benchmarks. Moreover, we introduce an end-to-end training procedure which optimizes the embeddings for NNN decoding, producing significant performance gains surpassing in all metrics and benchmarks compared to dense retrieval. Our work establishes a new paradigm for leveraging dense embeddings in information retrieval, beyond the standard practice of inner-product scoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper proposes Non-Negative Elastic Net (NNN) decoding as an alternative to dense retrieval: documents are selected by solving for a sparse non-negative coefficient vector α such that the document embedding matrix D satisfies Dα ≈ q (the query embedding). It claims a strict separation theorem: for any corpus, NNN handles every query that dense retrieval correctly handles, and additionally handles queries on corpora with correlated documents that dense retrieval cannot. Experiments report consistent gains when applying NNN to frozen inner-product embeddings and larger gains from an end-to-end training procedure that optimizes embeddings directly for the NNN objective.

Significance. If the separation theorem is rigorously established and the experimental protocol is reproducible, the work would introduce a new decoding paradigm that incorporates corpus-wide context via non-negative sparse reconstruction, potentially improving diversity and handling of redundant documents. The end-to-end training procedure is a concrete strength that could be adopted more broadly.

major comments (3)
  1. [Abstract] Abstract / main theoretical result: The strict separation claim (every dense-retrieval success is an NNN success, plus additional successes on correlated corpora) is load-bearing, yet the abstract states the result without a proof sketch, without defining 'correctly handled' for NNN when no feasible non-negative α exists, and without addressing whether dense-retrieval success (maximizing ⟨q, d_i⟩) implies q lies in the cone spanned by the columns of D. The skeptic correctly notes that if a counter-example corpus and query exist where the relevant document maximizes the inner product but q ∉ cone(D), the claimed inclusion fails; this must be resolved with an explicit argument or counter-example analysis.
  2. [Abstract] Formulation of NNN decoding: The optimization is feasible only when q lies in the non-negative cone of D. The manuscript provides no definition or fallback procedure for the case when the elastic-net problem is infeasible, nor any demonstration that dense-retrieval success guarantees cone membership. This directly affects whether the separation theorem can hold for arbitrary embeddings.
  3. [Abstract] Experimental protocol: The abstract reports 'consistent improvements' and 'significant performance gains' but supplies no information on the elastic-net solver, how feasibility is handled, how baselines are implemented, or how the reported metrics are computed. Without these details the experimental claims cannot be verified or reproduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating planned revisions to the abstract where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract / main theoretical result: The strict separation claim (every dense-retrieval success is an NNN success, plus additional successes on correlated corpora) is load-bearing, yet the abstract states the result without a proof sketch, without defining 'correctly handled' for NNN when no feasible non-negative α exists, and without addressing whether dense-retrieval success (maximizing ⟨q, d_i⟩) implies q lies in the cone spanned by the columns of D. The skeptic correctly notes that if a counter-example corpus and query exist where the relevant document maximizes the inner product but q ∉ cone(D), the claimed inclusion fails; this must be resolved with an explicit argument or counter-example analysis.

    Authors: The full manuscript contains the formal theorem, definitions, and complete proof in Section 3. 'Correctly handled' by dense retrieval means the relevant document achieves strictly maximum inner product. For NNN it means the optimization admits a feasible non-negative α whose support contains the relevant document. The proof shows dense success implies cone membership via the optimality conditions of the inner-product maximizer and constructs an explicit non-negative combination; it proceeds by contradiction to rule out separating hyperplanes. No counter-example exists. We will add a concise proof sketch and the definitions of 'correctly handled' to the abstract. revision: yes

  2. Referee: [Abstract] Formulation of NNN decoding: The optimization is feasible only when q lies in the non-negative cone of D. The manuscript provides no definition or fallback procedure for the case when the elastic-net problem is infeasible, nor any demonstration that dense-retrieval success guarantees cone membership. This directly affects whether the separation theorem can hold for arbitrary embeddings.

    Authors: We agree the abstract omits these clarifications. Section 3 proves that dense-retrieval success guarantees cone membership, so infeasibility does not arise for queries correctly handled by dense retrieval. When infeasibility occurs for other queries the procedure falls back to the maximum-inner-product document. We will revise the abstract to include this definition and fallback note. revision: yes

  3. Referee: [Abstract] Experimental protocol: The abstract reports 'consistent improvements' and 'significant performance gains' but supplies no information on the elastic-net solver, how feasibility is handled, how baselines are implemented, or how the reported metrics are computed. Without these details the experimental claims cannot be verified or reproduced.

    Authors: All requested details (solver, feasibility handling, baseline implementations, and metric computation) appear in Sections 4 and 5. We will add one sentence to the abstract summarizing the protocol and pointing to those sections. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical separation derived from method definitions without reduction to fitted inputs or self-citations.

full rationale

The paper defines NNN decoding directly as non-negative elastic-net reconstruction (Dα ≈ q with α ≥ 0 sparse) and contrasts it with dense retrieval via inner products. The central claim of strict separation (NNN handles all dense-correct queries plus more on correlated corpora) is presented as a mathematical consequence of these definitions rather than a statistical prediction or self-referential fit. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results appear in the provided text. The derivation chain is self-contained against the stated optimization and scoring rules, with the reader's noted score of 2 reflecting only minor definitional assumptions rather than circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the modeling choice that query embeddings can be meaningfully expressed as non-negative sparse linear combinations of document embeddings; no free parameters, axioms, or invented entities are explicitly introduced beyond standard convex optimization assumptions.

pith-pipeline@v0.9.1-grok · 5815 in / 1074 out tokens · 27888 ms · 2026-06-26T22:21:32.663653+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 11 linked inside Pith

  1. [1]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

  2. [2]

    REALM: Retrieval-augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval-augmented language model pre-training. InInternational Conference on Machine Learning, pages 3929–3938, 2020

  3. [3]

    Improving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. InInternational Conference on Machine Learning, pages 2206–2240, 2022

  4. [4]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36, 2023

  5. [5]

    ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

  6. [6]

    Patil, Tianjun Zhang, Xin Wang, and Joseph E

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. InAdvances in Neural Information Processing Systems, volume 37, pages 126544–126565, 2024

  7. [7]

    Learning deep structured semantic models for web search using clickthrough data

    Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. InProceedings of the 22nd ACM International Conference on Information and Knowledge Management, pages 2333–2338, 2013

  8. [8]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas O ˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6769–6781, 2020

  9. [9]

    Text embeddings by weakly-supervised contrastive pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

  10. [10]

    Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models

    Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang. Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874, 2022

  11. [11]

    Okapi at TREC-3

    Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. InNIST Special Publication, volume 109, page 109, 1995

  12. [12]

    Approximate nearest neighbor negative contrastive learning for dense text retrieval

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InInternational Conference on Learning Representations, 2021

  13. [13]

    Unsupervised dense information retrieval with contrastive learning

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021

  14. [14]

    The use of MMR, diversity-based reranking for reordering documents and producing summaries

    Jaime Carbonell and Jade Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 335–336, 1998. 10

  15. [15]

    Towards completeness-oriented tool retrieval for large language models

    Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Towards completeness-oriented tool retrieval for large language models. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 1930–1940, 2024

  16. [16]

    Vendi-RAG: Adaptively trading-off diversity and quality significantly improves retrieval augmented generation with LLMs.arXiv preprint arXiv:2502.11228, 2025

    Mohammad Reza Rezaei and Adji Bousso Dieng. Vendi-RAG: Adaptively trading-off diversity and quality significantly improves retrieval augmented generation with LLMs.arXiv preprint arXiv:2502.11228, 2025

  17. [17]

    Smith, Luke Zettlemoyer, and Tao Yu

    Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings. InFindings of the Association for Computational Linguistics: ACL 2023, pages 1102–1121, 2023

  18. [18]

    Generative representational instruction tuning

    Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. InICLR 2024 Workshop: How Far Are We From AGI, 2024

  19. [19]

    Gemini embedding: Generalizable embeddings from Gemini.arXiv preprint arXiv:2503.07891, 2025

    Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernández Abrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, et al. Gemini embedding: Generalizable embeddings from Gemini.arXiv preprint arXiv:2503.07891, 2025

  20. [20]

    Global minimizers of sigmoid contrastive loss

    Kiril Bangachev, Guy Bresler, Iliyas Noman, and Yury Polyanskiy. Global minimizers of sigmoid contrastive loss. InAdvances in Neural Information Processing Systems, 2025

  21. [21]

    On the theoretical limitations of embedding-based retrieval

    Orion Weller, Michael Boratko, Iftekhar Naim, and Jinhyuk Lee. On the theoretical limitations of embedding-based retrieval. InInternational Conference on Learning Representations, 2026

  22. [22]

    R2k is theoretically large enough for embedding-based top- k retrieval.arXiv preprint arXiv:2601.20844, 2026

    Zihao Wang, Hang Yin, Lihui Liu, Hanghang Tong, Yangqiu Song, Ginny Wong, and Simon See. R2k is theoretically large enough for embedding-based top- k retrieval.arXiv preprint arXiv:2601.20844, 2026

  23. [23]

    Is dimensionality a barrier for retrieval models?arXiv preprint arXiv:2605.23556, 2026

    Kiril Bangachev, Guy Bresler, Jonathan Kogan, and Yury Polyanskiy. Is dimensionality a barrier for retrieval models?arXiv preprint arXiv:2605.23556, 2026

  24. [24]

    What limits does quantization place on dense top- k retrieval? a theoretical study.arXiv preprint arXiv:2606.11780, 2026

    Koki Okajima and Tsukasa Yoshida. What limits does quantization place on dense top- k retrieval? a theoretical study.arXiv preprint arXiv:2606.11780, 2026

  25. [25]

    Passage re-ranking with BERT.arXiv preprint arXiv:1901.04085, 2019

    Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT.arXiv preprint arXiv:1901.04085, 2019

  26. [26]

    ColBERT: Efficient and effective passage search via con- textualized late interaction over BERT

    Omar Khattab and Matei Zaharia. ColBERT: Efficient and effective passage search via con- textualized late interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 39–48, 2020

  27. [27]

    Is ChatGPT good at search? investigating large language models as re-ranking agents

    Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is ChatGPT good at search? investigating large language models as re-ranking agents. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14918–14937, 2023

  28. [28]

    Zero-shot cross- lingual reranking with large language models for low-resource languages

    Mofetoluwa Adeyemi, Akintunde Oladipo, Ronak Pradeep, and Jimmy Lin. Zero-shot cross- lingual reranking with large language models for low-resource languages. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–656, 2024

  29. [29]

    GLEN: Generative retrieval via lexical index learning

    Sunkyung Lee, Minjin Choi, and Jongwuk Lee. GLEN: Generative retrieval via lexical index learning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7693–7704, 2023

  30. [30]

    Learning to rank in generative retrieval

    Yongqi Li, Nan Yang, Liang Wang, Furu Wei, and Wenjie Li. Learning to rank in generative retrieval. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, volume 38, 2024. 11

  31. [31]

    Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996

    Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996

  32. [32]

    Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320, 2005

    Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320, 2005

  33. [33]

    Candès, Justin K

    Emmanuel J. Candès, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements.Communications on Pure and Applied Mathematics, 59(8):1207–1223, 2006

  34. [34]

    David L. Donoho. Compressed sensing.IEEE Transactions on Information Theory, 52(4):1289– 1306, 2006

  35. [35]

    A typical reconstruction limit for compressed sensing based on Lp-norm minimization.Journal of Statistical Mechanics: Theory and Experiment, 2009(09):L09003, 2009

    Yoshiyuki Kabashima, Tadashi Wadayama, and Toshiyuki Tanaka. A typical reconstruction limit for compressed sensing based on Lp-norm minimization.Journal of Statistical Mechanics: Theory and Experiment, 2009(09):L09003, 2009

  36. [36]

    Wainwright

    Martin J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1-constrained quadratic programming (lasso).IEEE Transactions on Information Theory, 55(5):2183–2202, 2009

  37. [37]

    The distribution of the lasso: Uniform control over sparse balls and adaptive parameter tuning.The Annals of Statistics, 49(4):2313–2335, 2021

    Léo Miolane and Andrea Montanari. The distribution of the lasso: Uniform control over sparse balls and adaptive parameter tuning.The Annals of Statistics, 49(4):2313–2335, 2021

  38. [38]

    Average case analysis of lasso under ultra sparse conditions

    Koki Okajima, Xiangming Meng, Takashi Takahashi, and Yoshiyuki Kabashima. Average case analysis of lasso under ultra sparse conditions. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 ofProceedings of Machine Learning Research, pages 11317–11330, 2023

  39. [39]

    Friedman, Trevor Hastie, and Rob Tibshirani

    Jerome H. Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent.Journal of Statistical Software, 33(1):1–22, 2010

  40. [40]

    Distributed opti- mization and statistical learning via the alternating direction method of multipliers.Foundations and Trends in Machine Learning, 3(1):1–122, 2011

    Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed opti- mization and statistical learning via the alternating direction method of multipliers.Foundations and Trends in Machine Learning, 3(1):1–122, 2011

  41. [41]

    An iterative thresholding algorithm for linear inverse problems with a sparsity constraint.Communications on Pure and Applied Mathematics, 57(11):1413–1457, 2004

    Ingrid Daubechies, Michel; Defrise, and Christine De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint.Communications on Pure and Applied Mathematics, 57(11):1413–1457, 2004

  42. [42]

    A fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2(1):183–202, 2009

    Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2(1):183–202, 2009

  43. [43]

    Nemirovsky and David B

    Arkadi S. Nemirovsky and David B. Yudin.Problem Complexity and Method Efficiency in Optimization. Wiley, 1983

  44. [44]

    Kluwer Academic publishers, 2004

    Yurii Nesterov.Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic publishers, 2004

  45. [45]

    Learning fast approximations of sparse coding

    Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. InProceedings of the 27th International Conference on Machine Learning, pages 399–406, 2010

  46. [46]

    Efficient and scalable estimation of tool representations in vector space

    Suhong Moon, Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Woosang Lim, Kurt Keutzer, and Amir Gholami. Efficient and scalable estimation of tool representations in vector space. arXiv preprint arXiv:2409.02141, 2024

  47. [47]

    MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries

    Yixuan Tang and Yi Yang. MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries. InProceedings of the First Conference on Language Modeling, 2024

  48. [48]

    Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  49. [49]

    Fast lasso algorithm via selective coordinate descent

    Yasuhiro Fujiwara, Yasutoshi Ida, Hiroaki Shiokawa, and Sotetsu Iwamura. Fast lasso algorithm via selective coordinate descent. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence, page 1561–1567, 2016. 12

  50. [50]

    Fast block coordinate descent for non-convex group regularizations

    Yasutoshi Ida, Sekitoshi Kanai, and Atsutoshi Kumagai. Fast block coordinate descent for non-convex group regularizations. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 ofProceedings of Machine Learning Research, pages 2481–2493, 2023

  51. [51]

    Fast iterative hard thresholding methods with pruning gradient computations

    Yasutoshi Ida, Sekitoshi Kanai, Atsutoshi Kumagai, Tomoharu Iwata, and Yasuhiro Fujiwara. Fast iterative hard thresholding methods with pruning gradient computations. InAdvances in Neural Information Processing Systems, volume 37, pages 52836–52857, 2024

  52. [52]

    Malkov and D

    Yu A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):824–836, 2020

  53. [53]

    Accelerating large-scale inference with anisotropic vector quantization

    Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. Accelerating large-scale inference with anisotropic vector quantization. InProceedings of the 37th International Conference on Machine Learning, pages 3887–3896, 2020

  54. [54]

    The faiss library.arXiv preprint arXiv:2401.08281, 2024

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library.arXiv preprint arXiv:2401.08281, 2024

  55. [55]

    Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. A Proofs Setup.Let U= [u 1, . . . , uN]∈R d×N be a matrix of unit vectors and S⊆[N] with |S|=k≥1 . Define the correlation-gap region ΦDR(U, S) = n v∈S d−1 : max j∈S c u⊤ j v <min i∈S u⊤ i vandmin i∈S u⊤ i v >0 o . and let ΦNNN(U, S) be the set of uni...

  56. [56]

    Tools/query

    The inactive condition (10) forj= 1gives u⊤ 1 v−u ⊤ 1 USw⋆ S = 2 3 − 1√ 2 2 √ 2 3 −λ 1 ! = λ1√ 2 < λ 1,(20) so the full KKT system is satisfied withsupp(w ⋆) =S, givingv∈Φ NNN(U, S). B Reproducibility Information B.1 Datasets We evaluate on five benchmarks. The NumpyBank, PandasBank, and AWSBank datasets in ToolBank ship with their own train / validation ...