arxiv: 2605.01311 · v1 · submitted 2026-05-02 · 💻 cs.LG · econ.EM· stat.AP· stat.ML

Recognition: unknown

The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

Jikai Jin, Vasilis Syrgkanis

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:29 UTC · model grok-4.3

classification 💻 cs.LG econ.EMstat.APstat.ML

keywords causal identificationlanguage model evaluationconfounded logsrandomized experimentsoffline simulatorssemi-synthetic benchmarks

0 comments

The pith

A small randomized experiment plus an offline simulator recover unbiased causal performance comparisons of language models from confounded usage logs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Usage logs of language models mix self-selected populations, so raw score comparisons reflect user choice factors as well as model quality. The paper proves that a small randomized experiment that overrides model choice, combined with an offline simulator that replays candidate models on cached contexts, is sufficient to identify the true causal effects. The large observational log then improves precision but is not required for validity. Six estimator families are compared in controlled semi-synthetic settings and on real summarization and coding benchmarks, showing that relative performance depends on the amount of experimental data and alignment between reward signals. This design allows scalable evaluation without requiring every comparison to be randomized.

Core claim

The main result is an identification theorem establishing that the randomized experiment and the simulator together suffice to recover causal model values; the observational log is used only afterward to reduce estimation error rather than to make the causal comparison valid. Six families of estimators are evaluated under this identification strategy.

What carries the argument

The three-source identification design that uses a small randomized experiment for unconfounded scoring, an offline simulator for replaying models on cached contexts, and a large observational log only for variance reduction.

If this is right

Causal comparisons become valid with only a modest number of randomized trials rather than full randomization of all traffic.
Adding more observational log data improves precision of the estimates without reintroducing selection bias.
No single estimator family dominates; performance trades off with the relative sizes of experimental and observational data.
The approach applies to tasks where contexts can be cached and models can be replayed offline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of identification from variance reduction could apply to evaluating recommendation systems or search engines that suffer from position or selection bias.
Prioritizing simulator fidelity over additional confounded data collection would be the practical next step if the theorem holds.
Adaptive allocation of experiment budget could be derived by monitoring uncertainty after the identification step is secured.

Load-bearing premise

The simulator must accurately replay what each candidate model would have produced on the actual logged contexts, and the randomized experiment must assign models independently of user-side factors.

What would settle it

A direct comparison showing that simulator outputs systematically differ from live model outputs on the same cached prompts would break the identification of causal values.

Figures

Figures reproduced from arXiv: 2605.01311 by Jikai Jin, Vasilis Syrgkanis.

**Figure 1.** Figure 1: Causal graph with confounded model choice. The latent state view at source ↗

**Figure 2.** Figure 2: Three-source design. SIM and EXP are sufficient for identification of view at source ↗

**Figure 3.** Figure 3: Mean recommendation regret in the cached coding benchmark as the fix-success view at source ↗

**Figure 4.** Figure 4: Pairwise regret differences ∆ regret = regret(family A) − regret(family B) across the 5 × 5 grid of (αfix, ωweak) in the cached coding benchmark instantiation; negative values mean family A has lower regret than family B in that cell. Left panel: RepresentationEXP minus the best OBS-based family (the per-cell minimum across OBS − Only, CVCI, Grounded). Right panel: Grounded minus OBS − Only. Outlined cell… view at source ↗

**Figure 5.** Figure 5: Targeted DM/DR diagnostic under strong self-selection, intentional reward-model view at source ↗

read the original abstract

Offline evaluation of language models from usage logs is biased when model choice is confounded: the same user-side factors that influence which model is used can also influence how its output is judged, so raw comparisons of logged scores mix self-selected populations rather than estimating a common quantity of interest. A small randomized experiment can break this bias by overriding model choice, but in practice such experiments are scarce and costly. We study a three-source design that combines a large confounded observational log (OBS) for scale, a small randomized experiment (EXP) for unconfounded scoring, and an offline simulator (SIM) that replays candidate models on cached contexts. Our main result is an identification theorem showing that the randomized experiment and the simulator are together enough to recover causal model values; the observational log enters only afterward, to reduce estimation error rather than to make the causal comparison valid. Six estimator families are evaluated in a controlled semi-synthetic validation and in two real-task cached benchmarks for summarization and coding. No family dominates every regime; relative performance depends on the amount of unbiased EXP supervision and on how closely the target reward aligns with OBS-derived structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives an identification result for unbiased LM evaluation from confounded logs via randomized experiment plus simulator, with logs only for precision, but the claim rests on untested simulator fidelity.

read the letter

The main thing to know is that this paper claims an identification theorem: a small randomized experiment combined with an offline simulator that replays models on cached contexts is enough to recover causal model performance values, while the large confounded observational log only reduces variance afterward and does not drive the identification itself. They back this with six estimator families tested in semi-synthetic settings and on two real cached tasks for summarization and coding, showing that relative performance shifts with the amount of randomized supervision and how well the reward matches the log structure. That framing targets a genuine production issue where direct log comparisons mix self-selected populations. The work is clear on the three-source design and avoids claiming the observational data alone suffices for causal claims. The soft spot is the simulator assumption. Identification holds only if the replay produces output distributions that match what the model would have generated on the exact same contexts under randomized assignment. Any un-cached state, truncation, or implementation drift would bias the recovered contrast even with perfect randomization, and the semi-synthetic checks build the simulator to be exact by construction so they do not expose this gap. The real-task results are mentioned but the abstract gives no magnitudes or direct comparisons to simpler baselines, leaving the practical lift unclear. This is for researchers working on causal methods for offline model selection or practitioners who have limited randomized data but abundant logs. It deserves a serious referee to verify the theorem details and see whether the empirical gains hold up under realistic simulator mismatch.

Referee Report

2 major / 2 minor

Summary. The paper claims that offline evaluation of language models from confounded usage logs can be debiased using a three-source design: a large observational log (OBS) for scale, a small randomized experiment (EXP) for unconfounded measurements, and an offline simulator (SIM) that replays models on cached contexts. The central result is an identification theorem asserting that EXP and SIM together recover causal model values, with OBS entering only afterward for variance reduction rather than for validity. Six estimator families are evaluated in semi-synthetic validation and two real-task cached benchmarks (summarization and coding), with relative performance depending on the amount of EXP supervision and alignment between target reward and OBS-derived structure.

Significance. If the identification theorem is valid, the framework offers a practical route to scalable, low-bias evaluation of LM generations without requiring large-scale randomization, by leveraging abundant but confounded logs alongside limited unbiased data and replay simulators. The finding that no estimator family dominates across regimes is useful for guiding deployment choices. The work is strengthened by its explicit separation of identification from efficiency and by the controlled semi-synthetic plus real-task evaluations, though its impact hinges on the robustness of the simulator replay assumption.

major comments (2)

[Identification theorem and semi-synthetic validation] The identification theorem (abstract and §3) asserts that EXP + SIM suffice to recover causal quantities with OBS used only for efficiency. This holds only if the simulator produces outputs whose reward distribution exactly matches what each candidate model would have generated on the precise OBS-cached contexts under randomized assignment. Any replay mismatch (un-cached state, truncation, or implementation drift) directly biases the recovered causal contrast, yet the semi-synthetic validation constructs the simulator to be exact by design and therefore does not probe this gap.
[Empirical evaluation] The real-task benchmarks (summarization and coding) are presented as evidence that relative estimator performance depends on EXP size and reward alignment, but without reported quantitative tables or confidence intervals showing effect sizes under varying EXP supervision levels, it is difficult to assess whether the claimed practical utility is supported.

minor comments (2)

[Introduction] Notation for the three data sources (OBS, EXP, SIM) and the six estimator families should be introduced with a single summary table early in the paper for reader orientation.
[Abstract] The abstract states that 'no family dominates every regime' but does not preview the specific regimes (e.g., low vs. high EXP size) where each family is strongest; adding one sentence would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We respond to each major comment below, indicating where we agree and the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Identification theorem and semi-synthetic validation] The identification theorem (abstract and §3) asserts that EXP + SIM suffice to recover causal quantities with OBS used only for efficiency. This holds only if the simulator produces outputs whose reward distribution exactly matches what each candidate model would have generated on the precise OBS-cached contexts under randomized assignment. Any replay mismatch (un-cached state, truncation, or implementation drift) directly biases the recovered causal contrast, yet the semi-synthetic validation constructs the simulator to be exact by design and therefore does not probe this gap.

Authors: We agree that the identification theorem in §3 requires the simulator to produce reward distributions that exactly match those under randomized assignment on the cached contexts; this assumption is stated explicitly in the theorem conditions. The semi-synthetic validation is constructed with an exact simulator by design to isolate and verify the identification result and estimator performance under ideal conditions where the theorem's assumptions hold. This setup does not test robustness to replay mismatches, which is a valid practical concern not addressed in the current experiments. We will add a new limitations subsection discussing the replay assumption, potential sources of mismatch (such as incomplete caching or implementation drift), and practical steps to minimize them. We will also clarify in §3 that identification holds conditional on accurate simulation. revision: partial
Referee: [Empirical evaluation] The real-task benchmarks (summarization and coding) are presented as evidence that relative estimator performance depends on EXP size and reward alignment, but without reported quantitative tables or confidence intervals showing effect sizes under varying EXP supervision levels, it is difficult to assess whether the claimed practical utility is supported.

Authors: We agree that the presentation would benefit from explicit quantitative tables and confidence intervals. The manuscript currently emphasizes trends via figures, but we will add tables (in the main text or appendix) reporting numerical values for key metrics such as bias, variance, and MSE for each estimator family across multiple EXP supervision levels. These will include 95% confidence intervals computed via bootstrap resampling to quantify effect sizes and variability. This revision will make the evidence for practical utility more transparent and easier to evaluate. revision: yes

Circularity Check

0 steps flagged

No significant circularity; identification theorem stands independently of fitted inputs

full rationale

The paper's central claim is an identification theorem establishing that EXP + SIM suffice to recover causal model values, with OBS entering only for efficiency. No equation or derivation in the provided text reduces a claimed prediction or result to its own inputs by construction. The theorem is conditional on explicit assumptions (accurate simulator replay on cached contexts, unconfounded EXP), which are not smuggled in via self-citation or ansatz. Semi-synthetic validation uses exact simulators by design, but this tests estimator performance under the theorem's assumptions rather than making the identification itself tautological. No load-bearing self-citations or renaming of known results appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The identification result rests on domain assumptions about the experiment and simulator that are standard in causal designs but not fully detailed in the abstract; no free parameters or invented entities are mentioned.

axioms (2)

domain assumption The randomized experiment overrides model choice to break confounding.
Required for the experiment to supply unconfounded scoring.
domain assumption The simulator faithfully replays candidate models on cached contexts.
Required for the simulator to provide unconfounded data.

pith-pipeline@v0.9.0 · 5507 in / 1308 out tokens · 40562 ms · 2026-05-09T14:29:02.559251+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 27 canonical work pages · 4 internal anchors

[1]

Heejung Bang and James M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61 0 (4): 0 962--973, 2005. doi:10.1111/j.1541-0420.2005.00377.x

work page doi:10.1111/j.1541-0420.2005.00377.x 2005
[2]

2016 , month = jul, journal =

Elias Bareinboim and Judea Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113 0 (27): 0 7345--7352, 2016. doi:10.1073/pnas.1510507113. URL https://www.pnas.org/doi/10.1073/pnas.1510507113

work page doi:10.1073/pnas.1510507113 2016
[3]

and Cai, T

David Cheng and Tianxi Cai. Adaptive combination of randomized and observational data, 2021. URL https://arxiv.org/abs/2111.15012

work page arXiv 2021
[4]

Gonzalez, and Ion Stoica

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLM s by human preference. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machin...

2024
[5]

2017 , month = dec, journal =

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30, 2017. URL https://arxiv.org/abs/1706.03741

work page arXiv 2017
[6]

Cole and Elizabeth A

Stephen R. Cole and Elizabeth A. Stuart. Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. American Journal of Epidemiology, 172 0 (1): 0 107--115, 2010. doi:10.1093/aje/kwq084

work page doi:10.1093/aje/kwq084 2010
[7]

Causal inference methods for combining randomized trials and observational studies: A review

B \'e n \'e dicte Colnet, Imke Mayer, Guanhua Chen, Awa Dieng, Ruohong Li, Ga \"e l Varoquaux, Jean-Philippe Vert, Julie Josse, and Shu Yang. Causal inference methods for combining randomized trials and observational studies: A review. Statistical Science, 39 0 (1): 0 165--191, 2024. doi:10.1214/23-STS889. URL https://doi.org/10.1214/23-STS889

work page doi:10.1214/23-sts889 2024
[8]

Hashimoto

Yann Dubois, Bal \'a zs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. In First Conference on Language Modeling (COLM), 2024. URL https://openreview.net/forum?id=CybBmzWBX0

2024
[9]

Doubly robust policy evaluation and learning

Miroslav Dud \' i k, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning, 2011. URL https://arxiv.org/abs/1103.4601

work page arXiv 2011
[10]

Doubly robust policy evaluation and optimization

Miroslav Dud \' i k, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy evaluation and optimization. Statistical Science, 29 0 (4): 0 485--511, 2014. doi:10.1214/14-STS500. URL https://doi.org/10.1214/14-STS500

work page doi:10.1214/14-sts500 2014
[11]

Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E

Amir Feder, Katherine A. Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E. Roberts, Brandon M. Stewart, Victor Veitch, and Diyi Yang. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Comput...

work page doi:10.1162/tacl_a_00511 2022
[12]

Gemma 3 Technical Report

Gemma Team . Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Teaching machines to read and comprehend

Karl Moritz Hermann, Tom \'a s Ko c isk \'y , Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, volume 28, 2015. URL https://arxiv.org/abs/1506.03340

work page arXiv 2015
[14]

Doubly robust off-policy value evaluation for reinforcement learning

Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp.\ 652--661, 2016. URL https://proceedings.mlr.press/v48/jiang16.html

2016
[15]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE -bench: Can language models resolve real-world GitHub issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2310.06770

work page internal anchor Pith review arXiv 2024
[16]

Removing hidden confounding by experimental grounding

Nathan Kallus, Aahlad Manas Puli, and Uri Shalit. Removing hidden confounding by experimental grounding. In Advances in Neural Information Processing Systems, volume 31, pp.\ 10911--10920, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/566f0ea4f6c2e947f36795c8f58ba901-Abstract.html

2018
[17]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R \'e , Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Ho...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Xi Lin, Jens Magelund Tarp, and Robin J. Evans. Combining experimental and observational data through a power likelihood. Biometrics, 81 0 (1): 0 ujaf008, 2025. doi:10.1093/biomtc/ujaf008. URL https://academic.oup.com/biometrics/article/81/1/ujaf008/8016472

work page doi:10.1093/biomtc/ujaf008 2025
[19]

Is your automated software engineer trustworthy?, 2025

Noble Saji Mathews and Meiyappan Nagappan. Is your automated software engineer trustworthy?, 2025. URL https://arxiv.org/abs/2506.17812

work page arXiv 2025
[20]

Abstractive text summarization using sequence-to-sequence RNN s and beyond

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. Abstractive text summarization using sequence-to-sequence RNN s and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp.\ 280--290, 2016. URL https://arxiv.org/abs/1602.06023

work page arXiv 2016
[21]

Introducing SWE -bench V erified

OpenAI . Introducing SWE -bench V erified. OpenAI Blog, 2024. URL https://openai.com/index/introducing-swe-bench-verified/

2024
[22]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

2022
[23]

Bowman, and Shi Feng

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations, 2024. URL https://arxiv.org/abs/2404.13076

work page arXiv 2024
[24]

Transportability of causal and statistical relations: A formal approach

Judea Pearl and Elias Bareinboim. Transportability of causal and statistical relations: A formal approach. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011. doi:10.1609/aaai.v25i1.7861

work page doi:10.1609/aaai.v25i1.7861 2011
[25]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstra...

2023
[26]

Newton and Charles J

James M. Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89 0 (427): 0 846--866, 1994. doi:10.1080/01621459.1994.10476818

work page doi:10.1080/01621459.1994.10476818 1994
[27]

Evan T. R. Rosenman, Guillaume Basse, Art B. Owen, and Mike Baiocchi. Combining observational and experimental datasets using shrinkage estimators. Biometrics, 79 0 (4): 0 2961--2973, 2023. doi:10.1111/biom.13827. URL https://pubmed.ncbi.nlm.nih.gov/36629736/

work page doi:10.1111/biom.13827 2023
[28]

Get To The Point: Summarization with Pointer-Generator Networks

Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp.\ 1073--1083, 2017. URL https://arxiv.org/abs/1704.04368

work page Pith review arXiv 2017
[29]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Advances in Neural Information Processing Systems, volume 33, 2020. URL https://arxiv.org/abs/2009.01325

work page arXiv 2020
[30]

Counterfactual risk minimization: Learning from logged bandit feedback

Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp.\ 814--823, 2015. URL https://proceedings.mlr.press/v37/swaminathan15.html

2015
[31]

Thomas and Emma Brunskill

Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp.\ 2139--2148, 2016. URL https://proceedings.mlr.press/v48/thomasa16.html

2016
[32]

Victor Veitch, Dhanya Sridhar, and David M. Blei. Adapting text embeddings for causal inference. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 of Proceedings of Machine Learning Research, pp.\ 919--928, 2020. URL https://proceedings.mlr.press/v124/veitch20a.html

2020
[33]

Verga, S

Pat Verga, Sebastian Hofst \"a tter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models, 2024. URL https://arxiv.org/abs/2404.18796

work page arXiv 2024
[34]

Large language models are not fair evaluators

Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators, 2023. URL https://arxiv.org/abs/2305.17926

work page arXiv 2023
[35]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. SWE -agent: Agent-computer interfaces enable automated software engineering. In Advances in Neural Information Processing Systems 37 (NeurIPS), 2024. URL https://openreview.net/forum?id=mXpq6ut8J3

2024
[36]

I., and Imbens, G

Xuelin Yang, Licong Lin, Susan Athey, Michael I. Jordan, and Guido W. Imbens. Cross-validated causal inference: a modern method to combine experimental and observational data, 2025. URL https://arxiv.org/abs/2511.00727

work page arXiv 2025
[37]

Wild- Chat: 1M ChatGPT interaction logs in the wild.arXiv preprint arXiv:2405.01470,

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1 M ChatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2405.01470

work page arXiv 2024
[38]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2306.05685

work page internal anchor Pith review arXiv 2023
[39]

Xing, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. LMSYS - Chat -1 M : A large-scale real-world LLM conversation dataset. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum...

2024