arxiv: 2604.09737 · v1 · submitted 2026-04-09 · 💻 cs.LG · cs.AI

Recognition: unknown

STaR-DRO: Stateful Tsallis Reweighting for Group-Robust Structured Prediction

Samah Fodeh , Ganesh Puthiaraju , Elyas Irankhah , Linhai Ma , Srivani Talakokkul , Afshan Khan , Sreeraj Ramachandran , Jordan Alpert

show 1 more author

Sarah Schellhorn

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords group-robust optimizationstructured predictionTsallis divergenceclinical text miningrobust fine-tuningprompt engineeringhierarchical label extractionEPPC Miner

0 comments

The pith

STaR-DRO applies stateful Tsallis reweighting to focus fine-tuning on persistently hard groups, lifting Code F1 from 79.24 to 81.47 on clinical structured extraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a prompting strategy that uses XML instructions, disambiguation rules, and self-validation to reduce format errors and hallucinations in ontology-constrained generation. It then presents STaR-DRO, which tracks group losses over time with momentum smoothing and centers them against a neutral baseline before applying bounded multipliers inside Tsallis mirror descent. This setup upweights only groups that stay above the baseline, concentrating updates where difficulty persists. On the EPPC Miner benchmark of patient-provider messages, the combination raises F1 on the hardest label decisions while cutting group-wise validation cross-entropy by up to 29.6 percent on difficult clinical categories. These groups represent rare but consequential communication patterns, so the gains directly affect downstream reliability in care analysis.

Core claim

STaR-DRO combines Tsallis mirror descent with momentum-smoothed, centered group-loss signals and bounded excess-only multipliers so that only persistently hard groups above a neutral baseline receive higher weight, concentrating learning on the most difficult subgroups without volatile exponentiated-gradient reweighting or loss from downweighting easier groups.

What carries the argument

STaR-DRO, a stateful robust optimization method that uses Tsallis mirror descent driven by momentum-smoothed centered group-loss signals with bounded excess-only multipliers.

If this is right

Prompt engineering alone raises average F1 by 15.44 points across Code, Sub-code, and Span in zero-shot settings on four Llama models.
STaR-DRO on top of supervised fine-tuning further improves the hardest semantic decisions, specifically Code F1 to 81.47 and Sub-code F1 to 69.30 on Llama-3.3-70B-Instruct.
The method reduces group-wise validation cross-entropy by up to 29.6 percent on the most difficult clinical categories while preserving Span performance.
Because the improved groups correspond to clinically consequential communication behaviors, the gains strengthen reliability of communication mining for patient-centered care analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stateful centering and bounded-multiplier logic could be tested on other structured prediction tasks that exhibit stable but heterogeneous subgroup difficulty, such as legal document parsing or scientific entity linking.
If the momentum smoothing window is treated as a tunable hyperparameter, shorter windows might trade responsiveness for stability in streaming clinical data.
The prompting component could be paired with retrieval-augmented generation to further reduce metadata-conditioned confusion on rare label combinations.

Load-bearing premise

Group difficulty signals remain stable enough after momentum smoothing and centering that the bounded multipliers will consistently identify and upweight only the persistently hardest groups.

What would settle it

If STaR-DRO applied to the EPPC Miner dataset produces no reduction in group-wise validation cross-entropy on the most difficult clinical categories and no Sub-code F1 gain over standard supervised fine-tuning, the claimed advantage of the stateful excess-only reweighting would be refuted.

read the original abstract

Structured prediction requires models to generate ontology-constrained labels, grounded evidence, and valid structure under ambiguity, label skew, and heterogeneous group difficulty. We present a two-part framework for controllable inference and robust fine-tuning. First, we introduce a task-agnostic prompting strategy that combines XML-based instruction structure, disambiguation rules, verification-style reasoning, schema constraints, and self-validation to address format drift, label ambiguity, evidence hallucination, and metadata-conditioned confusion in in-context structured generation. Second, we introduce STaR-DRO, a stateful robust optimization method for group heterogeneity. It combines Tsallis mirror descent with momentum-smoothed, centered group-loss signals and bounded excess-only multipliers so that only persistently hard groups above a neutral baseline are upweighted, concentrating learning where it is most needed while avoiding volatile, dense exponentiated-gradient reweighting and unnecessary loss from downweighting easier groups. We evaluate the combined framework on EPPC Miner, a benchmark for extracting hierarchical labels and evidence spans from patient-provider secure messages. Prompt engineering improves zero-shot by +15.44 average F1 across Code, Sub-code, and Span over four Llama models. Building on supervised fine-tuning, STaR-DRO further improves the hardest semantic decisions: on Llama-3.3-70B-Instruct, Code F1 rises from 79.24 to 81.47 and Sub-code F1 from 67.78 to 69.30, while preserving Span performance and reducing group-wise validation cross-entropy by up to 29.6% on the most difficult clinical categories. Because these rare and difficult groups correspond to clinically consequential communication behaviors, these gains are not merely statistical improvements: they directly strengthen communication mining reliability for patient-centered care analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STaR-DRO adds a stateful Tsallis twist to group-robust fine-tuning and shows modest targeted gains on one clinical structured-prediction benchmark, but the supporting technical detail is still thin.

read the letter

The main thing here is a practical tweak to distributionally robust optimization for structured prediction under group heterogeneity. STaR-DRO layers momentum smoothing and excess-only multipliers on top of Tsallis mirror descent so that only persistently hard groups get upweighted while easier ones are left alone. That formulation looks new relative to the usual DRO baselines cited in the abstract. The prompting side of the work also delivers a clear zero-shot lift of about 15 F1 points across Code, Sub-code, and Span on the EPPC Miner patient-message task using four Llama variants. On top of supervised fine-tuning, the STaR-DRO stage then moves Code F1 from 79.24 to 81.47 and Sub-code F1 from 67.78 to 69.30 on the 70B model while cutting group-wise validation cross-entropy by up to 29.6 percent in the hardest clinical categories. Those numbers are concrete and the clinical motivation is straightforward: rare communication patterns matter for downstream analysis. The paper does a reasonable job framing why dense reweighting can hurt easier groups and why a bounded, centered, stateful signal might avoid that. What is missing is any derivation of the update rule, ablation of the momentum or threshold choices, or error analysis that would let a reader judge whether the multipliers stay independent of the target data. The abstract gives no training curves, no comparison against standard group DRO or focal loss, and no discussion of how the state is initialized or maintained across batches. The gains are small enough that they could be sensitive to hyperparameter tuning or the particular split of difficult groups. This is the kind of work that would interest people doing robust fine-tuning for clinical NLP or structured generation with LLMs. It is not a broad theoretical advance, but the empirical signal is sharp enough on its stated task that a serious editor should send it out for review rather than desk-reject it. The authors will need to expand the methods section and add controls before it can be cited with confidence.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a two-part framework for structured prediction under group heterogeneity: (1) a task-agnostic XML-based prompting strategy with disambiguation rules, verification reasoning, schema constraints, and self-validation; (2) STaR-DRO, which combines Tsallis mirror descent with momentum-smoothed centered group-loss signals and bounded excess-only multipliers to upweight only persistently hard groups. On the EPPC Miner benchmark for hierarchical clinical label and evidence extraction, prompting improves zero-shot F1 by +15.44 on average across four Llama models; STaR-DRO on top of supervised fine-tuning further raises Code F1 from 79.24 to 81.47 and Sub-code F1 from 67.78 to 69.30 on Llama-3.3-70B-Instruct while cutting group-wise validation cross-entropy by up to 29.6% on difficult categories.

Significance. If the empirical gains and the claimed robustness properties hold under scrutiny, the work offers a practical route to controllable structured generation and group-aware fine-tuning that concentrates capacity on clinically consequential rare behaviors without dense reweighting volatility. The combination of prompting and stateful Tsallis reweighting is a concrete contribution to robust optimization for ontology-constrained tasks, but its significance is limited by the modest absolute lifts and the absence of ablations or comparisons that would establish the method's incremental value over existing group-robust baselines.

major comments (3)

[Abstract] Abstract: the central performance claims (Code F1 +2.23, Sub-code F1 +1.52, up to 29.6% cross-entropy reduction) are presented without any derivation of the STaR-DRO update rule, without pseudocode or explicit equations for the momentum-smoothed centered loss and bounded excess-only multiplier, and without ablation or variance estimates; this leaves the optimization claim unverifiable from the reported numbers alone.
[Method (STaR-DRO)] The description of STaR-DRO states that thresholds and smoothing parameters are chosen so that only persistently hard groups are upweighted, yet no section demonstrates that these hyperparameters are set independently of the target validation or test distributions; this creates a circularity risk for the reported group-wise improvements.
[Experiments] No comparison is provided to standard group-robust baselines (Group DRO, standard DRO, or focal loss variants) on the same EPPC Miner splits; without such controls it is impossible to isolate whether the stateful Tsallis component, rather than generic reweighting or the prompting alone, drives the observed F1 and cross-entropy gains.

minor comments (2)

[Abstract] The abstract mentions four Llama models for the prompting results but does not list their exact sizes or variants; adding this table or sentence would improve reproducibility.
[Experiments] The number of groups, how groups are defined in EPPC Miner, and the precise clinical categories achieving the 29.6% cross-entropy reduction are not stated; a short table or footnote would clarify the scope of the robustness claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (Code F1 +2.23, Sub-code F1 +1.52, up to 29.6% cross-entropy reduction) are presented without any derivation of the STaR-DRO update rule, without pseudocode or explicit equations for the momentum-smoothed centered loss and bounded excess-only multiplier, and without ablation or variance estimates; this leaves the optimization claim unverifiable from the reported numbers alone.

Authors: We agree that the abstract's brevity limits inclusion of full derivations and pseudocode. The complete derivation of the STaR-DRO update rule, including explicit equations for the momentum-smoothed centered group-loss signal and bounded excess-only multiplier, appears in Section 3 with pseudocode in the appendix. Ablations and variance estimates (across multiple random seeds) are reported in Section 4. To improve self-contained verifiability, we will revise the abstract to include a concise high-level description of the update rule and direct references to the relevant sections and tables. revision: partial
Referee: [Method (STaR-DRO)] The description of STaR-DRO states that thresholds and smoothing parameters are chosen so that only persistently hard groups are upweighted, yet no section demonstrates that these hyperparameters are set independently of the target validation or test distributions; this creates a circularity risk for the reported group-wise improvements.

Authors: We acknowledge the importance of demonstrating independence to avoid circularity. Hyperparameters were tuned exclusively on a held-out validation split derived from the training data, with no access to test distributions. To make this explicit, we will add a new subsection in the revised Method section that details the tuning protocol, the exact validation split used, and confirmation that test data played no role in hyperparameter selection. revision: yes
Referee: [Experiments] No comparison is provided to standard group-robust baselines (Group DRO, standard DRO, or focal loss variants) on the same EPPC Miner splits; without such controls it is impossible to isolate whether the stateful Tsallis component, rather than generic reweighting or the prompting alone, drives the observed F1 and cross-entropy gains.

Authors: We agree that comparisons to Group DRO, standard DRO, and focal loss variants on identical splits would better isolate the contribution of the stateful Tsallis mechanism. The current results emphasize incremental gains over supervised fine-tuning plus prompting. In the revised manuscript we will add these baselines to the Experiments section, reporting F1 and group-wise cross-entropy on the same EPPC Miner splits in a new table. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces STaR-DRO as an empirical combination of Tsallis mirror descent, momentum-smoothed centered group-loss signals, and bounded excess-only multipliers to upweight persistently hard groups. No equations, predictions, or first-principles results are presented that reduce by construction to the method's own inputs or fitted parameters; the reported gains (e.g., Code F1 lift from 79.24 to 81.47) are framed as experimental outcomes on EPPC Miner rather than derived quantities. The description remains self-contained without load-bearing self-citations, ansatz smuggling, or renaming of known results as novel derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the optimization method implicitly assumes existence of stable group difficulty signals and bounded multipliers but supplies no derivation or justification.

pith-pipeline@v0.9.0 · 5669 in / 1152 out tokens · 45053 ms · 2026-05-10T17:25:48.708073+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 17 canonical work pages · 4 internal anchors

[1]

JAMIA Open8(4), 087 (2025)

Wec, A., Gleason, K.T., Peereboom, D.,et al.: Measurement, drivers, and outcomes of patient-initiated secure messaging use and intensity: A scoping review. JAMIA Open8(4), 087 (2025)

2025
[2]

North, F., Luhman, K.E., Mallmann, E.A.,et al.: A retrospective analysis of provider-to-patient secure messages: How much are they increasing, who is doing the work, and is the work happening after hours? JMIR Medical Informatics8(7), 16521 (2020) https://doi.org/10.2196/16521

work page doi:10.2196/16521 2020
[3]

Advances in neural information processing systems35, 27730–27744 (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,et al.: Training language models to follow instructions with human feedback. Advances in neural information processing systems35, 27730–27744 (2022)

2022
[4]

ACM Transactions on Computing for Healthcare 3(1), 1–23 (2021)

Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., Poon, H.: Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare 3(1), 1–23 (2021)

2021
[5]

In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Zhang, N., Chen, M., Bi, Z., Liang, X., Li, L., Shang, X., Yin, K., Tan, C., Xu, J., Huang, F.,et al.: CBLUE: A chinese biomedical language understanding evaluation benchmark. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7888–7915 (2022)

2022
[6]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. 36 In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022).https://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

In: International Conference on Learning Representations (2020)

Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=ryxGuJrFvS

2020
[8]

Long-tail learning via logit adjustment,

Menon, A.K., Jayasumana, S., Rawat, A.S., Jain, H., Veit, A., Kumar, S.: Long-tail learning via logit adjustment. arXiv preprint arXiv:2007.07314 (2020)

work page arXiv 2007
[9]

In: Advances in Neural Information Processing Systems, vol

Namkoong, H., Duchi, J.C.: Stochastic gradient methods for dis- tributionally robust optimization with f-divergences. In: Advances in Neural Information Processing Systems, vol. 29 (2016). https://proceedings.neurips.cc/paper/2016/hash/4588e674d3f0faf985047d4c3f13ed0d- Abstract.html

2016
[10]

Journal of Statistical Physics52(1–2), 479–487 (1988)

Tsallis, C.: Possible generalization of Boltzmann-Gibbs statistics. Journal of Statistical Physics52(1–2), 479–487 (1988)

1988
[11]

In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp

Peters, B., Niculae, V., Martins, A.F.T.: Sparse sequence-to-sequence models. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1504–1519 (2019). https://doi.org/10.18653/v1/P19-1146

work page doi:10.18653/v1/p19-1146 2019
[12]

Journal of Medical Internet Research23(9), 26189 (2021) https://doi.org/10.2196/26189

Carini, E., Villani, L., Pezzullo, A.M., Gentili, A., Barbara, A., Ricciardi, W., Boccia, S.: The impact of digital patient portals on health outcomes, system efficiency, and patient attitudes: Updated systematic literature review. Journal of Medical Internet Research23(9), 26189 (2021) https://doi.org/10.2196/26189

work page doi:10.2196/26189 2021
[13]

arXiv preprint arXiv:2602.21165 (2026)

Fodeh, S., Ma, L., Wang, Y., Talakokkul, S., et al.: PVMiner: A domain- specific tool to detect the patient voice in patient generated data. arXiv preprint arXiv:2602.21165 (2026)

work page arXiv 2026
[14]

arXiv preprint arXiv:2603.00028 (2026) https://doi.org/10.48550/arXiv.2603.00028

Fodeh, S., Wang, Y., Ma, L., Talakokkul, S., Alpert, J.M., Schellhorn, S.: EPPCMinerBen: A novel benchmark for evaluating large language models on electronic patient-provider communication via the patient portal. arXiv preprint arXiv:2603.00028 (2026) https://doi.org/10.48550/arXiv.2603.00028

work page doi:10.48550/arxiv.2603.00028 2026
[15]

arXiv preprint arXiv:2603.00025 (2026)

Fodeh, S., Ma, L., Puthiaraju, G., Talakokkul, S., Khan, A., Hagaman, A., Lowe, S.R., Roundtree, A.K.: Tab-po: Preference optimization with a token-level adaptive barrier for token-critical structured generation. arXiv preprint arXiv:2603.00025 (2026)

work page arXiv 2026
[16]

arXiv preprint arXiv:2603.05776 (2026) 37

Fodeh, S., Ma, L., Puthiaraju, G., Talakokkul, S., et al.: PVMinerLLM: Structured extraction of patient voice from patient-generated text using large language models. arXiv preprint arXiv:2603.05776 (2026) 37

work page arXiv 2026
[17]

ACM Transactions on Computing for Healthcare (HEALTH)3(1), 1–23 (2021)

Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., Poon, H.: Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH)3(1), 1–23 (2021)

2021
[18]

In: CCF International Conference on Natural Language Processing and Chinese Computing, pp

Liu, W., Tang, J., Cheng, Y., Li, W., Zheng, Y., Liang, X.: Meddg: an entity- centric medical consultation dataset for entity-aware medical dialogue generation. In: CCF International Conference on Natural Language Processing and Chinese Computing, pp. 447–459 (2022). Springer

2022
[19]

arXiv preprint arXiv:2410.14204 (2024)

Saley, V.V., Saha, G., Das, R.J., Raghu, D., et al.: Meditod: An english dialogue dataset for medical history taking with comprehensive annotations. arXiv preprint arXiv:2410.14204 (2024)

work page arXiv 2024
[20]

In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp

Yan, G., Pei, J., Ren, P., Ren, Z., Xin, X., Liang, H., De Rijke, M., Chen, Z.: Remedi: Resources for multi-domain, multi-service, medical dialogues. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3013–3024 (2022)

2022
[21]

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

White, J., Fu, Q., Hays, S.,et al.: A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv preprint arXiv:2302.11382 (2023) https://doi. org/10.48550/arXiv.2302.11382

work page internal anchor Pith review doi:10.48550/arxiv.2302.11382 2023
[22]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023)

Pang, C., Cao, Y., Ding, Q., Luo, P.: Guideline learning for in-context information extraction. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023)

2023
[23]

In: The Twelfth International Conference on Learning Representations (2024)

Sainz, O., Garc´ ıa-Ferrero, I., Agerri, R., Lacalle, O., Rigau, G., Agirre, E.: GoLLIE: Annotation guidelines improve zero-shot information-extraction. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=Y3wpuxd7u9

2024
[24]

In: Proceed- ings of the 2024 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp

Kong, A., Zhao, S., Chen, H., Li, Q., Qin, Y., Sun, R., Zhou, X., Wang, E., Dong, X.: Better zero-shot reasoning with role-play prompting. In: Pro- ceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol- ume 1: Long Papers), pp. 4099–4113. Association for Computation...

work page doi:10.18653/v1/2024.naacl-long.228 2024
[25]

In: Findings of the Association for Computational Linguistics: EMNLP 2024, pp

Li, Y., Ramprasad, R., Zhang, C.: A simple but effective approach to improve structured language model output for information extraction. In: Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 5133–5148. Associ- ation for Computational Linguistics, Miami, Florida, USA (2024). https://doi. org/10.18653/v1/2024.findings-emnlp.295 .ht...

work page doi:10.18653/v1/2024.findings-emnlp.295 2024
[26]

Self-Refine: Iterative Refinement with Self-Feedback

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yazdanbakhsh, A., Clark, P.: Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651 (2023)

work page internal anchor Pith review arXiv 2023
[27]

In: The Twelfth International Conference on Learning Representations (2024)

Huang, J., Chen, X., Mishra, S., Zheng, H.S., Yu, A.W., Song, X., Zhou, D.: Large language models cannot self-correct reasoning yet. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=IkmD3fKBPQ

2024
[28]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp

Wang, L., Li, L., Dai, D., Chen, D., Zhou, H., Meng, F., Zhou, J., Sun, X.: Label words are anchors: An information flow perspective for understanding in- context learning. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9840–9855. Association for Computational Linguistics, Singapore (2023). https://doi.org/...

work page doi:10.18653/v1/2023.emnlp-main.609 2023
[29]

In: Proceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing

Gao, L., Ghosh, D., Gimpel, K.: The benefits of label-description training for zero-shot text classification. In: Proceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore (2023).https://aclanthology.org/2023.emnlp-main.853/

2023
[30]

Duchi and Hongseok Namkoong

Duchi, J.C., Namkoong, H.: Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics49(3), 1378–1406 (2021) https://doi.org/10.1214/20-AOS2004

work page doi:10.1214/20-aos2004 2021
[31]

Journal of Machine Learning Research22(28), 1–49 (2021)

Zimmert, J., Seldin, Y.: Tsallis-INF: An optimal algorithm for stochastic and adversarial bandits. Journal of Machine Learning Research22(28), 1–49 (2021)

2021
[32]

Journal of Machine Learning Research23(257), 1–74 (2022)

Martins, A.F.T., Treviso, M., Farinhas, A., Aguiar, P.M.Q., Figueiredo, M.A.T., Blondel, M., Niculae, V.: Sparse continuous distributions and Fenchel-Young losses. Journal of Machine Learning Research23(257), 1–74 (2022)

2022
[33]

Journal of Machine Learning Research21(35), 1–69 (2020)

Blondel, M., Martins, A.F.T., Niculae, V.: Learning with fenchel-young losses. Journal of Machine Learning Research21(35), 1–69 (2020)

2020
[34]

In: Advances in Neural Information Processing Systems, vol

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. In: Advances in Neural Information Processing Systems, vol. 36 (2023)

2023
[35]

arXiv preprint (2026)

Huang, J., et al.: Group distributionally robust optimization-driven reinforcement learning for LLM reasoning. arXiv preprint (2026). Preprint

2026
[36]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020) 39

work page internal anchor Pith review Pith/arXiv arXiv 2001
[37]

Transactions on Machine Learning Research (2022)

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E.H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., Fedus, W.: Emergent abilities of large language models. Transactions on Machine Learning Research (2022)

2022
[38]

In: International Conference on Learning Representations (2022)

Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., Le, Q.V.: Finetuned language models are zero-shot learn- ers. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=gEZrGCozdqR

2022
[39]

Y": Provider speaking TO patient - TO PAT YN =

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=nZeVKeeFYf9 40 Appendix Contents A Prompt Engineering Techniques 42 A.1 M1: Why XML-Style Structural Segmentation . . . . ....

2022
[40]

Span Source: - Extract Spans ONLY from the provided message text - Context is for understanding only - Never invent, paraphrase, or infer Spans
[41]

Code and Sub-code Validity: - Every Sub-code MUST be valid for its Code - If a pairing is illogical or invalid, loop back and re-select
[42]

Span Exactness: - Copy EXACT text from the message - Preserve punctuation, capitalization, and spacing - No paraphrasing
[43]

Multi-label Requirement: - Identify ALL relevant Code and Sub-code pairs in the message </critical_rules> <reasoning_process> Follow this 4-step verification process: STEP 1: CONTEXT AND DIRECTION ANALYSIS - Read the full message carefully - Determine message direction using TO PAT YN - Understand speaker intent and conversational goal STEP 2: PHRASE DECO...
[44]

Best semantic match confirmed (if not, loop back to Step 2)
[45]

Sub-code valid for Code
[46]

Span is exact and present in message
[47]

All relevant phrases analyzed
[48]

Disambiguation rules applied correctly
[49]

Y" -> Provider codes - TO PAT YN =

High-confidence annotation defensible to experts </reasoning_process> <codes_definitions> The following are authoritative ground-truth definitions. Names must match exactly. (Full list omitted here for brevity). FORMAT (Code WITH Sub-codes): CODE_NAME: <one-sentence operational definition>. |- SUBCODE_1: <one-sentence operational definition>. |- SUBCODE_2...
[50]

All Sub-codes valid for Codes
[51]

All Spans are exact and present in message
[52]

Best semantic match verified
[53]

All disambiguation rules applied
[54]

Quality over speed

High confidence suitable for expert review Accuracy is paramount. Quality over speed. </quality_gate> INPUT: TO PAT YN: N (Patient speaking to provider) Context: Dr. Person1 I need my prescription sent to the pharmacy for my flecainide acetate 100 mg tablets twice a day the pharmacist has try requesting it no success and I don't have any pills. Person2 Ba...
[55]

1.2 Note: 1.3 Carefully read and analyze every word in the message to determine its context and identify all relevant communication elements

Understand the Input Sentence: 1.1 Analyze the message to establish the full context. 1.2 Note: 1.3 Carefully read and analyze every word in the message to determine its context and identify all relevant communication elements
[56]

2.3 Acknowledge that a message may involve multiple Codes

Identify Relevant Codes: 2.1 Match parts of the message to one or more Codes based on the intent and content described in the definitions below. 2.3 Acknowledge that a message may involve multiple Codes
[57]

3.2 Use definitions of Sub-codes to ensure accuracy and consistency

Determine Sub-codes for Each Code: 3.1 For each identified Code, assign the appropriate Sub-code(s) that further specify the meaning. 3.2 Use definitions of Sub-codes to ensure accuracy and consistency. 3.4 Important: Ensure that the Sub-code you select belongs to the Sub-code list under the identified Code. If it doesn't, reconsider whether the Code or S...
[58]

These pairs should fully describe the meaning of the message

Pair Codes with Sub-codes: 4.1 Form unique Code-Sub-code pairs for the message. These pairs should fully describe the meaning of the message. 4.2 If multiple Codes exist in the same message, their Sub-codes will differ
[59]

results": [

Highlight Evidence for Each Pair: 5.1 Extract minimal, specific Spans of text from the message that support each identified Code-Sub-code pair. 5.2 Note: The extracted minimum Span should be a core phrase in the message instead of the entire sentence. The following content provides definitions for Codes and Sub-Codes. ## Code and Definitions: (Full list o...
[60]

JSON output is parseable under the declared schema
[61]

All Sub-codes are valid for their parent Codes
[62]

All Spans are exact substrings present in the source message
[63]

Best semantic match has been verified for each annotation
[64]

All disambiguation rules have been applied
[65]

Diagnostics,

Annotation is high-confidence and defensible to expert review. This checklist is designed as anintra-generationaudit: the model is expected to evaluate these conditions within the same inference pass that produces the output, without access to external feedback or a second generation call. This distinguishes M6 48 from iterative self-refinement methods su...
[66]

In zero-shot or few-shot deployment, label design is a first-order deci- sion.Semantically transparent, descriptive labels that align with the model’s pre-trained token distribution should be preferred. Angle-bracket-delimited tokens that resemble control sequences are effectively unreachable, and arbitrary numeric identifiers sacrifice the semantic prior...
[67]

After supervised fine-tuning, encoding choice becomes secondary.SFT normalizes encoding-induced performance differences to within 1–2 F1 points, so practitioners may select encodings based on engineering considerations: parsing reliability, integration with downstream pipelines, or output token efficiency, rather than expected task-level performance impact
[68]

Code":"InfoGive

For hierarchical label sets, compositional structure matters more than token economy.Multi-token labels that encode hierarchical relationships (e.g., the Code–Sub-code membership implicit in PartnershipProvider) consistently outperform single-token simplifications in zero-shot settings, even when the single- token alias retains partial semantic relevance....