ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
hub
Watanabe, Tree-structured parzen estimator: Understanding its al- gorithm components and their roles for better empirical performance (2023)
19 Pith papers cite this work. Polarity classification is still indexing.
abstract
Recent scientific advances require complex experiment design, necessitating the meticulous tuning of many experiment parameters. Tree-structured Parzen estimator (TPE) is a widely used Bayesian optimization method in recent parameter tuning frameworks such as Hyperopt and Optuna. Despite its popularity, the roles of each control parameter in TPE and the algorithm intuition have not been discussed so far. The goal of this paper is to identify the roles of each control parameter and their impacts on parameter tuning based on the ablation studies using diverse benchmark datasets. The recommended setting concluded from the ablation studies is demonstrated to improve the performance of TPE. Our TPE implementation used in this paper is available at https://github.com/nabenabe0928/tpe/tree/single-opt. OptunaHub now provides our standalone TPE implementation at https://hub.optuna.org/samplers/tpe_tutorial/.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Optuna's constrained TPE is joint c-TPE, the same expected constrained improvement acquisition function computed from a joint likelihood instead of an independence assumption.
COCOCO is a conformal framework for NeSy-CBMs that jointly conformalizes concepts and labels, reconciles them via deduction-abduction revision, and satisfies consistency, coverage, and conciseness while retaining distribution-free guarantees.
PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
LRP on EEG transformers reveals Clever Hans artifacts in motor imagery tasks and a recurring central electrode cluster as a candidate sensorimotor signature of arousal.
GFlowNets sample multiple valid mechanistic simulator configurations for digital twin adaptation, recovering main parameter regions and preserving uncertainty in a tomato model case study.
FluidFlow uses conditional flow-matching with U-Net and DiT architectures to predict pressure and friction coefficients on airfoils and 3D aircraft meshes, outperforming MLP baselines with better generalization.
PENEX is a new formulation of the multi-class exponential loss for neural networks that supports first-order optimization and improves generalization in low-data regimes.
A new leaf-instance dataset for soybean-cotton detection and segmentation collected across growth stages and conditions from commercial farms is presented and validated with YOLOv11.
Multi-objective Bayesian optimization with TPE tunes industrial drive current controllers to expert-level performance in minutes on real hardware without a model or firmware changes.
Time series foundation models scale under a single training recipe, with forecast quality improving from 4M to 2.5B parameters and new SOTA results on BOOM, GIFT-Eval, and TIME benchmarks.
A physics-informed neural network infers pT spectra of pi, K, p, Lambda, and Ks in unmeasured rapidity regions from PYTHIA8 pp collisions at 13.6 TeV, achieving 1.5-5.83% yield uncertainties while reproducing yield ratios and freeze-out parameters.
OrthoBO introduces an orthogonal acquisition estimator subtracting an optimally weighted score-function control variate to reduce Monte Carlo variance, preserve the acquisition target, and improve ranking stability in Bayesian hyperparameter optimization.
PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.
A decision-support framework applies AFT models to show Nvidia L4 GPUs yield 20% longer adversarial survival time at 75% lower cost than V100, with inference latency as the strongest robustness predictor.
A quantile-regression ensemble with safety factor reduces under-allocated jobs from 4.17% to 2.89% and average overallocation from 148% to 44.51% on SAP build data.
A Gated Residual Network correction model reduces fault location error by 76% in simulated onshore wind farm collector networks compared to state-of-the-art methods.
EZR combines active Naive Bayes sampling and decision-tree distillation to reach over 90% of best-known multi-objective optimization performance on 60 datasets while producing clearer explanations than LIME, SHAP or BreakDown.
Transformer models under active learning classify high-binding epitopes from a small docking dataset more accurately than random sampling or other architectures in low-data regimes for PRRS.
citing papers explorer
-
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.