{"total":16,"items":[{"citing_arxiv_id":"2605.18971","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Shaping the Prior: How Synthetic Task Distributions Determine Tabular Foundation Model Quality","primary_cat":"cs.LG","submitted_at":"2026-05-18T18:00:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"O'Prior, a compositional synthetic prior with hierarchical SCMs, realism engines, stress modules, and curriculum protocols, improves tabular foundation model accuracy and robustness on real benchmarks when architecture and compute are held fixed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18549","ref_index":9,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics","primary_cat":"cs.CL","submitted_at":"2026-05-18T15:29:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15154","ref_index":12,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RoSHAP: A Distributional Framework and Robust Metric for Stable Feature Attribution","primary_cat":"stat.ML","submitted_at":"2026-05-14T17:51:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RoSHAP is a robust feature-ranking metric that summarizes the distributional properties of SHAP values via bootstrap resampling and asymptotic normality to reward active, strong, and stable features.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13986","ref_index":41,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TabPFN-3: Technical Report","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:01:43+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"40% 60% 80% 100% Win Rate TabArena Win-rate Matrix Figure 12. Pairwise win rates on TabArenafor a curated set of the strongest models on TabArena. See Appendix E.2.4 for the full results. list of recent models, including tree-based models like CatBoost [38], LightGBM [39] or XGBoost [40], as well as newer deep-learning models like RealMLP [32], TabM [41], ModernNCA [42] or xRFM [43], the AutoML system AutoGluon [2], and other Tabular Foundation Models like TabICL [29, 30], TabDPT [44], TabSTAR [37], LimiX [45], Mitra [46] or TabPFN v2 [17]. The benchmark contains a set of 51 datasets selected from 1053 to be representative of real-world tabular data. See Erickson et al.[1] for the list of datasets and Section E."},{"citing_arxiv_id":"2605.12435","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Environment-Adaptive Preference Optimization for Wildfire Prediction","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:31:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EAPO adapts wildfire models to new environments via k-nearest neighbor data retrieval and hybrid fine-tuning that emphasizes rare extreme events, achieving ROC-AUC 0.7310 on real data.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Extreme wildfires are challenging to predict [41], as they emerge from the complex interplay of fire weather [9, 10, 37], topography [33], vegetation fuels [32, 37], and human factors such as ignition and fire suppression [16, 19, 26, 44], all of which are difficult to fully represent in process-based wildfire models. Whereas machine learning (ML) approaches such as XGBoost [8] have shown promise in wildfire prediction [5, 18, 21, 25, 42], outperforming process-based wildfire models [41], they typically require extensive historical fire data for training and may struggle to generalize to novel fire regimes that emerge under climate change [ 14]. As climate change causes shifts in the spatial pattern, seasonality, and statistical distribution of fire"},{"citing_arxiv_id":"2605.18791","ref_index":25,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SpecX: A Large-Scale Benchmark for Multi-Modal Spectroscopy and Cross-Paradigm Evaluation","primary_cat":"eess.IV","submitted_at":"2026-05-11T04:12:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpecX is a new large-scale multi-modal spectroscopy benchmark with tiered datasets that supports unified evaluation across specialized models and MLLMs, showing specialized models excel at signal-level tasks while MLLMs are stronger in high-level reasoning but weaker in precise spectral grounding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07208","ref_index":9,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution","primary_cat":"cs.LG","submitted_at":"2026-05-08T03:57:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"wi for each paper by log-normalizing and aggregating its GitHub stars, citation counts, influential citations, and Altmetric score. The distribution of the four log-normalized ground-truth impact metrics utilized in the dataset is shown in Figure 4. Baselines.We benchmark FAME against three distinct categories of evaluators. First, we evaluate ML models, including XGBoost [9], SVR [11, 27], Transformer [31] and TGCN [39], trained directly 5 Table 1: Prospective forecasting performance across an 18-month sliding window evaluation from June 2024 to November 2025. Performance is measured by the Spearman rank correlationρs between predicted and ground-truth composite impact weights. The experiments are carried out three times,"},{"citing_arxiv_id":"2605.08272","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Quantifying Exposure Information Uncertainty in Regional Risk Assessment","primary_cat":"stat.AP","submitted_at":"2026-05-08T03:30:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A methodology decomposes total uncertainty in regional risk assessment into contributions from probabilistic exposure characterization and other sources using analytical and simulation approaches.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"a: Single-span bridges do not have a bent and are not included in the ML-based predictions. b: There is a small proportion of bridges that have more than 7 columns per bent. These bridges are considered as outliers and have been excluded before the model training. Figure 7: Proposed classifier chain for imputing missing attributes. An XGBoost classifier [40] is used as the predictive model to impute the four target attributes. The hyperparameters of each classifier are tuned using Bayesian optimization [41], which is conducted within a stratified five-fold cross- validation framework. Early stopping is employed during the training by monitoring validation performance within each fold, such that the boosting process is terminated when no further improvement is observed after a specified number"},{"citing_arxiv_id":"2605.06117","ref_index":5,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification","primary_cat":"cs.LG","submitted_at":"2026-05-07T12:27:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BoostLLM trains sequential PEFT adapters in a boosting framework with tree path inputs to improve LLM performance on few-shot tabular classification, matching or exceeding XGBoost.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Despite their efficiency, such adaptation can be unstable and prone to overfitting when supervision is scarce [48, 10, 23], highlighting the need for more data-efficient and robust fine-tuning strategies. Interestingly, the tabular learning community has long relied on a different paradigm to address similar challenges: gradient boosting. Systems such as XGBoost [ 5], LightGBM [20], and CatBoost [35] consistently achieve strong performance across a wide range of tabular tasks and are known to be particularly robust in data-limited settings [12, 13, 49, 8, 36]. Boosting constructs models in a stage-wise manner, where learners are added sequentially to correct the residual errors of previous ones. This additive training strategy encourages models to focus on informative residual errors"},{"citing_arxiv_id":"2605.05270","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Forecasting Oncology Demand Trends with Boosting-Based Bayesian Conjugate Models","primary_cat":"stat.ML","submitted_at":"2026-05-06T12:55:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A boosting-enhanced Bayesian conjugate model for oncology demand forecasting outperforms ARIMA, LSTM, and XGBoost in trend direction accuracy by up to 38.25% on real Brazilian hospital data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00083","ref_index":64,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Comparative Analysis of Polygon-Based and Global Machine Learning Models for Bus Occupancy Prediction","primary_cat":"cs.LG","submitted_at":"2026-04-30T15:35:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Localized polygon-based models trained on clustered bus stops achieve prediction accuracy comparable to a single global model when using ridership, spatial, weather, and temporal features.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04868","ref_index":4,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Noise Immunity in In-Context Tabular Learning: An Empirical Robustness Analysis of TabPFN's Attention Mechanisms","primary_cat":"cs.LG","submitted_at":"2026-04-06T17:16:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"TabPFN maintains high ROC-AUC and structured attention under controlled additions of irrelevant features, nonlinear correlations, and mislabeled targets in binary classification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.13566","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"EmDT: Embedding Diffusion Transformer for Tabular Data Generation in Fraud Detection","primary_cat":"stat.ML","submitted_at":"2026-03-13T20:13:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EmDT combines UMAP clustering with a Transformer-based diffusion process to create synthetic fraud samples that improve XGBoost classification on credit card fraud data while preserving correlations and privacy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.18850","ref_index":35,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Cognitive Alpha Mining via LLM-Driven Code-Based Evolution","primary_cat":"cs.CL","submitted_at":"2025-11-24T07:45:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CogAlpha combines LLM reasoning with code-level evolutionary search to discover financial alphas that show higher predictive accuracy and generalization than prior methods on five stock datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.08667","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models","primary_cat":"cs.LG","submitted_at":"2025-11-11T18:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TabPFN-2.5 scales tabular foundation models to 20x larger datasets, outperforms tuned tree models on TabArena, achieves near-perfect win rates against default XGBoost, and adds a distillation engine for fast production deployment.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"TabArena [1] is the most curated tabular benchmark, based on the largest number of candidate datasets considered, and created by open-source contributors from a wide range of institutions. It will appear at the NeurIPS 2025 Datasets & Benchmarks track and is thus most up-to-date. In particular, it compares a large class of recent models, including tree-based models like CatBoost [3], LightGBM [4] or XGBoost [2], as well as newer deep-learning models like RealMLP [22], TabM [24], ModernNCA [25] or xRFM [26], and other Tabular Foundation Models like TabICL [27], TabDPT [28], LimiX [29], Mitra [30] or TabPFNv2 [7]. We follow the paper's recommendation to benchmark on \"TabArena-Lite\", which is a cheaper but representative version of the full benchmark using only one test fold."},{"citing_arxiv_id":"2510.10454","ref_index":46,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction","primary_cat":"cs.AI","submitted_at":"2025-10-12T05:24:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Traj-CoA is a multi-agent LLM framework that sequentially processes noisy five-year EHR data via worker agents into EHRMem for manager-agent lung cancer risk prediction and outperforms four categories of baselines in zero-shot evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}