Schema-1 is the first Data Language Model that natively understands raw tabular data and outperforms gradient-boosted ensembles, AutoML, and prior tabular foundation models on row-level prediction and imputation tasks.
Title resolution pending
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
A new listwise learning-to-rank method uses smooth rank approximation and boosting to optimize without depending on a single metric.
Spline encodings for numerical features show task-dependent performance in tabular deep learning, with piecewise-linear encoding robust for classification and variable results for regression depending on spline family, knot strategy, and backbone.
XGBoost classifier filters interlopers in CSST slitless spectroscopy simulations, retaining 42% of galaxies with 96.6% accurate redshifts and 0.13% outliers.
ProfiliTable is a multi-agent system with profiler, generator, and evaluator components that outperforms baselines on 18 tabular task types via dynamic profiling and closed-loop refinement.
A gradient boosted classifier on X-ray light curve features detects stellar flares at 97.1% test accuracy and generates the largest public catalog of such events.
A data-centric AI framework cleans FLIm labels via confident learning and achieves 96% accuracy classifying glioma infiltration into low, moderate, and high cellularity.
Standalone tree-based models outperform both SAINT and SAINT-embedding hybrids for employee attrition prediction on tabular HR data.
citing papers explorer
-
Data Language Models: A New Foundation Model Class for Tabular Data
Schema-1 is the first Data Language Model that natively understands raw tabular data and outperforms gradient-boosted ensembles, AutoML, and prior tabular foundation models on row-level prediction and imputation tasks.
-
Metric-agnostic Learning-to-Rank via Boosting and Rank Approximation
A new listwise learning-to-rank method uses smooth rank approximation and boosting to optimize without depending on a single metric.
-
From Uniform to Learned Knots: A Study of Spline-Based Numerical Encodings for Tabular Deep Learning
Spline encodings for numerical features show task-dependent performance in tabular deep learning, with piecewise-linear encoding robust for classification and variable results for regression depending on spline family, knot strategy, and backbone.
-
Filtering Interlopers with Photometry and Diagnostic Features: A Machine Learning Framework Validated with CSST Slitless Spectroscopy
XGBoost classifier filters interlopers in CSST slitless spectroscopy simulations, retaining 42% of galaxies with 96.6% accurate redshifts and 0.13% outliers.
-
ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows
ProfiliTable is a multi-agent system with profiler, generator, and evaluator components that outperforms baselines on 18 tabular task types via dynamic profiling and closed-loop refinement.
-
Stellar flare detection in XMM-Newton with gradient boosted trees
A gradient boosted classifier on X-ray light curve features detects stellar flares at 97.1% test accuracy and generates the largest public catalog of such events.
-
A Data-Centric Framework for Intraoperative Fluorescence Lifetime Imaging for Glioma Surgical Guidance
A data-centric AI framework cleans FLIm labels via confident learning and achieves 96% accuracy classifying glioma infiltration into low, moderate, and high cellularity.
-
Integrating SAINT with Tree-Based Models: A Case Study in Employee Attrition Prediction
Standalone tree-based models outperform both SAINT and SAINT-embedding hybrids for employee attrition prediction on tabular HR data.