Readit News logoReadit News
onasta commented on Show HN: TabPFN v2 – A SOTA foundation model for small tabular data   nature.com/articles/s4158... · Posted by u/onasta
_giorgio_ · a year ago
It's probably the same model with the same limitations, released nearly two years ago?

https://arxiv.org/abs/2207.01848

onasta · a year ago
There have been a ton of improvements! Much better performance overall, way larger data size limit (1K-->10K rows, 100-->500 features), regression support, native categorical data and missing values handling, much better support for uninformative or outlier features etc.
onasta commented on Show HN: TabPFN v2 – A SOTA foundation model for small tabular data   nature.com/articles/s4158... · Posted by u/onasta
instanceofme · a year ago
Related: CARTE-AI, which can also deal with multiple tables.

https://soda-inria.github.io/carte/https://arxiv.org/pdf/2402.16785

The paper includes a comparison to TabPFN v1 (among others), noting the lack of categorical & missing values handling which v2 now seems to have. Would be curious to see an updated comparison.

onasta · a year ago
TabPFN is better on numerical data since v1 (see figure 6 in the CARTE paper). CARTE's main strength in on text features, which are now also supported for TabPFN v2 API version (https://github.com/PriorLabs/tabpfn-client). We compared this to CARTE and found our model to be generally quite better, and much faster. CARTE multi-table approach is also very interesting, and we want to tackle this setting in the future.
onasta commented on Why do tree-based models still outperform deep learning on tabular data?   arxiv.org/abs/2207.08815... · Posted by u/isolli
coffee_am · 3 years ago
I can't explain it, but I help maintain TensorFlow Decision Forests [1] and Yggdrasil Decision Forests [2], and in an AutoML system at work that trains models on lots of various users data, decision forest models gets selected as best (after AutoML tries various model types and hyperparameters) somewhere between 20% to 40% of the times, systematically. It's pretty interesting. Other ML types considered are NN, linear models (with auto feature crossings generation), and a couple of other variations.

[1] https://github.com/tensorflow/decision-forests [2] https://github.com/google/yggdrasil-decision-forests

onasta · 3 years ago
Super interesting! Do you know the kind of data that it's usually used for? And in the remaining 80% to 60%, do NNs acccount for a large portion of the best models?

Bonus question: are the stats you're mentioning publically available?

u/onasta

KarmaCake day63August 4, 2022View Original