Why do tree-based models still outperform deep learning on tabular data?

I have a theory - tree based models require minimal feature engineering. They are capable of handling categorical data in principled ways, they can handle the most skewed/multimodal/heteroskedastic continuous numeric data just as easily as a 0-1 scaled normal distribution, and they are easy to regularize compared to a DL model (which could have untold millions of possible parameter combinations, let alone getting the thing to train to a global optimum).

I think if you spent months getting your data and model structure to a good place, you could certainly get a DL model to out-perform a gradient boosted tree. But why do that, when the GBT will be done today?

a-dub · 3 years ago

this is along the lines of my thinking. people organize and summarize data before throwing it into spreadsheets, where deep learning models do their thing by generating new representations from raw data.

in a sense, most data in spreadsheets is compressed and deep learning models prefer to find their own compression that best suits the task at hand.

or in human terms: "these spreadsheets are garbage. i can't work with this. can you bring me the raw data please?" :)

lr1970 · 3 years ago

> I have a theory - tree based models require minimal feature engineering.

Actually, the whole premise of Deep Learning is to learn proper feature representations from data with minimal data preprocessing. And it works wonderfully in CV and NLP but is less performant in tabular data. The paper indicates that there are several contributing factors to the DL underperforming.

drzoltar · 3 years ago

I think another aspect is that most modern GBT models prefer the entire dataset to be in memory, thereby doing a full scan of the data for each iteration to calculate the optimal split point. That’s hard to compete with if your batch size is small in a NN model.

a-dub · 3 years ago

that's an interesting idea. but at the end of the paper they do an analysis of the effect of different hyperparameters for the nets with their dataset and find that the batch size doesn't seem to matter much. (although they're trying size ranges like [256, 512, 1024] as opposed to turning batching off entirely)

anothernewdude · 3 years ago

they also do subsampling of the data though.

oofbey · 3 years ago

I think you’re on the right track that trees are good at feature engineering. But the key problem is that DL researchers are horrible at feature engineering, because they have never had to do it. These folks included.

The feature engineering they do here is absolutely horrible! They use a QuantileTransform and that’s it. They don’t even tune the critical hyper parameter of the number of quantiles. Do they always use the scikitlearn default of 1,000 quantiles? No wonder uninformative features are hurting- they are getting expanded into 1000 even more uninformative features! Also with a single quantile transform like that, the relative values of the quantiles are completely lost! If the values 86 and 87 fall into different bins, the model has literally no information that the two bins are similar to each other, or even that they come from the same raw input.

For a very large dataset a NN would learn its way around this kind of bone headed mistake. But for this size dataset, these researchers have absolutely crippled the nets with this thoughtless approach to feature engineering.

There is plenty more to criticize about their experiments, but it’s probably less important. E.g. Their HP ranges are too small to allow for the kind of nets that are known to work best in the modern era (after Double Descent theory has been worked out) - large heavily regularized nets. They don’t let the nets get very big and they don’t let the regularization get nearly big enough.

So. Bad comparison. But it’s also very true that XGB “just works” most of the time. NN’s are finicky and complicated and very few people really understand them well enough to apply them to novel situations. Those who do are working on fancy AI problems, not writing poor comparison papers like this one.

eutectic · 3 years ago

I think you misunderstand what quantiletransformer does; it just transforms the distribution of the data be be more normal, it's not a binning technique. To many quantiles will just result in excess noise, not loss of information.

btown · 3 years ago

It occurs to me that a system, trained on peer-reviewed applied-machine-learning literature and Kaggle winners, that generates candidates for structured feature-engineering specifications, based on plaintext descriptions of columns' real-world meaning, should be considered a requisite part of the "meta" here.

Ah, and then you could iterate within the resulting feature-engineering-suggestion space as a hyper-parameter between experiments, which could be optimized with e.g. https://github.com/HIPS/Spearmint . The papers write themselves!

ellisv · 3 years ago

I agree. The majority of DL layers are about feature engineering, not performing classification.

wpietri · 3 years ago

Could you say more about this?

One of the things that interests me about nominal AI applications is the extent to which they're sort of a Mechanical Turk or what I've heard called Artificial Artificial Intelligence. By which I mean it's sold as computer magic, but most of the magic is actually humans sneaking in a lot of human judgement. That can come through humans directly massaging the output or through human-driven selection of results. But I've also been wondering to what extent natural human intelligence is getting put in at the lower layers of the system, like feature engineering.

thom · 3 years ago

What are the principled ways that tree based models handle categorical data? If you end up having to do one-hot encoding it feels like you need very wide forests or very deep trees. If your categorical data is actually vaguely continuous then splits can be quite efficient but that’s rare.

I assume some day someone will be able to explain all this in information theoretic terms. I’m never sure if we’re comparing like with like (are the deep learning models we’re comparing against actually that deep, for example?) but clearly there’s something to the intuition that many small overfit models are more efficient than one big general model.

CapmCrackaWaka · 3 years ago

Some implementations will group the categories into two groups, right and left. An easy way to do this is to create the groups based on whatever grouping creates the biggest decrease in the loss. There are more complicated implementations, lightgbm groups by the sum(gradient) / (sum(hessian) + smooth) where smooth is some smoothing parameter which prioritizes class size over loss decrease.

They link to this paper in their docs: https://www.tandfonline.com/doi/abs/10.1080/01621459.1958.10...

coffee_am · 3 years ago

One way is to use categorical set splits [1] (proposed for categorical set inputs, but works for categorical features as well), used in TF-DF [1]. Greedy, and expensive to train (cheap inference though), but it gives great results.

[1] https://arxiv.org/pdf/2009.09991.pdf [2] https://www.tensorflow.org/decision_forests/text_features

Deleted Comment

jb_s · 3 years ago

do you reckon it's possible to somehow transfer learn from a GBT to a NN ?

username_exists · 3 years ago

occam's razor

It's baffling to me how little research attention there has been to improving tree-based methods, considering their effectiveness.

For example, LightGBM and XGBoost allow some regularization terms, but the variance/bias is still mostly controlled by globally setting the max depth and max node count (and then parameter searching to find good settings).

Surely there must be more powerful and sophisticated ways of deciding when to stop building each tree than counting the number of nodes? If this was neural nets there would be a hundred competing papers proposing different methods and arguing over their strengths and weaknesses.

I'm not sure whether the problem is that neural nets are just fundamentally more sexy, or that in order to make SOTA improvements in GBMs you need to dive into some gnarly C++. Probably both.

gwern · 3 years ago

Why do you think there has been little research attention? Time was, 'machine learning' was little but tree-based methods (and that was how they distinguished themselves from 'AI'). Go look at Breiman's CV or random conference proceedings. Or as tree-based method proponents love to point out, pretty much everyone on Kaggle up until recently used trees for everything non-image-based; that's a ton of effort invested in tweaking trees. And there were hardware efforts to accelerate them (I recall MS talking about how they were investing in FPGAs for MS Azure to run trees better), so 'GPUs' isn't an excuse.

zelphirkalt · 3 years ago

I think people throwing algorithms at problems at Kaggle or combining them is not the kind of research, which the GP was referring to.

Limiting a tree by its depth is a very general global parameter for a tree. One could try to use any kind of criteria for deciding when to stop making more child nodes in a tree, depending on what information is locally available and that depends on how the tree algorithm is actually implemented. So people doing Kaggle challenges would have to dig into the source code of the tree implementation, then change things there, to modify locally available knowledge, in order to allow for more fine grained decision making at each node.

That is only the constructive side of things, when the tree is created. Even more powerful is the post processing / destructive / prunning side of things, because theoretically the whole tree structure can be taken into account, when deciding what branch to cut.

I think the GP is referring to research in the area of what other useful things one can come up with here. As far as I know, these are not the usual things people do in Kaggle challenges. Correct me if I am wrong.

_pastel · 3 years ago

Because anytime I search for literature on basic tweaks to the structure of decision trees, I find nothing.

Another example: modern GBM implementations all use binary trees. How would they perform with ternary trees? Or k-way trees for larger k plus some additional soft penalty that encourages minimizing the number branches, unless the information gain is really worth it?

(You can simulate ternary trees with binary, but the splitting behavior is different because ternary can more easily identify good splitting regions in the middle range of the histogram values.)

This seems like such a basic structural question, but the only relevant search result was this downvoted Stack Exchange question from 5 years ago: https://stats.stackexchange.com/questions/305685/ternary-dec...

There are lots of papers on ternary trees in old-school contexts like Ternary Decision Diagrams etc., but nothing relevant to the context of modern tree ensemble performance. Or maybe I'm just bad at searching?

(I implemented this and saw a small accuracy increase from ternary on large datasets, but much worse training speed because you get less mileage from the histogram subtraction trick. Maybe the accuracy would be even better with a more clever choice of soft penalty.)

mistrial9 · 3 years ago

> LightGBM not improving, meanwhile 8-figure budgets builds GPU and auto-logins..

My take? management agenda to build plug-and-play researchers (humans on jobs), rather than domain specialists. DeepLearning fits that description with all-plumbing, all-the-time.. domain specialists want graduate school, weekends and health benefits..

nl · 3 years ago

Is this supposed to be an ironic take on the "plug-and-play HN comment blaming cost cutting management for everything"?

Because it's not this.

Deep learning researchers are specialists in their specific technologies. For example recent advances in transformers not withstanding, it takes quite a long time for say someone who knows how to build deep learning models on images using CNNs to come up to speed on how build good natural language processing models.

micro_cam · 3 years ago

There are a fair number of papers (start with Dart/dropout, Bart (bayesian sampling of the whole gbm) but they start to look like global optimization problems and part of the reasons trees work so well is that the local greedy optimization can be made super fast on modern cpu caches.

So even if you can fit a more compact forest that performs well through clever regularization its usually better/faster in practice to grow more simple trees with more randomization and let overfitting average out.

natalyarostova · 3 years ago

I think part of the problem is that the upper bound on neural nets, as far as we can tell, might very well be general intelligence, and things like self-driving cars, and other nearly magical use-cases that seem within reach. Whereas tree based models, for a series of reasons, many related to scaling, don't offer that feeling of limitless potential.

zelphirkalt · 3 years ago

Maybe. But then again we often try to solve very specific problems, which are very far from requiring anything close to a general intelligence. Heck, a general intelligence might even be bad at things, just like humans are not as good as computers at certain things.

kbelder · 3 years ago

There's a science-fiction story somewhere in there. Imagine an alien planet where brains evolved using a structure completely different than a neural net... some sort of classification model where they were incredibly efficient inside their domain, but broke unpredictably when brought outside of it.

Permit · 3 years ago

> Results show that tree-based models remain state-of-the-art on medium-sized data (∼10K samples) even without accounting for their superior speed.

Is that really "medium"? That seems very small to me. MNIST has 60,000 samples and ImageNet has millions.

I think the title overstates the findings. I'd be interested to hear how these methods compare on much larger datasets. Is there a threshold at which deep learning outperforms tree-based models?

Edit: They touch on this in the appendix:

> A.2.2 Large-sized datasets

> We extend our benchmark to large-scale datasets: in Figures 9, 10, 11 and 12, we compare the results of our models on the same set of datasets, in large-size (train set truncated to 50,000 samples) and medium-size (train set truncated to 10,000 samples) settings.

> We only keep datasets with more than 50,000 samples and restrict the train set size to 50,000 samples (vs 10,000 samples for the medium-sized benchmark). Unfortunately, this excludes a lot of datasets, which makes the comparison less clear. However, it seems that, in most cases, increasing the train set size reduces the gap between neural networks and tree-based models. We leave a rigorous study of this trend to future work.

beckingz · 3 years ago

Many real world problems that result in data are decidedly medium: small enough to fit in excel, large enough to be too big to comfortable handle in excel.

ramraj07 · 3 years ago

Then these real world problems don’t actually warrant deep learning ?

I thought the biggest leap in NN and deep learning in recent history was the realization that we need a ton of data to get maximal effectiveness from them; it now sounds counterproductive to forget this and cry they don’t work well with 10,000 rows.

mochomocha · 3 years ago

I've put in production numerous models with millions of tabular data points and a 10^5-10^6 feature space where tree-based models (or FF nets) outperform more complex DL approaches.

whymauri · 3 years ago

I'll chime in with billions of data points and 100-300 feature space with some smart feature engineering outperforming DL in runtime/compute (by orders of magnitude) and performance. But the domain was very specific and everything prior to the tree was doable with matrix operations, with the tree model summarizing a mixture of experts that chose optimal features.

adamsmith143 · 3 years ago

What kind of freakish tabular data do you have with a million columns??

riedel · 3 years ago

MNIST is not your typical real world tabular data. Many if not most data science problems out there are still in the range of a few k samples from my perspective (trying to "sell" ML to the average company) From a statistical point of view I would not call the datasets small (you can decently compare two means from subsets without needing a student's distribution).

nonameiguess · 3 years ago

Assuming the categories are meant to apply to any data sets, anything amenable to machine learning at all is at least medium data. "Small" data would be something like a human trial with n=6 because the length and compliance of the protocol is so onerous. There are entirely different statistical techniques for finding significance in the face of extremely low power.

jwilber · 3 years ago

If you’re interested in how tree-based models work, I wrote an interactive explanation on decision trees here: https://mlu-explain.github.io/decision-tree/

and random forests here: https://mlu-explain.github.io/random-forest/

It’s also worth noting that a recentish paper shows neural networks can perform well on tabular data if well-regularized: https://arxiv.org/abs/2106.11189v1?utm_source=jesper&utm_med...

lnenad · 3 years ago

That was super easy to digest, thank you!

isabellat · 3 years ago

Really nice interactive explanations!

BenoitEssiambre · 3 years ago

I like decision trees and this helps support my case for using them. I often go even further and don't use an algorithm to build the trees but build trees myself along intuitive causal lines and use the data to train its parameters. I sometimes build a few models manually and see which fits better with the data.

Prior knowledge can prevent the pitfalls of automatically built models.

Trees may be better than NNs because they overfit less but you can overfit even less with a bespoke model. For example, I've seen an automatically generated tree made to tune the efficiency of a factory end up using as a main feature "be after a specific date" because a machine was upgraded on that date and so the learning algorithm latched on to that unactionable piece of data as a main predictor for the model.

This was an easy fix, to not feed timestamp data to the model but there are lots of more subtle cases like this and I've seen people spend much time cleaning and "tweaking" the input data to get the answers they want out of their ML models.

If you have to try to make your ML model behave by manually selecting what data to feed it, you might as well go all the way and build a clean causal model yourself that reflect your priors and domain knowledge about the subject.

I have an ML background but I often get more performance out of my models by doing something along the lines of what a Bayesian statistician would do.

Of course with highly dimensional data like pixels in images you almost have no choice to use NNs. There's no way to hand-build these models.

hackerlight · 3 years ago

I want to know more about your process. How do you optimise the split point. How do you choose which feature to split on first

I start with common sense priors or "intuition" but also sometimes write code to try different combinations and assess accuracy.

For example, not too long ago I was trying to estimate and predict ATM usage and considered using days of the week as a feature. The simplest case is a single node tree that assumes every day of the week has similar usage. You just calculate mean and variance.

But then, a lot of ATMs have different usage patterns on weekends than weekdays. So I considered a split into two groups, calculating different mean/variance for weekend days and week days. I could have tried different groupings of days but those two were easy to understand for users, covered most cases I was interested in with less risk of overfitting given the amount of data that I was working with (with more data I might have considered more splits even take into account special days like holidays). The more splits the more data you need to train the model, especially if you want not just a point estimate but an idea of variance (more parameters to estimate require more data).

The decision whether to split or not can be done on a per ATM basis cross validating the splitting strategy by assessing a fleet of ATMs.

People optimize different things in these kinds of problem. Here https://www.saedsayad.com/decision_tree_reg.htm for example they try to reduce prediction variance. Oftentimes decision trees try to optimize information gain or gain ratio (https://en.wikipedia.org/wiki/Information_gain_(decision_tre...) Gain ratio seems like a pretty ad-hoc technique to me so to prevent over-fitting and select the correct model (to split or not), I sometimes use a relative entropy penalty on the trained parameters kinda like described here: http://www.inference.org.uk/mackay/itprnn/ps/343.355.pdf. I've never used it but apparently if you have problems where you can't calculate that parameter relative entropy, there are approximation techniques such as https://en.wikipedia.org/wiki/Bayesian_information_criterion.

Then there's things I sometimes do with hyperparameters to improve performance like mixing into individual ATM data fake data points that represent the prior for a wider group or class of ATMs. Again using common sense to adjust those priors (and always cross validating my tweaks against real data).

ersiees · 3 years ago

Our lab works on changing this. I think it might still take some years for a full solution, but so far we are successful with NNs for small datasets with meta-learning (https://arxiv.org/abs/2207.01848) and large datasets with regularisation (https://arxiv.org/abs/2106.11189). The first is very new, but the second is also cited in this paper, but they didn’t run it as a baseline.

civilized · 3 years ago

Why not run your own method on their benchmark? If your method is general-purpose, it should be very easy. And it would be very impressive to see your method succeed on a task that was not chosen by you to write your paper.

Yes, good point! The paper I am responsible for is targeted at smaller datasets, but I will propose this to the authors of the second paper :)

freemint · 3 years ago

Tree based algorithms have the advantage that global or robust optimization of them is possible. Do your approaches offer similar guarantees?

The first work, I linked, arguably gets around this issue completely. There we train a neural network to approximate Bayesian prediction directly. It accepts the dataset as input. Thus, there is no optimisation procedure per dataset, just a forward pass. So far, this seems to be pretty stable.

isolli · 3 years ago

Related, an article from 2019 [0] on how neural network finally beat statistical models (e.g. ARIMA) in time-series forecasting.

[0] https://towardsdatascience.com/n-beats-beating-statistical-m...

nomel · 3 years ago

I can only assume this was achieved years ago, as a black project in the financial market.

mr_toad · 3 years ago

I had an econometrics professor who would have objected to calling ARIMA models statistics. He though they were no better than data mining.

jmmcd · 3 years ago

A great paper and an important result.

However, it omits to cite the highly relevant SRBench paper from 2021, which also carefully curates a suitable set of regression benchmarks and shows that Genetic Programming approaches also tend to be better than deep learning.

https://github.com/cavalab/srbench

cc u/optimalsolver