What I read in this paper blew that idea out the water! I mean, it’s still doable but you’ve far exceeded it.
I love that you covered many types of structures, used 8x consumer GPU’s more like OSS folks do (widely-accessible pretraining), claim no copyright infringement for pretraining, and use enough techniques in ML that people can enjoy Googling stuff for days.
I do have some questions about what I might have overlooked in the paper.
1. Is the training data and code available to reproduce the model? And iteratively improve its architectural decisions?
2. Most authors claiming their data was legal or open were actually committing copyright infringement. Your method might dodge that if users generate their own synthetic data using methods they can verify aren’t themselves encumbered. Is that code available under open licensing? If not, would you offer it for a fee for companies or free for researchers?
3. What specific, common uses could amateurs try that would display the model’s ability in a business setting? (Both to drive more research or build products on the model.)
I thank you for your time.
Thanks :)
1. Only for the first version, not for this version. I am sorry! 2. Yeah ours is guaranteed ok, as we wrote code to generate it basically just from plain torch ops. The code to run inference is available, just not the training code and data generation. 3. We have put it to work on time series data, which is very business relevant for example https://github.com/liam-sbhoo/tabpfn-time-series, and we have a table in the Appendix with all datasets we evaluate on in our main analysis to give you some ideas for possible datasets.
Just looking through the code a bit, it seems that the model both supports a (custom) attention mechanism between features and between rows (code uses the term items)? If so, does the attention between rows help improve accuracy significantly?
Generally, for standard regression and classification use cases, rows (observations) are seen to be independent, but I'm guessing cross-row attention might help the model see the gestalt of the data in some way that improves accuracy even when the independence assumption holds?
Could you please explain like I'm five what is doing a trick? You have model pre-trained on large set of small datasets and you leverage it to boost performance?
Training is fast, few seconds, but what is time needed to compute predictions?
How large is the model?
To draw a parallel to NLP: previously people trained a neural network for each kind of text classification they wanted to do, but then LLMs came around that pre-trained to learn to perform new tasks on the fly. Similarly, TabPFN learns to do new tasks on the fly just from the context (dataset) given.
Training and prediction in these models is by default one and the same, similar to how the prediction of the next token in an LLM is not split into learning from context and then doing the actual prediction. There is a way to split this even up, though, then the predictions, I believe, take something like 1/10s for medium-sized datasets.