Having done a bit of data engineering in my day, I'm growing more and more allergic to the DataFrame API (which I used 24/7 for years). From what I've seen over the past ~10 years, 90+% of use cases would be better served by SQL, both from the development perspective as well as debugging, onboarding, sharing, migrating etc.
Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.
I agree with this 100%. The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here [1].
I've been using Malloy [2], which compiles to SQL (like Typescript compiles to Javascript), so instead of editing a 1000 line SQL script, it's only 18 lines of Malloy.
I'd love to see a blog post comparing a pandas approach to cleaning to an SQL/Malloy approach.
> The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here.
That's pandas. Polars builds on much of the same 50 years of progress in database research by offering a lazy DataFrame API which does query optimization, morsel-based columnar execution, predicate pushdown into file I/O, etc, etc.
Disclaimer: I work for Polars on said query execution.
In the same talk, Mark acknowledges that "for data science workflows, database systems are frustrating and slow." Granted DuckDB is an attempt to fix that, most data scientists don't get to choose what database the data is stored in.
Why is the dataframe approach getting hate when you’re talking about runtime details?
That folks understand the almost conversational aspect of SQL vs. that of the dataframe api but the other points make no difference.
If you’re a competent dev/data person and are productive with the dataframe then yay. Also setup and creating test data and such it’s all objects and functions after all — if anything it’s better than the horribad experience of ORMs.
As a user? No, I don't have to choose. What I'm saying is that analysts (who this Polars Cloud targets, just like Coiled or Databricks) shouldn't worry about instance types, shuffling performance, join strategies, JVM versions, cross-AZ pricing etc. In most cases, they should just get a connection string and/or a web UI to run their queries, everything abstracted from them.
Sure, Python code is more testable and composable (and I do love that). Have I seen _any_ analysts write tests or compose their queries? I'm not saying these people don't exist, but I have yet to bump into any.
Fun aside - I actually used polars for a bit - first time I tried it, I actually thought it was broken, because it finished processing so quickly I thought it silently exited or something.
So I'm definitely a fan, IF you need the DataFrame API. My point was that most people don't need it and it's oftentimes standing in the way. That's all.
Polars is very nice. I’ve used it off and on. The option to write rust udf’s for performance, easy integration of rust with Python with pyo3 will make it a real contender.
Yes, I know spark and scala exist. I use it. But the underlying Java engines and the tacky Python gateway impact performance and capacity usage. Having your primary processing engine in the same process compiled natively always helps.
I think your argument focuses a lot on the scenario where you already have cleaned data (i.e., data warehouse). I and many other data engineers agree, you're better off with hosting it on SQL RDBMS.
However, before that, you need a lot of code to clean the data and raw data does not fit well into a structured RDBMS. Here you choose to either map your raw data into row view or a table view. You're now left with the choice of either inventing your own domain object (row view) or use a dataframe (table view).
I agree, but there are other possibilities in between those two extremes, like Quivr [1]. Schemas are good, but they can be defined in Python and you get a lot more composability and modularity than you would find in SQL (or pandas, realistically).
100% agree. I've also worked as a data engineer and came to the same conclusion. I wrote up a blog which went into a bit more depth on the topic here: https://www.robinlinacre.com/recommend_sql/
I recently had to create a reproducible version of incredibly complicated and messy R concoctions our data scientists came up with.
I did it with pandas without much experience with it and a lot of AI help (essentially to fill in the blanks the data scientists had left, because they only had to do the calculation once).
I then created a polars version which uses lazyframes. It ended up being about 20x faster than the first version. I did try to do some optimizations by hand to make the execution planner work even better which I believe paid off.
If you have to do a large non interactive analytical calculation (i.e. not in a notebook) polars seems to be way ahead imo!
I do wish that it was just as easy to use as a rust library though.. the focus however seems to be on being competitive in python land mainly.
He means that he wants our Rust library as easy as our Python lib. Which I understand as our focus has been mostly on Python.
It is where most of our userbase is and it is very hard for us to have a stable Rust API as we have a lot of internal moving parts which Rust users typically want access to (as they like to be closer to the metal), but has no stability guarantees from us.
In python, we are able to abstract and provide a stable API.
Still don't get why one of the biggest player in the space, Databricks is overinvesting in Spark. For startups, Polars or DuckDB are completely sufficient. Other companies like Palantir already support bring your own compute.
That's a good question! Especially after Frank McSherry's COST paper [1], it's hard to imagine where the sweet spot for Spark is. I guess for Databricks it makes sense to push Spark, since they are the ones who created it. In a way, it's their competitive advantage.
I absolutely love Polars. I work on some unholy dirty data and the ease of use, chaining, speed are a godsend. One dataset that previously took 40 minutes in Pandas now takes two minutes in Polars. Granted, the Pandas query could be optimized, but out of the box, Polars eats pandas when it comes to speed and efficiency.
I basically ditched SQL for most of my analytical work because it's way easier to understand for my juniors (we're not technically a tech team) so it's a total win in my eyes.
Polars is certainly better than pandas doing things locally. But that is a low bar. I’ve not had great experience using Polars on large enough datasets. I almost always end up using duckdb. If I am using SQL at the end of the day, why bother starting with Polars? With AI these days, it’s ridiculously fast to put together performant SQLs. Heck you can even make your own grammar and be done with it.
SQL is definitely easier and faster to compose than any dataframe syntax but I think pandas syntax (via slicing API) is faster to type and in most cases more intuitive but I still use polars for all df-related tasks in my workflow since it's more structured and composable (although needs more time to construct but that's a cost I'm willing to take when not simply prototyping). When in an ipython session, sql via duckdb is king. Also: python -m chdb "describe 'file.parquet'" (or any query) is wonderful
I guess if it’s too large to be performant than SQL can be the way to go. I avoid sql for one off tasks though as I can more easily grok transformations in polars code than sql queries.
I guess could be a good contender for replacing spark, however, I suspect the fact spark is free and open source, which forms a community around it, means that dpolars might struggle to gain traction, when it's gated by a credit card.
Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.
I've been using Malloy [2], which compiles to SQL (like Typescript compiles to Javascript), so instead of editing a 1000 line SQL script, it's only 18 lines of Malloy.
I'd love to see a blog post comparing a pandas approach to cleaning to an SQL/Malloy approach.
[1] https://www.youtube.com/watch?v=PFUZlNQIndo [2] https://www.malloydata.dev/
That's pandas. Polars builds on much of the same 50 years of progress in database research by offering a lazy DataFrame API which does query optimization, morsel-based columnar execution, predicate pushdown into file I/O, etc, etc.
Disclaimer: I work for Polars on said query execution.
Especially when considering testability and composability, using a DataFrame API inside regular languages like Python is far superior IMO.
Why is the dataframe approach getting hate when you’re talking about runtime details?
That folks understand the almost conversational aspect of SQL vs. that of the dataframe api but the other points make no difference.
If you’re a competent dev/data person and are productive with the dataframe then yay. Also setup and creating test data and such it’s all objects and functions after all — if anything it’s better than the horribad experience of ORMs.
Sure, Python code is more testable and composable (and I do love that). Have I seen _any_ analysts write tests or compose their queries? I'm not saying these people don't exist, but I have yet to bump into any.
So I'm definitely a fan, IF you need the DataFrame API. My point was that most people don't need it and it's oftentimes standing in the way. That's all.
Yes, I know spark and scala exist. I use it. But the underlying Java engines and the tacky Python gateway impact performance and capacity usage. Having your primary processing engine in the same process compiled natively always helps.
However, before that, you need a lot of code to clean the data and raw data does not fit well into a structured RDBMS. Here you choose to either map your raw data into row view or a table view. You're now left with the choice of either inventing your own domain object (row view) or use a dataframe (table view).
1: https://github.com/B612-Asteroid-Institute/quivr
I did it with pandas without much experience with it and a lot of AI help (essentially to fill in the blanks the data scientists had left, because they only had to do the calculation once).
I then created a polars version which uses lazyframes. It ended up being about 20x faster than the first version. I did try to do some optimizations by hand to make the execution planner work even better which I believe paid off.
If you have to do a large non interactive analytical calculation (i.e. not in a notebook) polars seems to be way ahead imo!
I do wish that it was just as easy to use as a rust library though.. the focus however seems to be on being competitive in python land mainly.
It is where most of our userbase is and it is very hard for us to have a stable Rust API as we have a lot of internal moving parts which Rust users typically want access to (as they like to be closer to the metal), but has no stability guarantees from us.
In python, we are able to abstract and provide a stable API.
Still don't get why one of the biggest player in the space, Databricks is overinvesting in Spark. For startups, Polars or DuckDB are completely sufficient. Other companies like Palantir already support bring your own compute.
[1]: https://www.usenix.org/system/files/conference/hotos15/hotos...
I basically ditched SQL for most of my analytical work because it's way easier to understand for my juniors (we're not technically a tech team) so it's a total win in my eyes.
Sometimes. But sometimes Python is just much easier. For example transposing rows and columns.