It's cristal clear that this page has been written for people who already know what they are looking at; the first line of the first paragraph, far from describing the tool, is about some qualities of it: "Polars is written from the ground up with performance in mind"
And the rest follows the same line.
Anyone could ELI5 what this is and for what needs it is a good solution to use?
EDIT: So an alternative implementation of Pandas DataFrame. Google gave me [0] which explains:
> The pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields.
> DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or Calc. In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they’re an integral part of the Python and NumPy ecosystems.
Yes, it's annoying negative feature of many tech products. Of course it's natural to want to speak to your target audience (in this case, data scientists who like Pandas but find it annoyingly slow/inflexible), but it's quite alienating to newbies who might otherwise become your most enthusiastic customers.
I am the target audience for Polars and have been meaning to try it for several months, but I keep procrastinating about because I feel residual loyalty to Pandas because Wes McKinney (its creator) took the time to write a helpful book about the most common analytical tools: https://wesmckinney.com/book/
Ritchie Vink (the creator of Polars) deliberately decided not to write a book so that he (and his team) can focus full time on Polars itself.
Thijs Nieuwdorp and I are currently working on the O'Reilly book "Python Polars: The Definitive Guide" [1]. It'll be a while before it gets released, but the Early Release version on the O'Reilly platform gets updated regularly. We also post draft chapters on the Polars Discord server [2].
The Discord server is also a great place to ask questions and interact with the Polars team.
> it's quite alienating to newbies who might otherwise become your most enthusiastic customers.
Newbies are your best target audience too! They aren't already ingrained in a system and have to learn a new framework. They are starting from yours. If a newbie can't get through your docs, you need to improve your docs. But it's strange to me how mature Polars is and that the docs are still this bad. It makes it feel like it isn't in active/continued development. Polars is certainly a great piece of software, but that doesn't mean much if you can't get people to use it. And the better your docs, the quicker you turn noobs into wizards. The quicker you do that, the quicker you offload support onto your newfound wizards.
Wes has also worked hard to improve a lot of the missteps of pandas, such as through pyarrow, which may prove even more impactful than pandas has been to date.
> Yes, it's annoying negative feature of many tech products.
Sadly its not only tech products, but also things like security disclosures too.
It always follows the same pattern:
- Spend $X time coding/researching something.
- Spend $not_enough_time documenting it.
- Spend $far_too_much_time thinking about / "engaging with the community" in deciding on a cute name, fancy logo and cool looking website.
It’s pandas, but fast. Pandas is the original open source data frame library. Pandas is robust and widely used, but sprawling and apparently slower than this newcomer. The word “data frames” keys in people who have worked with them before.
Actually pandas is not the original open source data frame library, perhaps only in Python. There is a very rich tradition in R on data.frames, which includes the unjustly neglected data.table.
Pandas has also moved to Apache Arrow as a backend [1], so it’s likely performance will be similar when comparing recent versions. But it’s great to have some friendly competition.
> Pandas is the original open source data frame library
...ehh, not quite. R and its predecessor S have Pandas beat by decades. Pandas wasn't even the first data frame library for Python. But it sure is popular now.
I may be a rare bird starting with R dataframes (still newbie+ level), then python polars (intermediate- ?). Frankly whenever I have to use pandas or df's in R I am not convinced that these are more intuitive/easier to master. I.e. I do not like the concept of row names.
Polars can be an overkill for small/medium dataset, but since I have been bitten by corrupted/badly formatted CSVs/TSVs I love the fact that Polars will throw the towel & complain about types/column number mismatches etc.
And the fact that it can scale up to millions of rows on a modest workstation compensates the fact that sometimes one can spend hours finding a proper way to manipulate a dataset.
I was going to say - it always feels so humbling seeing pages like this. "DataFrames for the new era" okay… maybe I know what data frames are? "Multi-threaded query engine" ahh, so it’s like a database. A graph comparing it to things called pandas, modin, and vaex - I have no clue what any of these are either! I guess this really isn’t for me.
It’s a shame because I like to read about new tech or project and try and learn more, even if I don’t understand it completely. But there’s just nothing here for me.
This must be what normal people go through when I talk about my lowly web development work…
In fairness, the title of the page is “Dataframes for the new Era”. The “Get Started” link below the title links to a document that points to the GitHub page, which explains what the library is about to people with data analysis backgrounds: https://github.com/pola-rs/polars
I'm currently getting dragged into "data" stuff, and I get the impression it's a parallel universe, with its own background and culture. A lot of stuff is like "connect to your Antelope or Meringue instances with the usability of Nincompoop and the performance of ARSE2".
Anyway, probably the interesting things about polars are that it's like pandas, but uses a more efficient rust "backend" called Arrow (although I think that part's also in pandas now) and something like a "query planner" that makes combining operations more efficient. Typically doing things in polars is much more efficient than pandas, to the extent that things that previously required complicated infrastructure can often be done on a single machine. It's a very friendly competition, created by the main developer of pandas.
As far as I can tell everybody loves it and it'll probably supplant pandas over time.
> As far as I can tell everybody loves it and it'll probably supplant pandas over time.
I've been using pandas heavily, everyday, for something like 8 years now. I also did contribute to it, as well as wrote numpy extensions. That is to say, I'm fairly familiar with the pandas/numpy ecosystem, strengths and weaknesses.
Polars is a breeze of fresh air. The API of pandas is a mess:
* overuse of polymorphic parameters and return types (functions that accept lists, ndarrays, dataframes, series or scalar) and return differently shaped dataframes or series.
* lots of indirections behind layers of trampoline functions that hide default values behind undocumented "=None" default values.
* abuse of half baked immutable APIs, favoring "copy everything" style of coding, mixed with half supported, should-have-been-deprecated, in-place variants.
* lots and lots of regressions at every new release, of the worst kind ("oh yeah we changed the behavior of function X when there is more than Y NaNs over the window")
* Very hard to actually know what is delegated to numpy, what is Cython/pandas, and what is pure python/pandas.
* Overall the beast seemed to have won against its masters, and the maintainers seem lost as to what to fix versus what to keep backward compatible.
Polars fixes a lot of these issues, but it has some shortcomings as well. Mainly I found that:
* the API is definitely more consistent, but also more rigid than pandas. Some things can be very verbose to write. It will take some years for nicer simpler "shortcuts" and patterns to emerge.
* The main issue IMHO is polars' handling of cross sectional (axis=1) computations. Polars is _very_ time series (axis=0) oriented, and most cross sectional computations require to transpose the data frame, which is very slow. Pandas has a lot of dedicated axis=1 implementations that avoid a full transposition.
Arrow is an in-memory data format designed to allow zero-copy interop between languages, including Rust, C++, and Python. It's a bit more sophisticated than just some arrays, but ultimately everything is just arrays.
Polars is implemented in Rust and uses Arrow to represent some data in memory.
Pandas is often used in a way that results in a lot of processing happening in Python. Data processing in Python is comically inefficient and single-threaded. Beating that sort of pipeline by 10-100x is not that difficult if you have optimization experience and are able to work in a more suitable language.
> A lot of stuff is like "connect to your Antelope or Meringue instances with the usability of Nincompoop and the performance of ARSE2".
I've been recently getting into DevOps stuff, and this is exactly what that sounds like to me. Same thing whenever I look at any new web framework:
Okay, cool, so your product brings the power of "WazzupJS" to "the edge"... But it also has "teleports" like "Blimper" but without all of the downsides of "Vormhole Blues"?
I'm sure that's really useful, but I kinda wish I knew what the product actually does
I try to use polars each time I have to do some analysis where dataframes helps. So basically any time I'd reach for pandas, which isn't too often. So each time it's fairly "new". This makes me have a hard time believing everyone that is saying "Pandas but faster" has used Polars, because I can often write Pandas from memory.
There's enough subtle and breaking changes that it is a bit frustrating. I really think Polars would be much more popular if the learning curve wasn't so high. It wouldn't be so high if there were just good docs. I'm also confused why there's a split between "User Guide" and "Docs".
To all devs:
Your docs are incredibly important! They are not an afterthought. And dear god, don't treat them as an afterthought and then tell people opening issues to RTFM. It's totally okay to point people in the right direction without hostility. It even takes less energy! It's okay to have a bad day and apologize later too, you'll even get more respect! Your docs are just as important as your code, even if you don't agree with me, they are to everyone but you. Besides technical debt there is also design debt. If you're getting the same questions over and over, you probably have poor design or you've miscommunicated somewhere. You're not expected to be a pro at everything you do and that's okay, we're all learning.
This isn't about polars, but I'm sure I'm not the only one to experience main character devs. It makes me (and presumably others) not want to open issues on __any__ project, not just bad projects, and that's bad for the whole community (including me, because users find mistakes. And we know there's 2 types of software: those with bugs and those that no one uses). Stupid people are just wizards in training and you don't get more wizards without noobs.
As another data point, I switched to Polars because I found it much more intuitive than Pandas - I coulnt remember how to do much in pandas in the rare (maybe twice a year) times I want to do data analysis. In contrast, Polars has a (to me anyway) wonderfully consistent API that reminds me a lot of SQL
Any new library will have a learning curve. I used Pandas for many years and never could get used to it, never mind memorize most of it. It was always all over the place with no consistency.
Switched to Polars a year or so ago and never looked back. I can now usually write it from memory as the method structure is clear. Still running Pandas in a production system that doesn't get many updates but that's all. Even if you like Pandas, you cannot ignore how incredibly slow it is and more importantly, a memory hog.
The story of Polars seems to be shaping up a bit like the story of Python 3000: everything probably could have been done in a slow series of migrations, but the BDFL was at their limit, and had to start fresh. So it takes 10 years for the community to catch up. In the mean time, there will be a lot of heartache.
Above that it says “DataFrames for a new era” hidden in their graphics. I believe it’s a competitor to the Python library “Pandas”, which makes it easy to do complex transformations on tabular data in Python.
It seems like it's a disease endemic to data products. Everybody, the big cloud providers and the small data products, build something whose selling point is "I'm the same as Apache X but better." But if you don't know what Apache X is, you have to go read up on that, and its website might say "I'm the same as Whatever Else but better," and you have to go read up on that. I don't want to figure out what a product does by walking a "like X but better" chain and applying diffs in my head. Just tell me what it does!
I get that these are general purpose tools with a lot of use cases, but some real quick examples of "this is a good use case" and "this is a bad use case, maybe prefer SQL/nosql/quasisql/hadoop/a CSV file and sed" would be really helpful, please.
I think something like dataframes suffers from having a name that isn't obscure enough. You read "dataframes" and think those are two words you know, so you should understand what it is.
If they'd called them flurzles you wouldn't feel like you should understand if it's not something you work with.
How come some submissions don't even describe what it is about than just the name of it? It's really puzzling how everyone is meant to know what it is by its name.
I've mentioned this before and got downvoted because of course everyone is a web dev and knows what xyz random framework (name and version number in the title, nothing else) is.
Right... but the title before the first line reads "DataFrames for the new era". If you don't know what a data frame is then, yes, it's for people who already know that.
You were right that the page is written for those that know what they are looking for, which is just fine. If you are getting started in DS/ML/etc and you have used numpy, pandas, etc. polars is useful in some cases. A simple one, it loads dataframes faster (from experience with a team I help) than pandas.
I haven't played enough to know all it's benefits, but yes it's the next logical step if you are in the space using the above mentioned libraries, it's something one will find.
Dataframes in Python are a wrapper around 2D numpy arrays, that have labels and various accessors. Operations on them are OOM slower than using the underlying arrays.
There's a very good point here but I don't think its made clear.
If your data fits into numpy arrays or structured arrays (mainly if it is in numeric types), numpy is designed for this and will likely be much faster than pandas/polars (though I've also heard pandas can be faster on very large tables).
Pandas and Polars are designed for ease of use on heterogeneous data. They also include a python 'Object' data type which numpy very much does not. They are also designed more like database (e.g. 'join' operations). This allows you to work directly with imported data that numpy won't accept - after which Pandas uses numpy for underlying operations.
So I think the point is if you are running into speed issues in Pandas/Polars, you may find that the time-critical operations could be things that are more efficiently done in numpy (and this would be a much bigger gain than moving from Pandas to Polars)
I don't know where this myth originated from but I have seen this in multiple places. Even if you think about it 2d numpy arrays can't have different type for different columns.
This is true specifically for Pandas DataFrames. Numpy arrays are themselves just wrappers around c arrays, which are contiguous chunks of memory. Polars supports operations on files larger than memory.
Marketing is a skill that needs to be learned. You have to put yourself in the shoes of a person who knows nothing about your product. This does not come naturally to the engineers who make these products and are used to talking to other specialists like themselves.
This is true in general but I'm not sure it's what's going on here.
Marketing is also very concerned with understanding who your target audience(s) are and speaking their language.
I think talking about "DataFrames" is exactly that; the target audience of this project knows what that means. What they are interested in is "ok but who cares about data frames? I've been using pandas for like fifteen years", so what you want to tell them is why this is an improvement, how it would help them. Dumbing it down to spend a bunch of space describing what data frames are would just be a distraction. You'd probably lose the target audience before you ever got to the actual benefits of the project.
Noticed exactly the same - there's no description of the library whatsoever on the landing page. It is implied that it is a DataFrame library, whatever that means.
Maybe this is sort of like the opposite of how scam emails are purposefully scammy, so that only people who can't recognize scams will fall for them. Only people who know what "a DataFrame library" is - which is an enormous number of people, since this is probably the most broadly known concept in data science / engineering - will keep reading this, and they are the target audience.
I don't use dataframes in my day job but have dabbled in them enough that I found this website pretty easy to digest.
You'd really have to be a complete data engineering newbie to not understand it I think?
I mean, where do you draw the line? You wouldn't expect a software tool like this to explain what it is in language my grandma would understand, I don't think?
> You'd really have to be a complete data engineering newbie to not understand it I think?
I do occasionally use Pandas in my day job, but I honestly think very few programmers that could have use for a data frame library would describe themselves as a “data engineer” at all.
In my case, for example, I’m just a physicist - I don’t work with machine learning, big data, or in the software industry at all. I just use Pandas + Seaborn to process the results of numerical simulations and physical experiments similarly to how someone else might use Excel. Works great.
Had the exact same thought seeing this. Too many of these websites are missing a simple tldr of the thing actually is. Great, it's fast, but fast at what??
It has that simple tldr, it's the very first word, "DataFrames". Everyone in this thread just doesn't know what that means, and that's fine, I get that, but seriously, that's the simple summary. Data frames aren't an obscure or esoteric concept in the data analysis space; quite the opposite.
I hate this doc style that has become so popular lately. They get so wrapped up in selling you their story that they forget to tell you basic shit. Like what it is. Or how to install it.
The PMs literally simplified things so much they simplified the product right ought of the docs.
Just once I’d like to see “this library was written to fulfill head-in-the clouds demands by management that we have some implementation, without regards to quality.”
> As a hobby project I tried to build a DataFrame library in Rust. I got excited about the Apache Arrow project and wondered if this would succeed.
> After two months of development it is faster than pandas for groupby's and left and inner joins. I still got some ideas for the join algorithms. Eventually I'd also want to add a query planner for lazy evaluation.
> A couple of personal points that I may as well insert here.
The account I'm now using, dang, was used briefly in 2010 by someone who didn't leave an email address. I feel bad for not being able to ask them if they still want it, so I'm considering this an indefinite loan. If you're the original owner, please email me and I'll give it back to you. The reason I want the name dang is that (a) it's a version of my real name and (b) it's something you say when you make a mistake.
Used pandas for years and it always felt like rolling a ball uphill - just look at doing something as simple as a join (don't forget to reset the index).
Polars feels better than pandas in every way (faster + multi-core, less memory, more intuitive API). The library is still relatively young which has its downsides but in my opinion, at minimum, it deserves to be considered on any new project.
Easily being able to leverage the Rust ecosystem is also awesome - I sped up some geospatial code 100X by writing my own plugin to parallelize a function.
> just look at doing something as simple as a join (don't forget to reset the index)
It's slightly ironic that you mention this, because I always thought the biggest problem with Pandas was its documentation. Case in point: did you know there's a way to join data frames without using the index? It's called "merge" rather than "join".
Pandas was originally very heavily inspired by R terminology and usage patterns, where the term "merge" to mean "join" was already commonplace. If I didn't already know R when I started learning Pandas (~2015), I don't think I'd have been able to pick it up quickly at all.
> I always thought the biggest problem with Pandas was its documentation. Case in point: did you know there's a way to join data frames without using the index? It's called "merge" rather than "join".
chatgpt (even the free tier) solved that problem for me. I ask it what I want in sql terms (or just plain english) and it tells me the pandas spell invocation. It even started to make sense after a few kLOC...
For me, Pandas fits in neatly with Matplotlib in the niche category of "R-inspired Python libraries that are somewhat counter-intuitive due to said R-inspiration"
I had to check the R documentation for merge in disbelief, because it didn't ring a bell. Between data.table's [ syntax and dplyr joins I can't remember the last time I've used merge!
I am very curious to know how you feel about PRQL (prql-lang.org) ? IMHO it gives you the ergonomics and DX of Polars or Pandas with the power and universality of SQL because you can still execute your queries on any SQL compatible query execution engine of your choice, including Polars and Pandas but also DuckDB, ClickHouse, BigQuery, Redshift, Postgres, Trino/Presto, SQLite, ... to name just a few popular ones.
The join syntax and semantics is one of the trickiest parts and is under discussion again recently. It's actually one of the key parts of any data transformation platform and is foundational to Relational Algebra, being right there in the "Relational" part and also the R in PRQL. Most of the PRQL built-in primitive transforms are just simple list manipulations like map, filter or reduce but joins require care to preserve monadic composition (see for example the design of SelectMany in LINQ or flatmap in the List Monad). See this comment for some of my thoughts on this: https://github.com/PRQL/prql/issues/3782#issuecomment-181131...
That issue is closed but I would love to hear any comments and you are welcome to open a new issue referencing that comment or simply tagging me (@snth).
First time I’ve heard of it but seems very cool. My background is data science though so being able to use DS libraries or even apply a python function is why I find myself in Pandas / Polars. This seems very powerful for a data engineer.
I also think it’s awesome you guys have a duckdb integration - maybe I’ll try it out.
Biggest advantage I found when I evaluated it was the API was much more consistent and understandable than the pandas one. Which is probably a given, they’ve learned from watching 20 major versions of pandas get released. However, since it’s much rarer, copilot had trouble writing polars code. So I’m sticking with pandas and copilot for now. Interesting barrier to new libraries in general I hadn’t noticed until I tried this.
You're the first person I ever encounter that publicly states to prefer a library because of its copilot support.
Not making a judgement, just finding it interesting.
Anyway, for what is worth, Copilot learns fast in your repos, very fast.
I use an extremely custom stack made of TS-Plus a TypeScript fork that not even the author itself uses nor recommends and Copilot churns very good TS-Plus code.
So don't underestimate how good can copilot can get at the boilerplate stage once he's seen few examples.
Copilot support is a chicken and egg problem. It needs to train on others code but if people don't write Polars code without Copilot then Copilot will not get better at writing Polars code.
This is really interesting to see these two posts. I can now imagine where AI tools actually inhibit innovation in many domains simply because they’re optimized for things that are already entrenched and new entrants won’t be in the training data. Further inhibiting adoption compared to existing things and thus further inhibiting enough growth to make it into model updates.
You recognize the API is more consistent and understandable, but you want to stay with Pandas only because Copilot makes it easier? Please, (a) for your own sake and (b) for the sake of open source innovation, use the tool that you admit is better.
About me: I've used and disliked the Pandas API for a long time. I'm very proactive about continual improvement in my learning, tooling, mindset, and skills.
> Please, (a) for your own sake and (b) for the sake of open source innovation, use the tool that you admit is better.
This is...such a strange take. To follow your logic to an extreme, everyone should use a very small handful of languages that are the "best" in their domain with ne'er a care for their personal comfort or preference.
> for your own sake
They're sticking with Pandas exactly for their own sake since they like being able to use Copilot.
> for the sake of open source innovation
Ohh by all means let's all be constantly relearning and rehashing just to follow the latest and greatest in open source innovation this week.
Tools are designed to be _used_ and if you like using a tool _and_ it does the job you require of it that seems just fine to me; especially if you're also taking the time to evaluate what else is out there occasionally.
The Polars lib changes rapidly. I am not using Copilot but achieved very good results with ChatGpt if you set system instructions to let it know that eg with_column was replaced with with_columns etc. and add the updated doc information to the system instructions.
I'm not personally a Data Science guy, but considering how early the JS/Jupyter ecosystem is, it was surprisingly quick to get pola.rs-based analysis up and running in TypeScript.
The JS bindings certainly need a bit of love, but hopefully now that it's more accessible we'll see some iteration on it.
I'm really excited about Polars and it's speed performance is super impressive buuutt. . . It annoys me to see vaex, modin and dask all compared on the same benchmarks.
For anyone who doesn't use those libraries, they are all targeted towards out-of-core data processing (i.e. computing across multiple machines because your data is too big). Comparing them to a single core data frame library is just silly, and they will obviously be slower because they necessarily come with a lot of overhead. It just wouldn't make sense to use polars in the same context as those libraries, so seeing them presented in benchmarks as if they are equivalents is a little silly.
And on top of that, duckdb, which you might use in the same context as polars and is faster than polars in a lot of contexts, isn't included in the benchmarks.
The software engineering behind polars is amazing work and there's no need to have misleading benchmarks like this.
I don't know about the others but you can use Dask on a single machine, and it's also the easiest way to use Dask. It allows parallelizing operations by splitting dataframes into partitions that get processed in individual cores on your machine. Performance boost over pandas can be 2x with zero config, and I've seen up to 5x on certain operations.
Ibis, a Python dataframe created by the creator of pandas, uses DuckDB as the default backend and generally beats Polars on these benchmarks (with exceptions on some queries)
I don’t use Polars directly, but instead I use it as a materialization format in my DuckDB workflows.
Duckdb.query(sql).pl() is much faster than duckdb.query(sql).df(). It’s zero copy to Polars and happens instantaneously while Pandas takes quite a while if the DataFrame is big. And you can manipulate it like a Pandas DataFrame (albeit with slightly different syntax).
There must be a corollary to Greenspun's Tenth Rule (https://en.wikipedia.org/wiki/Greenspun's_tenth_rule) that any sufficiently complicated data analysis library contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of SQL.
I use Pandas from time to time and I'll probably try this out, but I always find myself wishing I'd just started with shoving whatever data I'm working with into Postgres.
It's not like I'm some database expert either, I'm far more comfortable with Python, but the facilities for selecting, sorting, filtering, joining, etc tabular data are just way better in SQL.
I recommend you look at DuckDB and the duckdb-prql extension.
DuckDB allows you to work on your Polars and Pandas (and any data on Arrow format) directly using SQL without needing any data copying or duplication.
The duckdb-prql allows you to use PRQL (prql-lang.org) which gives you all the power and universality of SQL with the ergonomics and DX of Polars or Pandas (IMHO).
You could do that, but it would likely both perform significantly worse (if you're doing "analytical" kinds of queries) and be a lot less flexible and expressive.
But you may want to look into DuckDB, which has a sql implementation that is not ad hoc, bug ridden, slow, or incomplete (though I honestly don't know about the formality of its specification). And it is compatible with polars :)
Imo lazy dataframe syntax is a far superior frontend to a query engine. Polars also has SQL support, but really the frontend isn't generally where bugs come from but instead come from the query engine.
Postgres would be an order of magnitude slower than OLAP query engines for the types of queries that people do with them.
And the rest follows the same line.
Anyone could ELI5 what this is and for what needs it is a good solution to use?
EDIT: So an alternative implementation of Pandas DataFrame. Google gave me [0] which explains:
> The pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields.
> DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or Calc. In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they’re an integral part of the Python and NumPy ecosystems.
[0]: https://realpython.com/pandas-dataframe/
I am the target audience for Polars and have been meaning to try it for several months, but I keep procrastinating about because I feel residual loyalty to Pandas because Wes McKinney (its creator) took the time to write a helpful book about the most common analytical tools: https://wesmckinney.com/book/
Thijs Nieuwdorp and I are currently working on the O'Reilly book "Python Polars: The Definitive Guide" [1]. It'll be a while before it gets released, but the Early Release version on the O'Reilly platform gets updated regularly. We also post draft chapters on the Polars Discord server [2].
The Discord server is also a great place to ask questions and interact with the Polars team.
[1] More information about the book: https://jeroenjanssens.com/pp/
[2] Polars Discord server: https://discord.gg/fngBqDry
Newbies are your best target audience too! They aren't already ingrained in a system and have to learn a new framework. They are starting from yours. If a newbie can't get through your docs, you need to improve your docs. But it's strange to me how mature Polars is and that the docs are still this bad. It makes it feel like it isn't in active/continued development. Polars is certainly a great piece of software, but that doesn't mean much if you can't get people to use it. And the better your docs, the quicker you turn noobs into wizards. The quicker you do that, the quicker you offload support onto your newfound wizards.
This whole thread just comes across as unmitigated pedantry to me.
Polars is also a wonderful project!
Sadly its not only tech products, but also things like security disclosures too.
It always follows the same pattern:
[1] https://datapythonista.me/blog/pandas-20-and-the-arrow-revol...
...ehh, not quite. R and its predecessor S have Pandas beat by decades. Pandas wasn't even the first data frame library for Python. But it sure is popular now.
Polars can be an overkill for small/medium dataset, but since I have been bitten by corrupted/badly formatted CSVs/TSVs I love the fact that Polars will throw the towel & complain about types/column number mismatches etc. And the fact that it can scale up to millions of rows on a modest workstation compensates the fact that sometimes one can spend hours finding a proper way to manipulate a dataset.
It’s a shame because I like to read about new tech or project and try and learn more, even if I don’t understand it completely. But there’s just nothing here for me.
This must be what normal people go through when I talk about my lowly web development work…
You can much more easily compose the operations you want to run.
Just think of it as an API for manipulating tabular data stored somewhere (often parquet files, though they can query many different data sources).
Anyway, probably the interesting things about polars are that it's like pandas, but uses a more efficient rust "backend" called Arrow (although I think that part's also in pandas now) and something like a "query planner" that makes combining operations more efficient. Typically doing things in polars is much more efficient than pandas, to the extent that things that previously required complicated infrastructure can often be done on a single machine. It's a very friendly competition, created by the main developer of pandas.
As far as I can tell everybody loves it and it'll probably supplant pandas over time.
I've been using pandas heavily, everyday, for something like 8 years now. I also did contribute to it, as well as wrote numpy extensions. That is to say, I'm fairly familiar with the pandas/numpy ecosystem, strengths and weaknesses.
Polars is a breeze of fresh air. The API of pandas is a mess:
* overuse of polymorphic parameters and return types (functions that accept lists, ndarrays, dataframes, series or scalar) and return differently shaped dataframes or series.
* lots of indirections behind layers of trampoline functions that hide default values behind undocumented "=None" default values.
* abuse of half baked immutable APIs, favoring "copy everything" style of coding, mixed with half supported, should-have-been-deprecated, in-place variants.
* lots and lots of regressions at every new release, of the worst kind ("oh yeah we changed the behavior of function X when there is more than Y NaNs over the window")
* Very hard to actually know what is delegated to numpy, what is Cython/pandas, and what is pure python/pandas.
* Overall the beast seemed to have won against its masters, and the maintainers seem lost as to what to fix versus what to keep backward compatible.
Polars fixes a lot of these issues, but it has some shortcomings as well. Mainly I found that:
* the API is definitely more consistent, but also more rigid than pandas. Some things can be very verbose to write. It will take some years for nicer simpler "shortcuts" and patterns to emerge.
* The main issue IMHO is polars' handling of cross sectional (axis=1) computations. Polars is _very_ time series (axis=0) oriented, and most cross sectional computations require to transpose the data frame, which is very slow. Pandas has a lot of dedicated axis=1 implementations that avoid a full transposition.
Polars is implemented in Rust and uses Arrow to represent some data in memory.
Pandas is often used in a way that results in a lot of processing happening in Python. Data processing in Python is comically inefficient and single-threaded. Beating that sort of pipeline by 10-100x is not that difficult if you have optimization experience and are able to work in a more suitable language.
I've been recently getting into DevOps stuff, and this is exactly what that sounds like to me. Same thing whenever I look at any new web framework:
Okay, cool, so your product brings the power of "WazzupJS" to "the edge"... But it also has "teleports" like "Blimper" but without all of the downsides of "Vormhole Blues"?
I'm sure that's really useful, but I kinda wish I knew what the product actually does
There's enough subtle and breaking changes that it is a bit frustrating. I really think Polars would be much more popular if the learning curve wasn't so high. It wouldn't be so high if there were just good docs. I'm also confused why there's a split between "User Guide" and "Docs".
To all devs:
Your docs are incredibly important! They are not an afterthought. And dear god, don't treat them as an afterthought and then tell people opening issues to RTFM. It's totally okay to point people in the right direction without hostility. It even takes less energy! It's okay to have a bad day and apologize later too, you'll even get more respect! Your docs are just as important as your code, even if you don't agree with me, they are to everyone but you. Besides technical debt there is also design debt. If you're getting the same questions over and over, you probably have poor design or you've miscommunicated somewhere. You're not expected to be a pro at everything you do and that's okay, we're all learning.
This isn't about polars, but I'm sure I'm not the only one to experience main character devs. It makes me (and presumably others) not want to open issues on __any__ project, not just bad projects, and that's bad for the whole community (including me, because users find mistakes. And we know there's 2 types of software: those with bugs and those that no one uses). Stupid people are just wizards in training and you don't get more wizards without noobs.
I get that these are general purpose tools with a lot of use cases, but some real quick examples of "this is a good use case" and "this is a bad use case, maybe prefer SQL/nosql/quasisql/hadoop/a CSV file and sed" would be really helpful, please.
If they'd called them flurzles you wouldn't feel like you should understand if it's not something you work with.
I haven't played enough to know all it's benefits, but yes it's the next logical step if you are in the space using the above mentioned libraries, it's something one will find.
If your data fits into numpy arrays or structured arrays (mainly if it is in numeric types), numpy is designed for this and will likely be much faster than pandas/polars (though I've also heard pandas can be faster on very large tables).
Pandas and Polars are designed for ease of use on heterogeneous data. They also include a python 'Object' data type which numpy very much does not. They are also designed more like database (e.g. 'join' operations). This allows you to work directly with imported data that numpy won't accept - after which Pandas uses numpy for underlying operations.
So I think the point is if you are running into speed issues in Pandas/Polars, you may find that the time-critical operations could be things that are more efficiently done in numpy (and this would be a much bigger gain than moving from Pandas to Polars)
Marketing is also very concerned with understanding who your target audience(s) are and speaking their language.
I think talking about "DataFrames" is exactly that; the target audience of this project knows what that means. What they are interested in is "ok but who cares about data frames? I've been using pandas for like fifteen years", so what you want to tell them is why this is an improvement, how it would help them. Dumbing it down to spend a bunch of space describing what data frames are would just be a distraction. You'd probably lose the target audience before you ever got to the actual benefits of the project.
You'd really have to be a complete data engineering newbie to not understand it I think?
I mean, where do you draw the line? You wouldn't expect a software tool like this to explain what it is in language my grandma would understand, I don't think?
I do occasionally use Pandas in my day job, but I honestly think very few programmers that could have use for a data frame library would describe themselves as a “data engineer” at all.
In my case, for example, I’m just a physicist - I don’t work with machine learning, big data, or in the software industry at all. I just use Pandas + Seaborn to process the results of numerical simulations and physical experiments similarly to how someone else might use Excel. Works great.
The PMs literally simplified things so much they simplified the product right ought of the docs.
> Quick install > Polars is written from the ground up making it easy to install. Select your programming language and get started!
> As a hobby project I tried to build a DataFrame library in Rust. I got excited about the Apache Arrow project and wondered if this would succeed.
> After two months of development it is faster than pandas for groupby's and left and inner joins. I still got some ideas for the join algorithms. Eventually I'd also want to add a query planner for lazy evaluation.
Detailed Comparison Between Polars, DuckDB, Pandas, Modin, Ponder, Fugue, Daft - https://news.ycombinator.com/item?id=37087279 - Aug 2023 (1 comment)
Polars: Company Formation Announcement - https://news.ycombinator.com/item?id=36984611 - Aug 2023 (52 comments)
Replacing Pandas with Polars - https://news.ycombinator.com/item?id=34452526 - Jan 2023 (82 comments)
Fast DataFrames for Ruby - https://news.ycombinator.com/item?id=34423221 - Jan 2023 (25 comments)
Modern Polars: A comparison of the Polars and Pandas dataframe libraries - https://news.ycombinator.com/item?id=34275818 - Jan 2023 (62 comments)
Rust polars 0.26 is released - https://news.ycombinator.com/item?id=34092566 - Dec 2022 (1 comment)
Polars: Fast DataFrame library for Rust and Python - https://news.ycombinator.com/item?id=29584698 - Dec 2021 (124 comments)
Polars: Rust DataFrames Based on Apache Arrow - https://news.ycombinator.com/item?id=23768227 - July 2020 (1 comment)
> A couple of personal points that I may as well insert here. The account I'm now using, dang, was used briefly in 2010 by someone who didn't leave an email address. I feel bad for not being able to ask them if they still want it, so I'm considering this an indefinite loan. If you're the original owner, please email me and I'll give it back to you. The reason I want the name dang is that (a) it's a version of my real name and (b) it's something you say when you make a mistake.
You're probably too late.
https://imgflip.com/memegenerator/Slowpoke
Dead Comment
Polars feels better than pandas in every way (faster + multi-core, less memory, more intuitive API). The library is still relatively young which has its downsides but in my opinion, at minimum, it deserves to be considered on any new project.
Easily being able to leverage the Rust ecosystem is also awesome - I sped up some geospatial code 100X by writing my own plugin to parallelize a function.
It's slightly ironic that you mention this, because I always thought the biggest problem with Pandas was its documentation. Case in point: did you know there's a way to join data frames without using the index? It's called "merge" rather than "join".
Pandas was originally very heavily inspired by R terminology and usage patterns, where the term "merge" to mean "join" was already commonplace. If I didn't already know R when I started learning Pandas (~2015), I don't think I'd have been able to pick it up quickly at all.
chatgpt (even the free tier) solved that problem for me. I ask it what I want in sql terms (or just plain english) and it tells me the pandas spell invocation. It even started to make sense after a few kLOC...
The join syntax and semantics is one of the trickiest parts and is under discussion again recently. It's actually one of the key parts of any data transformation platform and is foundational to Relational Algebra, being right there in the "Relational" part and also the R in PRQL. Most of the PRQL built-in primitive transforms are just simple list manipulations like map, filter or reduce but joins require care to preserve monadic composition (see for example the design of SelectMany in LINQ or flatmap in the List Monad). See this comment for some of my thoughts on this: https://github.com/PRQL/prql/issues/3782#issuecomment-181131... That issue is closed but I would love to hear any comments and you are welcome to open a new issue referencing that comment or simply tagging me (@snth).
Disclaimer: I'm a PRQL contributor.
I also think it’s awesome you guys have a duckdb integration - maybe I’ll try it out.
Not making a judgement, just finding it interesting.
Anyway, for what is worth, Copilot learns fast in your repos, very fast.
I use an extremely custom stack made of TS-Plus a TypeScript fork that not even the author itself uses nor recommends and Copilot churns very good TS-Plus code.
So don't underestimate how good can copilot can get at the boilerplate stage once he's seen few examples.
That sounds really interesting and valuable, I just have no idea where to start.
Not a big deal because I just read the docs but it was annoying that I couldn't have copilot just spit out what I need.
About me: I've used and disliked the Pandas API for a long time. I'm very proactive about continual improvement in my learning, tooling, mindset, and skills.
This is...such a strange take. To follow your logic to an extreme, everyone should use a very small handful of languages that are the "best" in their domain with ne'er a care for their personal comfort or preference.
> for your own sake
They're sticking with Pandas exactly for their own sake since they like being able to use Copilot.
> for the sake of open source innovation
Ohh by all means let's all be constantly relearning and rehashing just to follow the latest and greatest in open source innovation this week.
Tools are designed to be _used_ and if you like using a tool _and_ it does the job you require of it that seems just fine to me; especially if you're also taking the time to evaluate what else is out there occasionally.
https://blog.jupyter.org/bringing-modern-javascript-to-the-j...
I'm not personally a Data Science guy, but considering how early the JS/Jupyter ecosystem is, it was surprisingly quick to get pola.rs-based analysis up and running in TypeScript.
The JS bindings certainly need a bit of love, but hopefully now that it's more accessible we'll see some iteration on it.
For anyone who doesn't use those libraries, they are all targeted towards out-of-core data processing (i.e. computing across multiple machines because your data is too big). Comparing them to a single core data frame library is just silly, and they will obviously be slower because they necessarily come with a lot of overhead. It just wouldn't make sense to use polars in the same context as those libraries, so seeing them presented in benchmarks as if they are equivalents is a little silly.
And on top of that, duckdb, which you might use in the same context as polars and is faster than polars in a lot of contexts, isn't included in the benchmarks.
The software engineering behind polars is amazing work and there's no need to have misleading benchmarks like this.
Duckdb.query(sql).pl() is much faster than duckdb.query(sql).df(). It’s zero copy to Polars and happens instantaneously while Pandas takes quite a while if the DataFrame is big. And you can manipulate it like a Pandas DataFrame (albeit with slightly different syntax).
It’s greater for working with big datasets.
I use Pandas from time to time and I'll probably try this out, but I always find myself wishing I'd just started with shoving whatever data I'm working with into Postgres.
It's not like I'm some database expert either, I'm far more comfortable with Python, but the facilities for selecting, sorting, filtering, joining, etc tabular data are just way better in SQL.
DuckDB allows you to work on your Polars and Pandas (and any data on Arrow format) directly using SQL without needing any data copying or duplication.
The duckdb-prql allows you to use PRQL (prql-lang.org) which gives you all the power and universality of SQL with the ergonomics and DX of Polars or Pandas (IMHO).
Disclaimer: I'm a PRQL contributor.
But you may want to look into DuckDB, which has a sql implementation that is not ad hoc, bug ridden, slow, or incomplete (though I honestly don't know about the formality of its specification). And it is compatible with polars :)
Postgres would be an order of magnitude slower than OLAP query engines for the types of queries that people do with them.
this is a better approach to Python dataframes and SQL