> NULL indicates a missing value, and NaN is Not a Number.
That’s actually less true than it sounds. One of the primary functions of NaN is to be the result of 0/0, so there it means that there could be a value but we don’t know what it is because we didn’t take the limit properly. One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what. These ideas are what motivates the comparison shenanigans both NaN and NULL are known for.
There’s certainly an argument to be made that the actual implementation of both of these ideas is half-baked in the respective standards, and that they are half-baked differently so we shouldn’t confuse them. But I don’t think it’s fair to say that they are just completely unrelated. If anything, it’s Python’s None that’s doesn’t belong.
Related to this, below someone posted a link to the blog post of Wes McKinney where he discussed Pandas limitations and how PyArrow works around these
> 4. Doing missing data right
> All missing data in Arrow is represented as a packed bit array, separate from the rest of the data. This makes missing data handling simple and consistent across all data types. You can also do analytics on the null bits (AND-ing bitmaps, or counting set bits) using fast bit-wise built-in hardware operators and SIMD.
> The null count in an array is also explicitly stored in its metadata, so if data does not have nulls, we can choose faster code paths that skip null checking. With pandas, we cannot assume that arrays do not have null sentinel values and so most analytics has extra null checking which hurts performance. If you have no nulls, you don’t even need to allocate the bit array.
> Because missing data is not natively supported in NumPy, over time we have had to implement our own null-friendly versions of most key performance-critical algorithms. It would be better to have null-handling built into all algorithms and memory management from the ground up.
Note that the post is from 2017, and pandas now has (optional) support for PyArrow backed dataframes. So there is movement away from the critiques that were presented there.
Why don't we build a portable numeric system that just has these mathematical constants and definitions built in, in a portable and performance manner?
> One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what.
I'm not sure I understand this. Can you explain it with an example?
What I mean is, the ivory-tower relational model is that each table (“relation”) represents a predicate (Boolean-valued function) accepting as many arguments as there are columns (“attributes”), and the rows (“tuples”) of the table are an exhaustive listing of those combinations of arguments for which the predicate yields true (“holds”). E.g. the relation Lived may contain a tuple (Edgar Codd, 1923-08-19, 2003-04-18) to represent the fact that Franklin was born on 19 August 1923 and died on 18 April 2003.
One well-established approach[1] to NULLs is that they represent the above “closed-world assumption” of exhaustiveness to encompass values we don’t know. For example, the same relation could also contain the tuple (Leslie Lamport, 1941-02-07, NULL) to represent that Lamport was born on 7 February 1941 and lives to the present day.
We could then try to have some sort of three-valued logic to propagate this notion of uncertainty: define NULL = x to be (neither true nor false but) NULL (“don’t know”), and then any Boolean operation involving NULL to also yield NULL; that’s more or less what SQL does. As far as I know, it’s possible to make this consistent, but it’ll always be weaker than necessary: a complete solution would instead assign a variable to each unknown value of this sort and recognize that those values are, at the very least, equal to themselves, but dealing with this is NP-complete, as it essentially amounts to implementing Prolog.
In the broader DB theory context and not strictly Pandas, think of a `users` table with a `middle_name` column. There's a difference between `middle_name is null` (we don't know whether they have a middle name or what it is if they do) and `middle_name = ''` (we know that they don't have a middle name).
In that case, `select * from users where middle_name is null` gives us a set of users to prod for missing information next time we talk to them. `...where middle_name = ''` gives us a list of people without a middle name, should we care.
> If a string-based field has null=True, that means it has two possible values for “no data”: NULL, and the empty string. In most cases, it’s redundant to have two possible values for “no data;” the Django convention is to use the empty string, not NULL.
I've been around numerous Django devs when they found out that other DBs and ORMs do not remotely consider NULL and empty string to be the same thing.
Even medium tasks, the slow step is rarely computations, but the squishy human entering them.
When I work on a decent 50GB+ dataset, I have to do something fairly naive before I get frustrated at computation time.
Edit: pandas is now Boring Technology (not a criticism). It is a solid default choice. In contrast, we are still in a Cambrian explosion of NeoPandas wannabes. I have no idea who will win, but there is a lot of fragmentation which makes it difficult me for to jump head first into one of these alternatives. Most of them are faster or less RAM pressure which is really low on my list of problems.
> YAGNI - For lots of small data tasks, pandas is perfectly fine.
There’s a real possibility the growth of single node RAM capacity will perpetually outpace the growth of a business’s data. AWS has machines with 32 TB RAM now.
It’s a real question as to whether big data systems become more accessible before single node machines become massive enough to consume almost every data set.
Nice to see, over the past months I've replaces pandas with ibis in all new projects and I am a huge fan!
- Syntax in general feels more fluid than pandas
- Chaining operations with deferred expressions makes code snippets very portable
- Duckdb backend is super fast
- Community is very active, friendly and responsive
I'm trying to promote it to all my peers but it's not a very well known project in my circles. (Unlike Polars which seems to be the subject of 10% of the talks at all Python conferences)
Pandas has been working fine for me. The most powerful feature that makes me stick to it is the multi-index (hierarchical indexes) [1]. Can be used for columns too. Not sure how the cool new kids like polars or ibis would fare in that category.
Multi-indexes definitely have their place. In fact, I got involved in pandas development in 2013 as part of some work I was doing in graduate school, and I was a heavy user of multi-indexed columns. I loved them.
Over time, and after working on a variety of use cases, I personally have come to believe the baggage introduced by these data structures wasn't worth it. Take a look at the indexing code in pandas, and the staggering complexity of what's possible to put inside square brackets and how to decipher its meaning. The maintenance cost alone is quite high.
We don't plan to ever support multi-indexed rows or columns in Ibis. I don't think we'd fare well _at all_ there, intentionally so.
> and the staggering complexity of what's possible to put inside square brackets and how to decipher its meaning
I might not be aware of everything that's possible -- the usage I have of it doesn't give me an impression of staggering complexity. In fact I've found the functionality quite basic, and have been using pd.MultiIndex.from_* quite extensively for anything slightly more advanced than selecting a bunch of values at some level of the index.
I work a lot with IoT data, where basically everything is multi-variate time-series from multiple devices (at different physical locations and logical groupings). Pandas multi index is very nice for this, at least having time+space in the index.
Sorry I don't know what to answer. I don't think what I do qualifies as "workload".
I have a process that generates lots of data. I put it in a huge multi-indexed dataframe that luckily fits in RAM. I then slice out the part I need and pass it on to some computation (at which point the data usually becomes a numpy array or a torch tensor). Core-count is not really a concern as there's not much going on other than slicing in memory.
The main gain I get of this approach is prototyping velocity and flexibility. Certainly sub-optimal in terms of performance.
Polars is a great choice but, like pandas, locks you into its execution engine(s). Polars is a supported backend for Ibis, though depending on your use case DuckDB or DataFusion may scale better. we did some benchmarking recently: https://ibis-project.org/posts/1tbc/
Also depending on your use case, Ibis may not support the part of the Polars API you actually need
(to be clear, I'm a fan of "DuckDB via Ibis", I'm much less a fan of statements to the effect of "use Polars via Ibis instead of Polars directly", as I think they miss a lot :wink:)
I've been using pandas for over a decade and I'm very efficient with it.
The Modern Pandas blog series from Tom Augsburger is excellent, and the old Wes McKinney book (creator of pandas) gave a good glimpse into how to effectively use it.
I don't think many people learn it correctly so it gets used in a very inefficient and spaghetti script style.
It's definitely a tool that's showing it's age, but it's still a very effective swiss army knife for data processing.
I am glad we're getting better alternatives, Hadley Wickams tidyverse really showed what data processing can be.
One thing I do like about pandas is it’s pretty extensible to columns of new types. Maybe I’m missing something, but does Polars allow this? Last time I checked there wasn’t a clear path forward.
The pandas support for extensions is very robust with extensive testing to make sure you can do all the data aggregations functionality you expect in and pandas column.
Never heard of ibis before but the front page give a pretty good overview of what it is to me at least. Looks like a dataframes api that can be executed on a variety of backend engines, both distributed and local.
Somewhat analogous to how the pandas api can be used in pyspark via pyspark pandas, but from the api -> implementation direction rather than (pandas) implementation -> api -> implementation maybe?
As far as I can tell, it is an ORM for data engineers. They can write Python code that gets translated to either SQL or some other language understood by the DB engine that actually runs it.
That’s actually less true than it sounds. One of the primary functions of NaN is to be the result of 0/0, so there it means that there could be a value but we don’t know what it is because we didn’t take the limit properly. One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what. These ideas are what motivates the comparison shenanigans both NaN and NULL are known for.
There’s certainly an argument to be made that the actual implementation of both of these ideas is half-baked in the respective standards, and that they are half-baked differently so we shouldn’t confuse them. But I don’t think it’s fair to say that they are just completely unrelated. If anything, it’s Python’s None that’s doesn’t belong.
> 4. Doing missing data right > All missing data in Arrow is represented as a packed bit array, separate from the rest of the data. This makes missing data handling simple and consistent across all data types. You can also do analytics on the null bits (AND-ing bitmaps, or counting set bits) using fast bit-wise built-in hardware operators and SIMD.
> The null count in an array is also explicitly stored in its metadata, so if data does not have nulls, we can choose faster code paths that skip null checking. With pandas, we cannot assume that arrays do not have null sentinel values and so most analytics has extra null checking which hurts performance. If you have no nulls, you don’t even need to allocate the bit array.
> Because missing data is not natively supported in NumPy, over time we have had to implement our own null-friendly versions of most key performance-critical algorithms. It would be better to have null-handling built into all algorithms and memory management from the ground up.
https://wesmckinney.com/blog/apache-arrow-pandas-internals/
We'd love to have a docs contribution that lays this out in detail if you'd be up for it!
Disclaimer: (I lead the project and work on it full time).
I'm not sure I understand this. Can you explain it with an example?
I'm thinking something like:
But then I'm stuck.One well-established approach[1] to NULLs is that they represent the above “closed-world assumption” of exhaustiveness to encompass values we don’t know. For example, the same relation could also contain the tuple (Leslie Lamport, 1941-02-07, NULL) to represent that Lamport was born on 7 February 1941 and lives to the present day.
We could then try to have some sort of three-valued logic to propagate this notion of uncertainty: define NULL = x to be (neither true nor false but) NULL (“don’t know”), and then any Boolean operation involving NULL to also yield NULL; that’s more or less what SQL does. As far as I know, it’s possible to make this consistent, but it’ll always be weaker than necessary: a complete solution would instead assign a variable to each unknown value of this sort and recognize that those values are, at the very least, equal to themselves, but dealing with this is NP-complete, as it essentially amounts to implementing Prolog.
[1] http://www.esp.org/foundations/database-theory/holdings/codd...
In that case, `select * from users where middle_name is null` gives us a set of users to prod for missing information next time we talk to them. `...where middle_name = ''` gives us a list of people without a middle name, should we care.
Edit:
A related aside is that Django's ORM's handling of NULL gives me freaking hives. From https://docs.djangoproject.com/en/5.1/ref/models/fields/#nul...:
> If a string-based field has null=True, that means it has two possible values for “no data”: NULL, and the empty string. In most cases, it’s redundant to have two possible values for “no data;” the Django convention is to use the empty string, not NULL.
I've been around numerous Django devs when they found out that other DBs and ORMs do not remotely consider NULL and empty string to be the same thing.
Folks then ask why not jump from pandas to [insert favorite tool]?
- Existing codebases. Lots of legacy pandas floating about.
- Third party integration. Everyone supports pandas. Lots of libraries work with tools like Polars, but everything works with pandas.
- YAGNI - For lots of small data tasks, pandas is perfectly fine.
When I work on a decent 50GB+ dataset, I have to do something fairly naive before I get frustrated at computation time.
Edit: pandas is now Boring Technology (not a criticism). It is a solid default choice. In contrast, we are still in a Cambrian explosion of NeoPandas wannabes. I have no idea who will win, but there is a lot of fragmentation which makes it difficult me for to jump head first into one of these alternatives. Most of them are faster or less RAM pressure which is really low on my list of problems.
Fully agree that if pandas is working for you, then you're better off sticking with it!
There’s a real possibility the growth of single node RAM capacity will perpetually outpace the growth of a business’s data. AWS has machines with 32 TB RAM now.
It’s a real question as to whether big data systems become more accessible before single node machines become massive enough to consume almost every data set.
(Also love your work, thank you)
The hoops folks will jump through...
(Thanks!)
Not sure what "big" means here, but a combination of .pipe, pyarrow, and polars can speed up many slow Pandas operations.
Polars streaming is surprisingly good for larger than RAM. I get that clusters are cool, but I prefer to keep it on a single machine if possible.
Also, libraries like cudf can greatly speed up Pandas code on a single machine, while Snowpark can scale Pandas code to Snowflake scale.
- Syntax in general feels more fluid than pandas
- Chaining operations with deferred expressions makes code snippets very portable
- Duckdb backend is super fast
- Community is very active, friendly and responsive
I'm trying to promote it to all my peers but it's not a very well known project in my circles. (Unlike Polars which seems to be the subject of 10% of the talks at all Python conferences)
[1] https://pandas.pydata.org/docs/user_guide/advanced.html#adva...
Over time, and after working on a variety of use cases, I personally have come to believe the baggage introduced by these data structures wasn't worth it. Take a look at the indexing code in pandas, and the staggering complexity of what's possible to put inside square brackets and how to decipher its meaning. The maintenance cost alone is quite high.
We don't plan to ever support multi-indexed rows or columns in Ibis. I don't think we'd fare well _at all_ there, intentionally so.
As the end-user, not quite my concern.
> and the staggering complexity of what's possible to put inside square brackets and how to decipher its meaning
I might not be aware of everything that's possible -- the usage I have of it doesn't give me an impression of staggering complexity. In fact I've found the functionality quite basic, and have been using pd.MultiIndex.from_* quite extensively for anything slightly more advanced than selecting a bunch of values at some level of the index.
I have a process that generates lots of data. I put it in a huge multi-indexed dataframe that luckily fits in RAM. I then slice out the part I need and pass it on to some computation (at which point the data usually becomes a numpy array or a torch tensor). Core-count is not really a concern as there's not much going on other than slicing in memory.
The main gain I get of this approach is prototyping velocity and flexibility. Certainly sub-optimal in terms of performance.
(to be clear, I'm a fan of "DuckDB via Ibis", I'm much less a fan of statements to the effect of "use Polars via Ibis instead of Polars directly", as I think they miss a lot :wink:)
DuckDB was chosen as the default because Polars was too unstable to be the default backend two years ago.
Pandas turns 10x programmers into 1x programmers.
The Modern Pandas blog series from Tom Augsburger is excellent, and the old Wes McKinney book (creator of pandas) gave a good glimpse into how to effectively use it.
I don't think many people learn it correctly so it gets used in a very inefficient and spaghetti script style.
It's definitely a tool that's showing it's age, but it's still a very effective swiss army knife for data processing.
I am glad we're getting better alternatives, Hadley Wickams tidyverse really showed what data processing can be.
- Consistent Expression and SQL-like API.
- Lazy execution mode, where queries are compiled and optimized before running.
- Sane NaN and null handling.
- Much faster.
- Much more memory efficient.
I added a column type for full text search, searching tokenized text (https://github.com/softwaredoug/searcharray)
The pandas support for extensions is very robust with extensive testing to make sure you can do all the data aggregations functionality you expect in and pandas column.
The front page doesn't make it clear to make what ibis is besides being an alternative to pandas/polars.
I didn’t know either. It’s a python dataframe library/api.
Somewhat analogous to how the pandas api can be used in pyspark via pyspark pandas, but from the api -> implementation direction rather than (pandas) implementation -> api -> implementation maybe?
Check out SQLAlchemy's ORM (https://docs.sqlalchemy.org/en/20/orm/) for what that looks like.