Readit News logoReadit News
mananaysiempre · a year ago
> NULL indicates a missing value, and NaN is Not a Number.

That’s actually less true than it sounds. One of the primary functions of NaN is to be the result of 0/0, so there it means that there could be a value but we don’t know what it is because we didn’t take the limit properly. One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what. These ideas are what motivates the comparison shenanigans both NaN and NULL are known for.

There’s certainly an argument to be made that the actual implementation of both of these ideas is half-baked in the respective standards, and that they are half-baked differently so we shouldn’t confuse them. But I don’t think it’s fair to say that they are just completely unrelated. If anything, it’s Python’s None that’s doesn’t belong.

miohtama · a year ago
Related to this, below someone posted a link to the blog post of Wes McKinney where he discussed Pandas limitations and how PyArrow works around these

> 4. Doing missing data right > All missing data in Arrow is represented as a packed bit array, separate from the rest of the data. This makes missing data handling simple and consistent across all data types. You can also do analytics on the null bits (AND-ing bitmaps, or counting set bits) using fast bit-wise built-in hardware operators and SIMD.

> The null count in an array is also explicitly stored in its metadata, so if data does not have nulls, we can choose faster code paths that skip null checking. With pandas, we cannot assume that arrays do not have null sentinel values and so most analytics has extra null checking which hurts performance. If you have no nulls, you don’t even need to allocate the bit array.

> Because missing data is not natively supported in NumPy, over time we have had to implement our own null-friendly versions of most key performance-critical algorithms. It would be better to have null-handling built into all algorithms and memory management from the ground up.

https://wesmckinney.com/blog/apache-arrow-pandas-internals/

jononor · a year ago
Note that the post is from 2017, and pandas now has (optional) support for PyArrow backed dataframes. So there is movement away from the critiques that were presented there.
cpcloud · a year ago
100% this.

We'd love to have a docs contribution that lays this out in detail if you'd be up for it!

Disclaimer: (I lead the project and work on it full time).

aeonik · a year ago
Why don't we build a portable numeric system that just has these mathematical constants and definitions built in, in a portable and performance manner?
verandaguy · a year ago
We just don’t have the technology for that, else there’d be libraries floating around that let you do that.
OptionOfT · a year ago
> One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what.

I'm not sure I understand this. Can you explain it with an example?

I'm thinking something like:

    let (x, y) = NULL;
But then I'm stuck.

mananaysiempre · a year ago
What I mean is, the ivory-tower relational model is that each table (“relation”) represents a predicate (Boolean-valued function) accepting as many arguments as there are columns (“attributes”), and the rows (“tuples”) of the table are an exhaustive listing of those combinations of arguments for which the predicate yields true (“holds”). E.g. the relation Lived may contain a tuple (Edgar Codd, 1923-08-19, 2003-04-18) to represent the fact that Franklin was born on 19 August 1923 and died on 18 April 2003.

One well-established approach[1] to NULLs is that they represent the above “closed-world assumption” of exhaustiveness to encompass values we don’t know. For example, the same relation could also contain the tuple (Leslie Lamport, 1941-02-07, NULL) to represent that Lamport was born on 7 February 1941 and lives to the present day.

We could then try to have some sort of three-valued logic to propagate this notion of uncertainty: define NULL = x to be (neither true nor false but) NULL (“don’t know”), and then any Boolean operation involving NULL to also yield NULL; that’s more or less what SQL does. As far as I know, it’s possible to make this consistent, but it’ll always be weaker than necessary: a complete solution would instead assign a variable to each unknown value of this sort and recognize that those values are, at the very least, equal to themselves, but dealing with this is NP-complete, as it essentially amounts to implementing Prolog.

[1] http://www.esp.org/foundations/database-theory/holdings/codd...

kstrauser · a year ago
In the broader DB theory context and not strictly Pandas, think of a `users` table with a `middle_name` column. There's a difference between `middle_name is null` (we don't know whether they have a middle name or what it is if they do) and `middle_name = ''` (we know that they don't have a middle name).

In that case, `select * from users where middle_name is null` gives us a set of users to prod for missing information next time we talk to them. `...where middle_name = ''` gives us a list of people without a middle name, should we care.

Edit:

A related aside is that Django's ORM's handling of NULL gives me freaking hives. From https://docs.djangoproject.com/en/5.1/ref/models/fields/#nul...:

> If a string-based field has null=True, that means it has two possible values for “no data”: NULL, and the empty string. In most cases, it’s redundant to have two possible values for “no data;” the Django convention is to use the empty string, not NULL.

I've been around numerous Django devs when they found out that other DBs and ORMs do not remotely consider NULL and empty string to be the same thing.

__mharrison__ · a year ago
Not surprising. There are much better compute engines than pandas.

Folks then ask why not jump from pandas to [insert favorite tool]?

- Existing codebases. Lots of legacy pandas floating about.

- Third party integration. Everyone supports pandas. Lots of libraries work with tools like Polars, but everything works with pandas.

- YAGNI - For lots of small data tasks, pandas is perfectly fine.

0cf8612b2e1e · a year ago
Even medium tasks, the slow step is rarely computations, but the squishy human entering them.

When I work on a decent 50GB+ dataset, I have to do something fairly naive before I get frustrated at computation time.

Edit: pandas is now Boring Technology (not a criticism). It is a solid default choice. In contrast, we are still in a Cambrian explosion of NeoPandas wannabes. I have no idea who will win, but there is a lot of fragmentation which makes it difficult me for to jump head first into one of these alternatives. Most of them are faster or less RAM pressure which is really low on my list of problems.

cpcloud · a year ago
All great reasons to stick with Pandas. If a thing works for you, then that's one less thing to manage.

Fully agree that if pandas is working for you, then you're better off sticking with it!

halfcat · a year ago
> YAGNI - For lots of small data tasks, pandas is perfectly fine.

There’s a real possibility the growth of single node RAM capacity will perpetually outpace the growth of a business’s data. AWS has machines with 32 TB RAM now.

It’s a real question as to whether big data systems become more accessible before single node machines become massive enough to consume almost every data set.

(Also love your work, thank you)

__mharrison__ · a year ago
But... but we have to have big data and a cluster...

The hoops folks will jump through...

(Thanks!)

whimsicalism · a year ago
If your task is too big for pandas, you should probably skip right over dask and polars for a better compute engine.
__mharrison__ · a year ago
Jump straight to cluster and skip Dask?

Not sure what "big" means here, but a combination of .pipe, pyarrow, and polars can speed up many slow Pandas operations.

Polars streaming is surprisingly good for larger than RAM. I get that clusters are cool, but I prefer to keep it on a single machine if possible.

Also, libraries like cudf can greatly speed up Pandas code on a single machine, while Snowpark can scale Pandas code to Snowflake scale.

riezebos · a year ago
Nice to see, over the past months I've replaces pandas with ibis in all new projects and I am a huge fan!

- Syntax in general feels more fluid than pandas

- Chaining operations with deferred expressions makes code snippets very portable

- Duckdb backend is super fast

- Community is very active, friendly and responsive

I'm trying to promote it to all my peers but it's not a very well known project in my circles. (Unlike Polars which seems to be the subject of 10% of the talks at all Python conferences)

kremi · a year ago
Pandas has been working fine for me. The most powerful feature that makes me stick to it is the multi-index (hierarchical indexes) [1]. Can be used for columns too. Not sure how the cool new kids like polars or ibis would fare in that category.

[1] https://pandas.pydata.org/docs/user_guide/advanced.html#adva...

cpcloud · a year ago
Multi-indexes definitely have their place. In fact, I got involved in pandas development in 2013 as part of some work I was doing in graduate school, and I was a heavy user of multi-indexed columns. I loved them.

Over time, and after working on a variety of use cases, I personally have come to believe the baggage introduced by these data structures wasn't worth it. Take a look at the indexing code in pandas, and the staggering complexity of what's possible to put inside square brackets and how to decipher its meaning. The maintenance cost alone is quite high.

We don't plan to ever support multi-indexed rows or columns in Ibis. I don't think we'd fare well _at all_ there, intentionally so.

kremi · a year ago
> Take a look at the indexing code in pandas

As the end-user, not quite my concern.

> and the staggering complexity of what's possible to put inside square brackets and how to decipher its meaning

I might not be aware of everything that's possible -- the usage I have of it doesn't give me an impression of staggering complexity. In fact I've found the functionality quite basic, and have been using pd.MultiIndex.from_* quite extensively for anything slightly more advanced than selecting a bunch of values at some level of the index.

jononor · a year ago
I work a lot with IoT data, where basically everything is multi-variate time-series from multiple devices (at different physical locations and logical groupings). Pandas multi index is very nice for this, at least having time+space in the index.
highfrequency · a year ago
Is your workload mostly single-threaded? If so, is that due to dataset size, or machine core count?
kremi · a year ago
Sorry I don't know what to answer. I don't think what I do qualifies as "workload".

I have a process that generates lots of data. I put it in a huge multi-indexed dataframe that luckily fits in RAM. I then slice out the part I need and pass it on to some computation (at which point the data usually becomes a numpy array or a torch tensor). Core-count is not really a concern as there's not much going on other than slicing in memory.

The main gain I get of this approach is prototyping velocity and flexibility. Certainly sub-optimal in terms of performance.

frakt0x90 · a year ago
Curious if you considered Polars. That's become the defacto standard in my group as we all dislike pandas.
codyvoda · a year ago
Polars is a great choice but, like pandas, locks you into its execution engine(s). Polars is a supported backend for Ibis, though depending on your use case DuckDB or DataFusion may scale better. we did some benchmarking recently: https://ibis-project.org/posts/1tbc/
marcogorelli · a year ago
Also depending on your use case, Ibis may not support the part of the Polars API you actually need

(to be clear, I'm a fan of "DuckDB via Ibis", I'm much less a fan of statements to the effect of "use Polars via Ibis instead of Polars directly", as I think they miss a lot :wink:)

cpcloud · a year ago
We support Polars as a backend.

DuckDB was chosen as the default because Polars was too unstable to be the default backend two years ago.

amelius · a year ago
Good riddance.

Pandas turns 10x programmers into 1x programmers.

data-ottawa · a year ago
I've been using pandas for over a decade and I'm very efficient with it.

The Modern Pandas blog series from Tom Augsburger is excellent, and the old Wes McKinney book (creator of pandas) gave a good glimpse into how to effectively use it.

I don't think many people learn it correctly so it gets used in a very inefficient and spaghetti script style.

It's definitely a tool that's showing it's age, but it's still a very effective swiss army knife for data processing.

I am glad we're getting better alternatives, Hadley Wickams tidyverse really showed what data processing can be.

__mharrison__ · a year ago
Curious if you could expand on this?
Kalanos · a year ago
Lol. As if different tools for filtering rows/columns determines a programmer's capability.
fifilura · a year ago
What is the difference, other than parallel execution?
magnio · a year ago
- Much faster and more memory efficient.

- Consistent Expression and SQL-like API.

- Lazy execution mode, where queries are compiled and optimized before running.

- Sane NaN and null handling.

- Much faster.

- Much more memory efficient.

ang_cire · a year ago
I wouldn't switch to Polars as a replacement to Pandas, given that Polars also seems likely to die soon. I'd also recommend avoiding Manatees.
cpcloud · a year ago
On the flip side, Ibises (Ibisii ??) have been around for years and aren't going anywhere.
highfrequency · a year ago
Why do you think polars is likely to die soon?
softwaredoug · a year ago
One thing I do like about pandas is it’s pretty extensible to columns of new types. Maybe I’m missing something, but does Polars allow this? Last time I checked there wasn’t a clear path forward.
porker · a year ago
I'm curious, when do you want to add a new column type? It's not a situation I've been in.
softwaredoug · a year ago
GeoPandas is one popular library (https://geopandas.org/en/stable/)

I added a column type for full text search, searching tokenized text (https://github.com/softwaredoug/searcharray)

The pandas support for extensions is very robust with extensive testing to make sure you can do all the data aggregations functionality you expect in and pandas column.

deadlypointer · a year ago
OTOH the list and struct types allow lots of complex operations, e.g with `.eval`, which run blazingly fast.
datadrivenangel · a year ago
Ibis allows this with custom datatypes.
spywaregorilla · a year ago
What is this exactly?

The front page doesn't make it clear to make what ibis is besides being an alternative to pandas/polars.

prepend · a year ago
Check the project’s left nav, https://ibis-project.org/why

I didn’t know either. It’s a python dataframe library/api.

smu3l · a year ago
Never heard of ibis before but the front page give a pretty good overview of what it is to me at least. Looks like a dataframes api that can be executed on a variety of backend engines, both distributed and local.

Somewhat analogous to how the pandas api can be used in pyspark via pyspark pandas, but from the api -> implementation direction rather than (pandas) implementation -> api -> implementation maybe?

giovannibonetti · a year ago
As far as I can tell, it is an ORM for data engineers. They can write Python code that gets translated to either SQL or some other language understood by the DB engine that actually runs it.
cpcloud · a year ago
Ibis is not an ORM. It has no opinion of or knowledge about your data model, or relationships between tables and their columns.

Check out SQLAlchemy's ORM (https://docs.sqlalchemy.org/en/20/orm/) for what that looks like.

joelschw · a year ago
Huge fan of Ibis, the value isn't that you can now use DuckDB... it's that your syntax will work when the next cool thing arrives too