Readit News logoReadit News
farhanhubble · 3 months ago
yboris · 3 months ago
Amazing Thank you for sharing.

Reminds me of how thinking using frequencies rather than computing probabilities is easier and can avoid errors (e.g. a 99% accurate test being positive does not mean 99% likelihood of having disease for a disease with a 1/10,000 prevalence in population).

ellisv · 3 months ago
These types of books are always interesting to me because they tackle so many different things. They cover a range of topics at a high level (data manipulation, visualization, machine learning) and each could have its own book. They balance teaching programming while introducing concepts (and sometimes theory).

In short I think it's hard to strike an appropriate balance between these but this seems to be a good intro level book.

trio8453 · 3 months ago
This book was absolute fire for getting started with data science in 2017-2018, Jake is a great teacher.
mayankkaizen · 3 months ago
This was the solid book to get into data science and ML scene. Covers everything. Jake is a fantastic teacher. I really wish he comes up with updated second edition.
sschnei8 · 3 months ago
Interesting choice of Pandas in this day and age. Maybe he’s after imparting general concepts that you could apply to any tabular data manipulator rather than selecting for the latest shiny tool.
msto · 3 months ago
It was originally published in 2016, and I think this is still the first edition.
mkl · 3 months ago
dahcryn · 3 months ago
why? It's the industry standard as far as my reach goes.

What other framework would you replace it with?

No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.

crystal_revenge · 3 months ago
You can assert whatever you want, but Polars is a great answer. The performance improvements are secondary to me compared to the dramatic improvement in interface.

Today all serious DS work will ultimately become data engineering work anyway. The time when DS can just fiddle around in notebooks all day has passed.

porker · 3 months ago
> No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.

Can you expand on why Polars isn't optimised for a holistic approach to data science?

minimaxir · 3 months ago
What can you do in more easily in pandas than polars?
maxnoe · 3 months ago
The book is quite old actually, not sure if "this day and age" still applies to it
xenophonf · 3 months ago
What's wrong with Pandas?
crystal_revenge · 3 months ago
Pandas is generally awful unless you're just living in a notebook (and even then it's probably least favorite implementation of the 'data frame' concept).

Since Pandas lacks Polars' concept of an Expression, it's actually quite challenging to programmatically interact with non-trivial Pandas queries. In Polars the query logic can be entirely independent of the data frame while still referencing specific columns of the data frame. This makes Polars data frames work much more naturally with typical programming abstractions.

Pandas multi-index is a bad idea in nearly all contexts other than it's original use case: financial time series (and I'll admit, if you're working with purely financial time series, then Pandas feels much better). Sufficiently large Pandas code bases are littered with seemingly arbitrary uses of 'reset_index', there are many times where multi-index will create bugs, and, most important, I've never seen any non-financial scenario where anyone has ever used Multi-index to their advantage.

Finally Pandas is slow, which is honestly the least priority for me personally, but using Polars is so refreshing.

What other data frames have you used? Having used R's native dataframes extensively (the way they make use of indexing is so much nicer) in addition to Polars both are drastically preferable to Pandas. My experience is that most people use Pandas because it has been the only data frame implementation in Python. But personally I'd rather just not use data frames if I'm forced to used Pandas. Could you expand on what you like about Pandas over other data frames models you've worked with?

clickety_clack · 3 months ago
I probably wouldn’t rewrite an entire data science stack that used pandas, but most people would use polars if starting a new project today.
amelius · 3 months ago
Pandas turns 10x developers with a lust for life into 0.1x developers with grey hairs.
wesleywt · 3 months ago
Nothing, it gets the job done for most people. If you don't like it, make a better tool. Polars is not it.

Deleted Comment

pantsforbirds · 3 months ago
I used the Kernel Density Estimation (KDE) page/blog at my very first job. It was immensely useful and I've loved his work ever since.
BenGosub · 3 months ago
He's a great writer and I miss his blog. He had an awesome post on pivot table that I think is now a part of this book.
ayhanfuat · 3 months ago
He is also the creator of the Altair visualization library (Vega-Lite in Python https://altair-viz.github.io/). I really like using it.
linhns · 3 months ago
Thanks for the fact, I used Altair sometimes and really admire the simplicity, not knowing it was written by Jake.

Dead Comment

synergy20 · 3 months ago
it's written 8 years ago though, there is a 2ed of the book by the same author.
phone_book · 3 months ago
The linked Github seems to have the 2nd edition in the form of notebooks, https://github.com/jakevdp/PythonDataScienceHandbook/blob/ma..., under the Using Code Examples section, "attribution usually includes the title, author, publisher, and ISBN. For example: "Python Data Science Handbook, 2nd edition, by Jake VanderPlas (O’Reilly). Copyright 2023..." compared to the OP's link which has "The Python Data Science Handbook by Jake VanderPlas (O’Reilly). Copyright 2016..."
mayankkaizen · 3 months ago
There is a second edition?
refactor_master · 3 months ago
I honestly don't get why you'd hate pandas more than anything else in the Python ecosystem. It's probably not the best tool in the world, and sure, like everybody else I'd rewrite the universe in Rust if I could start over, and had infinite time to catch up.

But the code base I work on has thousands and THOUSANDS of lines of Pandas churning through big data, and I can't remember the last time it lead to a bug or error in production.

We use pandas + static schema wrapper + type checker, so you'll have to get exotic to break things.

bormaj · 3 months ago
Custom schema wrapper or some package you'd recommend from pypi?
refactor_master · 3 months ago
Originally I used Pandera, but it had several issues last

* Mypy dependency and really bad PEP compliance * Sub-optimal runtime check decorators * Subclasses pd.DataFrame, so using e.g. .assign(...) makes the type checker think it's still the same type, but now you just violated your own schema

So I wrote my own library that solves all these issues, but it's currently company-internal. I've been meaning to push for open-sourcing it, but just haven't had the time.