I really am not sure how "comprehensive" I would call this, after glancing over it it looks like one of the million currently existing basic pandas plotting guides.
What should one do if after following those million guides it still doesn’t stick? I always end up googling what I want and hit SO. Is there something wrong with me or does numpy and pandas seem more difficult than they should be?
The pandas api is a mess and that's why it feels that way. It is a great tool, and very powerful, but boy does it make the user's life difficult with a pretty convoluted API that makes it very hard to automatically discover functionality. As others have said, unless you use it all the time, you're essentially in for a bad time that involves lots of Googling and SO browsing even to get basic things done in pandas. I say this as someone who had developed a few smaller projects that utilize pandas extensively, so I'm not just criticizing it without having used it.
I really wouldn't worry about it - I've been using Python for data work for the last 5 years or so and I have to look stuff all the time. Eventually the basic stuff sticks but it's like any kind of coding, I don't think anyone ever hits the point where they hardly ever have to Google stuff.
One useful tip I can suggest is to create a repo for useful code snippets, so if you ever find yourself doing something new that you think you might need again, just spend a bit of time commenting and describing it and add it to the repo. That way instead of having to spend time searching you'll hopefully remember doing it before and be able to find it easily.
It is possible to do the same things in Pandas is many different ways. It is good to have this flexibility, but it is confusing for new users. On top of that the bracket operator ([) is overloaded in many ways.
Things started to make sense after I read a very good book on Pandas[1]. Reading a book is better than reading blog posts, because it is consistent. In contrast, reading small tutorials for every little thing is confusing, because every blog post is using a different way to do the same thing.
I suggest looking into matplotlib structures: figures and axes. I think this article [0] is pretty good at detailing how to work with them. They definitely can be confusing but I think most can grasp what to use after reading the article.
>Is there something wrong with me or does numpy and pandas seem more difficult than they should be?
If you're only an occasional user, this will be your life forever. My experience with pandas is that if you use it heavily for 3 months, then things start to "stick" and you need to look it up less often.
Unfortunately I changed jobs and have forgotten most of pandas, so I'm back to looking things up again.
This capability is at least a few years old which is when I first started using pandas. I believe it uses matplotlib on the backend for the plotting by default, and works pretty seamlessly with seaborn too.
Lately, `pip install pandas` is my first step after making a new virtual environment. Its read_sql and read_csv methods are magic. The resulting DataFrames are just like DataTables in C#. And for complex joins and aggregations, I can DataFrame.to_sql into an in-memory SQLite database.
Pandas feels like the wrong tool for this job. I don't use multi-indexes or any statistical methods. I don't chart anything.
But it's so darn convenient. If the time comes to optimize I can `import csv` directly and improve performance. But nothing beats it for prototyping.
On occasion, I've fired up pandas just to sanitize a CSV file and drop malformed rows as preparation to bulk ingesting into a database:
import pandas as pd
pd.read_csv('bad_file.csv', error_bad_lines=False).to_csv('good_file.csv')
It's not efficient (reads everything into memory), but read_csv is robust when it comes to handling embedded unescaped quotes/commas/etc., and supports dropping rows with the incorrect number of columns due to anomalies it can't handle.
Genuine question - would you be willing to spend money for a better version of pandas?
Better in some generic sense of lighter, faster, better API.
I share your implied concern that pandas can be quite large and I personally disagree with a lot of the design decisions when it comes to the pandas API, but building an alternative tool would be a full time job. Unfortunately, there is no mechanism to support Python library developers and the expectation is for Python libraries to be free.
I'm curious how many people would be ok paying for a Python library.
I think that would be an uphill battle and very hard to succeed financially. I agree with you regarding the API being a mess, but pandas is so heavily entrenched in the datascience space (in Python land) that it is almost impossible for a free replacement to take over, let a lone a paid library.
Seaborn isn't usable without matplotlib, nor does it aim to be. It gives you simple high-level calls and as soon as you need to tweak your plot you're back to matplotlib. (Similar to the pandas plotting shown in the article actually.)
Bokeh is nice. And has made huge improvements over the past year or two. But it still doesn't directly compete with matplotlib because it's more focused on interactive plots.
Bokeh and Plotly obviously do much, much more than matplotlib, and if you want what they provide, you would go for those. Seaborn is more presentable-looking by default and easier to get things laid out in. I personally haven't used straight-up matplotlib for a long time.
Bokeh and Seaborn are very, very thin skins over matplotlib, and honestly don't improve the user experience at all. Only Altair has changed things for me.
The matlab plotting syntax got transferred to Python through matplotlib and got very deeply ingrained - was first, got popular, built into pandas and statsmodels, foundation of seaborn etc. Recently I saw a snippet of Julia code that uses "pyplot", I assume because people find it familiar and convenient. That API just refuses to die.
Most of the things like ticks and lims are covered which is above basics. But if you are looking for annotation or animations then you need coding in matplotlib though
I strongly recommend Altair (https://altair-viz.github.io/) as an extremely Pandas-friendly alternative approach to data visualization. It's the first library that has successfully "hidden" the ugly, gnarled matplotlib layer underneath for me. It also looks killer.
Second that but there is no matplotlib underneath AFAIK. It does html based interactive graphs using a variant of the grammar of graphics, with extensions for interactions. Grammar of graphics means a great, proven (see ggplot) compromise of power and simplicity. <shameless plug> If you need a library of one-liners built on top of altair (that is if you need some standard stats graph) I wrote altair_recipes (https://github.com/piccolbo/altair_recipes/ or pip install altair_recipes) for that. </shameless plug>
I like Altair, but it still has some annoying missing features like the inability to caption and footnote charts. Or the inability to format filters like sliders.
It makes nice looking charts in html/d3, but is a hassle to save a real image because it requires chrome or Firefox. Which happens to not work in my CI environment.
So at least matplotlib can save png without needing a bunch of stuff.
One useful tip I can suggest is to create a repo for useful code snippets, so if you ever find yourself doing something new that you think you might need again, just spend a bit of time commenting and describing it and add it to the repo. That way instead of having to spend time searching you'll hopefully remember doing it before and be able to find it easily.
Things started to make sense after I read a very good book on Pandas[1]. Reading a book is better than reading blog posts, because it is consistent. In contrast, reading small tutorials for every little thing is confusing, because every blog post is using a different way to do the same thing.
[1] https://github.com/jakevdp/PythonDataScienceHandbook
[0] http://jonathansoma.com/lede/algorithms-2017/classes/fuzzine...
If you're only an occasional user, this will be your life forever. My experience with pandas is that if you use it heavily for 3 months, then things start to "stick" and you need to look it up less often.
Unfortunately I changed jobs and have forgotten most of pandas, so I'm back to looking things up again.
And the pandas plots are ugly
Pandas feels like the wrong tool for this job. I don't use multi-indexes or any statistical methods. I don't chart anything.
But it's so darn convenient. If the time comes to optimize I can `import csv` directly and improve performance. But nothing beats it for prototyping.
Are there better options in this space?
Better in some generic sense of lighter, faster, better API.
I share your implied concern that pandas can be quite large and I personally disagree with a lot of the design decisions when it comes to the pandas API, but building an alternative tool would be a full time job. Unfortunately, there is no mechanism to support Python library developers and the expectation is for Python libraries to be free.
I'm curious how many people would be ok paying for a Python library.
https://dev.pandas.io/pandas2/goals.html
Deleted Comment
Bokeh is nice. And has made huge improvements over the past year or two. But it still doesn't directly compete with matplotlib because it's more focused on interactive plots.
https://bokeh.pydata.org/en/latest/docs/dev_guide/setup.html
It makes nice looking charts in html/d3, but is a hassle to save a real image because it requires chrome or Firefox. Which happens to not work in my CI environment.
So at least matplotlib can save png without needing a bunch of stuff.
https://pandas.pydata.org/pandas-docs/stable/user_guide/visu...