Comprehensive Guide on Data Visualization with Pandas

lawlorino · 6 years ago

I really am not sure how "comprehensive" I would call this, after glancing over it it looks like one of the million currently existing basic pandas plotting guides.

abhishekjha · 6 years ago

What should one do if after following those million guides it still doesn’t stick? I always end up googling what I want and hit SO. Is there something wrong with me or does numpy and pandas seem more difficult than they should be?

_coveredInBees · 6 years ago

The pandas api is a mess and that's why it feels that way. It is a great tool, and very powerful, but boy does it make the user's life difficult with a pretty convoluted API that makes it very hard to automatically discover functionality. As others have said, unless you use it all the time, you're essentially in for a bad time that involves lots of Googling and SO browsing even to get basic things done in pandas. I say this as someone who had developed a few smaller projects that utilize pandas extensively, so I'm not just criticizing it without having used it.

lawlorino · 6 years ago

I really wouldn't worry about it - I've been using Python for data work for the last 5 years or so and I have to look stuff all the time. Eventually the basic stuff sticks but it's like any kind of coding, I don't think anyone ever hits the point where they hardly ever have to Google stuff.

One useful tip I can suggest is to create a repo for useful code snippets, so if you ever find yourself doing something new that you think you might need again, just spend a bit of time commenting and describing it and add it to the repo. That way instead of having to spend time searching you'll hopefully remember doing it before and be able to find it easily.

Foivos · 6 years ago

It is possible to do the same things in Pandas is many different ways. It is good to have this flexibility, but it is confusing for new users. On top of that the bracket operator ([) is overloaded in many ways.

Things started to make sense after I read a very good book on Pandas[1]. Reading a book is better than reading blog posts, because it is consistent. In contrast, reading small tutorials for every little thing is confusing, because every blog post is using a different way to do the same thing.

[1] https://github.com/jakevdp/PythonDataScienceHandbook

squaresmile · 6 years ago

I suggest looking into matplotlib structures: figures and axes. I think this article [0] is pretty good at detailing how to work with them. They definitely can be confusing but I think most can grasp what to use after reading the article.

[0] http://jonathansoma.com/lede/algorithms-2017/classes/fuzzine...

BeetleB · 6 years ago

>Is there something wrong with me or does numpy and pandas seem more difficult than they should be?

If you're only an occasional user, this will be your life forever. My experience with pandas is that if you use it heavily for 3 months, then things start to "stick" and you need to look it up less often.

Unfortunately I changed jobs and have forgotten most of pandas, so I'm back to looking things up again.

trts · 6 years ago

Is this actually a new feature of pandas? I've only used other libraries like seaborn and matplotlib.

fr1tkot · 6 years ago

It's not new, I remember using the .plot function 2 or 3 year ago. Seaborn is much better anyway though, so don't bother switching.

And the pandas plots are ugly

lawlorino · 6 years ago

This capability is at least a few years old which is when I first started using pandas. I believe it uses matplotlib on the backend for the plotting by default, and works pretty seamlessly with seaborn too.

nerdponx · 6 years ago

It's a matplotlib wrapper.

BeetleB · 6 years ago

I've been using pandas since 2012 and it had decent plotting capabilities back then.

mjparrott · 6 years ago

I’m not sure i would describe those graphs as “beautiful”

min2bro · 6 years ago

These are raw matplotlib graphs. Its crude. Not sure what's the definition of beautiful?

rcarrigan87 · 6 years ago

Yeah, this doesn't seem to add much...was hoping for something a little deeper.

whitehouse3 · 6 years ago

Lately, `pip install pandas` is my first step after making a new virtual environment. Its read_sql and read_csv methods are magic. The resulting DataFrames are just like DataTables in C#. And for complex joins and aggregations, I can DataFrame.to_sql into an in-memory SQLite database.

Pandas feels like the wrong tool for this job. I don't use multi-indexes or any statistical methods. I don't chart anything.

But it's so darn convenient. If the time comes to optimize I can `import csv` directly and improve performance. But nothing beats it for prototyping.

Are there better options in this space?

danpelota · 6 years ago

On occasion, I've fired up pandas just to sanitize a CSV file and drop malformed rows as preparation to bulk ingesting into a database:

  import pandas as pd
  pd.read_csv('bad_file.csv', error_bad_lines=False).to_csv('good_file.csv')

It's not efficient (reads everything into memory), but read_csv is robust when it comes to handling embedded unescaped quotes/commas/etc., and supports dropping rows with the incorrect number of columns due to anomalies it can't handle.

maest · 6 years ago

Genuine question - would you be willing to spend money for a better version of pandas?

Better in some generic sense of lighter, faster, better API.

I share your implied concern that pandas can be quite large and I personally disagree with a lot of the design decisions when it comes to the pandas API, but building an alternative tool would be a full time job. Unfortunately, there is no mechanism to support Python library developers and the expectation is for Python libraries to be free.

I'm curious how many people would be ok paying for a Python library.

_coveredInBees · 6 years ago

I think that would be an uphill battle and very hard to succeed financially. I agree with you regarding the API being a mess, but pandas is so heavily entrenched in the datascience space (in Python land) that it is almost impossible for a free replacement to take over, let a lone a paid library.

luckydata · 6 years ago

I have so many things to say about this but I also want to remind you that Wes is working on pandas 2.

https://dev.pandas.io/pandas2/goals.html

Deleted Comment

datascientist · 6 years ago

If you need to scale out or speedup pandas, there's Modin https://modin.readthedocs.io/en/latest/ (which uses Ray from)

Dowwie · 6 years ago

Other graphing libs for jupyter notebooks include:

  - Bokeh
  - Plotly
  - Seaborn

These libraries were built to improve upon matplotlib or each other, weren't they? Yet, people continue to reach for "the original" ¯\(ツ)/¯

psv1 · 6 years ago

Seaborn isn't usable without matplotlib, nor does it aim to be. It gives you simple high-level calls and as soon as you need to tweak your plot you're back to matplotlib. (Similar to the pandas plotting shown in the article actually.)

Bokeh is nice. And has made huge improvements over the past year or two. But it still doesn't directly compete with matplotlib because it's more focused on interactive plots.

fr1tkot · 6 years ago

Bokeh and Plotly obviously do much, much more than matplotlib, and if you want what they provide, you would go for those. Seaborn is more presentable-looking by default and easier to get things laid out in. I personally haven't used straight-up matplotlib for a long time.

lazzlazzlazz · 6 years ago

Bokeh and Seaborn are very, very thin skins over matplotlib, and honestly don't improve the user experience at all. Only Altair has changed things for me.

wodenokoto · 6 years ago

Bokehs rendering backend is bokehJS

https://bokeh.pydata.org/en/latest/docs/dev_guide/setup.html

psv1 · 6 years ago

Bokeh is not a skin over matplotlib.

psv1 · 6 years ago

The matlab plotting syntax got transferred to Python through matplotlib and got very deeply ingrained - was first, got popular, built into pandas and statsmodels, foundation of seaborn etc. Recently I saw a snippet of Julia code that uses "pyplot", I assume because people find it familiar and convenient. That API just refuses to die.

psv1 · 6 years ago

The pandas wrappers around matplotlib are convenient but for anything that needs customising, you'll need to reach for the full matplotlib API anyway.

min2bro · 6 years ago

Most of the things like ticks and lims are covered which is above basics. But if you are looking for annotation or animations then you need coding in matplotlib though

mayankkaizen · 6 years ago

Does anybody here think that Pandas API design is ugly and inconsistent? It feels like hack after hack.

WilliamEdward · 6 years ago

I really only use Pandas for DataFrame structures. Doesn't really bother me if the rest of it is bad.

bllguo · 6 years ago

absolutely. pandas is just something i put up with to be able to use everything else in python, i'd drop it in a heartbeat.

lazzlazzlazz · 6 years ago

I strongly recommend Altair (https://altair-viz.github.io/) as an extremely Pandas-friendly alternative approach to data visualization. It's the first library that has successfully "hidden" the ugly, gnarled matplotlib layer underneath for me. It also looks killer.

piccolbo · 6 years ago

Second that but there is no matplotlib underneath AFAIK. It does html based interactive graphs using a variant of the grammar of graphics, with extensions for interactions. Grammar of graphics means a great, proven (see ggplot) compromise of power and simplicity. <shameless plug> If you need a library of one-liners built on top of altair (that is if you need some standard stats graph) I wrote altair_recipes (https://github.com/piccolbo/altair_recipes/ or pip install altair_recipes) for that. </shameless plug>

prepend · 6 years ago

I like Altair, but it still has some annoying missing features like the inability to caption and footnote charts. Or the inability to format filters like sliders.

It makes nice looking charts in html/d3, but is a hassle to save a real image because it requires chrome or Firefox. Which happens to not work in my CI environment.

So at least matplotlib can save png without needing a bunch of stuff.

piccolbo · 6 years ago

Add lack of support of polar coordinates. But every lib started immature, and altair is quite new. I think they could use a few good PRs though.

Tycho · 6 years ago

I recommend reading the official docs. I think they improved the plotting interface recently, and I learned a lot from reading through the guide:

https://pandas.pydata.org/pandas-docs/stable/user_guide/visu...