Excellent library for train_test_split.
Jokes aside. This next to Numpy, Pandas Jupyter and Matplotlib + the DL libraries are the reason Python is the powerhouse it is for Data Science.
I'm with you on sklearn, the DL libraries and Numpy, but Pandas and Matplotlib are poor, poor relations of the tools available in the R ecosystem (dplyr/ggplot etc).
I used to very strongly agree with you re: matplotlib, but I've recently switched from using almost exclusively ggplot2 to almost exlusively Matplotlib and my realization is that they are very different tools serving very different purposes.
ggplot2 is obviously fantastic and makes beautiful plots, and very easily at that. However it is definitely a "convention over configuration" tool. For 99% of the typical plot you might want to create, ggplot is going to be easier and look nicer.
However matplotlib lib really shines when you want to make very custom plots. If you have a plot in your mind that you want to see on paper, matplotlib will be the better tool for helping you create exactly what you are looking for.
For certain projects I've done, where I want to do a bunch of non-standard visualizations, especially ones that tend to be fairly dense, I prefer matplotlib. For day to day analytics ggplot2 is so much better it's ridiculous. The real issue is that Python doesn't really offer anything in the same league as ggplot2 for "convention over configuration" type plotting.
Fully agree on Pandas. R's native data frame + tidyverse is world's easier. Pandas' overly complex indexing system is a persistent source of annoyance no matter how much I use that library.
Wait how many companies are actually using R in the wild? As I understand it, R is born of academia, great for statistics/analysis but breaks down on data manipulation and isn't used in production/data engineering. Maybe my understanding is dated though?
Hehe used to do R IMO you are right about ggplot but I strongly disagree about pandas. I fing love it. Would love to understand you troubles with it though, after using it for 4 years daily mabye I can offer some perspective ;)
Matplotlib is my go-to despite being mediocre. I recently found proplot library built on it which seems to solve a lot of the warts (particularly around figure layout with subplots and legends). I haven't had a chance to use it yet - does anyone know if it's worth it?
I like to stick to basic, widely used tools when possible so I'm biased against it versus just wrangling it out with matplotlib. But proplot does look compelling, like it was written for exactly my complaints.
I'm surprised you dont like pandas. I've found it to be a pretty easy to use and useful tool and you can almost always use something like DASK (or if youre lucky CUDF from rapidsai) if you need better performance.
I will say that my very first "real" programming experience was Matlab at a research internship, so maybe i just got used to working in vectors and arrays for computational tasks.
If you're doing data science aren't sklearn, DL, and numpy getting you 90% of the way there anyway? Even if R has better "versions" of pandas/matplotlib (not conceding that point) it's not exactly central to the job of data science.
Early on, pandas made some unfortunate design decisions that are still biting hard. For example, the choice of datetime (pandas.Timestamp) represented by a 64-bit int with a fixed nanosecond resolution. This choice gives dynamic range of +- 292 years around 1970-01-01 (the epoch). This range is too small to represent the works of William Shakespeare, never mind human history. Using pandas in these areas becomes a royal pain in the neck, for one constantly needs to work around pandas datetime limitations.
OTOH, in numpy one can choose time resolution units (anything from attosecond to a year) tailoring time resolution to your task (from high energy physics all way to astronomy). Panda's choice is only good for high-frequency stock traders, though.
Most data is not 300 years old or in the distance future, in fact ranges 1970+-292 years are very common. That is to say, panda's choice is good for lots of people, including outside high-frequency stock traders.
> Most data is not 300 years old or in the distance future, in fact ranges 1970+-292 years are very common.
In what domains? Astronomy, geology, history call for larger time range. Laser and High Energy physics need femtosecond rather than nanosecond resolution. My point is that a fixed time resolution, whatever it is, is a bad choice. Numpy explicitly allows to select time resolution unit and this is the right approach. BTW, numpy is pandas dependency and predates it by several years.
Best documented library. It even provides examples, guidance and best practices in the documentation. Have rarely learned so much as when I went through the sci-kit documentation. Absolute delight.
Really, for any other ML library the best documentation is how-tos spread through the web, but scikit-learn leaves very little room for that kind of content.
Great that they finally added quantile regression. This was sorely missed.
I’m still hoping for a mixed-effects model implementation someday, like lme4 in R. The statsmodels implementation can only do predictions on fixed effects, which limits it greatly.
I’ve always wondered why mixed effect type models are not more popular in the ML world.
scikit-learn (next to numpy) is the one library I use in every single project at work. Every time I consider switching away from python I am faced with the fact that I'd lose access to this workhorse of a library.
Of course it's not all sunshine and rainbows - I had my fair share of rummaging through its internals - but its API design is a de-facto standard for a reason.
My only recurring gripe is that the serialization story (basically just pickling everything) is not optimal.
I recently ran into this issue as well. Serialization of sklearn random forests results in absolutely massive files. I had to switch to lightgbm, which is 100x faster to load from a save file and about 20x smaller.
There is so much wrong with the api design of sklearn (how can one think "predict_proba" is a good function name?). I can understand this, since most of it was probably written by PhD students without the time and expertise to come up with a proper api; many of them without a CS background.[1]
These seem like minor gripes (reading your link) - and I don't even agree with them, seems like an ok use of mutable state (otherwise a separate object would be needed for hyperparameter state?). Maybe my expectations are low, but they way sklearn unifies the API across different estimators all across the library - that's already way above what you can expect - especially if you consider it to be "written by a bunch of phd students".
I didn't want to bag on sklearn (I've already bagged on pandas enough here), but for what it's worth I agree with you. It's, ahh, not the API I would've come up with. It's what everybody has standardized on, though, and maybe there's some value in that.
ggplot2 is obviously fantastic and makes beautiful plots, and very easily at that. However it is definitely a "convention over configuration" tool. For 99% of the typical plot you might want to create, ggplot is going to be easier and look nicer.
However matplotlib lib really shines when you want to make very custom plots. If you have a plot in your mind that you want to see on paper, matplotlib will be the better tool for helping you create exactly what you are looking for.
For certain projects I've done, where I want to do a bunch of non-standard visualizations, especially ones that tend to be fairly dense, I prefer matplotlib. For day to day analytics ggplot2 is so much better it's ridiculous. The real issue is that Python doesn't really offer anything in the same league as ggplot2 for "convention over configuration" type plotting.
Fully agree on Pandas. R's native data frame + tidyverse is world's easier. Pandas' overly complex indexing system is a persistent source of annoyance no matter how much I use that library.
I like to stick to basic, widely used tools when possible so I'm biased against it versus just wrangling it out with matplotlib. But proplot does look compelling, like it was written for exactly my complaints.
I will say that my very first "real" programming experience was Matlab at a research internship, so maybe i just got used to working in vectors and arrays for computational tasks.
https://github.com/hosseinmoein/DataFrame
OTOH, in numpy one can choose time resolution units (anything from attosecond to a year) tailoring time resolution to your task (from high energy physics all way to astronomy). Panda's choice is only good for high-frequency stock traders, though.
In what domains? Astronomy, geology, history call for larger time range. Laser and High Energy physics need femtosecond rather than nanosecond resolution. My point is that a fixed time resolution, whatever it is, is a bad choice. Numpy explicitly allows to select time resolution unit and this is the right approach. BTW, numpy is pandas dependency and predates it by several years.
https://github.com/scikit-learn/scikit-learn/releases/tag/1....
I’m still hoping for a mixed-effects model implementation someday, like lme4 in R. The statsmodels implementation can only do predictions on fixed effects, which limits it greatly.
I’ve always wondered why mixed effect type models are not more popular in the ML world.
[1] https://www.reddit.com/r/haskell/comments/7brsuu/machine_lea...