Scikit-Learn Version 1.0

Excellent library for train_test_split. Jokes aside. This next to Numpy, Pandas Jupyter and Matplotlib + the DL libraries are the reason Python is the powerhouse it is for Data Science.

disgruntledphd2 · 5 years ago

I'm with you on sklearn, the DL libraries and Numpy, but Pandas and Matplotlib are poor, poor relations of the tools available in the R ecosystem (dplyr/ggplot etc).

baron_harkonnen · 5 years ago

I used to very strongly agree with you re: matplotlib, but I've recently switched from using almost exclusively ggplot2 to almost exlusively Matplotlib and my realization is that they are very different tools serving very different purposes.

ggplot2 is obviously fantastic and makes beautiful plots, and very easily at that. However it is definitely a "convention over configuration" tool. For 99% of the typical plot you might want to create, ggplot is going to be easier and look nicer.

However matplotlib lib really shines when you want to make very custom plots. If you have a plot in your mind that you want to see on paper, matplotlib will be the better tool for helping you create exactly what you are looking for.

For certain projects I've done, where I want to do a bunch of non-standard visualizations, especially ones that tend to be fairly dense, I prefer matplotlib. For day to day analytics ggplot2 is so much better it's ridiculous. The real issue is that Python doesn't really offer anything in the same league as ggplot2 for "convention over configuration" type plotting.

Fully agree on Pandas. R's native data frame + tidyverse is world's easier. Pandas' overly complex indexing system is a persistent source of annoyance no matter how much I use that library.

boringg · 5 years ago

Wait how many companies are actually using R in the wild? As I understand it, R is born of academia, great for statistics/analysis but breaks down on data manipulation and isn't used in production/data engineering. Maybe my understanding is dated though?

kzrdude · 5 years ago

They need seaborn too, whoch makes the python side a lot stronger

lysecret · 5 years ago

Hehe used to do R IMO you are right about ggplot but I strongly disagree about pandas. I fing love it. Would love to understand you troubles with it though, after using it for 4 years daily mabye I can offer some perspective ;)

tgb · 5 years ago

Matplotlib is my go-to despite being mediocre. I recently found proplot library built on it which seems to solve a lot of the warts (particularly around figure layout with subplots and legends). I haven't had a chance to use it yet - does anyone know if it's worth it?

I like to stick to basic, widely used tools when possible so I'm biased against it versus just wrangling it out with matplotlib. But proplot does look compelling, like it was written for exactly my complaints.

pantsforbirds · 5 years ago

I'm surprised you dont like pandas. I've found it to be a pretty easy to use and useful tool and you can almost always use something like DASK (or if youre lucky CUDF from rapidsai) if you need better performance.

I will say that my very first "real" programming experience was Matlab at a research internship, so maybe i just got used to working in vectors and arrays for computational tasks.

pc86 · 5 years ago

If you're doing data science aren't sklearn, DL, and numpy getting you 90% of the way there anyway? Even if R has better "versions" of pandas/matplotlib (not conceding that point) it's not exactly central to the job of data science.

optimalsolver · 5 years ago

Of possible interest, a C++ replacement for Pandas:

https://github.com/hosseinmoein/DataFrame

scikit-learn (next to numpy) is the one library I use in every single project at work. Every time I consider switching away from python I am faced with the fact that I'd lose access to this workhorse of a library. Of course it's not all sunshine and rainbows - I had my fair share of rummaging through its internals - but its API design is a de-facto standard for a reason. My only recurring gripe is that the serialization story (basically just pickling everything) is not optimal.

CapmCrackaWaka · 5 years ago

I recently ran into this issue as well. Serialization of sklearn random forests results in absolutely massive files. I had to switch to lightgbm, which is 100x faster to load from a save file and about 20x smaller.

kzrdude · 5 years ago

What's a typical task you do with sklearn? Just trying to get inspired about what it can do

zeec123 · 5 years ago

There is so much wrong with the api design of sklearn (how can one think "predict_proba" is a good function name?). I can understand this, since most of it was probably written by PhD students without the time and expertise to come up with a proper api; many of them without a CS background.[1]

[1] https://www.reddit.com/r/haskell/comments/7brsuu/machine_lea...

kzrdude · 4 years ago

These seem like minor gripes (reading your link) - and I don't even agree with them, seems like an ok use of mutable state (otherwise a separate object would be needed for hyperparameter state?). Maybe my expectations are low, but they way sklearn unifies the API across different estimators all across the library - that's already way above what you can expect - especially if you consider it to be "written by a bunch of phd students".

mrtranscendence · 5 years ago

I didn't want to bag on sklearn (I've already bagged on pandas enough here), but for what it's worth I agree with you. It's, ahh, not the API I would've come up with. It's what everybody has standardized on, though, and maybe there's some value in that.

lr1970 · 5 years ago

Early on, pandas made some unfortunate design decisions that are still biting hard. For example, the choice of datetime (pandas.Timestamp) represented by a 64-bit int with a fixed nanosecond resolution. This choice gives dynamic range of +- 292 years around 1970-01-01 (the epoch). This range is too small to represent the works of William Shakespeare, never mind human history. Using pandas in these areas becomes a royal pain in the neck, for one constantly needs to work around pandas datetime limitations.

OTOH, in numpy one can choose time resolution units (anything from attosecond to a year) tailoring time resolution to your task (from high energy physics all way to astronomy). Panda's choice is only good for high-frequency stock traders, though.

Bostonian · 5 years ago

Pandas was started by a quant working for AQR Capital, so it's not surprising if "Panda's choice is only good for high-frequency stock traders".

An illustrative example of how reasonable short-term and narrow-scope considerations can be really bad in long-term and/or at a larger scope.

nxpnsv · 5 years ago

Most data is not 300 years old or in the distance future, in fact ranges 1970+-292 years are very common. That is to say, panda's choice is good for lots of people, including outside high-frequency stock traders.

> Most data is not 300 years old or in the distance future, in fact ranges 1970+-292 years are very common.

In what domains? Astronomy, geology, history call for larger time range. Laser and High Energy physics need femtosecond rather than nanosecond resolution. My point is that a fixed time resolution, whatever it is, is a bad choice. Numpy explicitly allows to select time resolution unit and this is the right approach. BTW, numpy is pandas dependency and predates it by several years.

Just to clarify, scikit-learn 1.0 has not been released yet. The latest tag in the github repo is 1.0.rc2

https://github.com/scikit-learn/scikit-learn/releases/tag/1....

NeutralForest · 5 years ago

Excellent library with stellar documentation, I hope it'll live on for a long time.

sveme · 5 years ago

Best documented library. It even provides examples, guidance and best practices in the documentation. Have rarely learned so much as when I went through the sci-kit documentation. Absolute delight.

mistrial9 · 5 years ago

you mean the 4000 page cookbook thing?

Uberphallus · 5 years ago

Really, for any other ML library the best documentation is how-tos spread through the web, but scikit-learn leaves very little room for that kind of content.

jgilias · 5 years ago

Yes, glad they've decided it's finally out of the "don't use it, it's experimental" phase and gotten off the 0ver.org wall of shame!

laichzeit0 · 5 years ago

Great that they finally added quantile regression. This was sorely missed.

I’m still hoping for a mixed-effects model implementation someday, like lme4 in R. The statsmodels implementation can only do predictions on fixed effects, which limits it greatly.

I’ve always wondered why mixed effect type models are not more popular in the ML world.

crimsoneer · 5 years ago

Preach. The statsmodels implement sucks.

conor_f · 5 years ago

https://0ver.org/ will need an update!

infimum · 5 years ago

monkeybutton · 5 years ago

Scikit-Learn is great, and, reading the documentation for other 3rd party ML packages and seeing the words "Scikit-learn API" is even better.