Lessons Learned from Two Years as a Data Scientist

This reads like a pretty "wet-behind-the-ears" professional who doesn't know what he doesn't know.

> There's no Java awfulness like ... instead it's just `cars = []`

I mean, there's very good reasons for static typing. And if he was using Kotlin, he could specify whether the variable `cars` was itself immutable and whether the list was immutable (`val/var cars : List<String>/MutableList<String>`

> notebooks

yea, Jupyter kernels exist for almost every language. This is not a Python advantage.

> debugging

Good IDEs have the ability to set breakpoints, inspect variables, test methods, etc.

> type hints

"oh, forget what I said earlier about how Java had ugly boilerplate, now I have an `import` and a type def after all" - except nothing here is actually enforced

> parallelism

parallelism is relative... a lot of compiled JVM code will run much faster than Python to start with, and even with `multiprocessing`, Python won't catch up (and JVM languages have their own concurrency solutions, of course)

> I've put a large chunk of my money in leveraged index funds and etfs.

Written by a person who's never seen the slightest hint of a bear market, or rising interest rates. That's ok, you wouldn't be the first smart person to be seduced by leverage: https://www.investopedia.com/terms/m/myron-scholes.asp

> Stimulants like caffeine, adderall, and modafinil are magic... People do stay on adderall and modafinil indefinitely

Look, I'm no doctor (and I'm aware I'm out of the loop on things like this), but mental & concentration stimulants are the kinds of things associated with old people, not recent graduates.

tomnipotent · 5 years ago

This is an oddly snarky response to someone just sharing their experience, but you seem to be reading it matter-of-factly. She's upfront about having just graduated, and having 2-years of experience with a math background and not a CS background.

She clearly covered a lot of good ground in that time, and even took the time to write a 6000+ word article.

dawndrain · 5 years ago

Certainly lots of things to learn still.

Going point by point:

It's not "he" it's "she".

I don't think it's controversial that you can be much more concise with python. My experience first learning Java was that everything was 2-3x as verbose as in python. The difference is smaller if you're using type hints in python, but it's still more concise.

I talked about repl's/notebooks for other languages. They're still an especially great tool for python/data science since they make it very easy to visualize data and share analyses.

I played around with breakpoints in pycharm and I don't think it would work for me. You need to run your code from pycharm in debug mode for the breakpoint to trigger, whereas I always run things from the command line or a notebook.

I believe you that there are times when python is slower. At least it's not noticeably slower when simply calling C behind the scenes or when you're i/o blocked anyway.

Re investing, I mean, everyone has seen phenomenal returns since they were born: this century is unprecedented. Also there was the pandemic crash very recently, so everyone has experienced an extremely harsh (albeit brief) bear market too. LTCM was like 100x-leveraged, which I would not advocate for, since you'll almost certainly get wiped out if you hold that position for more than a few hours...

Eh, lots of kids have add, and like 10% of college students used adderall in 2016 according to the first hit on google. In any case they've been magic for me the few times I've tried them e.g. working 12+ good hours in a day.

dhanna · 5 years ago

It’s magic because you have no tolerance. That goes away after a few weeks of continuous use.

hazbot · 5 years ago

True, but that didn't stop me from finding parts interesting, and just generally enjoying the earnestness.

int_19h · 5 years ago

Modafinil is literally known as "the Wall Street drug", and it's not because the old people there use it.

I felt like this article was a bit light on data scientist specific advice, and while I am not one, I do herd them for a living, so thought I'd put some random thoughts together:

1) Quite often you are not training a machine to be the best at something. You're training a machine to help a human to be the best at something. Be sure to optimise for this when necessary.

2) Push predictions, don't ask others to pull them. Focus on decoupling your data science team and their customers early on. The worst thing that can happen is duplicating logic, first to craft features during training, and later to submit those features to an API from clients. Even if you hide the feature engineering behind the API, this can either slow down predictions, or still require bulky requests from the client in the case of sequence data. Instead, stream data into your feature store, and stream predictions out onto your event bus. Then your data science team can truly be a black box.

3) Unit test invariants in your model's outputs. While you can't write tests for exact outputs, you can say "such and such a case should output a higher value than some other case, all things being equal". When your model disagrees, do at least consider that the model may be correct though.

4) Do ablation tests in reverse, and unit test each addition to your model's architecture to prove it helps.

5) Often you will train a model on historical data, and content yourself that all future predictions will be outside this training set. However, don't forget that sometimes updates to historical data will trigger a prediction to be recalculated, and this might be overfit. Sometimes you can serve cached results, but small feature changes make this harder.

6) Your data scientists are probably the people who are most intimate with your data. They will be the first to stumble on bugs and biases, so give them very good channels to report QA issues. If you are a lone data scientist in a larger organisation, seek out and forge these channels early.

7) Don't treat labelling tools as grubby little hacked together apps. Resource them properly, make sure you watch and listen to the humans building and using them.

8) Have objective ways of comparing models that are thematically similar but may differ in their exact goal variables. If you can't directly compare log loss or whatever like-for-like, find some more external criteria.

9) Much of your job is building trust in your models with stakeholders. Don't be afraid to build simple stuff that captures established intuitions before going deep - show people the machine gets the basics first.

10) If you're struggling to replicate a result from a paper, either with or without the original code, welcome to academia.

Probably not earth shattering stuff, I grant you.

jeeeb · 5 years ago

Good list. One thing I'd add, which you kind of hint at:

Good practices from software engineering are just as applicable to Data Science. In particular:

Notebooks are great for performing an EDA, and testing out new concepts. They're not great for running production code. Put your non-once off code in regular source code files and source control it.

Break your code into separately testable and composable functions. Write unit tests to verify behavior where you can. Speaking from experience you all most certainly will find bugs.

Implement a peer review process for the methodology used and the code. Approaches should be explainable and justifiable. Bugs and poor assumptions can lead to incorrect results.

Focus on making your model training process end-to-end reproducible. Document the training data used. Document the configuration used. Link back to the commit hash of the exact code used. Make sure your environment is reproducible.

eointierney · 5 years ago

Thank you for superb advice. Like all good advice it is clear, general, based on experience, and anthropocentric.

It may not be earth-shattering, but that's probably a good thing.

I especially like numbers 1 and 10. If I was looking for a job I'd like to work with you.

thom · 5 years ago

Thank you for making such a kind comment!

listenallyall · 5 years ago

cobertos · 5 years ago

I really liked this post. Tons of small tidbits I feel I'd only get from working in the teams they worked on. Some things I noted:

* Check out `ray` as an alternative to `multiprocessing`

* Check out `tqdm`

* Use `pdb` more

* See if fast.ai or https://jalammar.github.io/illustrated-transformer/ are worthwhile

* Prioritize the papers I read better

* _Leveraged_ index funds?

chriak8292 · 5 years ago

Risk-adjusted return (Sharpe)

- VTI/VOO: 0.6 to 0.8

- Leveraged S&P500 ETFs: 0.5 to 0.9

- Savings account: positive infinity

- Treasury bonds: 1.0 to 1.2

Leveraged index funds give approximately the same risk-adjusted return as passive index funds but with 2-2.5x the absolute return (eg, UPRO).

However, if you have a weak stomach (you don’t like seeing your balance drop), then leveraged ETFs are not for you.

rejectedandsad · 5 years ago

What should the net worth of someone like OP be by 25? I really wonder how far behind I am compared to my more savvy investor peers.

cauthon · 5 years ago

Check out ipdb instead of pdb. It’s pdb but with the ipython repl instead of python’s

nerdponx · 5 years ago

I find that PDB++ (on Pip as "pdbpp") is better in its role specifically as a debugger.

otabdeveloper4 · 5 years ago

Seems like "data scientist" is now code for "junior Python developer".

I need a person who actually knows statistics, how to integrate, what Bayes' theorem means in practice, the difference between a confidence and a credible interval, etc.

What would this person be called in 2021? (I understand that searching for "data scientist" candidates would be a waste of my time.)

jstx1 · 5 years ago

> Seems like "data scientist" is now code for "junior Python developer".

No, it isn't.

> I need a person who actually knows statistics, how to integrate...

Maybe someone with a strong mathematical education and commercial experience in research... like the author of the post? Just because they focused on some more basic SWE concepts in a blog post doesn't mean that they don't know statistics.

> What would this person be called in 2021?

Data scientist, research scientist, probably some other titles too.

Forgive me, but the blog is titled "what I learned from my first two years as a data scientist", and I'd rather trust the author on their word and not speculate.

(And the experience of me working with dozens of data scientists corroborates - "data science" means basic Python programming and the kind of boring trial-and-error feature engineering tasks you'd typically assign to a junior software developer.)

I want a person who actually uses statistics and math to drive business decisions, not just someone who took some statistics courses in university before becoming a software developer.

P.S. Honest question, really. I don't want to sift through 500 resumes before finding the person I want.

co2benzoate · 5 years ago

You are looking for a statistician.

ekianjo · 5 years ago

> reading programming books is a waste of time

Erm... How about no? Unless you read useless books.

Contemporary students have no appreciation for reading and that’s going to be my competitive advantage.

"Useless" accurately describes the majority of published books in the industry these days. So you can spend a lot of time finding the right book... or you can just dive straight for the spec.

> or you can just dive straight for the spec.

in practice its not even following the spec, its just copy pasting stack overflow answers.

Reading a book at least gives a more sustainable structure to what you learn.

qsort · 5 years ago

> Google doesn't allow any production-level projects to be written in python due to safety concerns

Is this actually true? If it's true that Google doesn't allow Python in production, it seems unlikely that it's due to security concerns.

MontyCarloHall · 5 years ago

Their command line interfaces to Google Cloud Platform (e.g. gcloud/gsutil) is 100% Python. Is that not considered a “production level project”?

natchy · 5 years ago

That’s just a client though, right? They write JavaScript too for clients.

It’s probably production level services, not just projects.

moogly · 5 years ago

Yeah, did they rewrite YouTube?

I heard this second-hand, not totally sure it's true

smueller1234 · 5 years ago

It's not.

The remote connection to safety (did you mean security?) would be that static source analysis tools don't work as reliably with dynamic languages. That matters at Google. But you don't even have to think as hard about it: Python is simply comparatively slow and inefficient. Google's fleet is large. It pays off to use more efficient languages.

(There's also the whole thing about Python being largely single threaded and computers being very wide these days, as well as being a terrible memory hog and memory making up half the cost of servers.)

wheelinsupial · 5 years ago

> in fact Google doesn't allow any production-level projects to be written in python due to safety concerns.

Then why did you write it as a fact?

kgwgk · 5 years ago

Nice pictures. (Didn't read the words... but I liked the pictures.)

123pie123 · 5 years ago

I was distracted (in a good way) by the pictures

but I did read most of it. I thought it was a fairly honest and good blog

I did like the diagram of the different company styles

I'll will read it 100% when I have enough time

jldugger · 5 years ago

> I did like the diagram of the different company styles

That's quite an old internet cartoon.

irrational · 5 years ago

I’d like a copy of the “Excuses to Miss Meetings” book.

rossdavidh · 5 years ago

Good to know I'm not the only one.

jsrcout · 5 years ago

I just love those fake O'Reilly book covers. Wish I could order the actual books :-)

throwaway98797 · 5 years ago

Good stuff, excpet:

<<I've put a large chunk of my money in leveraged index funds and etfs>>

This is dangerous advice if you go all in on this.

pja · 5 years ago

I believe the available evidence suggests that over the long term, a small amount of leverage increases returns. Obviously it increases volatility as well, but if your time horizon is long, you can cope with that.

You do not want to invest in ETFs that are themselves leveraged & rebalance daily however, that’s going to eat all your money if you hold them for any length of time - those products are designed to be held for short periods only.

bt3 · 5 years ago

SPXL also has an expense ratio over 1%, which will eat away at earnings unless in the best of bull rushes (now).

tfehring · 5 years ago

The expense ratios of leveraged ETFs are nothing compared to the volatility drag. There are far cheaper and more effective ways than leveraged ETFs for buy-and-hold investors to obtain leverage, notably LEAPs and index futures. (Disclaimer: Not investing advice, do your own research, etc.)

Leveraged EFTs outperform VTI/VOO (in terms of total return) over 30-40 year investment horizons. Period.

Now, the risk (potential one-year downside) is not for everyone.