This reads like a pretty "wet-behind-the-ears" professional who doesn't know what he doesn't know.
> There's no Java awfulness like ... instead it's just `cars = []`
I mean, there's very good reasons for static typing. And if he was using Kotlin, he could specify whether the variable `cars` was itself immutable and whether the list was immutable (`val/var cars : List<String>/MutableList<String>`
> notebooks
yea, Jupyter kernels exist for almost every language. This is not a Python advantage.
> debugging
Good IDEs have the ability to set breakpoints, inspect variables, test methods, etc.
> type hints
"oh, forget what I said earlier about how Java had ugly boilerplate, now I have an `import` and a type def after all" - except nothing here is actually enforced
> parallelism
parallelism is relative... a lot of compiled JVM code will run much faster than Python to start with, and even with `multiprocessing`, Python won't catch up (and JVM languages have their own concurrency solutions, of course)
> I've put a large chunk of my money in leveraged index funds and etfs.
Written by a person who's never seen the slightest hint of a bear market, or rising interest rates. That's ok, you wouldn't be the first smart person to be seduced by leverage: https://www.investopedia.com/terms/m/myron-scholes.asp
> Stimulants like caffeine, adderall, and modafinil are magic... People do stay on adderall and modafinil indefinitely
Look, I'm no doctor (and I'm aware I'm out of the loop on things like this), but mental & concentration stimulants are the kinds of things associated with old people, not recent graduates.
This is an oddly snarky response to someone just sharing their experience, but you seem to be reading it matter-of-factly. She's upfront about having just graduated, and having 2-years of experience with a math background and not a CS background.
She clearly covered a lot of good ground in that time, and even took the time to write a 6000+ word article.
I don't think it's controversial that you can be much more concise with python. My experience first learning Java was that everything was 2-3x as verbose as in python. The difference is smaller if you're using type hints in python, but it's still more concise.
I talked about repl's/notebooks for other languages. They're still an especially great tool for python/data science since they make it very easy to visualize data and share analyses.
I played around with breakpoints in pycharm and I don't think it would work for me. You need to run your code from pycharm in debug mode for the breakpoint to trigger, whereas I always run things from the command line or a notebook.
I believe you that there are times when python is slower. At least it's not noticeably slower when simply calling C behind the scenes or when you're i/o blocked anyway.
Re investing, I mean, everyone has seen phenomenal returns since they were born: this century is unprecedented. Also there was the pandemic crash very recently, so everyone has experienced an extremely harsh (albeit brief) bear market too. LTCM was like 100x-leveraged, which I would not advocate for, since you'll almost certainly get wiped out if you hold that position for more than a few hours...
Eh, lots of kids have add, and like 10% of college students used adderall in 2016 according to the first hit on google. In any case they've been magic for me the few times I've tried them e.g. working 12+ good hours in a day.
Seems like "data scientist" is now code for "junior Python developer".
I need a person who actually knows statistics, how to integrate, what Bayes' theorem means in practice, the difference between a confidence and a credible interval, etc.
What would this person be called in 2021? (I understand that searching for "data scientist" candidates would be a waste of my time.)
> Seems like "data scientist" is now code for "junior Python developer".
No, it isn't.
> I need a person who actually knows statistics, how to integrate...
Maybe someone with a strong mathematical education and commercial experience in research... like the author of the post? Just because they focused on some more basic SWE concepts in a blog post doesn't mean that they don't know statistics.
> What would this person be called in 2021?
Data scientist, research scientist, probably some other titles too.
Forgive me, but the blog is titled "what I learned from my first two years as a data scientist", and I'd rather trust the author on their word and not speculate.
(And the experience of me working with dozens of data scientists corroborates - "data science" means basic Python programming and the kind of boring trial-and-error feature engineering tasks you'd typically assign to a junior software developer.)
I want a person who actually uses statistics and math to drive business decisions, not just someone who took some statistics courses in university before becoming a software developer.
P.S. Honest question, really. I don't want to sift through 500 resumes before finding the person I want.
"Useless" accurately describes the majority of published books in the industry these days. So you can spend a lot of time finding the right book... or you can just dive straight for the spec.
The remote connection to safety (did you mean security?) would be that static source analysis tools don't work as reliably with dynamic languages. That matters at Google. But you don't even have to think as hard about it: Python is simply comparatively slow and inefficient. Google's fleet is large. It pays off to use more efficient languages.
(There's also the whole thing about Python being largely single threaded and computers being very wide these days, as well as being a terrible memory hog and memory making up half the cost of servers.)
I felt like this article was a bit light on data scientist specific advice, and while I am not one, I do herd them for a living, so thought I'd put some random thoughts together:
1) Quite often you are not training a machine to be the best at something. You're training a machine to help a human to be the best at something. Be sure to optimise for this when necessary.
2) Push predictions, don't ask others to pull them. Focus on decoupling your data science team and their customers early on. The worst thing that can happen is duplicating logic, first to craft features during training, and later to submit those features to an API from clients. Even if you hide the feature engineering behind the API, this can either slow down predictions, or still require bulky requests from the client in the case of sequence data. Instead, stream data into your feature store, and stream predictions out onto your event bus. Then your data science team can truly be a black box.
3) Unit test invariants in your model's outputs. While you can't write tests for exact outputs, you can say "such and such a case should output a higher value than some other case, all things being equal". When your model disagrees, do at least consider that the model may be correct though.
4) Do ablation tests in reverse, and unit test each addition to your model's architecture to prove it helps.
5) Often you will train a model on historical data, and content yourself that all future predictions will be outside this training set. However, don't forget that sometimes updates to historical data will trigger a prediction to be recalculated, and this might be overfit. Sometimes you can serve cached results, but small feature changes make this harder.
6) Your data scientists are probably the people who are most intimate with your data. They will be the first to stumble on bugs and biases, so give them very good channels to report QA issues. If you are a lone data scientist in a larger organisation, seek out and forge these channels early.
7) Don't treat labelling tools as grubby little hacked together apps. Resource them properly, make sure you watch and listen to the humans building and using them.
8) Have objective ways of comparing models that are thematically similar but may differ in their exact goal variables. If you can't directly compare log loss or whatever like-for-like, find some more external criteria.
9) Much of your job is building trust in your models with stakeholders. Don't be afraid to build simple stuff that captures established intuitions before going deep - show people the machine gets the basics first.
10) If you're struggling to replicate a result from a paper, either with or without the original code, welcome to academia.
Good list. One thing I'd add, which you kind of hint at:
Good practices from software engineering are just as applicable to Data Science. In particular:
Notebooks are great for performing an EDA, and testing out new concepts. They're not great for running production code. Put your non-once off code in regular source code files and source control it.
Break your code into separately testable and composable functions. Write unit tests to verify behavior where you can. Speaking from experience you all most certainly will find bugs.
Implement a peer review process for the methodology used and the code. Approaches should be explainable and justifiable. Bugs and poor assumptions can lead to incorrect results.
Focus on making your model training process end-to-end reproducible. Document the training data used. Document the configuration used. Link back to the commit hash of the exact code used. Make sure your environment is reproducible.
I believe the available evidence suggests that over the long term, a small amount of leverage increases returns. Obviously it increases volatility as well, but if your time horizon is long, you can cope with that.
You do not want to invest in ETFs that are themselves leveraged & rebalance daily however, that’s going to eat all your money if you hold them for any length of time - those products are designed to be held for short periods only.
The expense ratios of leveraged ETFs are nothing compared to the volatility drag. There are far cheaper and more effective ways than leveraged ETFs for buy-and-hold investors to obtain leverage, notably LEAPs and index futures. (Disclaimer: Not investing advice, do your own research, etc.)
> There's no Java awfulness like ... instead it's just `cars = []`
I mean, there's very good reasons for static typing. And if he was using Kotlin, he could specify whether the variable `cars` was itself immutable and whether the list was immutable (`val/var cars : List<String>/MutableList<String>`
> notebooks
yea, Jupyter kernels exist for almost every language. This is not a Python advantage.
> debugging
Good IDEs have the ability to set breakpoints, inspect variables, test methods, etc.
> type hints
"oh, forget what I said earlier about how Java had ugly boilerplate, now I have an `import` and a type def after all" - except nothing here is actually enforced
> parallelism
parallelism is relative... a lot of compiled JVM code will run much faster than Python to start with, and even with `multiprocessing`, Python won't catch up (and JVM languages have their own concurrency solutions, of course)
> I've put a large chunk of my money in leveraged index funds and etfs.
Written by a person who's never seen the slightest hint of a bear market, or rising interest rates. That's ok, you wouldn't be the first smart person to be seduced by leverage: https://www.investopedia.com/terms/m/myron-scholes.asp
> Stimulants like caffeine, adderall, and modafinil are magic... People do stay on adderall and modafinil indefinitely
Look, I'm no doctor (and I'm aware I'm out of the loop on things like this), but mental & concentration stimulants are the kinds of things associated with old people, not recent graduates.
She clearly covered a lot of good ground in that time, and even took the time to write a 6000+ word article.
Going point by point:
It's not "he" it's "she".
I don't think it's controversial that you can be much more concise with python. My experience first learning Java was that everything was 2-3x as verbose as in python. The difference is smaller if you're using type hints in python, but it's still more concise.
I talked about repl's/notebooks for other languages. They're still an especially great tool for python/data science since they make it very easy to visualize data and share analyses.
I played around with breakpoints in pycharm and I don't think it would work for me. You need to run your code from pycharm in debug mode for the breakpoint to trigger, whereas I always run things from the command line or a notebook.
I believe you that there are times when python is slower. At least it's not noticeably slower when simply calling C behind the scenes or when you're i/o blocked anyway.
Re investing, I mean, everyone has seen phenomenal returns since they were born: this century is unprecedented. Also there was the pandemic crash very recently, so everyone has experienced an extremely harsh (albeit brief) bear market too. LTCM was like 100x-leveraged, which I would not advocate for, since you'll almost certainly get wiped out if you hold that position for more than a few hours...
Eh, lots of kids have add, and like 10% of college students used adderall in 2016 according to the first hit on google. In any case they've been magic for me the few times I've tried them e.g. working 12+ good hours in a day.
* Check out `ray` as an alternative to `multiprocessing`
* Check out `tqdm`
* Use `pdb` more
* See if fast.ai or https://jalammar.github.io/illustrated-transformer/ are worthwhile
* Prioritize the papers I read better
* _Leveraged_ index funds?
- VTI/VOO: 0.6 to 0.8
- Leveraged S&P500 ETFs: 0.5 to 0.9
- Savings account: positive infinity
- Treasury bonds: 1.0 to 1.2
Leveraged index funds give approximately the same risk-adjusted return as passive index funds but with 2-2.5x the absolute return (eg, UPRO).
However, if you have a weak stomach (you don’t like seeing your balance drop), then leveraged ETFs are not for you.
I need a person who actually knows statistics, how to integrate, what Bayes' theorem means in practice, the difference between a confidence and a credible interval, etc.
What would this person be called in 2021? (I understand that searching for "data scientist" candidates would be a waste of my time.)
No, it isn't.
> I need a person who actually knows statistics, how to integrate...
Maybe someone with a strong mathematical education and commercial experience in research... like the author of the post? Just because they focused on some more basic SWE concepts in a blog post doesn't mean that they don't know statistics.
> What would this person be called in 2021?
Data scientist, research scientist, probably some other titles too.
(And the experience of me working with dozens of data scientists corroborates - "data science" means basic Python programming and the kind of boring trial-and-error feature engineering tasks you'd typically assign to a junior software developer.)
I want a person who actually uses statistics and math to drive business decisions, not just someone who took some statistics courses in university before becoming a software developer.
P.S. Honest question, really. I don't want to sift through 500 resumes before finding the person I want.
Erm... How about no? Unless you read useless books.
in practice its not even following the spec, its just copy pasting stack overflow answers.
Reading a book at least gives a more sustainable structure to what you learn.
Is this actually true? If it's true that Google doesn't allow Python in production, it seems unlikely that it's due to security concerns.
It’s probably production level services, not just projects.
The remote connection to safety (did you mean security?) would be that static source analysis tools don't work as reliably with dynamic languages. That matters at Google. But you don't even have to think as hard about it: Python is simply comparatively slow and inefficient. Google's fleet is large. It pays off to use more efficient languages.
(There's also the whole thing about Python being largely single threaded and computers being very wide these days, as well as being a terrible memory hog and memory making up half the cost of servers.)
Then why did you write it as a fact?
1) Quite often you are not training a machine to be the best at something. You're training a machine to help a human to be the best at something. Be sure to optimise for this when necessary.
2) Push predictions, don't ask others to pull them. Focus on decoupling your data science team and their customers early on. The worst thing that can happen is duplicating logic, first to craft features during training, and later to submit those features to an API from clients. Even if you hide the feature engineering behind the API, this can either slow down predictions, or still require bulky requests from the client in the case of sequence data. Instead, stream data into your feature store, and stream predictions out onto your event bus. Then your data science team can truly be a black box.
3) Unit test invariants in your model's outputs. While you can't write tests for exact outputs, you can say "such and such a case should output a higher value than some other case, all things being equal". When your model disagrees, do at least consider that the model may be correct though.
4) Do ablation tests in reverse, and unit test each addition to your model's architecture to prove it helps.
5) Often you will train a model on historical data, and content yourself that all future predictions will be outside this training set. However, don't forget that sometimes updates to historical data will trigger a prediction to be recalculated, and this might be overfit. Sometimes you can serve cached results, but small feature changes make this harder.
6) Your data scientists are probably the people who are most intimate with your data. They will be the first to stumble on bugs and biases, so give them very good channels to report QA issues. If you are a lone data scientist in a larger organisation, seek out and forge these channels early.
7) Don't treat labelling tools as grubby little hacked together apps. Resource them properly, make sure you watch and listen to the humans building and using them.
8) Have objective ways of comparing models that are thematically similar but may differ in their exact goal variables. If you can't directly compare log loss or whatever like-for-like, find some more external criteria.
9) Much of your job is building trust in your models with stakeholders. Don't be afraid to build simple stuff that captures established intuitions before going deep - show people the machine gets the basics first.
10) If you're struggling to replicate a result from a paper, either with or without the original code, welcome to academia.
Probably not earth shattering stuff, I grant you.
Good practices from software engineering are just as applicable to Data Science. In particular:
Notebooks are great for performing an EDA, and testing out new concepts. They're not great for running production code. Put your non-once off code in regular source code files and source control it.
Break your code into separately testable and composable functions. Write unit tests to verify behavior where you can. Speaking from experience you all most certainly will find bugs.
Implement a peer review process for the methodology used and the code. Approaches should be explainable and justifiable. Bugs and poor assumptions can lead to incorrect results.
Focus on making your model training process end-to-end reproducible. Document the training data used. Document the configuration used. Link back to the commit hash of the exact code used. Make sure your environment is reproducible.
It may not be earth-shattering, but that's probably a good thing.
I especially like numbers 1 and 10. If I was looking for a job I'd like to work with you.
but I did read most of it. I thought it was a fairly honest and good blog
I did like the diagram of the different company styles
I'll will read it 100% when I have enough time
That's quite an old internet cartoon.
<<I've put a large chunk of my money in leveraged index funds and etfs>>
This is dangerous advice if you go all in on this.
You do not want to invest in ETFs that are themselves leveraged & rebalance daily however, that’s going to eat all your money if you hold them for any length of time - those products are designed to be held for short periods only.
Now, the risk (potential one-year downside) is not for everyone.