though i noticed some outdated info for most companies (last updated >2 months ago).
Regurgitating and Examples are both ways to lean into that and try to recover whatever has been lost by Chat-based tuning.
I also made a math game 3 years ago.
https://github.com/zineanteoh/clean-the-river
I called it “Clean the river”. It is a web-based game that lets children practice forming and solving basic mathematical expressions by "cleaning the river".
So do an end-to-end project where you:
- start from a CSV dataset, with the goal of predicting some output column. A classic example is predicting whether a household's income is >$50K or not from census information.
- transform/clean the data in a jupyter notebook and engineer features for input into a model. Export the features to disk into a format suitable for training.
- train a simple linear model using a chosen framework: a regressor if you're predicting a numerical field, a classifier if its categorical.
- iterate on model evaluation metrics through more feature engineering, scoring the model on unseen data to see its actual performance.
- export the model in such a way it can be loaded or hosted. The format largely depends on the framework.
- construct a docker container that exposes the model over HTTP and a handler for receiving prediction requests and transforming them for input into the model, and a client that sends requests to that model.
That'll basically get an entire end-to-end run the entire MLE lifecycle. Every other part of development is a series of concentric loop between these steps, scaled out to ridiculous scale in several dimensions: number of features, size of dataset, steps in a data/feature processing pipeline to generate training datasets, model architecture and hyperparameters, latency/availability requirements for model servers...
For bonus points:
- track metrics and artifacts using a local mlflow deployment.
- compare performance for different models.
- examine feature importance to remove unnecessary (or net-negative) features.
- use a NN model and train on GPU. Use profiling tools (depends on the framework) and Nvidia NSight to examine performance. Optimize.
- host a big model on GPU. Profile and optimize.
IMO: the biggest missing piece for ML systems/platform engineers is how to feed GPUs. If you can right-size workloads and feed a GPU with MLE workloads you'll get hired. MLE workloads vary wildly (ratio of data volume in vs. compute; size of model; balancing CPU compute for feature processing with GPU compute for model training). We're all working under massive GPU scarcity.
curious: which part of the pipeline does the majority of 'business' value come from?