even though I'm normally a fan of release early and often, deepseek often loses some impact because they tend to release model and evals on different days. it wouldnt hurt anyone to just wait a day to release both together so that the conversations are more substantive.
ofc deepseek is doing the highest order bit of just train a good model and let everyone figure it out on their own time.
i love this but dont use emacs. i wish that existing tools were lighter weight at video trimming. Screenflow is the lightest i know of but fails badly at some video formats and sometimes OoMs. if it had a better architecture streaming bytes i feel like it might not have that problem.
Doing hyperparameter sweeps on lots of small models to find the optimal values for each size and fitting scaling laws to predict the hyperparameters to use for larger models seems to work reasonably well. I think https://arxiv.org/abs/2505.01618 is the latest advance in that vein.