[^1]: https://github.com/guidance-ai/llguidance/blob/f4592cc0c783a...
[1]: https://github.com/linebender/gpu-stroke-expansion-paper
Back when that M1 MAX vs 3090 blog post was released, I ran those same tests on the M1 Pro (16GB), Google Colab Pro, and free GPUs (RTX4000, RTX5000) on the Paperspace Pro plan.
To make a long story short, I don't think buying any M1 chip make senses if your primary purpose is Deep Learning. If you are just learning or playing around with DL, Colab Pro and the M1 Max provide similar performance. But Colab Pro is ~$10/month, and upgrading any laptop to M1-Max is at least $600.
The "free" RTX5000 on Paperspace Pro (~$8 month) is much faster (especially with fp16 and XLA) than M1 Max and Colab Pro, albeit the RTX5000 isn't always available. The free RTX4000 is also a faster than M1 Max, albeit you need to use smaller batch sizes due to 8GB of VRAM.
If you assume that M1-Ultra doubles the performance of M1-Max in similar fashion to how the M1-Max seems to double the gpu performance of the M1-Pro, it still doesn't make sense from a cost perspective. If you are a serious DL practitioner, putting that money towards cloud resources or a 3090 makes a lot more sense than buying the M1-Ultra.
Apple Silicon (including base M1) actually has great FP16 support at the hardware level, including conversions. So it is wrong to say it only supports FP32.