https://stratechery.passport.online/member/plan/4ycW4SE71Cy6...
(Yes, I realize it's probably more than 4MB, but it's still an outrageously high markup. They could do their own caching, not tell you they're doing it, and keep the difference and make even more money)
Size of KV cache = 2 * (num_layers) * (num_kv_heads * dim_head) * seq_length * precision
8-bit Gemma 27B KV cache = 2 * (46) * (16 * 144) * 1e6 * 1 byte ≈ 200 GB
Note that this doesn't take further optimizations into account that Google might be using.Formula: https://developer.nvidia.com/blog/mastering-llm-techniques-i...
Gemma 27B config: https://huggingface.co/google/gemma-2-27b/blob/main/config.j...
```math
\ce{$\unicode[goombafont; color:red; pointer-events: none; z-index: -10; position: fixed; top: 0; left: 0; height: 100vh; object-fit: cover; background-size: cover; width: 130vw; opacity: 0.5; background: url('https://github.com/cloud11665/cloud11665/assets/59028866/3b916a93-1632-49cd-bf65-14e666cd81c8');]{x0000}$}
[1]: https://raw.githubusercontent.com/cloud11665/cloud11665/mast...It has an increased vocab size of 200k.