As always, take those t/s stats with a huge boulder of salt. The demo shows a question "solved" in < 500 tokens. Still amazing that it's possible, but you'll get nowhere near those speeds when dealing with real-world problems at real-world useful context lengths for "thinking" models (8-16k tokens). Even epyc's with lots of channels go down to 2-4 t/s after ~4096 context length.
* pos=0 => P 138 ms S 864 kB R 1191 kB Connect
* pos=2000 => P 215 ms S 864 kB R 1191 kB .
* pos=4000 => P 256 ms S 864 kB R 1191 kB manager
* pos=6000 => P 335 ms S 864 kB R 1191 kB the