I get around ~140 ms per token running a 13B parameter model on a thinkpad laptop with a 14 core Intel i7-9750 processor. Because it's CPU inference the initial prompt processing takes longer than on GPU so total latency is still higher than I'd like. I'm working on some caching solutions that should make this bareable for things like chat.
IIRC gpt-3 can not do chain-of-thought
GPT-3.5 (what's being used here) is a little better at zero-shot in-context learning as it's been intstruction fine-tuned so it's only given the general format in the context.
The actual wide narrative is that the current language models hallucinate and lie, and there is no coherent plan to avoid this. Google? Microsoft? This is a much less important question than whether or not anyone is going to push this party-trick level technology onto a largely unsuspecting public.
Search is one application, and it might be crap right now, but for Microsoft it only needs to provide incremental value, for Google it's life or death. Microsoft is still better positioned in both the enterprise (Azure, Office365, Teams) and developer (Github, VSCode) markets.
(Best + Worst + 4 * Average) / 6
One nice property is that it imposes a distribution that adjusts for longer tailed risks.
After what felt like endless googling it is what I decided to spend some time with, haven't had time to do much yet so can't say how it performs for me but the idea and execution really resonates with me.
Not fully-featured yet but what I'd like to eventually do is set it up in a similar way to the mermaidjs editor [3]. They encode the entire diagram in the url. That makes it really easy to link to from markdown documents and has the nice benefit that the diagram is immutable for a given url so you don't need a backend to store anything.
[1]: https://www.npmjs.com/package/pikchr-js
the proper remedy is to simply wrap the string to parse with fmemopen(3), which makes the temporary FILE object explicit and persistent for the whole parse, and needs just one strlen call.
- Human Reading Speed (English): ~250 words per minute
- Human Speaking Speed (English): ~150 words per minute
Should be treated like the Doherty Threshold [1] for generative content.
[1] https://lawsofux.com/doherty-threshold/