It seems like an inevitable outcome of this is elaborate system-gaming to mitigate how much employees fall under S174…
I've built a lot of LLM applications with web browsing in it. Allow/block lists are easy to implement with most web search APIs, but multi-hop gets really hairy (and expensive) to do well because it usually requires context from the URLs themselves.
The thing I'm still not seeing here that makes LLM web browsing particularly difficult is the mismatch between search result relevance vs LLM relevance. Getting a diverse list of links is great when searching Google because there is less context per query, but what I really need from an out-of-the-box LLM web browsing API is reranking based on the richer context provided by a message thread/prompt.
For example, writing an article about the side effects of Accutane should err on the side of pulling in research articles first for higher quality information and not blog posts.
It's possible to do this reranking decently well with LLMs (I do it in my "agents" that I've written), but I haven't seen this highlighted from anyone thus far, including in this announcement.
One addition: the O(n^2) compute cost is most acute during the one-time prefill of the input prompt. I think the real bottleneck, however, is the KV cache during the decode phase.
For each new token generated, the model must access the intermediate state of all previous tokens. This state is held in the KV Cache, which grows linearly with sequence length and consumes an enormous amount of expensive GPU VRAM. The speed of generating a response is therefore more limited by memory bandwidth.
Viewed this way, Google's 2x price hike on input tokens is probably related to the KV Cache, which supports the article’s “workload shape” hypothesis. A long input prompt creates a huge memory footprint that must be held for the entire generation, even if the output is short.