cmogni1 (u/cmogni1) - Readit News

cmogni1 commented on The End of Moore's Law for AI? Gemini Flash Offers a Warning sutro.sh/blog/the-end-of-... · Posted by u/sethkim

cmogni1 · 2 months ago

The article does a great job of highlighting the core disconnect in the LLM API economy: linear pricing for a service with non-linear, quadratic compute costs. The traffic analogy is an excellent framing.

One addition: the O(n^2) compute cost is most acute during the one-time prefill of the input prompt. I think the real bottleneck, however, is the KV cache during the decode phase.

For each new token generated, the model must access the intermediate state of all previous tokens. This state is held in the KV Cache, which grows linearly with sequence length and consumes an enormous amount of expensive GPU VRAM. The speed of generating a response is therefore more limited by memory bandwidth.

Viewed this way, Google's 2x price hike on input tokens is probably related to the KV Cache, which supports the article’s “workload shape” hypothesis. A long input prompt creates a huge memory footprint that must be held for the entire generation, even if the output is short.

cmogni1 commented on Tell HN: Help restore the tax deduction for software dev in the US (Section 174) · Posted by u/dang

cmogni1 · 3 months ago

Does anyone know how AI coding fits in with S174? If a person’s “coding” part of the job is primarily running prompts and checking code outputs (quality control and minor reprompting) with the remainder of the time used for other activities, does this count as software engineering?

It seems like an inevitable outcome of this is elaborate system-gaming to mitigate how much employees fall under S174…

cmogni1 commented on Web search on the Anthropic API anthropic.com/news/web-se... · Posted by u/cmogni1

cmogni1 · 4 months ago

I think the most interesting thing to me is they have multi-hop search & query refinement built in based on prior context/searches. I'm curious how well this works.

I've built a lot of LLM applications with web browsing in it. Allow/block lists are easy to implement with most web search APIs, but multi-hop gets really hairy (and expensive) to do well because it usually requires context from the URLs themselves.

The thing I'm still not seeing here that makes LLM web browsing particularly difficult is the mismatch between search result relevance vs LLM relevance. Getting a diverse list of links is great when searching Google because there is less context per query, but what I really need from an out-of-the-box LLM web browsing API is reranking based on the richer context provided by a message thread/prompt.

For example, writing an article about the side effects of Accutane should err on the side of pulling in research articles first for higher quality information and not blog posts.

It's possible to do this reranking decently well with LLMs (I do it in my "agents" that I've written), but I haven't seen this highlighted from anyone thus far, including in this announcement.