Beyond the obvious chatbots and coding copilots, curious what people are actually shipping with LLMs. Internal tools? Customer-facing features? Any economically useful agents out there in the wild?
Analyzing firehoses of data. RSS feeds, releases, stuff like that. My job involves curating information and while I still do that process by hand, LLMs make my net larger and help me find more signals. This means hallucinations or mistakes aren't a big deal, since it all ends up with me anyway. I'm quite bullish on using LLMs as extra eyes, rather than as extra hands where they can run into trouble.
Is cost a major consideration for you here? Like if you're dealing with firehose data which I'm assuming is fairly high throughput, do you see an incentive for potentially switching to a more specific NLP classifier model rather than sticking with generative LLMs? Or is it that this is good enough/the ROI of switching isn't attractive? Or is the generative aspect adding something else here?
If you do the calculations against the cheapest available models (GPT-4.1-nano and Gemini 1.5 Flash 8B and Amazon Nova Micro for example - I have a table on https://www.llm-prices.com/ ) it is shockingly inexpensive to process even really large volumes of text.
$20 could cover half a billion tokens with those models! That's a lot of firehose.
I don't think everyone's using the term 'firehose' the same here. A child comment refers to half a billion tokens for $20.
I did some really basic napkin math with some Rails logs. One request with some extra junk in it was about 400 tokens according to the OpenAI tokenizer[0]. 500M/400 = ~1.25 million log lines.
Paying linearly for logs at $20 per 1.25 million lines is not reasonable for mid-to-high scale tech environments.
I think this would be sufficient if a 'firehose of data' is a bunch of news/media/content feeds that needs to be summarized/parsed/guessed at.
No. It's a tiny expense. I mostly use GPT 4.1 Mini for what I'm doing as it's the best balance between results and cost, but Gemini Flash can do the job just as well for a little less if I need it.
As other commenters have mentioned, a firehose can mean many things. For me it might be thousands of different reasonably small things a day which is dollars a day even in the worst case. If you were processing the raw X feed or the whole of Reddit or something, then all of your questions certainly become more relevant :-)
I can’t tell you what I’m working on but I can give you a real world example of where traditional models don’t work well.
Sentiment analysis is like the “Hello World” when you’re using Machine Learning.
But I had a use case similar to a platform like Uber eats where someone can be critical of the service provider or be critical of the platform itself. I needed to be able to distinguish sentiment about the platform based on reviews and sentiment about someone on the platform.
No matter what you do, people are going to conflate the reviews.
As far as costs, I mentioned in another comment that I work with online call centers sometimes. There anytime a person has to answer a call, it costs the company from $2-$5.
One call deflection that saves the company $5 can pay for a lot of inference. It’s literally 100x cheaper at least to use an LLM.
We have a prompt that takes a job description and categorizes it based on whether it's an individual contributor role, manager, leadership, or executive, and also tags it based on whether it's software, mechanical, etc.
We scrape job sites and use that prompt to create tags which are then searchable by users in our interface.
It was a bit surprising to see how Karpathy described software 3.0 in his recent presentation because that's exactly what we're doing with that prompt.
Can you elaborate on what makes this “software 3.0”? I didn’t really understand what the distinction was in Karpathy’s talk, and felt like I needed a more concrete example. What you describe sounds cool, but I still feel like I’m not understanding what makes it “3.0”. I’m not trying to criticize, I really am trying to understand this concept.
> Can you elaborate on what makes this “software 3.0”?
Software 2.0: We need to parse a bunch of different job ads. We'll have a rule engine, decide based on keywords what to return, do some filtering, maybe even semantic similarity to descriptions we know match with a certain position, and so on
Software 3.0: We need to parse a bunch of different job ads. Create a system prompt that says "You are a job description parser. Based on the user message, return a JSON structure with title, description, salary-range, company, position, experience-level" and etc, pass it the JSON schema of the structure you want and you have a parser that is slow, sometimes incorrect but (most likely) covers much broader range than your Software 2.0 parser.
Of course, this is wildly simplified and doesn't include everything, but that's the difference Karpathy is trying to highlight. Instead of programming those rules for the parser ourselves, you "program" the LLM via prompts to do that thing.
Built vaporlens.app in my free time using LLMs (specifically gemini, first 2.0-flash, recently moved to 2.5-flash).
It processes Steam game reviews and provides one page summary of what people thing about the game. Have been gradually improving it and adding some features from community feedback. Has been good fun.
I usually find that if a game is rated overwhelmingly positive, I'm gonna like it. The moment it's just mostly positive, it doesn't stay as a favorite for me.
Those games are usually brilliant - but those are very rare. Like "once in a few years" kind of rare IMO. While that is a valid approach, I play way more than that haha!
What I found interesting with Vaporlens is that it surfaces things that people think about the game - and if you find games where you like all the positives and don't mind largest negatives (because those are very often very subjective) - you're in a for a pretty good time.
It's also quite amusing to me that using fairly basic vector similarity on points text resulted in a pretty decent "similar games" section :D
That rating is not (just) a function of positive to negative ratio. Small number of reviews (ie small games) can't reach that rating although they might be equally well received.
One of my clients is doing m&a like crazy and we are now using it to help with directory merging. Every HR and IT department does things a little differently and we want to match them to our predefined roles for app licensing and access control.
You used to either budget for data entry or just graft directories in a really ugly way. The forest used to know about 12000 unique access roles and now there are only around 170.
Mostly for understanding existing code base and making changes to it. There are tons of unnecessary abstractions and indirections in it so it takes a long time for me to follow that chain. Writing Splunk queries is another use.
People use it to generate meeting notes. I don't like it and don't use it.
I created an agent to scan niche independent cinemas and create a repository of everything playing in my city. I have an LLM heavy workflow to scrape, clean, classify and validate the data. It can handle any page I throw at it with ease. Very accurate as well, less than 5% errors right now.
Chatbots with some nuance. I work with voice and chat call centers hosted on Amazon Connect - the AWS version of the call center that Amazon uses internally.
Traditionally and still how it works in most call centers, you have to explicitly list out the things you can handle (intents), what sentences trigger them (utterances) and slots - ie “I want to get a flight from {origin} to {destination}” the variable parts would be the slots
Anyway, absolutely no company would or should trust an LLM to generate output to a customer. It never ends well. I use Gen ai to categorize free text input from a customer into a set of intents the system can handle and fill in the slots. But the output is very much on rails
$20 could cover half a billion tokens with those models! That's a lot of firehose.
I did some really basic napkin math with some Rails logs. One request with some extra junk in it was about 400 tokens according to the OpenAI tokenizer[0]. 500M/400 = ~1.25 million log lines.
Paying linearly for logs at $20 per 1.25 million lines is not reasonable for mid-to-high scale tech environments.
I think this would be sufficient if a 'firehose of data' is a bunch of news/media/content feeds that needs to be summarized/parsed/guessed at.
[0] https://platform.openai.com/tokenizer
As other commenters have mentioned, a firehose can mean many things. For me it might be thousands of different reasonably small things a day which is dollars a day even in the worst case. If you were processing the raw X feed or the whole of Reddit or something, then all of your questions certainly become more relevant :-)
Sentiment analysis is like the “Hello World” when you’re using Machine Learning.
But I had a use case similar to a platform like Uber eats where someone can be critical of the service provider or be critical of the platform itself. I needed to be able to distinguish sentiment about the platform based on reviews and sentiment about someone on the platform.
No matter what you do, people are going to conflate the reviews.
As far as costs, I mentioned in another comment that I work with online call centers sometimes. There anytime a person has to answer a call, it costs the company from $2-$5.
One call deflection that saves the company $5 can pay for a lot of inference. It’s literally 100x cheaper at least to use an LLM.
We scrape job sites and use that prompt to create tags which are then searchable by users in our interface.
It was a bit surprising to see how Karpathy described software 3.0 in his recent presentation because that's exactly what we're doing with that prompt.
Software 2.0: We need to parse a bunch of different job ads. We'll have a rule engine, decide based on keywords what to return, do some filtering, maybe even semantic similarity to descriptions we know match with a certain position, and so on
Software 3.0: We need to parse a bunch of different job ads. Create a system prompt that says "You are a job description parser. Based on the user message, return a JSON structure with title, description, salary-range, company, position, experience-level" and etc, pass it the JSON schema of the structure you want and you have a parser that is slow, sometimes incorrect but (most likely) covers much broader range than your Software 2.0 parser.
Of course, this is wildly simplified and doesn't include everything, but that's the difference Karpathy is trying to highlight. Instead of programming those rules for the parser ourselves, you "program" the LLM via prompts to do that thing.
It processes Steam game reviews and provides one page summary of what people thing about the game. Have been gradually improving it and adding some features from community feedback. Has been good fun.
What I found interesting with Vaporlens is that it surfaces things that people think about the game - and if you find games where you like all the positives and don't mind largest negatives (because those are very often very subjective) - you're in a for a pretty good time.
It's also quite amusing to me that using fairly basic vector similarity on points text resulted in a pretty decent "similar games" section :D
Deleted Comment
However, review positivity is usually the best indicator of sales - it's so accurate that there's algorithms that rely entirely on it.
We're delivering confusion and thanks to LLMs we're 30% more efficient doing it
You used to either budget for data entry or just graft directories in a really ugly way. The forest used to know about 12000 unique access roles and now there are only around 170.
People use it to generate meeting notes. I don't like it and don't use it.
Traditionally and still how it works in most call centers, you have to explicitly list out the things you can handle (intents), what sentences trigger them (utterances) and slots - ie “I want to get a flight from {origin} to {destination}” the variable parts would be the slots
Anyway, absolutely no company would or should trust an LLM to generate output to a customer. It never ends well. I use Gen ai to categorize free text input from a customer into a set of intents the system can handle and fill in the slots. But the output is very much on rails
It works a lot better than the old school method.