Tools: Code Is All You Need

Directionally I think this is right. Most LLM usage at scale tends to be filling the gaps between two hardened interfaces. The reliability comes not from the LLM inference and generation but the interfaces themselves only allowing certain configuration to work with them.

LLM output is often coerced back into something more deterministic such as types, or DB primary keys. The value of the LLM is determined by how well your existing code and tools model the data, logic, and actions of your domain.

In some ways I view LLMs today a bit like 3D printers, both in terms of hype and in terms of utility. They excel at quickly connecting parts similar to rapid prototyping with 3d printing parts. For reliability and scale you want either the LLM or an engineer to replace the printed/inferred connector with something durable and deterministic (metal/code) that is cheap and fast to run at scale.

Additionally, there was a minute during the 3D printer Gardner hype cycle where there were notions that we would all just print substantial amounts of consumer goods when the reality is the high utility use case are much more narrow. There is a corollary here to LLM usage. While LLMs are extremely useful we cannot rely on LLMs to generate or infer our entire operational reality or even engage meaningfully with it without some sort of pre-existing digital modeling as an anchor.

foobarbecue · 2 months ago

Hype cycle for drones and VR was similar -- at the peak, you have people claiming drones will take over package delivery and everyone will spend their day in VR. Reality is that the applicability is more narrow.

golergka · 2 months ago

People claimed that we would spend most of our day on the internet in the mid-90s, and then the dotcom bubble burst. And then people claimed that by 2015 robo-taxis would be around all the major cities of the planet.

You can be right but too early. There was a hype wave for drones and VR (more than one for the latter one), but I wouldn't be so sure that it's peak of their real world usage yet.

ivape · 2 months ago

You checked out drone warfare? It’s all the rage in every conflict at the moment. The hype around drones is not fake, and I’d compare it more to autonomous cars because regulation is the only reason you don’t see a million private drones flying around.

soulofmischief · 2 months ago

That's the claim for AR, not VR, and you're just noticing how research and development cycles play out, you can draw comparisons to literally any technology cycle.

freetinker · 2 months ago

Drones are delivering. Ukraine?

jangxx · 2 months ago

I mean both of these things are actually happening (drone deliveries and people spending a lot of time in VR), just at a much much smaller scale than it was hyped up to be.

threatofrain · 2 months ago

Good drones are very Chinese atm, as is casual consumer drone delivery. Americans might be more than a decade away even with concerted bipartisan war-like effort to boost domestic drone competency.

The reality is Chinese.

whiplash451 · 2 months ago

Interesting take but too bearish on LLMs in my opinion.

LLMs have already found large-scale usage (deep research, translation) which makes them more ubiquitous today than 3D printers ever will or could have been.

benreesman · 2 months ago

What we call an LLM today (by which almost everyone means an autogressive language model from the Generative Pretrained Transformer family tree, and BERTs are still doing important eork, believe that) is actually an offshoot of neural machine translation.

This isn't (intentionally at least) mere HN pedantry: they really do act like translation tools in a bunch of observable ways.

And while they have recently crossed the threshold into "yeah, I'm always going to have a gptel buffer open now" territory at the extreme high end, their utility outside of the really specific, totally non-generalizing code lookup gizmo usecase remains a claim unsupported by robust profits.

There is a hole in the ground where something between 100 billion and a trillion dollars in the ground that so far has about 20B in revenue (not profit) going into it annually.

AI is going to be big (it was big ten years ago).

LLMs? Look more and more like the Metaverse every day as concerns the economics.

skeeter2020 · 2 months ago

Th author is not bearish on LLMs at all; this post is about using LLMs and code vs. LLMs with autonomous tools via MCP. An example from your set would be translation. The author says you'll get better results if you do something like ask an LLM to translate documents, review the proposed approach, ask it to review it's work and maybe ask another LLM to validate the results than if you say "you've got 10K documents in English, and these tools - I speak French"

kibwen · 2 months ago

No, 3D printers are the backbone of modern physical prototyping. They're far more important to today's global economy than LLMs are, even if you don't have the vantage point to see it from your sector. That might change in the future, but snapping your fingers to wink LLMs out of existence would change essentially nothing about how the world works today; it would be a non-traumatic non-event. There just hasn't been time to integrate them into any essential processes.

datameta · 2 months ago

Without trying to take away from your assertion, I think it is worthwhile to mention that part of this phenomenon is the unavoidable matter of meatspace being expensive and dataspace being intangibly present everywhere.

deadbabe · 2 months ago

large scale usage in niche domains is still small scale overall.

dingnuts · 2 months ago

And yet you didn't provide a single reference link! Every case of LLM usage that I've seen claimed about those things has been largely a lie -- guess you won't take the opportunity to be the first to present a real example. Just another rumor.

nativeit · 2 months ago

[citation needed]

hk1337 · 2 months ago

> Directionally I think this is right.

We have a term at work we use called, "directionally accurate", when it's not entirely accurate but headed in the right direction.

abdulhaq · 2 months ago

this is a really good take

> try completing a GitHub task with the GitHub MCP, then repeat it with the gh CLI tool. You'll almost certainly find the latter uses context far more efficiently and you get to your intended results quicker.

This is spot on. I have a "devops" folder with a CLAUDE.md with bash commands for common tasks (e.g. find prod / staging logs with this integration ID).

When I complete a novel task (e.g. count all the rows that were synced from stripe to duckdb) I tell Claude to update CLAUDE.md with the example. The next time I ask a similar question, Claude one-shots it.

This is the first few lines of the CLAUDE.md

    This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

    ## Purpose
    This devops folder is dedicated to Google Cloud Platform (GCP) operations, focusing on:
    - Google Cloud Composer (Airflow) DAG management and monitoring
    - Google Cloud Logging queries and analysis
    - Kubernetes cluster management (GKE)
    - Cloud Run service debugging

    ## Common DevOps Commands

    ### Google Cloud Composer
    ```bash
    # View Composer environment details
    gcloud composer environments describe meltano --location us-central1 --project definite-some-id

    # List DAGs in the environment
    gcloud composer environments storage dags list --environment meltano --location us-central1 --project definite-some-id

    # View DAG runs
    gcloud composer environments run meltano --location us-central1 dags list

    # Check Airflow logs
    gcloud logging read 'resource.type="cloud_composer_environment" AND resource.labels.environment_name="meltano"' --project definite-some-id --limit 50

jayd16 · 2 months ago

I feel like I'm taking crazy pills sometimes. You have a file with a set of snippets and you prefer to ask the AI to hopefully run them instead of just running it yourself?

lreeves · 2 months ago

The commands aren't the special sauce, it's the analytical capabilities of the LLM to view the outputs of all those commands and correlate data or whatever. You could accomplish the same by prefilling a gigantic context window with all the logs but when the commands are presented ahead of time the LLM can "decide" which one to run based on what it needs to do.

mritchie712 · 2 months ago

the snippets are examples. You can ask hundreds of variations of similar, but different, complex questions and the LLM can adjust the example for that need.

I don't have a snippet for, "find all 500's for the meltano service for duckdb syntax errors", but it'd easily nail that given the existing examples.

light_hue_1 · 2 months ago

Yes. I'm not the poster but I do something similar.

Because now the capabilities of the model grow over time. And I can ask questions that involve a handful of those snippets. When we get to something new that requires some doing, it becomes another snippet.

I can offload everything I used to know about an API and never have to think about it again.

theshrike79 · 2 months ago

The AI will run whatever command it figures out might work, which might be wasteful and taint the context with useless crap.

But when you give it tools for retrieving all client+server logs combined (for a web application), it can use it and just get what it needs as simply as possible.

Or it'll start finding a function by digging around code files with grep, if you provide a tool that just lists all functions, their parameters and locations, it'll find the exact spot in one go.

qazxcvbnmlp · 2 months ago

You dont ask the ai to run the commands. you say "build and test this feature" and then the AI correctly iterates back and forth between the build and test commands until the thing works.

lsaferite · 2 months ago

Just as a related aside, you could literally make that bottom section into a super simple stdio MCP Server and attach that to Claude Code. Each of your operations could be a tool and have a well-defined schema for parameters. Then you are giving the LLM a more structured and defined way to access your custom commands. I'm pretty positive there are even pre-made MCP Servers that are designed for just this activity.

Edit: First result when looking for such an MCP Server: https://github.com/inercia/MCPShell

theshrike79 · 2 months ago

The problem with MCP is that it's not composeable.

With separate tools or command snippets the LLM can run one command, feed the result to another command and grep that result for whatever it needs. One composed command or script and it gets exactly what it needed.

With MCPs it'd need to run every command separately, spending precious context for shuffling data from MCP tool to another.

gbrindisi · 2 months ago

wouldn't this defeat the point? Claude Code already has access to the terminal, adding specific instruction in the context is enough

chriswarbo · 2 months ago

I use a similar file, but just for myself (I've never used an LLM "agent"). I live in Emacs, but this is the only thing I use org-mode for: it lets me fold/unfold the sections, and I can press C-c C-c over any of the code snippets to execute it. Some of them are shell code, some of them are Emacs Lisp code which generates shell code, etc.

stpedgwdgfhgdd · 2 months ago

I do something similar, but the problem is that claude.md keeps on growing.

To tackle this, I converted a custom prompt into an application, but there is an interesting trade-off. The application is deterministic. It cannot deal with unknown situations. In contrast to CC, which is way slower, but can try alternative ways of dealing with an unknown situation.

I ended up with adding an instruction to the custom command to run the application and fix the application code (TDD) if there is a problem. Self healing software… who ever thought

e12e · 2 months ago

You're letting the LLM execute privileged API calls against your production/test/staging environment, just hoping it won't corrupt something, like truncate logs, files, databases etc?

Or are you asking it to provide example commands that you can sanity check?

I'd be curious to see some more concrete examples.

tayloramurphy · 2 months ago

Fun to see Meltano mentioned here :)

mritchie712 · 2 months ago

meltano4life

pclowes · 2 months ago

simonw · 2 months ago

Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.

The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.

That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.

My assembly Mandelbrot experiment was the thing that made this click for me: https://simonwillison.net/2025/Jul/2/mandelbrot-in-x86-assem...

vunderba · 2 months ago

> The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide, and how to define the success criteria for the model.

Your test case seems like a quintessential example where you're missing that last step.

Since it is unlikely that you understand the math behind fractals or x86 assembly (apologies if I'm wrong on this), your only means for verifying the accuracy of your solution is a superficial visual inspection, e.g. "Does it look like the Mandelbrot series?"

Ideally, your evaluation criteria would be expressed as a continuous function, but at the very least, it should take the form of a sufficiently diverse quantifiable set of discrete inputs and their expected outputs.

That's exactly why I like using Mandelbrot as a demo: it's perfect for "superficial visual inspection".

With a bunch more work I could likely have got a vision LLM to do that visual inspection for me in the assembly example, but having a human in the loop for that was much more productive.

shepherdjerred · 2 months ago

Are fractals or x86 assembly representative of most dev work?

chamomeal · 2 months ago

That’s super cool, I’m glad you shared this!

I’ve been thinking about using LLMs for brute forcing problems too.

Like LLMs kinda suck at typescript generics. They’re surprisingly bad at them. Probably because it’s easy to write generics that look correct, but are then screwy in many scenarios. Which is also why generics are hard for humans.

If you could have any LLM actually use TSC, it could run tests, make sure things are inferring correctly, etc. it could just keep trying until it works. I’m not sure this is a way to produce understandable or maintainable generics, but it would be pretty neat.

Also while typing this is realized that cursor can see typescript errors. All I need are some utility testing types, and I could have cursor write the tests and then brute force the problem!

If I ever actually do this I’ll update this comment lol

rasengan · 2 months ago

Makes sense.

I treat an LLM the same way I'd treat myself as it relates to context and goals when working with code.

"If I need to do __________ what do I need to know/see?"

I find that traditional tools, as per the OP, have become ever more powerful and useful in the age of LLMs (especially grep).

Furthermore, LLMs are quite good at working with shell tools and functionalities (heredoc, grep, sed, etc.).

chrisweekly · 2 months ago

Giving LLMs the right context -- eg in the form of predefined "cognitive tools", as explored with a ton of rigor here^1 -- seems like the way forward, at least to this casual observer.

1. https://github.com/davidkimai/Context-Engineering/blob/main/...

(the repo is a WIP book, I've only scratched the surface but it seems pretty brilliant to me)

nico · 2 months ago

> LLM in a sandbox using tools in a loop, you can brute force that problem

Does this require using big models through their APIs and spending a lot of tokens?

Or can this be done either with local models (probably very slow), or with subscriptions like Claude Code with Pro (without hitting the rate/usage limits)?

I saw the Mandelbrot experiment, it was very cool, but still a rather small project, not really comparable to a complex/bigger/older code base for a platform used in production

The local models aren't quite good enough for this yet in my experience - the big hosted models (o3, Gemini 2.5, Claude 4) only just crossed the capability threshold for this to start working well.

I think it's possible we'll see a local model that can do this well within the next few months though - it needs good tool calling, not an encyclopedic knowledge of the world. Might be possible to fit that in a model that runs locally.

dist-epoch · 2 months ago

I've been using a VM for a sandbox, just to make sure it won't delete my files if it goes insane.

With some host data directories mounted read only inside the VM.

This creates some friction though. Feels like a tool which runs the AI agent in a VM, but then copies it's output to the host machine after some checks would help, so that it would feel that you are running it natively on the host.

jitl · 2 months ago

This is very easy to do with Docker. Not sure it you want the vm layer as an extra security boundary, but even so you can just specify the VM’s docker api endpoint to spawn processes and copy files in/out from shell scripts.

Have you tried giving the model a fresh checkout in a read-write volume?

One of my biggest, ongoing challenges has been to get the LLM to use the tool(s) that are appropriate for the job. It feels like teach your kids to say, do laundry and you want to just tell them to step aside and let you do it.

antirez · 2 months ago

I have the feeling that's not really MCP specifically VS other ways, but it is pretty simply: at the current state of AI, to have a human in the loop is much better. LLMs are great at certain tasks but they often get trapped into local minima, if you do the back and forth via the web interface of LLMs, ask it to write a program, look at it, provide hints to improve it, test it, ..., you get much better results and you don't cut yourself out to find a 10k lines of code mess that could be 400 lines of clear code. That's the current state of affairs, but of course many will try very hard to replace programmers that is currently not possible. What it is possible is to accelerate the work of a programmer several times (but they must be good both at programming and LLM usage), or take a smart person that has a relatively low skill in some technology, and thanks to LLM make this person productive in this field without the long training otherwise needed. And many other things. But "agentic coding" right now does not work well. This will change, but right now the real gain is to use the LLM as a colleague.

It is not MCP: it is autonomous agents that don't get feedbacks from smart humans.

rapind · 2 months ago

So I run my own business (product), I code everything, and I use claude-code. I also wear all the other hats and so I'd be happy to let Claude handle all of the coding if / when it can. I can confirm we're certainly not there yet.

It's definitely useful, but you have to read everything. I'm working in a type-safe functional compiled language too. I'd be scared to try this flow in a less "correctness enforced" language.

That being said, I do find that it works well. It's not living up to the hype, but most of that hype was obvious nonsense. It continues to surprise me with it's grasp on concepts and is definitely saving me some time, and more importantly making some larger tasks more approachable since I can split my time better.

galdre · 2 months ago

My absolute favorite use of MCP so far is Bruce Hauman's clojure-mcp. In short, it gives the LLM (a) a bash tool, (b) a persistent Clojure REPL, and (c) structural editing tools.

The effect is that it's far more efficient at editing Clojure code than any purely string-diff-based approach, and if you write a good test suite it can rapidly iterate back and forth just editing files, reloading them, and then re-running the test suite at the REPL -- just like I would. It's pretty incredible to watch.

I was just going to comment about clojure-mcp!! It’s far and away the coolest use of mcp I’ve seen so far.

It can straight up debug your code, eval individual expressions, document return types of functions. It’s amazing.

It actually makes me think that languages with strong REPLs are a better for languages than those without. Seeing clojure-mcp do its thing is the most impressive AI feat I’ve seen since I saw GPT-3 in action for the first time

https://github.com/bhauman/clojure-mcp

victorbjorklund · 2 months ago

I think the GitHub CLI example isn't entirely fair to MCP. Yes, GitHub's CLI is extensively documented online, so of course LLMs will excel at generating code for well-known tools. But MCP shines in different scenarios.

Consider internal company tools or niche APIs with minimal online documentation. Sure, you could dump all the documentation into context for code generation, but that often requires more context than interacting with an MCP tool. More importantly, generated code for unfamiliar APIs is prone to errors so you'd need robust testing and retry mechanisms built in to the process.

With MCP, if the tools are properly designed and receive correct inputs, they work reliably. The LLM doesn't need to figure out API intricacies, authentication flows, or handle edge cases - that's already handled by the MCP server.

So I agree MCP for GitHub is probably overkill but there are many legitimate use cases where pre-built MCP tools make more sense than asking an LLM to reverse-engineer poorly documented or proprietary systems from scratch.

the_mitsuhiko · 2 months ago

> Sure, you could dump all the documentation into context for code generation, but that often requires more context than interacting with an MCP tool.

MCP works exactly that way: you dump documentation into the context. That's how the LLM knows how to call your tool. Even for custom stuff I noticed that giving the LLM things to work with that it knows (eg: python, javascript, bash) beats it using MCP tool calling, and in some ways it wastes less context.

YMMV, but I found the limit of tools available to be <15 with sonnet4. That's a super low amount. Basically the official playwright MCP alone is enough to fully exhaust your available tool space.

JyB · 2 months ago

Ive never used that many. The LLM performances collapse/degrade significantly because of too much initial context? It seems like MCP implems updates could easily solve that. Like only injecting relevant servers for the given task based on initial user prompt.

That's handled by the MCP server in the sense of it doesn't do authentication, etc. it provides a simplified view of the world.

If that's what you wanted you could have designed that as your poorly documented internal API differently to begin with. There's zero advantage to MCP in the scenario you describe aside from convincing people that their original API is too hard to use.

pamelafox · 2 months ago

Regarding the Playwright example: I had the same experience this week attempting to build an agent first by using the Playwright MCP server, realizing it was slow, token-inefficient, and flaky, and rewriting with direct Playwright calls.

MCP servers might be fun to get an idea for what's possible, and good for one-off mashups, but API calls are generally more efficient and stable, when you know what you want.

Here's the agent I ended up writing: https://github.com/pamelafox/personal-linkedin-agent

Demo: https://www.youtube.com/live/ue8D7Hi4nGs

arkmm · 2 months ago

This is cool! Also have found the Playwright MCP implementation to be overkill and think of it more as a reference to an opinionated subset of the Playwright API.

LinkedIn has this reputation of being notorious about making it hard to build automations on top of, did you run into any roadblocks when building your personal LinkedIn agent?

zahlman · 2 months ago

... Ah, reading these as well as more carefully reading TFA, I understand now that there is an MCP based on Playwright, and that Playwright itself is not considered an example of something that accidentally is an MCP despite having been released all the way back in January 2020.

... But now I still feel far away from understanding what MCP really is. As in:

* What specifically do I have to implement in order to create one?

* Now that the concept exists, what are the implications as the author of, say, a traditional REST API?

* Now that the concept exists, what new problems exist to solve?

jumploops · 2 months ago

We’re playing an endless cat and mouse game of capabilities between old and new right now.

Claude Code shows that the models can excel at using “old” programmatic interfaces (CLIs) to do Real Work™.

MCP is a way to dynamically provide “new” programmatic interfaces to the models.

At some point this will start to converge, or at least appear to do so, as the majority of tools a model needs will be in its pre-training set.

Then we’ll argue about MPPP (model pre-training protocol pipeline), and how to reduce knowledge pollution of all the LLM-generated tools we’re passing to the model.

Eventually we’ll publish the Merrium-Webster Model Tool Dictionary (MWMTD), surfacing all of the approved tools hidden in the pre-training set.

Then the kids will come up with Model Context Slang (MCS), in an attempt to use the models to dynamically choose unapproved tools, for much fun and enjoyment.

Ad infinitum.