The new skill in AI is not prompting, it's context engineering

llm -m openai/o3 \ -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \ -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \ -s 'Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment'

When using the {ToolName.ReadFile} tool, prefer reading a large section over calling the {ToolName.ReadFile} tool many times in sequence. You can also think of all the pieces you may be interested in and read them in parallel. Read large enough context to ensure you get what you need.

description: 'Read the contents of a file. Line numbers are 1-indexed. This tool will truncate its output at 2000 lines and may be called repeatedly with offset and limit parameters to read larger files in chunks.'

The new skill is programming, same as the old skill. To the extent these things are comprehensible, you understand them by writing programs: programs that train them, programs that run inferenve, programs that analyze their behavior. You get the most out of LLMs by knowing how they work in detail.

I had one view of what these things were and how they work, and a bunch of outcomes attached to that. And then I spent a bunch of time training language models in various ways and doing other related upstream and downstream work, and I had a different set of beliefs and outcomes attached to it. The second set of outcomes is much preferable.

I know people really want there to be some different answer, but it remains the case that mastering a programming tool involves implemtenting such, to one degree or another. I've only done medium sophistication ML programming, and my understand is therefore kinda medium, but like compilers, even doing a medium one is the difference between getting good results from a high complexity one and guessing.

Go train an LLM! How do you think Karpathy figured it out? The answer is on his blog!

pyman · 2 months ago

Saying the best way to understand LLMs is by building one is like saying the best way to understand compilers is by writing one. Technically true, but most people aren't interested in going that deep.

benreesman · 2 months ago

I don't know, I've heard that meme too but it doesn't track with the number of cool compiler projects on GitHub or that frontpage HN, and while the LLM thing is a lot newer, you see a ton of useful/interesting stuff at the "an individual could do this on their weekends and it would mean they fundamentally know how all the pieces fit together" type stuff.

There will always be a crowd that wants the "master XYZ in 72 hours with this ONE NEAT TRICK" course, and there will always be a..., uh, group of people serving that market need.

But most people? Especially in a place like HN? I think most people know that getting buff involves going to the gym, especially in a place like this. I have a pretty high opinion of the typical person. We're all tempted by the "most people are stupid" meme, but that's because bad interactions are memorable, not because most people are stupid or lazy or whatever. Most people are very smart if they apply themselves, and most people will work very hard if the reward for doing so is reasonably clear.

https://www.youtube.com/shorts/IQmOGlbdn8g

wickedsight · 2 months ago

The best way to understand a car is to build a car. Hardly anyone is going to do that, but we still all use them quite well in our daily lives. In large part because the companies who build them spend time and effort to improve them and take away friction and complexity.

If you want to be an F1 driver it's probably useful to understand almost every part of a car. If you're a delivery driver, it probably isn't, even if you use one 40+ hours a week.

Deleted Comment

hackable_sand · 2 months ago

It's not that deep

Davidzheng · 2 months ago

I highly highly doubt that training a LLM like gpt-2 will help you use something the size of GPT-4. And I guess most people can't afford to train something like GPT-4. I trained some NNs back before the ChatGPT era, I don't think any of it helps in using Chatgpt/alternatives

benreesman · 2 months ago

With modern high-quality datasets and the plummeting H100 rental costs it is 100% a feasible undertaking for an individual to train a model with performance far closer to gpt-4-1106-preview than to gpt-2, in fact its difficult to train a model that performs as badly as gpt-2 without carefully selecting for datasets like OpenWebText with the explicit purpose of replicating runs of historical interest: modern datasets will do better than that by default.

GPT-4 is a 1.75 terraweight MoE (the rumor has it) and that's probably pushing it for an individual's discretionary budget unless they're very well off, but you don't need to match that exactly to learn how these things fundamentally work.

I think you underestimate how far the technology has come. torch.distributed works out of the box now, deepspeed and other strategies that are both data and model parallel are weekend projects to spin up on an 8xH100 SXM2 interconnected cluster that you can rent from Lambda Labs, HuggingFace has extreme quality curated datasets (the fineweb family I was alluding to from Karpathy's open stuff is stellar).

In just about any version of this you come to understand how tokenizers work (which makes a whole class of failure modes go from baffling to intuitive), how models behave and get evaled after pretraining, after instruct training / SFT rounds, how convergence does and doesn't happen, how tool use and other special tokens get used (and why they are abundant).

And no, doing all that doesn't make Opus 4 completely obvious in all aspects. But its about 1000x more effective as a learning technique than doing prompt engineer astrology. Opus 4 is still a bit mysterious if you don't work at a frontier lab, there's very interesting stuff going on there and I'm squarely speculating how some of that works if I make claims about it.

Models that look and act a lot like GPT-4 while having dramatically lower parameter counts are just completely understood in open source now. The more advanced ones require resources of a startup rather than an individual, but you don't need to eval the same as 1106 to take all the mystery out of how it works.

The "holy shit" models are like 3-4 generations old now.

> Building powerful and reliable AI Agents is becoming less about finding a magic prompt or model updates.

Ok, I can buy this

> It is about the engineering of context and providing the right information and tools, in the right format, at the right time.

when the "right" format and "right" time are essentially, and maybe even necessarily, undefined, then aren't you still reaching for a "magic" solution?

If the definition of "right" information is "information which results in a sufficiently accurate answer from a language model" then I fail to see how you are doing anything fundamentally differently than prompt engineering. Since these are non-deterministic machines, I fail to see any reliable heuristic that is fundamentally indistinguishable than "trying and seeing" with prompts.

v3ss0n · 2 months ago

At this point , due to non-deterministic nature and hallucination context engineering is pretty much magic. But here are our findings.

1 - LLM Tends to pick up and understand contexts that comes at top 7-12 lines.Mostly first 1k token is best understood by llms ( tested on Claude and several opensource models ) so - most important contexts like parsing rules need to be placed there.

2 - Need to keep context short . Whatever context limit they claim is not true . They may have long context window of 1 mil tokens but only up to avg 10k token have good accuracy and recall capabilities , the rest is just bunk , just ignore them. Write the prompt and try compressing/summerizing it without losing key information manually or use of LLM.

3 - If you build agent-to-agent orchestration , don't build agents with long context and multiple tools, break them down to several agents with different set of tools and then put a planning agent which solely does handover.

4 - If all else fails , write agent handover logic in code - as it always should.

From building 5+ agent to agent orchestration project on different industries using autogen + Claude - that is the result.

zacksiri · 2 months ago

Based on my testing the larger the model the better it is at handling larger context.

I tested with 8B model, 14B model and 32B model.

I wanted it to create structured json, and the context was quite large like 60k tokens.

the 8B model failed miserably despite supporting 128k context, the 14b did better the 32B one almost got everything correct. However when jumping to a really large model like grok-3-mini it got it all perfect.

The 8B, 14B, 32B models I tried were Qwen 3. All the models I tested I disabled thinking.

Now for my agent workflows I use small models for most workflow (it works quite nicely) and only use larger models when the problem is harder.

lblume · 2 months ago

I have uploaded entire books to the latest Gemini and had the model reliably accurately answer specific questions requiring knowledge of multiple chapters.

Deleted Comment

zvitiate · 2 months ago

Claude’s system prompt is SO long though that the first 1k lines might not be as relevant for Gemini, GPT, or Grok.

mentalgear · 2 months ago

It's magical thinking all the way down. Whether they call it now "prompt" or "context" engineering because it's the same tinkering to find something that "sticks" in non-deterministic space.

nonethewiser · 2 months ago

>Whether they call it now "prompt" or "context" engineering because it's the same tinkering to find something that "sticks" in non-deterministic space.

I dont quite follow. Prompts and contexts are different things. Sure, you can get thing into contexts with prompts but that doesn't mean they are entirely the same.

You could have a long running conversation with a lot in the context. A given prompt may work poorly, whereas it would have worked quite well earlier. I don't think this difference is purely semantic.

For whatever it's worth I've never liked the term "prompt engineering." It is perhaps the quintessential example of overusing the word engineering.

belter · 2 months ago

Got it...updating CV to call myself a VibeOps Engineer in a team of Context Engineers...A few of us were let go last quarter, as they could only do Prompt Engineering.

tootie · 2 months ago

You say "magic" I say "heuristic"

ironmagma · 2 months ago

What is all software but tinkering?

I mean this not as an insult to software dev but to work generally. It’s all play in the end.

surecoocoocoo · 2 months ago

We used to define a specification.

In other words; context.

But that was like old man programming.

As the laws of physics changed between 1970 and 2009.

edwardbernays · 2 months ago

The state of the art theoretical frameworks typically separates these into two distinct exploratory and discovery phases. The first phase, which is exploratory, is best conceptualized as utilizing an atmospheric dispersion device. An easily identifiable marker material, usually a variety of feces, is metaphorically introduced at high velocity. The discovery phase is then conceptualized as analyzing the dispersal patterns of the exploratory phase. These two phases are best summarized, respectively, as "Fuck Around" followed by "Find Out."

Aeolun · 2 months ago

There is only so much you can do with prompts. To go from the 70% accuracy you can achieve with that to the 95% accuracy I see in Claude Code, the context is absolutely the most important, and it’s visible how much effort goes into making sure Claude retrieves exactly the right context, often at the expense of speed.

majormajor · 2 months ago

Why are we drawing a difference between "prompt" and "context" exactly? The linked article is a bit of puffery that redefines a commonly-used term - "context" - to mean something different than what it's meant so far when we discuss "context windows." It seems to just be some puffery to generate new hype.

When you play with the APIs the prompt/context all blurs together into just stuff that goes into the text fed to the model to produce text. Like when you build your own basic chatbot UI and realize you're sending the whole transcript along with every step. Using the terms from the article, that's "State/History." Then "RAG" and "Long term memory" are ways of working around the limits of context window size and the tendency of models to lose the plot after a huge number of tokens, to help make more effective prompts. "Available tools" info also falls squarely in the "prompt engineering" category.

The reason prompt engineering is going the way of the dodo is because tools are doing more of the drudgery to make a good prompt themselves. E.g., finding relevant parts of a codebase. They do this with a combination of chaining multiple calls to a model together to progressively build up a "final" prompt plus various other less-LLM-native approaches (like plain old "find").

So yeah, if you want to build a useful LLM-based tool for users you have to write software to generate good prompts. But... it ain't really different than prompt engineering other than reducing the end user's need to do it manually.

It's less that we've made the AI better and more that we've made better user interfaces than just-plain-chat. A chat interface on a tool that can read your code can do more, more quickly, than one that relies on you selecting all the relevant snippets. A visual diff inside of a code editor is easier to read than a markdown-based rendering of the same in a chat transcript. Etc.

dinvlad · 2 months ago

> when the "right" format and "right" time are essentially, and maybe even necessarily, undefined, then aren't you still reaching for a "magic" solution?

Exactly the problem with all "knowing how to use AI correctly" advice out there rn. Shamans with drums, at the end of the day :-)

thomastjeffery · 2 months ago

Models are Biases.

There is no objective truth. Everything is arbitrary.

There is no such thing as "accurate" or "precise". Instead, we get to work with "consistent" and "exhaustive". Instead of "calculated", we get "decided". Instead of "defined" we get "inferred".

Really, the whole narrative about "AI" needs to be rewritten from scratch. The current canonical narrative is so backwards that it's nearly impossible to have a productive conversation about it.

andy99 · 2 months ago

It's called over-fitting, that's basically what prompt engineering is.

evjan · 2 months ago

That doesn't sound like how I understand over-fitting, but I'm intrigued! How do you mean?

felipeerias · 2 months ago

If someone asked you about the usages of a particular element in a codebase, you would probably give a more accurate answer if you were able to use a code search tool rather than reading every source file from top to bottom.

For that kind of tasks (and there are many of those!), I don't see why you would expect something fundamentally different in the case of LLMs.

bostik · 2 months ago

In my previous job I repeatedly told people that "git grep is a superpower". Especially in a monorepo, but works well in any big repository, really.

To this day I think the same. With the addition that knowing about "git log -S" grants you necromancy in addition to the regular superpowers. Ability to do rapid code search, and especially code history search, make you look like a wizard without the funny hat.

manishsharan · 2 months ago

I provided 'grep' as a tool to LLM (deepseek) and it does a better job of finding usages. This is especially true if the code is obfuscated JavaScript.

skydhash · 2 months ago

But why not provide the search tool instead of being an imperfect interface between it and the person asking? The only reason for the latter is that you have more applied knowledge in the context and can use the tool better. For any other case, the answer should be “use this tool”.

autobodie · 2 months ago

Tha problem is that "right" is defined circularly

ninetyninenine · 2 months ago

Yeah but do we have to make a new buzz word out of it? "Context engineer"

FridgeSeal · 2 months ago

It’s just AI people moving the goalposts now that everyone has realised that “prompt engineering” isn’t a special skill.

coliveira · 2 months ago

In other words, "if AI doesn't work for you the problem is not IA, it is the user", that's what AI companies want us to believe.

j45 · 2 months ago

Everything is new to someone and the tends of reference will evolve.

colordrops · 2 months ago

> Since these are non-deterministic machines, I fail to see any reliable heuristic that is fundamentally indistinguishable than "trying and seeing" with prompts

There are many sciences involving non-determinism that still have laws and patterns, e.g. biology and maybe psychology. It's not all or nothing.

Also, LLMs are deterministic, just not predictable. The non-determinism is injected by providers.

Anyway is there an essential difference between prompt engineering and context engineering? They seem like two names for the same thing.

simonw · 2 months ago

They arguably are two names for the same thing.

The difference is that "prompt engineering" as a term has failed, because to a lot of people the inferred definition is "a laughably pretentious term for typing text into a chatbot" - it's become indistinguishable from end-user prompting.

My hope is that "context engineering" better captures the subtle art of building applications on top of LLMs through carefully engineering their context.

phyalow · 2 months ago

“non-deterministic machines“

Not correct. They are deterministic as long as a static seed is used.

kazga · 2 months ago

That's not true in practice. Floating point arithmetic is not commutative due to rounding errors, and the parallel operations introduce non-determinisn even at temperature 0.

Deleted Comment

pbreit · 2 months ago

What's the difference?

PeterStuer · 2 months ago

"these are non-deterministic machines"

Only if you choose so by allowing some degree of randomness with the temperature setting.

pegasus · 2 months ago

They are usually nondeterministic even at temperature 0 - due to things like parallelism and floating point rounding errors.

edflsafoiewq · 2 months ago

In the strict sense, sure, but the point is they depend not only on the seed but on seemingly minor variations in the prompt.

zelphirkalt · 2 months ago

This is what irks me so often when reading these comments. This is just software inside a ordinary computer, it always does the same with the same input, which includes hidden and global state. Stating that they are "non-deterministic machines" sounds like throwing the towel and thinking "it's magic!". I am not even sure what people want to actually express, when they make these false statements.

If one wants to make something give the same answers every time, one needs to control all the variables of input. This is like any other software including other machine learning algorithms.

csallen · 2 months ago

This is like telling a soccer player that no change in practice or technique is fundamentally different than another, because ultimately people are non-deterministic machines.

I wrote a bit about this the other day: https://simonwillison.net/2025/Jun/27/context-engineering/

Drew Breunig has been doing some fantastic writing on this subject - coincidentally at the same time as the "context engineering" buzzword appeared but actually unrelated to that meme.

How Long Contexts Fail - https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-ho... - talks about the various ways in which longer contexts can start causing problems (also known as "context rot")

How to Fix Your Context - https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.... - gives names to a bunch of techniques for working around these problems including Tool Loadout, Context Quarantine, Context Pruning, Context Summarization, and Context Offloading.

the_mitsuhiko · 2 months ago

Drew Breunig's posts are a must read on this. This is not only important for writing your own agents, it is also critical when using agentic coding right now. These limitations/behaviors will be with us for a while.

outofpaper · 2 months ago

They might be good reads on the topic but Drew makes some significant etymological mistakes. For example loadout doesn't come from gaming but military terminology. It's essentially the same as kit or gear.

Daub · 2 months ago

For visual art I feel that the existing approaches in context engineering are very much lacking. An Ai understands well enough such simple things as content (bird, dog, owl etc), color (blue green etc) and has a fair understanding of foreground/background. However, the really important stuff is not addressed.

For example: in form, things like negative shape and overlap. In color contrast things like Ratio contrast and dynamic range contrast. Or how manipulating neighboring regional contrast produces tone wrap. I could go on.

One reason for this state of affairs is that artists and designers lack the consistent terminology to describe what they are doing (though this does not stop them from operating at a high level). Indeed, many of the terms I have used here we (my colleagues and I) had to invent ourselves. I would love to work with an AI guru to address this developing problem.

> artists and designers lack the consistent terminology to describe what they are doing

I don't think they do. It may not be completely consistent, but open any art book and you find the same thing being explained again and again. Just for drawing humans, you will find emphasis on the skeleton and muscle volume for forms and poses, planes (especially the head) for values and shadows, some abstract things like stability and line weight, and some more concrete things like foreshortening.

Several books and course have gone over those concepts. They are not difficult to explain, they are just difficult to master. That's because you have to apply judgement for every single line or brush stroke deciding which factors matter most and if you even want to do the stroke. Then there's the whole hand eye coordination.

So unless you can solve judgement (which styles derive from), there's not a lot of hope there.

ADDENDUM

And when you do a study of another's work, it's not copying the data, extracting colors, or comparing labels,... It's just studying judgement. You know the complete formula from which a more basic version is being used for the work, and you only want to know the parameters. Whereas machine training is mostly going for the wrong formula with completely different variables.

daxfohl · 2 months ago

I'm surprised there isn't already an ecosystem of libraries that just do this. When building agents you either have to roll your own or copy an algorithm out of some article.

I'd expect this to be a lot more plug and play, and as swappable as LLMs themselves by EOY, along with a bunch of tooling to help with observability, A/B testing, cost and latency analysis (since changing context kills the LLM cache), etc.

Or maybe it's that each of these things is pretty simple in itself. Clipping context is one line of code, summarizing could be a couple lines to have an LLM summarize it for you, etc. So not substantial enough for a formal library. Whereas the combinations of these techniques is very application dependent, so not reusable enough to warrant separating as an independent library.

Or maybe it just hasn't matured yet and we'll see more of it in the future. We'll see.

risyachka · 2 months ago

“A month-long skill” after which it won’t be a thing anymore, like so many other.

Most of the LLM prompting skills I figured out ~three years ago are still useful to me today. Even the ones that I've dropped are useful because I know that things that used to be helpful aren't helpful any more, which helps me build an intuition for how the models have improved over time.

orbital-decay · 2 months ago

Many people figured it out two-three years ago when AI-assisted coding basically wasn't a thing, and it's still relevant and will stay relevant. These are fundamental principles, all big models work similarly, not just transformers and not just LLMs.

However, many fundamental phenomena are missing from the "context engineering" scope, so neither context engineering nor prompt engineering are useful terms.

coldtea · 2 months ago

What exactly month-long AI skills of 2023 AI are obsolete now?

Surely not prompt engineering itself, for example.

tptacek · 2 months ago

If you're not writing your own agents, you can skip this skill.

storus · 2 months ago

Those issues are considered artifacts of the current crop of LLMs in academic circles; there is already research allowing LLMs to use millions of different tools at the same time, and stable long contexts, likely reducing the amount of agents to one for most use cases outside interfacing different providers.

Anyone basing their future agentic systems on current LLMs would likely face LangChain fate - built for GPT-3, made obsolete by GPT-3.5.

Can you link to the research on millions of different terms and stable long contexts? I haven't come across that yet.

ZYbCRq22HbJ2y7 · 2 months ago

How would "a million different tool calls at the same time" work? For instance, MCP is HTTP based, even at low latency in incredibly parallel environments that would take forever.

Foreignborn · 2 months ago

yes, but those aren’t released and even then you’ll always need glue code.

you just need to knowingly resource what glue code is needed, and build it in a way it can scale with whatever new limits that upgraded models give you.

i can’t imagine a world where people aren’t building products that try to overcome the limitations of SOTA models

> already research allowing LLMs to use millions of different tools

Hmm first time hearing about this, could you share any examples please?

dosnem · 2 months ago

Providing context makes sense to me, but do you have any examples of providing context and then getting the AI to produce something complex? I am quite a proponent of AI but even I find myself failing to produce significant results on complex problems, even when I have clone + memory bank, etc. it ends up being a time sink of trying to get the ai to do something only to have me eventually take over and do it myself.

Quite a few times, I've been able to give it enough context to write me an entire working piece of software in a single shot. I use that for plugins pretty often, eg this:

Which produced this: https://gist.github.com/simonw/249e16edffe6350f7265012bee9e3...

AnotherGoodName · 2 months ago

I had a series of “Using Manim create an animation for formula X rearranging into formula Y with a graph of values of the function”

Beautiful one shot results and i now have really nice animations of some complex maths to help others understand. (I’ll put it up on youtube soon).

I don't know the manim library at all so saved me about a week of work learning and implementing

old_man_cato · 2 months ago

[flagged]

jknoepfler · 2 months ago

Oh, and don't forget to retain the artist to correct the ever-increasingly weird and expensive mistakes made by the context when you need to draw newer, fancier pelicans. Maybe we can just train product to draw?

d0gsg0w00f · 2 months ago

This hits too close to home.

_carbyau_ · 2 months ago

arbitrary_name · 2 months ago

From the first link:Read large enough context to ensure you get what you need.

How does this actually work, and how can one better define this to further improve the prompt?

This statement feels like the 'draw the rest of the fucking owl' referred to elsewhere in the thread

I'm not sure how you ended up on that page... my comment above links to https://simonwillison.net/2025/Jun/27/context-engineering/

The "Read large enough context to ensure you get what you need" quote is from a different post entirely, this one: https://simonwillison.net/2025/Jun/30/vscode-copilot-chat/

That's part of the system prompts used by the GitHub Copilot Chat extension for VS Code - from this line: https://github.com/microsoft/vscode-copilot-chat/blob/40d039...

The full line is:

That's a hint to the tool-calling LLM that it should attempt to guess which area of the file is most likely to include the code that it needs to review.

It makes more sense if you look at the definition of the ReadFile tool:

https://github.com/microsoft/vscode-copilot-chat/blob/40d039...

The tool takes three arguments: filePath, offset and limit.

JoeOfTexas · 2 months ago

So who will develop the first Logic Core that automates the context engineer.

igravious · 2 months ago

The first rule of automation: that which can be automated will be automated.

Observation: this isn't anything that can't be automated /

TZubiri · 2 months ago

Rediscovering encapsulation

JohnMakin · 2 months ago

niemandhier · 2 months ago

LLM agents remind me of the great Nolan movie „Memento“.

The agents cannot change their internal state hence they change the encompassing system.

They do this by injecting information into it in such a way that the reaction that is triggered in them compensates for their immutability.

For this reason I call my agents „Sammy Jenkins“.

StochasticLi · 2 months ago

I think we can reasonably expect they will become non-stateless in the next few years.

tdaltonc · 2 months ago

If agents are stateful a few years form now it will be because they accrete a layer of context engineering.

roflyear · 2 months ago

Why?

baxtr · 2 months ago

>Conclusion
Building powerful and reliable AI Agents is becoming less about finding a magic prompt or model updates. It is about the engineering of context and providing the right information and tools, in the right format, at the right time. It’s a cross-functional challenge that involves understanding your business use case, defining your outputs, and structuring all the necessary information so that an LLM can “accomplish the task."

That’s actually also true for humans: the more context (aka right info at the right time) you provide the better for solving tasks.

root_axis · 2 months ago

I am not a fan of this banal trend of superficially comparing aspects of machine learning to humans. It doesn't provide any insight and is hardly ever accurate.

furyofantares · 2 months ago

I've seen a lot of cases where, if you look at the context you're giving the model and imagine giving it to a human (just not yourself or your coworker, someone who doesn't already know what you're trying to achieve - think mechanical turk), the human would be unlikely to give the output you want.

Context is often incomplete, unclear, contradictory, or just contains too much distracting information. Those are all things that will cause an LLM to fail that can be fixed by thinking about how an unrelated human would do the job.

stefan_ · 2 months ago

Theres all these philosophers popping up everywhere. This is also another one of these topics that featured in peoples favorite scifi hyperfixation so all discussions inevitably get ruined with scifi fanfic (see also: room temperature superconductivity).

ModernMech · 2 months ago

I agree, however I do appreciate comparisons to other human-made systems. For example, "providing the right information and tools, in the right format, at the right time" sounds a lot like a bureaucracy, particularly because "right" is decided for you, it's left undefined, and may change at any time with no warning or recourse.

Without my note I wouldn’t have seen this comment, which is very insightful to me at least.

https://news.ycombinator.com/item?id=44429880

layer8 · 2 months ago

The difference is that humans can actively seek to acquire the necessary context by themselves. They don't have to passively sit there and wait for someone else to do the tedious work of feeding them all necessary context upfront. And we value humans who are able to proactively do that seeking by themselves, until they are satisfied that they can do a good job.

> The difference is that humans can actively seek to acquire the necessary context by themselves

These days, so can LLM systems. The tool calling pattern got really good in the last six months, and one of the most common uses of that is to let LLMs search for information they need to add to their context.

o3 and o4-mini and Claude 4 all do this with web search in their user-facing apps and it's extremely effective.

The same patterns is increasingly showing up in coding agents, giving them the ability to search for relevant files or even pull in official document documentation for libraries.

Basically, finding the right buttons to push within the constraints of the environment. Not so much different from what (SW) engineering is, only non-deterministic in the outcomes.

QuercusMax · 2 months ago

Yeah... I'm always asking my UX and product folks for mocks, requirements, acceptance criteria, sample inputs and outputs, why we care about this feature, etc.

Until we can scan your brain and figure out what you really want, it's going to be necessary to actually describe what you want built, and not just rely on vibes.

therealdrag0 · 2 months ago

Ya reminds me of social engineering. Like we’re seeing “How to Win Programming and Influence LLMs”.

fergal · 2 months ago

THis.. I was about to make a similar point; this conclusion reads like a job description for a technical lead role where they managed and define work for a team of human devs who execute implementation.

eviks · 2 months ago

Right info at the right time is not "more", and with humans it's pretty easy to overwhelm, so do the opposite - convert "more" into "wrong"

lupire · 2 months ago

Not "more" context. "Better" context.

(X-Y problem, for example.)

I think too much context is harmful

zaptheimpaler · 2 months ago

I feel like this is incredibly obvious to anyone who's ever used an LLM or has any concept of how they work. It was equally obvious before this that the "skill" of prompt-engineering was a bunch of hacks that would quickly cease to matter. Basically they have the raw intelligence, you now have to give them the ability to get input and the ability to take actions as output and there's a lot of plumbing to make that happen.

imiric · 2 months ago

That might be the case, but these tools are marketed as having close to superhuman intelligence, with the strong implication that AGI is right around the corner. It's obvious that engineering work is required to get them to perform certain tasks, which is what the agentic trend is about. What's not so obvious is the fact that getting them to generate correct output requires some special skills or tricks. If these tools were truly intelligent and capable of reasoning, surely they would be able to inform human users when they lack contextual information instead of confidently generating garbage, and their success rate would be greater than 35%[1].

The idea that fixing this is just a matter of providing better training and contextual data, more compute or plumbing, is deeply flawed.

[1]: https://www.theregister.com/2025/06/29/ai_agents_fail_a_lot/

skort · 2 months ago

Yeah, my reaction to this was "Big deal? How is this news to anyone"

It reads like articles put out by consultants at the height of SOA. Someone thought for a few minutes about something and figured it was worth an article.

crystal_revenge · 2 months ago

Definitely mirrors my experience. One heuristic I've often used when providing context to model is "is this enough information for a human to solve this task?". Building some text2SQL products in the past it was very interesting to see how often when the model failed, a real data analyst would reply something like "oh yea, that's an older table we don't use any more, the correct table is...". This means the model was likely making a mistake that a real human analyst would have without the proper context.

One thing that is missing from this list is: evaluations!

I'm shocked how often I still see large AI projects being run without any regard to evals. Evals are more important for AI projects than test suites are for traditional engineering ones. You don't even need a big eval set, just one that covers your problem surface reasonably well. However without it you're basically just "guessing" rather than iterating on your problem, and you're not even guessing in a way where each guess is an improvement on the last.

edit: To clarify, I ask myself this question. It's frequently the case that we expect LLMs to solve problems without the necessary information for a human to solve them.

adiabatichottub · 2 months ago

A classic law of computer programming:

"Make it possible for programmers to write in English and you will find that programmers cannot write in English."

It's meant to be a bit tongue-in-cheek, but there is a certain truth to it. Most human languages fail at being precise in their expression and interpretation. If you can exactly define what you want in English, you probably could have saved yourself the time and written it in a machine-interpretable language.

hobs · 2 months ago

The thing is, all the people cosplaying as data scientists don't want evaluations, and that's why you saw so little in fake C level projects, because telling people the emperor has no clothes doesn't pay.

For those actually using the products to make money well, hey - all of those have evaluations.

shermantanktop · 2 months ago

I know this proliferation of excited wannabes is just another mark of a hype cycle, and there’s real value this time. But I find myself unreasonably annoyed by people getting high on their own supply and shouting into a megaphone.

kevin_thibedeau · 2 months ago

Asking yes no questions will get you a lie 50% of the time.

adriand · 2 months ago

I have pretty good success with asking the model this question before it starts working as well. I’ll tell it to ask questions about anything it’s unsure of and to ask for examples of code patterns that are in use in the application already that it can use as a template.

zacharyvoase · 2 months ago

I love how we have such a poor model of how LLMs work (or more aptly don't work) that we are developing an entire alchemical practice around them. Definitely seems healthy for the industry and the species.

The stuff that's showing up under the "context engineering" banner feels a whole lot less alchemical to me than the older prompt engineering tricks.

Alchemical is "you are the world's top expert on marketing, and if you get it right I'll tip you $100, and if you get it wrong a kitten will die".

The techniques in https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.... seem a whole lot more rational to me than that.

As it gets more rigorous and predictable I suppose you could say it approaches psychology.

This is offensive to alchemy.

__MatrixMan__ · 2 months ago

Reminds me of quantum mechanics