jumploops (u/jumploops)

jumploops commented on A few random notes from Claude coding quite a bit last few weeks twitter.com/karpathy/stat... · Posted by u/bigwheels

dexdal · 13 days ago

The failure mode is missing constraints, not “coding skill”. Treat the model as a generator that must operate inside an explicit workflow: define the invariant boundaries, require a plan/diff before edits, run tests and static checks, and stop when uncertainty appears. That turns “hacky conditional” behaviour into controlled change.

jumploops · 13 days ago

Yes, exactly.

The LLM is onboarding to your codebase with each context window, all it knows is what it’s seen already.

jumploops commented on A few random notes from Claude coding quite a bit last few weeks twitter.com/karpathy/stat... · Posted by u/bigwheels

mh2266 · 13 days ago

are LLM codebases not messy?

Claude by default, unless I tell it not to, will write stuff like:

    // we need something to be true
    somethingPasses = something()
    if (!somethingPasses) {
        return false
    }

    // we need somethingElse to be true
    somethingElsePasses = somethingElse()
    if (!somethingElsePasses) {
        return false
    }

    return true

instead of the very simple boolean logic that could express this in one line, with the "this code does what it obviously does" comments added all over the place.

generally unless you tell it not to, it does things in very verbose ways that most humans would never do, and since there's an infinite number of ways that it can invent absurd verbosity, it is hard to preemptively prompt against all of them.

to be clear, I am getting a huge amount of value out of it for executing a bunch of large refactors and "modernization" of a (really) big legacy codebase at scale and in parallel. but it's not outputting the sort of code that I see when someone prompts it "build a new feature ...", and a big part of my prompts is screaming at it not to do certain things or to refuse the task if it at any point becomes unsure.

jumploops · 13 days ago

Yeah to be clear it will have the same issues as a flyby contributor if prompted to.

Meaning if you ask it “handle this new condition” it will happily throw in a hacky conditional and get the job done.

I’ve found the most success in having it reason about the current architecture (explicitly), and then to propose a set of changes to accomplish the task (2-5 ways), review, and then implement the changes that best suit the scope of the larger system.

jumploops commented on A few random notes from Claude coding quite a bit last few weeks twitter.com/karpathy/stat... · Posted by u/bigwheels

smusamashah · 14 days ago

The code base I work on at $dayjob$ is legacy, has few files with 20k lines each and a few more with around 10k lines each. It's hard to find things and connect dots in the code base. Dont think LLMs able to navigate and understand code bases of that size yet. But have seen lots of seemingly large projects shown here lately that involve thousands of files and millions of lines of code.

jumploops · 14 days ago

I’ve found that LLMs seem to work better on LLM-generated codebases.

Commercial codebases, especially private internal ones, are often messy. It seems this is mostly due to the iterative nature of development in response to customer demands.

As a product gets larger, and addresses a wider audience, there’s an ever increasing chance of divergence from the initial assumptions and the new requirements.

We call this tech debt.

Combine this with a revolving door of developers, and you start to see Conway’s law in action, where the system resembles the organization of the developers rather than the “pure” product spec.

With this in mind, I’ve found success in using LLMs to refactor existing codebases to better match the current requirements (i.e. splitting out helpers, modularizing, renaming, etc.).

Once the legacy codebase is “LLMified”, the coding agents seem to perform more predictably.

YMMV here, as it’s hard to do large refactors without tests for correctness.

(Note: I’ve dabbled with a test first refactor approach, but haven’t gone to the lengths to suggest it works, but I believe it could)

jumploops commented on Prism openai.com/index/introduc... · Posted by u/meetpateltech

jumploops · 14 days ago

I’ve been “testing” LLM willingness to explore novel ideas/hypotheses for a few random topics[0].

The earlier LLMs were interesting, in that their sycophantic nature eagerly agreed, often lacking criticality.

After reducing said sycophancy, I’ve found that certain LLMs are much more unwilling (especially the reasoning models) to move past the “known” science[1].

I’m curious to see how/if we can strike the right balance with an LLM focused on scientific exploration.

[0]Sediment lubrication due to organic material in specific subduction zones, potential algorithmic basis for colony collapse disorder, potential to evolve anthropomorphic kiwis, etc.

[1]Caveat, it’s very easy for me to tell when an LLM is “off-the-rails” on a topic I know a lot about, much less so, and much more dangerous, for these “tests” where I’m certainly no expert.

jumploops commented on Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model kimi.com/blog/kimi-k2-5.h... · Posted by u/nekofneko

jumploops · 14 days ago

> For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls.

> K2.5 Agent Swarm improves performance on complex tasks through parallel, specialized execution [..] leads to an 80% reduction in end-to-end runtime

Not just RL on tool calling, but RL on agent orchestration, neat!

jumploops commented on AI code and software craft alexwennerberg.com/blog/2... · Posted by u/alexwennerberg

jumploops · 14 days ago

> People have said that software engineering at large tech companies resembles "plumbing"

> AI code [..] may also free up a space for engineers seeking to restore a genuine sense of craft and creative expression

This resonates with me, as someone who joined the industry circa 2013, and discovered that most of the big tech jobs were essentially glorified plumbers.

In the 2000s, the web felt more fun, more unique, more unhinged. Websites were simple, and Flash was rampant, but it felt like the ratio of creators to consumers was higher than now.

With Claude Code/Codex, I've built a bunch of things that usually would die at a domain name purchase or init commit. Now I actually have the bandwidth to ship them!

This ease of dev also means we'll see an explosion in slopware, which we're already starting to see with App Store submissions up 60% over the last year[0].

My hope is that, with the increase of slop, we'll also see an increase in craft. Even if the proportion drops, the scale should make up for it.

We sit in prefab homes, cherishing the cathedrals of yesteryear, often forgetting that we've built skyscrapers the ancient architects could never dream of.

More software is good. Computers finally work the way we always expected them to!

[0]https://www.a16z.news/p/charts-of-the-week-the-almighty-cons...

jumploops commented on Unrolling the Codex agent loop openai.com/index/unrollin... · Posted by u/tosh

CjHuber · 18 days ago

It depends on the API path. Chat completions does what you describe, however isn't it legacy?

I've only used codex with the responses v1 API and there it's the complete opposite. Already generated reasoning tokens even persist when you send another message (without rolling back) after cancelling turns before they have finished the thought process

Also with responses v1 xhigh mode eats through the context window multiples faster than the other modes, which does check out with this.

jumploops · 16 days ago

That’s what I used to think, before chatting with the OAI team.

The docs are a bit misleading/opaque, but essentially reasoning persists for multiple sequential assistant turns, but is discarded upon the next user turn[0].

The diagram on that page makes it pretty clear, as does the section on caching.

[0]https://cookbook.openai.com/examples/responses_api/reasoning...

jumploops commented on Unrolling the Codex agent loop openai.com/index/unrollin... · Posted by u/tosh

EnPissant · 17 days ago

Thanks. That's really interesting. That documentation certainly does say that reasoning from previous turns are dropped (a turn being an agentic loop between user messages), even if you include the encrypted content for them in the API calls.

I wonder why the second PR you linked was made then. Maybe the documentation is outdated? Or maybe it's just to let the server be in complete control of what gets dropped and when, like it is when you are using responses statefully? This can be because it has changed or they may want to change it in the future. Also, codex uses a different endpoint than the API, so maybe there are some other differences?

Also, this would mean that the tail of the KV cache that contains each new turn must be thrown away when the next turn starts. But I guess that's not a very big deal, as it only happens once for each turn.

EDIT:

This contradicts the caching documentation: https://developers.openai.com/blog/responses-api/

Specifically:

> And here’s where reasoning models really shine: Responses preserves the model’s reasoning state across those turns. In Chat Completions, reasoning is dropped between calls, like the detective forgetting the clues every time they leave the room. Responses keeps the notebook open; step‑by‑step thought processes actually survive into the next turn. That shows up in benchmarks (TAUBench +5%) and in more efficient cache utilization and latency.

jumploops · 17 days ago

I think the delta may be an overloaded use of "turn"? The Responses API does preserve reasoning across multiple "agent turns", but doesn't appear to across multiple "user turns" (as of November, at least).

In either case, the lack of clarity on the Responses API inner-workings isn't great. As a developer, I send all the encrypted reasoning items with the Responses API, and expect them to still matter, not get silently discarded[0]:

> you can choose to include reasoning 1 + 2 + 3 in this request for ease, but we will ignore them and these tokens will not be sent to the model.

[0]https://raw.githubusercontent.com/openai/openai-cookbook/mai...

jumploops commented on Unrolling the Codex agent loop openai.com/index/unrollin... · Posted by u/tosh

EnPissant · 18 days ago

I don't think this is true.

I'm pretty sure that Codex uses reasoning.encrypted_content=true and store=false with the responses API.

reasoning.encrypted_content=true - The server will return all the reasoning tokens in an encrypted blob you can pass along in the next call. Only OpenaAI can decrypt them.

store=false - The server will not persist anything about the conversation on the server. Any subsequent calls must provide all context.

Combined the two above options turns the responses API into a stateless one. Without these options it will still persist reasoning tokens in a agentic loop, but it will be done statefully without the client passing the reasoning along each time.

jumploops · 17 days ago

Maybe it's changed, but this is certainly how it was back in November.

I would see my context window jump in size, after each user turn (i.e. from 70 to 85% remaining).

Built a tool to analyze the requests, and sure enough the reasoning tokens were removed from past responses (but only between user turns). Here are the two relevant PRs [0][1].

When trying to get to the bottom of it, someone from OAI reached out and said this was expected and a limitation of the Responses API (interesting sidenote: Codex uses the Responses API, but passes the full context with every request).

This is the relevant part of the docs[2]:

> In turn 2, any reasoning items from turn 1 are ignored and removed, since the model does not reuse reasoning items from previous turns.

[0]https://github.com/openai/codex/pull/5857

[1]https://github.com/openai/codex/pull/5986

[2]https://cookbook.openai.com/examples/responses_api/reasoning...

jumploops commented on Unrolling the Codex agent loop openai.com/index/unrollin... · Posted by u/tosh

jumploops · 18 days ago

One thing that surprised me when diving into the Codex internals was that the reasoning tokens persist during the agent tool call loop, but are discarded after every user turn.

This helps preserve context over many turns, but it can also mean some context is lost between two related user turns.

A strategy that's helped me here, is having the model write progress updates (along with general plans/specs/debug/etc.) to markdown files, acting as a sort of "snapshot" that works across many context windows.