> I don’t recall what happened next. I think I slipped into a malaise of models. 4-way split-paned worktrees, experiments with cloud agents, competing model runs and combative prompting.
You’re trying to have the LLM solve some problem that you don’t really know how to solve yourself, and then you devolve into semi-random prompting in the hope that it’ll succeed. This approach has two problems:
1. It’s not systematic. There’s no way to tell if you’re getting any closer to success. You’re just trying to get the magic to work.
2. When you eventually give up after however many hours, you haven’t succeeded, you haven’t got anything to build on, and you haven’t learned anything. Those hours were completely wasted.
Contrast this with you beginning to do the work yourself. You might give up, but you’d understand the source code base better, perhaps the relationship between Perl and Typescript, and perhaps you’d have some basics ported over that you could build on later.
When I teach programming, some students, when stuck, will start flailing around - deleting random lines of code, changing call order, adding more functions, etc - and just hoping one of those things will “fix it” eventually.
This feels like the LLM-enabled version of this behavior (except that in the former case, students will quickly realize that what they’re doing is pointless and ask a peer or teacher for help; whereas maybe the LLM is a little
too good at hijacking that and making its user feel like things are still on track).
The most important thing to teach is how to build an internal model of what is happening, identify which assumptions in your model are most likely to be faulty/improperly captured by the model, what experiments to carry out to test those assumptions…
In essence, what we call an “engineering mindset” and what good education should strive to teach.
> When I teach programming, some students, when stuck, will start flailing around - deleting random lines of code, changing call order, adding more functions, etc - and just hoping one of those things will “fix it” eventually.
That sounds like a lot of people I’ve known, except they weren’t students. More like “senior engineers”.
I definitely fall into this trap sometimes. Oftentimes that simple order of ops swap will fix my issue, but when it doesn't, it's easy to get stuck in the "just one more change" mindset instead of taking a step back to re-assess.
Funny to see this show up today since coincidentally I've had Claude code running for the past ~15 hours attempting to port MicroQuickJS to pure dependency-free Python, mainly as an experiment in how far a porting project can go but also because a sandboxed (memory constrained, to us time limits) JavaScript interpreter that runs in Python is something I really want to exist.
I'm currently torn on whether to actually release it - it's in a private GitHub repository at the moment. It's super-interesting and I think complies just fine with the MIT licenses on MicroQuickJS so I'm leaning towards yes.
TI had similar idea with TI-99/4 - running interpreted BASIC programs using BASIC written in special interpreted language (GPL) running in its own virtual machine, with actual CPU machine code executing from ram accessible thru single byte window of Video processor. Really brilliant system, turtles all the way down.
I wouldn't trust it without a deeper inspection. I've had Claude do a workaround (ie: use a javascript interpreter and wrap it in Python) and then claim that it completed the task! The CoT was an interesting read on how his mind think about my mind (the user want ... but this should also achieve this ... the user however asked it to be this ... but this can get what the user want ...; that kind of salad)
How many tests do other JS runtimes like V8 have? ~400 tests sounds reasonable for a single data structure, but orders of magnitude off for a language runtime.
I build and distribute software in Python. My ideal solution is something that installs cleanly via pip so I can include it as a regular dependency of my other projects.
It's amusing to think that claude might be better at generating ascii diagrams than generating code to generate diagrams, despite it being nominally better at generating code.
I'm generating a lot of PDFs* in claude, so it does ascii diagrams for those, and it's generally very good at it, but it likely has a lot of such diagrams in its training set. What it then doesn't do very well is aligning them under modification. It can one-shot the diagram, it can't update it very well.
The euphoric breakthrough into frustration of so-called vibe-coding is well recognised at this point. Sometimes you just have to step back and break the task down smaller. Sometimes you just have to wait a few months for an even better model which can now do what the previous one struggled at.
I won't deny OP learned something in this process, but I can't help but wonder: if they spent the same time and effort just porting the code themselves, how much more would they have learned?
Specially considering that the output would be essentially the same: a bunch of code that doesn't work.
That may be true, but it does seem like OP's intent was to learn something about how LLM agents perform on complex engineering tasks, rather than learning about ASCII creation logic. A different but perhaps still worthy experiment.
I guess it depends on well people want to know things like "Perl (and C) library to web" skills. Personally, there are languages I don't want to learn, but for one reason or another, I have to change some details in a project that happen to use that language. Sure, I could sit down and learn enough of the language so I can do the thing, but if I don't like or want to use that language, the knowledge will eventually atrophy anyways, so why bother?
I think the specific language in question - perl - is really the source of OP's frustration. Perl is kind of like Regular Expressions - much easier to write than it is to read. I would expect LLMs to struggle with understanding perl. It's one of the best languages for producing obfuscated code by hand. There are many subtleties and context-dependence in perl, and they aren't immediately apparent from the raw syntax.
Edit: I totally agree with your point about not wanting to learn a language. That's definitely a situation where LLMs can excel and almost an ideal use case for them. I just think that Perl, in particular, will be hard to work with, given the current capabilities of LLM coding tools and models. It might be necessary to actually learn the language, and even that might not be enough.
While there's not a lot of meat on the bone for this post, one section of it reflects the overall problem with the idea of Claude-as-everything:
> I spent weeks casually trying to replicate what took years to build. My inability to assess the complexity of the source material was matched by the inability of the models to understand what it was generating.
When the trough of disillusionment hits, I anticipate this will become collective wisdom, and we'll tailor LLMs to the subset of uses where they can be more helpful than hurtful. Until then, we'll try to use AI to replace in weeks what took us years to build.
If LLMs stopped improving today I’m sure you would be correct- as it is I think it’s very hard to predict what the future holds and where the advancements take us.
I don’t see a particularly good reason why LLMs wouldn’t be able to do most programming tasks, with the limitation being our ability to specify the problem sufficiently well.
I feel like we’ve been hearing this for 4 years now. The improvements to programming (IME) haven’t come from improved models, they’ve come from agents, tooling, and environment integrations.
LLM capability improvement is hitting a plateau with recent advancements mostly relying on accessing context locally (RAG), or remotely (MCP), with a lot of extra tokens (read: drinking water and energy), being spent prompting models for "reasoning". Foundation-wise, observed improvements are incremental, not exponential.
> able to do most programming tasks, with the limitation being our ability to specify the problem sufficiently well
We've spent 80 years trying to figure that out. I'm not sure why anyone would think we're going to crack this one anytime in the next few years.
I would think/hope that the code assist LLMs would be optimizing towards supportable/legible code solutions overall. Mostly in that they can at least provide a jumping off point, largely accepting that they more often than not won't be able to produce complete, finished solutions entirely.
As always, the answer is "divide & conquer". Works for humans, works for LLMs. Divide the task into as small, easy to verify steps as possible, ideally steps you can automatically verify by running one command. Once done, either do it yourself or offload to LLM, if the design and task splitting is done properly, it shouldn't really matter. Task too difficult? Divide into smaller steps.
Judging from this an approach might have been to port the 28 modules individually and check that everything returns the same data in Perl and TS versions:
"I took a long-overdue peek at the source codebase. Over 30,000 lines of battle-tested Perl across 28 modules. A* pathfinding for edge routing, hierarchical group rendering, port configurations for node connections, bidirectional edges, collapsing multi-edges. I hadn’t expected the sheer interwoven complexity."
Well, ideally we teach the AIs how to divide-and-conquer. I don't care, whether my AI coding assistant is multiple LLMs (or other models) working together.
They already know how to. But you have to tell them that's the way you want them to operate, tell them how to keep track of it, tell them how to determine when each step is done. You need to specify what you want both in terms of final result but also in terms of process.
The AI's are super capable now, but still need a lot of guiding towards the right workflow for the project. They're like a sports team, but you still need to be a good coach.
It is really important that such posts exist. There is the risk that we only hear about the wild successes and never the failures. But from the failures we learn much more.
One difference between this story and the various success stories is that the latter all had comprehensive test suites as part of the source material that agents could use to gain feedback without human intervention. This doesn’t seem to exist in this case, which may simply be the deal breaker.
>> This doesn’t seem to exist in this case, which may simply be the deal breaker.
Perhaps, but perhaps not. The reason tests are valuable in these scenarios is they are actually a kind of system spec. LLMs can look at them to figure out how a system should (and should not) behave, and use that to guide the implementation.
I don’t see why regular specs (e.g. markdown files) could not serve the same purpose. Of course, most GitHub projects don’t include such files, but maybe that will change as time goes on.
It turns out that having a "trainer" to "coach" you is not a coincidence: these two words evolved together from the rail industry to the gym. Do "port" and "ship" have a similar history, evolving together from the maritime industry to software?
As far as I can tell, no. The relationship isn't the same; in software, the "port" is the translated software itself, not the destination platform.
The etymological roots are quite interesting, though. We aren't quite sure where the word "ship" comes from — Etymonline hazards
> Watkins calls this a "Germanic noun of obscure origin." OED says "the ultimate etymology is uncertain." Traditionally since Pokorny it is derived from PIE root *skei- "to cut, split," perhaps on the notion of a tree cut out or hollowed out, but the semantic connection is unclear. Boutkan gives it "No certain IE etymology."
The word "port" goes back to the PIE root "*per-" meaning "forward", and thus as a verb "to lead". It seems to have emerged in Latin in multiple forms: the word "portus" ("harbor"), verb "portare" (to carry or bring). I was surprised to learn that the English "ferry" does not come from the other Latin verb with the sense of carrying (the irregular "ferre"), but from Germanic and Norse words... that are still linked back to "*per-".
Basically, transportation (same "port"!) has been important to civilization for a long time, and quite a bit of it was done by, well, shipping. And porting software is translating the code; the "lat" there comes from the past participle of the irregular Latin verb mentioned above, about which
> Presumably lātus was taken (by a process linguists call suppletion) from a different, pre-Latin verb. By the same process, in English, went became the past tense of go. Latin lātus is said by Watkins to be from *tlatos, from PIE root *tele- "to bear, carry" (see extol), but de Vaan says "No good etymology available."
> It turns out that having a "trainer" to "coach" you is not a coincidence: these two words evolved together from the rail industry to the gym.
This does not appear to be true.
Train (etymonline):
> "to discipline, teach, bring to a desired state or condition by means of instruction," 1540s, which probably is extended from the earlier sense of "draw out and manipulate in order to bring to a desired form" (Middle English trainen, attested c. 1400 as "delay, tarry" on a journey, etc.); from train (n.) For the notion of "educate" from that of "draw," compare educate.
[That train (n.) doesn't refer to the rail industry, which didn't really exist in the 1540s. It refers to a succession (as one railcar will follow another in later centuries), or to the part of your clothing that might drag on the ground behind you, or to the act of dragging anything generally. Interestingly, etymonline derives this noun from a verb train meaning to drag; given the existence of this verb, I see no reason to derive the verb train in the sense "teach" from the noun derived from the same verb in the sense "drag". The entry on the verb already noted that it isn't unexpected for "drawing" [as water from a well] to evolve into "teaching".]
Coach (wiktionary):
> The meaning "instructor/trainer" is from Oxford University slang (c. 1830) for a "tutor" who "carries" one through an exam
Coach might be a metaphor from the rail industry (or the horse-and-buggy industry), but trainer isn't.
> I don’t recall what happened next. I think I slipped into a malaise of models. 4-way split-paned worktrees, experiments with cloud agents, competing model runs and combative prompting.
You’re trying to have the LLM solve some problem that you don’t really know how to solve yourself, and then you devolve into semi-random prompting in the hope that it’ll succeed. This approach has two problems:
1. It’s not systematic. There’s no way to tell if you’re getting any closer to success. You’re just trying to get the magic to work.
2. When you eventually give up after however many hours, you haven’t succeeded, you haven’t got anything to build on, and you haven’t learned anything. Those hours were completely wasted.
Contrast this with you beginning to do the work yourself. You might give up, but you’d understand the source code base better, perhaps the relationship between Perl and Typescript, and perhaps you’d have some basics ported over that you could build on later.
This feels like the LLM-enabled version of this behavior (except that in the former case, students will quickly realize that what they’re doing is pointless and ask a peer or teacher for help; whereas maybe the LLM is a little too good at hijacking that and making its user feel like things are still on track).
The most important thing to teach is how to build an internal model of what is happening, identify which assumptions in your model are most likely to be faulty/improperly captured by the model, what experiments to carry out to test those assumptions…
In essence, what we call an “engineering mindset” and what good education should strive to teach.
That sounds like a lot of people I’ve known, except they weren’t students. More like “senior engineers”.
I'm currently torn on whether to actually release it - it's in a private GitHub repository at the moment. It's super-interesting and I think complies just fine with the MIT licenses on MicroQuickJS so I'm leaning towards yes.
Its got to 402 tests with 2 failing - the big unlock was the test suite from MicroQuickJS: https://github.com/bellard/mquickjs/tree/main/tests
Its been spitting out lines like this as it works:
idk how complete it is but it solved youtube's challenges etc for a long time
https://github.com/yt-dlp/yt-dlp/blob/6d92f87ddc40a319590976...
Here's the transcript showing how I built it: https://static.simonwillison.net/static/2025/claude-code-mic...
Though if you look in those files some of them run a ton of test functions and assertions.
My new Python library executes copies of the tests from that mquickjs repo - but those only count as 7 of the 400+ other tests.
Or why not run MicroQuickJS under Fil-C? It's ideal since it has not dependencies.
I'm generating a lot of PDFs* in claude, so it does ascii diagrams for those, and it's generally very good at it, but it likely has a lot of such diagrams in its training set. What it then doesn't do very well is aligning them under modification. It can one-shot the diagram, it can't update it very well.
The euphoric breakthrough into frustration of so-called vibe-coding is well recognised at this point. Sometimes you just have to step back and break the task down smaller. Sometimes you just have to wait a few months for an even better model which can now do what the previous one struggled at.
* Well, generating Typst mark-up, anyway.
Specially considering that the output would be essentially the same: a bunch of code that doesn't work.
Edit: I totally agree with your point about not wanting to learn a language. That's definitely a situation where LLMs can excel and almost an ideal use case for them. I just think that Perl, in particular, will be hard to work with, given the current capabilities of LLM coding tools and models. It might be necessary to actually learn the language, and even that might not be enough.
> I spent weeks casually trying to replicate what took years to build. My inability to assess the complexity of the source material was matched by the inability of the models to understand what it was generating.
When the trough of disillusionment hits, I anticipate this will become collective wisdom, and we'll tailor LLMs to the subset of uses where they can be more helpful than hurtful. Until then, we'll try to use AI to replace in weeks what took us years to build.
I don’t see a particularly good reason why LLMs wouldn’t be able to do most programming tasks, with the limitation being our ability to specify the problem sufficiently well.
> able to do most programming tasks, with the limitation being our ability to specify the problem sufficiently well
We've spent 80 years trying to figure that out. I'm not sure why anyone would think we're going to crack this one anytime in the next few years.
Such has always been the largest issue with software development projects, IMO.
"I took a long-overdue peek at the source codebase. Over 30,000 lines of battle-tested Perl across 28 modules. A* pathfinding for edge routing, hierarchical group rendering, port configurations for node connections, bidirectional edges, collapsing multi-edges. I hadn’t expected the sheer interwoven complexity."
The AI's are super capable now, but still need a lot of guiding towards the right workflow for the project. They're like a sports team, but you still need to be a good coach.
One difference between this story and the various success stories is that the latter all had comprehensive test suites as part of the source material that agents could use to gain feedback without human intervention. This doesn’t seem to exist in this case, which may simply be the deal breaker.
Perhaps, but perhaps not. The reason tests are valuable in these scenarios is they are actually a kind of system spec. LLMs can look at them to figure out how a system should (and should not) behave, and use that to guide the implementation.
I don’t see why regular specs (e.g. markdown files) could not serve the same purpose. Of course, most GitHub projects don’t include such files, but maybe that will change as time goes on.
I think because they're doomed to become outdated without something actually enforcing the spec.
It turns out that having a "trainer" to "coach" you is not a coincidence: these two words evolved together from the rail industry to the gym. Do "port" and "ship" have a similar history, evolving together from the maritime industry to software?
The etymological roots are quite interesting, though. We aren't quite sure where the word "ship" comes from — Etymonline hazards
> Watkins calls this a "Germanic noun of obscure origin." OED says "the ultimate etymology is uncertain." Traditionally since Pokorny it is derived from PIE root *skei- "to cut, split," perhaps on the notion of a tree cut out or hollowed out, but the semantic connection is unclear. Boutkan gives it "No certain IE etymology."
The word "port" goes back to the PIE root "*per-" meaning "forward", and thus as a verb "to lead". It seems to have emerged in Latin in multiple forms: the word "portus" ("harbor"), verb "portare" (to carry or bring). I was surprised to learn that the English "ferry" does not come from the other Latin verb with the sense of carrying (the irregular "ferre"), but from Germanic and Norse words... that are still linked back to "*per-".
Basically, transportation (same "port"!) has been important to civilization for a long time, and quite a bit of it was done by, well, shipping. And porting software is translating the code; the "lat" there comes from the past participle of the irregular Latin verb mentioned above, about which
> Presumably lātus was taken (by a process linguists call suppletion) from a different, pre-Latin verb. By the same process, in English, went became the past tense of go. Latin lātus is said by Watkins to be from *tlatos, from PIE root *tele- "to bear, carry" (see extol), but de Vaan says "No good etymology available."
This does not appear to be true.
Train (etymonline):
> "to discipline, teach, bring to a desired state or condition by means of instruction," 1540s, which probably is extended from the earlier sense of "draw out and manipulate in order to bring to a desired form" (Middle English trainen, attested c. 1400 as "delay, tarry" on a journey, etc.); from train (n.) For the notion of "educate" from that of "draw," compare educate.
[That train (n.) doesn't refer to the rail industry, which didn't really exist in the 1540s. It refers to a succession (as one railcar will follow another in later centuries), or to the part of your clothing that might drag on the ground behind you, or to the act of dragging anything generally. Interestingly, etymonline derives this noun from a verb train meaning to drag; given the existence of this verb, I see no reason to derive the verb train in the sense "teach" from the noun derived from the same verb in the sense "drag". The entry on the verb already noted that it isn't unexpected for "drawing" [as water from a well] to evolve into "teaching".]
Coach (wiktionary):
> The meaning "instructor/trainer" is from Oxford University slang (c. 1830) for a "tutor" who "carries" one through an exam
Coach might be a metaphor from the rail industry (or the horse-and-buggy industry), but trainer isn't.