aray07 (u/aray07) - Readit News

aray07 commented on Agents that run while I sleep claudecodecamp.com/p/i-m-... · Posted by u/aray07

jakevoytko · 5 days ago

I’ve been playing around with these kinds of prompts. My experience is that the prompts need a lot of iteration to truly one-shot something that is halfway usable. If it’s under-spec’d it’ll just return after 15-20 minutes with something that’s not even half baked. If I give it an extremely detailed spec it’ll start dropping requirements and then finish around the 60-70 minute mark, but I needed 20 minutes to write the prompt and I need to hunt for the things it didn’t bother to do.

I’ve gotten some success iterating on the one-shot prompt until it’s less work to productionize the newest artifact than to start over, and it does have some learning benefits to iterate like this. I’m not sure if it’s any faster than just focusing on the problem directly though.

aray07 · 5 days ago

The dropping requirements problem is real. What's helped us is breaking the spec into numbered ACs and having the verification run per-criterion. If AC-3 fails you know exactly what got dropped.

aray07 commented on Agents that run while I sleep claudecodecamp.com/p/i-m-... · Posted by u/aray07

seer · 5 days ago

This seems quite amazing really, thanks for sharing

What is the scope of projects / features you’ve seen this be successful at?

Do you have a step before where an agent verifies that your new feature spec is not contradictory, ambiguous etc. Maybe as reviewed with regards to all the current feature sets?

Do you make this a cycle per step - by breaking down the feature to small implementable and verifiable sub-features and coding them in sequence, or do you tell it to write all the tests first and then have at it with implementation and refactoring?

Why not refactor-red-green-refactor cycle? E.g. a lot of the time it is worth refactoring the existing code first, to make a new implementation easier, is it worth encoding this into the harness?

aray07 · 5 days ago

I do it per feature, not per step. Write the AC for the whole feature upfront, then the agent builds against it. I haven't added a spec-validation step before coding but that's a good idea. Catching ambiguity in the spec before the agent runs with it would save a lot of rework

aray07 commented on Agents that run while I sleep claudecodecamp.com/p/i-m-... · Posted by u/aray07

aray07 · 5 days ago

Agreed. The spec file is context. Writing acceptance criteria before you prompt provides the context the agent needs to not go off in the wrong direction. Human leverage just moved up and the plan/spec is the most important step.

Parallelism on top of bad context just gets you more wrong answers faster

aray07 commented on Agents that run while I sleep claudecodecamp.com/p/i-m-... · Posted by u/aray07

josephg · 5 days ago

Testing works because tests are (essentially) a second, crappy implementation of your software. Tests only pass if both implementations of your software behave the same way. Usually that will only happen if the test and the code are both correct. Imagine if your code (without tests) has a 5% defect rate. And the tests have a 5% defect rate (with 100% test coverage). Then ideally, you will have a 5%^2 defect rate after fixing all the bugs. Which is 0.25%.

The price you pay for tests is that they need to be written and maintained. Writing and maintaining code is much more expensive than people think.

Or at least it used to be. Writing code with claude code is essentially free. But the defect rate has gone up. This makes TDD a better value proposition than ever.

TDD is also great because claude can fix bugs autonomously when it has a clear failing test case. A few weeks ago I used claude code and experts to write a big 300+ conformance test suite for JMAP. (JMAP is a protocol for email). For fun, I asked claude to implement a simple JMAP-only mail server in rust. Then I ran the test suite against claude's output. Something like 100 of the tests failed. Then I asked claude to fix all the bugs found by the test suite. It took about 45 minutes, but now the conformance test suite fully passes. I didn't need to prompt claude at all during that time. This style of TDD is a very human-time efficient way to work with an LLM.

aray07 · 5 days ago

This is great. The tests in this case are the spec. When you give the agent something concrete to fail against, it knows what done looks like.

The problem is if you skip that step and ask Claude to write the tests after.

aray07 commented on Agents that run while I sleep claudecodecamp.com/p/i-m-... · Posted by u/aray07

palmotea · 5 days ago

> it's just that there's a human limit on how much garbage they can type out in their allocated time.

Another example where removing friction and constraints is a bad thing.

aray07 · 5 days ago

i think the friction has moved upstream - now it's working on the right thing and specifying what correct looks like. i don't think we are going back to a world where we will write code by hand again.

aray07 commented on Agents that run while I sleep claudecodecamp.com/p/i-m-... · Posted by u/aray07

recroad · 5 days ago

Am I supposed to be impressed by this? I think people are now just using agents for the sake of it. I'm perfectly happy running two simple agents, one for writing and one for reviewing. I don't need to go be writing code at faster than light speed. Just focusing on the spec, and watching the agent as it does its work and intervening when it goes sideways is perfectly fine with me. I'm doing 5-7x productivity easily, and don't need more than that.

I also spend most of my time reviewing the spec to make sure the design is right. Once I'm done, the coding agent can take 10 minutes or 30 minutes. I'm not really in that much of a rush.

aray07 · 5 days ago

yup, agree - i spend most of my time reviewing the spec. The highest leverage time is now deciding what to work on and then working on the spec. I ended up building the verify skill (https://github.com/opslane/verify) because I wanted to ensure claude follows the spec. I have found that even after you have the spec - it can sometimes not follow it and it takes a lot of human review to catch those issues.

aray07 commented on Agents that run while I sleep claudecodecamp.com/p/i-m-... · Posted by u/aray07

bhouston · 6 days ago

I call this "Test Theatre" and it is real. I wrote about it last year:

https://benhouston3d.com/blog/the-rise-of-test-theater

You have to actively work against it.

aray07 · 5 days ago

Test theatre is exactly the right framing. The tests are syntactically correct, they run, they pass but do they actually prove anything?

aray07 commented on Agents that run while I sleep claudecodecamp.com/p/i-m-... · Posted by u/aray07

afro88 · 6 days ago

I guess to reach this point you have already decided you don't care what the code looks like.

Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?

Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.

One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.

aray07 · 6 days ago

yeah honestly thats what i am struggling with too and I dont have a a good solution. However, I do think we are going to see more of this - so it will be interesting to see how we are going to handle this.

i think we will need some kind of automated verification so humans are only reviewing the “intent” of the change. started building a claude skill for this (https://github.com/opslane/verify)