technocrat8080 (u/technocrat8080)

technocrat8080 commented on Shall I implement it? No gist.github.com/bretonium... · Posted by u/breton

dostick · 3 days ago

Its gotten so bad that Claude will pretend in 10 of 10 cases that task is done/on screenshot bug is fixed, it will even output screenshot in chat, and you can see the bug is not fixed pretty clear there.

I consulted Claude chat and it admitted this as a major problem with Claude these days, and suggested that I should ask what are the coordinates of UI controls are on screenshot thus forcing it to look. So I did that next time, and it just gave me invented coordinates of objects on screenshot.

I consult Claude chat again, how else can I enforce it to actually look at screenshot. It said delegate to another “qa” agent that will only do one thing - look at screenshot and give the verdict.

I do that, next time again job done but on screenshot it’s not. Turns out agent did all as instructed, spawned an agent and QA agent inspected screenshot. But instead of taking that agents conclusion coder agent gave its own verdict that it’s done.

It will do anything- if you don’t mention any possible situation, it will find a “technicality” , a loophole that allows to declare job done no matter what.

And on top of it, if you develop for native macOS, There’s no official tooling for visual verification. It’s like 95% of development is web and LLM providers care only about that.

technocrat8080 · 3 days ago

You can provide the screencapture cli as a tool to Claude and it will take screenshots (of specific windows) to verify things visually.

technocrat8080 commented on Ask HN: Embedding Claude Code as infrastructure? · Posted by u/technocrat8080

ruso-0 · 4 days ago

I've been doing exactly this for the past few weeks, really. Claude Code + MCP plugins is the setup that really worked for me.

The key thing I learned was not to just tell Claude your repository and hope for the best. The raw approach consumes tokens incredibly fast, but Claude reads entire files when it only needs one feature, retries incorrect edits more than 5 times, and loses context halfway through.

What really works is giving Claude a CLAUDE.md file in the root of your repository with specific instructions for the workflow (which tools to prefer, when to compress or read raw, etc.). Claude Code reads it automatically upon login. Think of it as an .editorconfig file, but for AI behavior.

For the $25/PR review use case specifically, the bottleneck isn't Claude's intelligence, but the context window management. A repository of 500 files can exhaust the window before Claude finishes reviewing. You would need some kind of indexing layer that provides Claude only with the relevant snippets for each PR difference, not the entire codebase.

What kind of repositories do you have in mind? The approach varies greatly depending on the size, but I'd like to hear your thoughts.

technocrat8080 · 3 days ago

In my experience, especially with Opus 4.6, using subagents greatly mitigates the startup context hit. 4.6 has very obviously been RL'ed on subagent usage and it almost always spins up an Explore agent to get a feel of the codebase and get a token-efficient summary. The 1M context version of 4.6 further alleviates this.

My original question was more along the lines of implementing things like PR review yourself. I was tinkering with an internal service that spins up ephemeral CC instances to analyze PRs, but realized this can easily generalize across arbitrary tasks. Was curious what sort of things folks could use that for.

technocrat8080 commented on Mojo-V: Secret Computation for RISC-V github.com/toddmaustin/mo... · Posted by u/fork-bomber

tromp · 4 months ago

This should not (so much) be compared with Fully Homomorphic Encryption (FHE) but with a Trusted Execution Environment (TEE). It is a very elegant and minimal way to implement TEEs, but suffers from the same drawbacks: a data owner has to trust the service provider to publish the public keys of actual properly constructed Mojo-V hardware rather than arbitrary public keys or public keys of maliciously constructed Mojo-V hardware.

[1] https://en.wikipedia.org/wiki/Trusted_execution_environment

technocrat8080 · 4 months ago

To be clear, it's not a TEE replacement but does address one of the most common use cases of TEEs

technocrat8080 commented on OpenAI acquires Sky.app openai.com/index/openai-a... · Posted by u/meetpateltech

technocrat8080 · 5 months ago

Seems pretty obvious Sky.app's functionality will land in the macOS ChatGPT app at some point. I wonder how Atlas fits into that story.

technocrat8080 commented on OpenAI acquires Sky.app openai.com/index/openai-a... · Posted by u/meetpateltech

luma · 5 months ago

I'm not an IOS guy so I'm trying to track this - from the thread I'm to gather this allows robotic process automation on IOS which I guess isn't easy to do? I could see the use case if you're trying to build an agent that can navigate and use apps on IOS.

Here's the question - why is this difficult on IOS? What "magic" does Sky bring to the table to make this happen?

technocrat8080 · 5 months ago

Sky is macOS only. It essentially gives an LLM access to various system APIs coupled with a floating user interface that you can access on command.

technocrat8080 commented on Claude Sonnet 4.5 anthropic.com/news/claude... · Posted by u/adocomplete

fragmede · 5 months ago

subagents are where Claude code shines and codex still lags behind. Claude code can do some things in parallel within a single session with subagents and codex cannot.

technocrat8080 · 5 months ago

By parallel, do you mean editing the codebase in parallel? Does it use some kind of mechanism to prevent collisions (e.g. work trees)?

technocrat8080 commented on Claude Sonnet 4.5 anthropic.com/news/claude... · Posted by u/adocomplete

ChadMoran · 6 months ago

Sub-agents. I've had Claude Code run a prompt for hours on end.

technocrat8080 · 6 months ago

What kind of agents do you have setup?

technocrat8080 commented on Claude Sonnet 4.5 anthropic.com/news/claude... · Posted by u/adocomplete

Bjorkbat · 6 months ago

> Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.

Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all in their system card. It's only through an article on The Verge that we get more context. Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code (https://www.theverge.com/ai-artificial-intelligence/787524/a...)

I have very low expectations around what would happen if you took an LLM and let it run unattended for 30 hours on a task, so I have a lot of questions as to the quality of the output

technocrat8080 · 6 months ago

Curious about this too – does it use the standard context management tools that ship with Claude Code? At 200K context size (or 1M for the beta version), I'm really interested in the techniques used to run it for 30 hours.

technocrat8080 commented on Sandboxing AI agents at the kernel level greptile.com/blog/sandbox... · Posted by u/dakshgupta

technocrat8080 · 6 months ago

A bit confused, all this to say you folks use standard containerization?