binalpatel (u/binalpatel)

binalpatel commented on Moltworker: a self-hosted personal AI agent, minus the minis blog.cloudflare.com/moltw... · Posted by u/ghostwriternr

jjice · 12 days ago

The most interesting part of it to me (that isn't anything particularly special, but I hadn't seen it before) is giving it full file system access so it'll write it's own tools to come back to later.

It's an obvious move in hindsight, but I hadn't thought of it. Now, the amount of people running it outside of a sandbox or isolated machine and giving it that kind of access would probably make me cry.

binalpatel · 11 days ago

The agent making it's own harness idea is really powerful, I gave it a try here with some opinionated choices:

https://github.com/caesarnine/binsmith

Been running it on a locked down Hetzner server + using Tailscale to interact with it and it's been surprisingly useful even just defaulting to Gemini 3 Flash.

It feels like the general shape of things to come - if agents can code then why can't they make their own harness for the very specific environments they end up in (whether it's a business, or a super personalized agent for a user, etc). How to make it not a security nightmare is probably the biggest open question and why I assume Anthropic/others haven't gone full bore into it.

binalpatel commented on Show HN: Ocrbase – pdf → .md/.json document OCR and structured extraction API github.com/majcheradam/oc... · Posted by u/adammajcher

binalpatel · 20 days ago

This is admittedly dated but even back in December 2023 GPT-4 with it's Vision preview was able to very reliably do structured extraction, and I'd imagine Gemini 3 Flash is much better than back then.

https://binal.pub/2023/12/structured-ocr-with-gpt-vision/

Back of the napkin math (which I could be messing up completely) but I think you could process a 100 page PDF for ~$0.50 or less using Gemini 3 Flash?

>560 input tokens per page * 100 pages = 56000 tokens = $0.028 input ($0.5/m input tokens) >~1000 output tokens per page * 100 pages = $0.30 output ($3/m output tokens)

(https://ai.google.dev/gemini-api/docs/gemini-3#media_resolut...)

binalpatel commented on The Code-Only Agent rijnard.com/blog/the-code... · Posted by u/emersonmacro

meander_water · 22 days ago

Have you done a comparison on token usage + cost? I'd imagine there would be some level of re-inventing the wheel (i.e. rewriting code for very similar tasks) for common tasks, or do you re-use previously generated code?

binalpatel · 22 days ago

It reuses previously generated code, so tools it creates persists from session to session. It also lets the LLM avoid actually “seeing” the tokens in some cases since it can pipe directly between tools/write to disk instead of getting returned into the LLMs context window.

binalpatel commented on The Code-Only Agent rijnard.com/blog/the-code... · Posted by u/emersonmacro

skybrian · 22 days ago

That’s pretty cool. Is it practical? What have you used it for?

binalpatel · 22 days ago

I've been using it daily, so far it's built CLIs for hackernews, BBC news, weather, a todo manager, fetching/parsing webpages etc. I asked it to make a daily briefing one that just composes some of them. So the first thing it runs when I message it in the morning is the daily briefing which gives me a summary of top tech news/non-tech news, the weather, my open tasks between work/personal. I can ask for follow ups like "summarize the top 5 stories on HN" and it can fetch the content and show it to me in full or give me a bullet list of the key points.

Right now I'm thinking through how to make it more "proactive" even if it's just a cron that wakes it up, so it can do things like query my emails/calendar on an ongoing basis + send me alerts/messages I can respond to instead of me always having to message it first.

binalpatel commented on The Code-Only Agent rijnard.com/blog/the-code... · Posted by u/emersonmacro

binalpatel · 22 days ago

I went down (continue to do down) this rabbit hole and agree with the author.

I tried a few different ideas and the most stable/useful so far has been giving the agent a single run_bash tool, explicitly prompting it to create and improve composable CLIs, and injecting knowledge about these CLIs back into it's system prompt (similar to have agent skills work).

This leads to really cool pattens like: 1. User asks for something

2. Agent can't do it, so it creates a CLI

3. Next time it's aware of the CLI and uses it. If the user asks for something it can't do it either improves the CLI it made, or creates a new CLI.

4. Each interaction results in updated/improved toolkits for the things you ask it for.

You as the user can use all these CLIs as well which ends up an interesting side-channel way of interacting with the agent (you add a todo using the same CLI as what it uses for example).

It's also incredibly flexible, yesterday I made a "coding agent" by having it create tools to inspect/analyze/edit a codebase and it could go off and do most things a coding agent can.

https://github.com/caesarnine/binsmith

binalpatel commented on Show HN: Webctl – Browser automation for agents based on CLI instead of MCP github.com/cosinusalpha/w... · Posted by u/cosinusalpha

cosinusalpha · a month ago

I am also not sure if MCP will eventually be fixed to allow more control over context, or if the CLI approach really is the future for Agentic AI.

Nevertheless, I prefer the CLI for other reasons: it is built for humans and is much easier to debug.

binalpatel · a month ago

100% - sharing CLIs with the agent has felt like another channel to interact with them once I’ve done it enough, like a task manager the agent and I can both use using the same interface

binalpatel commented on Show HN: Webctl – Browser automation for agents based on CLI instead of MCP github.com/cosinusalpha/w... · Posted by u/cosinusalpha

binalpatel · a month ago

Cool to see lots of people independently come to "CLIs are all you need". I'm still not sure if it's a short-term bandaid because agents are so good at terminal use or if it's part of a longer term trend but it's definitely felt much more seamless to me then MCPs.

(my one of many contribution https://github.com/caesarnine/binsmith)