https://binal.pub/2023/12/structured-ocr-with-gpt-vision/
Back of the napkin math (which I could be messing up completely) but I think you could process a 100 page PDF for ~$0.50 or less using Gemini 3 Flash?
>560 input tokens per page * 100 pages = 56000 tokens = $0.028 input ($0.5/m input tokens) >~1000 output tokens per page * 100 pages = $0.30 output ($3/m output tokens)
(https://ai.google.dev/gemini-api/docs/gemini-3#media_resolut...)
It's an obvious move in hindsight, but I hadn't thought of it. Now, the amount of people running it outside of a sandbox or isolated machine and giving it that kind of access would probably make me cry.
https://github.com/caesarnine/binsmith
Been running it on a locked down Hetzner server + using Tailscale to interact with it and it's been surprisingly useful even just defaulting to Gemini 3 Flash.
It feels like the general shape of things to come - if agents can code then why can't they make their own harness for the very specific environments they end up in (whether it's a business, or a super personalized agent for a user, etc). How to make it not a security nightmare is probably the biggest open question and why I assume Anthropic/others haven't gone full bore into it.