someguy101010 (u/someguy101010)

someguy101010 commented on Show HN: A MitM proxy to see what your LLM tools are sending github.com/jmuncor/sherlo... · Posted by u/jmuncor

someguy101010 · 16 days ago

Does this support bedrock?

someguy101010 commented on Skills for organizations, partners, the ecosystem claude.com/blog/organizat... · Posted by u/adocomplete

someguy101010 · 2 months ago

Is it possible to provide a llm a skill through the mcp resource feature?

someguy101010 commented on Why Windows XP is the ultimate AI benchmark cuabench.ai... · Posted by u/frabonacci

frabonacci · 2 months ago

We spent the last few months trying to understand why computer-use agents (Claude Computer-Use, OpenAI CUA, Gemini 2.5 Computer-Use) fail so inconsistently.

The pattern we kept seeing: same agent, same task, different OS theme = notably different results.

Claude Sonnet 4 scores 31.9% on OSWorld and Windows Agent Arena (2 of the most relevant benchmarks for computer-use agents) — but with massive variance. An agent trained on Windows 11 light mode fails on dark mode. Works on macOS Ventura, breaks on Monterey. Works on Win11, collapses on Vista.

The root cause: training data lacks visual diversity. Current benchmarks (OSWorld, Windows Agent Arena) rely on static VM snapshots with fixed configurations. They don't capture the reality of diverse OS themes, window layouts, resolution differences, or desktop clutter.

We built cua-bench — HTML-based simulated environments that render across 10+ OS themes (macOS, Win11, WinXP, Win98, Vista, iOS, Android). Define a task once, generate thousands of visual variations.

This enables: - Oracle trajectory generation via a Playwright-like API (verified ground truth for training) - Trajectory replotting: record 1 demo → re-render across 10 OS themes = 10 training trajectories

The technical report covers our approach to trajectory generation, Android/iOS environments, cross-platform HTML snapshots, and a comparison with existing benchmarks.

We’re currently working with research labs on training data generation and benchmarks, but we’d really value input from the HN community: - What tasks or OS environments should be standardized to actually stress computer-use agents? - Legacy OSes? Weird resolutions? Broken themes? Cluttered desktops? Modal hell?

Curious what people here think are the real failure modes we should be benchmarking.

someguy101010 · 2 months ago

as an infrastructure engineer the idea of being able to train computer use agents without provisioning infrastructure sounds amazing!

a common use case i run into is i want to be able to configure corporate vpn software on windows machines. is there a link for a getting started guide i could try this out with?

someguy101010 commented on Dagger: Define software delivery workflows and dev environments dagger.io/... · Posted by u/ahamez

leetrout · 2 months ago

Curious if anyone in the thread has / is using windmill?

They don't seem to have jumped for AI hype (yet?)...

https://www.windmill.dev/

someguy101010 · 2 months ago

have used it, and i do like it, but the licensing situation is not great. It open source but its not free software by any means.

someguy101010 commented on The "confident idiot" problem: Why AI needs hard rules, not vibe checks steerlabs.substack.com/p/... · Posted by u/steer_dev

someguy101010 · 2 months ago

wrote about this a bit too in https://www.robw.fyi/2025/10/24/simple-control-flow-for-auto...

ran into this when writing agents to fix unit tests. often times they would just give up early so i started writing the verifiers directly into the agent's control flow and this produced much more reliable results. i believe claude code has hooks that do something similar as well.

someguy101010 commented on Isn't WSL2 just a VM? ssg.dev/isnt-wsl2-just-a-... · Posted by u/sedatk

bogwog · 2 months ago

> and is only supported on Windows Server.

Imagine licensing and installing Windows Server to run Linux software through WSL

someguy101010 · 2 months ago

clearly you have never worked in enterprise

someguy101010 commented on Ghostty compiled to WASM with xterm.js API compatibility github.com/coder/ghostty-... · Posted by u/kylecarbs

someguy101010 · 2 months ago

nice one kyle! you could add https://github.com/wasmerio/webassembly.sh and have a fully featured in browser shell with support for installing packages!

someguy101010 commented on The Thinking Game Film – Google DeepMind documentary thinkinggamefilm.com... · Posted by u/ChrisArchitect

someguy101010 · 2 months ago

reposting this from youtube comment

From 1:14:55-1:15:20, within the span of 25 seconds, the way Demis spoke about releasing all known sequences without a shred of doubt was so amazing to see. There wasn't a single second where he worried about the business side of it (profits, earnings, shareholders, investors) —he just knew it had to be open source for the betterment of the world. Gave me goosebumps. I watched that on repeat for more than 10 times.