Readit News logoReadit News
Posted by u/wujerry2000 17 days ago
Launch HN: Halluminate (YC S25) – Simulating the internet to train computer use
Hi everyone, Jerry and Wyatt here from Halluminate (https://halluminate.ai/). We help AI labs train computer use agents with high quality data and RL environments.

Training AI agents to use computers, browsers, and software is one of the highest-potential opportunities for AI. To date, however, this capability is still unreliable. The emerging method to improve this is called Reinforcement Learning with Verifiable Rewards (RLVR). However, researchers are currently bottlenecked by a lack of high-quality simulators and task + verifiers.

To solve this problem, we’re building Westworld, a fully-simulated internet made up of synthetic versions of the most common consumer and enterprise apps. Agents use Westworld to learn how to do economically valuable tasks.

For example, AI agents can practice planning vacations on a simulated flight booking site (https://flights.halluminate.ai/), or learn how to reorganize outdated information in your sales platform, or train to do financial modeling directly in a spreadsheet.

Here’s a demo showing our flight booking simulation: https://www.loom.com/share/74a3b28067e24c1b886054ba90a90aa5.

How it works: AI agents access our environment and are given a task + verifier. A task is basically an objective for the agent to achieve, for example "Book me a flight from SF to NYC on this date with x, y, z filters.” A verifier is a programmatic way to determine if the task was successfully completed. For example, in this case it might be a json that checks if the final flight data matches expectations. These signals can then be used to calculate a reward in RL.

The more simulators we build, the more AI labs can improve on capabilities that computer use agents are currently weak at. One of our customers saw a ~20% improvement in date-picking performance when training on our flight booking simulator.

Two things make this hard so far:

(1) The simulations have to be realistic. You can’t get away with a vibe-coded “80% solution” because even small divergences impact performance. Generating simulated data is even harder. For example, massaging flight data to look realistic took a lot of trial and experimentation.

(2) The tasks you train agents on have to be well-chosen. They are only valuable if they reflect work that people actually want solved. We need a lot of feedback from domain experts to get this right.

That said, we find this work incredibly interesting and are excited to tackle these issues. A few things we are pumped to ship in the near term: - Ability to train on long-horizon tasks by stringing multiple simulators together for extended workflows; - Procedural data generation. Instead of synthetically generating all the data upfront, how can we model data generation so that our simulators are populated procedurally as agents explore (think Minecraft); - Open source! We plan to release our environments to the public so developers/researchers can hack them for their own experimentation.

RL simulators are just one part of our business. The other part is around human data creation (think Scale AI but for computer use). We produce off-the-shelf pre-training/fine-tuning datasets, expert human evaluation/error analysis, or any other data needs for our customers. There are also a lot of exciting overlaps between the two - for example, using human experts to help create our simulators/tasks. Happy to go in more detail, but we thought that simulators would make for the more interesting HackerNews post :)

Finally, about us: Wyatt and I met while studying CS at Cornell and have been living and working together for over 7 years. I previously led product/research at Capital One Labs, where I launched one of the first AI agents in banking. Wyatt previously was a Cornell Milstein scholar and did large-scale data engineering for 2 early-stage startups in NYC. We left our jobs last year, and faced these problems first-hand while building evals for our customers who were browser/computer use agent companies.

If anyone has any questions, feedback, or thoughts please let us know! Looking forward to your comments.

zebomon · 17 days ago
This is very interesting. I think a lot of people may be quick to overlook the value of such simulators when thinking about AI agents at the extremes. (Either they're not good enough to trust or they're so good they'll leapfrog over any economic value here.)

My own experience makes me lean toward thinking that the truth is somewhere in the middle in this situation, and that simulators like these will be valuable. I've been experimenting a lot with computer use on my website Bingeclock, passing through different prompts along the lines of "make a movie marathon based on X." The newest agents are consistently impressive, while also being consistently imperfect in surprising and interesting ways.

Whether or not all the labs are already running this kind of thing internally for themselves, you would know better than I. But it's an idea that seems very useful nonetheless. Congratulations on the launch!

wujerry2000 · 17 days ago
Computer use agents are starting to perform well on websites/apps that are in their training distribution, but still struggle a lot when dealing with tasks outside their distribution. A big reason why is because many more niche/enterprise applications are really hard to test on in the real world, hence the need for sims!

re: labs doing this internally. They definitely are! However, the scale of sims buildout is going to be massive, probably many orders of magnitude above what we have today. We think it makes sense for one central player to do this because a really good simulator can be used by multiple people at once. It doesn’t make sense for every AI lab/company to build out their own environments if an industry standard catalog exists.

zebomon · 17 days ago
Intriguing analysis. I'll be following along with interest!
reactordev · 17 days ago
Conway’s law strikes again…
mandeepj · 17 days ago
Have you looked at agents from OpenAI and perplexity? Sure, they aren't perfect, but at the same time, they aren't far from near ready.

Does this simulation really required? There's another YC startup, they're processing PDFs I believe. They didn't train their systems on any simulation.

Edited to reword and add more context.

wujerry2000 · 17 days ago
OpenAI agent is very impressive!

That being said, there are still a lot of use cases its not good at, and also looking at long trajectory tasks, enterprise work tasks, etc. I imagine those are all still very nascent.

I think we are still very early on computer use, being "production ready" requires probably close to 95%+ accuracy on most tasks and we're not there yet for most use cases.

davecyen · 17 days ago
Very cool - is it possible to simulate this on a live production site (i.e. instead of Halluminate Flights, just test the agent live on Expedia)? Even though you don't have access to the backend json, presumably you could verify the right values were entered in the frontend/UI?
wm2 · 17 days ago
yup, though without access to the code it's much harder to pull the state of the components - becomes more like a web scraping problem, it's a brittle and much hackier than just intentionally exposing component state like we can do in the sim.

more importantly though are use cases that depend on the data. the data on real google flights/expedia is constantly changing, so it's impossible to build datasets based ground truth, e.g. the answer for a task like "Find the cheapest round-trip flight option from Bologna (BLQ) to Dushanbe (DYU) if I leave on 2026-05-05 and come back on 2026-05-15. Return the total price and the flight numbers for all flights." isn't stable. on our site, we control the data, so that answer is stable (deterministically random). so controlling the whole clone rather than running on the prod site unlocks richer and more repeatable tasks/testing.

lastly, our site runs the exact same locally as deployed, it has zero internet dependencies. so it can be run offline directly on the cluster with no issue for network latency/failures

DearAll · 17 days ago
Love what you’re doing. Are you currently open to interns? Would love to connect with you and chat more about using high quality data to help people better train and evaluate their ai agents!
wm2 · 17 days ago
hey not hiring right now but connect with me on twitter and we can talk more there: https://x.com/wgm752
orliesaurus · 17 days ago
Good luck Jerry!!! Interesting pivot for sure, playgrounds for AI seems like a good idea, I wish someone tackled them in 3D too (not just for browser/computer agent I mean) :P
whymauri · 17 days ago
Are these simulations shared between your customers, or are you building bespoke environments per client/user? How does the creation of environments scale?
wujerry2000 · 17 days ago
Theses are really good questions!

we share the public/consumer simulators, but we also build bespoke environments on a per customer basis (think enterprise sites or even full VMs loaded with applications and data).

environment creation scalability is a big priority for us. we currently automate most of the process, but it still takes a fair bit of manual work to finish them and to get the details right. there is some reusability across environments, for example, we can use the flight results generation code in any travel/flightbooking sim. we also have some semi-automated approaches for creating tasks and verifiers. but still lots of work to be done here.

whymauri · 17 days ago
Super interesting, thank you.
BobbyJo · 17 days ago
Had this exact idea recently, applied to various software tooling. I think agents of all types are going to follow a similar path to self-driving cars: first 80% comes in a big boom, and the last 20% comes over a decade of training and simulations.

I think each agent use case is going to need a simulation for its reward to eek out the last 20%.

Edit: Realized I forgot to say Great Work! Looks Cool!

wujerry2000 · 17 days ago
Self driving cars are a really good place to derive intuitions. Robotics as well!

Both those spaces are still optimizing on the last mile performance gains that get exponentially harder.

The good thing about computer use is building software environments are faster and also more repeatable, so hopefully we see quicker improvements here. :)

nasmorn · 17 days ago
This is very interesting but I would worry if this proves to be an important part of the solution, why would Expedia not release a sandbox that returns validation if agent use becomes valuable for them.
wujerry2000 · 16 days ago
This is a really important question.

I definitely think as companies begin optimizing for an "Agent first" economy, they will start figuring out how to optimize their sites for agent traffic.

They definitely could do this themselves, but I imagine there will be some engineering work/expertise around building RL envs that they might want to partner with an external provider to do it.

ALSO the value of Westworld isn't any standalone env but many stringed together for long trajectory workflows. That is why they may be inclined to work with another provider to do it.

Those are just our thoughts though, will see how the market plays out