All data is persisted via RocksDB. Not only the materialized state of invocations and journals, but even the log itself uses RocksDB as the storage layer for sequence of events. We do that to benefit from the insane testing and hardening that Meta has done (they run millions of instances). We are currently even trying to understand which operations and code paths Meta uses most, to adopt the code to use those, to get the best-tested paths possible.
The more sensitive part would be the consensus log, which only comes into play if you run in distributed deployments. In a way, that puts us into a similar boat as companies like Neon: having reliably single-node storage engine, but having to build the replication and failover around that. But in that is also the value-add over most databases.
We do actually use Jepsen internally for a lot of testing.
(Side note: Jepsen might be one of the most valuable things that this industry has - the value it adds cannot be overstated)
It is a fun exercise, there are many interesting patterns and micro problems to solve: * Durable steps (no retry spaghetti) * Managing session across workflows (remember conversations) * Interrupting an ongoing coding task to add new context * Robust life cycles for resources (sandboxes) * Scalable serverless deployments * Tracing / replay / metrics
Sharing our learnings here for anyone who builds agents at scale, beyond "hello world"