The core idea: geo-replication should be a deployment concern, not something you architect into every line of application code. You write normal business logic, then configure replication policies at deployment time and let Restate handle the rest.
The configuration is straightforward: `default-replication = "{region: 2, node: 3}"` ensures data is replicated to at least 2 regions and 3 nodes. This ensures that your apps can tolerate a region outage or losing two arbitrary nodes while staying fully available. Behind the scenes, Restate handles leader election, log replication, and state synchronization. We use S3 cross-region replication for snapshots with delayed log trimming to ensure consistency.
We tested this with a 6-node cluster across 3 AWS regions under 400 req/s load. Killing an entire region resulted in sub-60-second automatic failover with zero downtime and no data loss. Only 1% of requests saw latency spikes during the failover window. Once nodes in us-east-1 were no longer running, P50 latency increased when replication shifted from nearby us-east-1/us-east-2 to distant us-east-2/us-west-1.
Happy to answer technical questions or discuss tradeoffs!
The core insight: durability matters more than you'd think for agents. When an agent takes 5-10 minutes on a task, crashes become inevitable. Rate limits hit. Sandboxes timeout. Users interrupt mid-task. Traditional retry logic gets messy fast.
Our approach uses Restate for durable execution (workflows continue from the last completed step) and Modal for ephemeral sandboxes. We get automatic failure recovery, interruptions for new input, great scalability, and scale-to-zero without any custom retry code. The tradeoffs: coupling to Restate's execution model and requiring discipline around deterministic replay.
How are you handling long-running agent workflows to make them run reliably at scale?