Show HN: Clawphone – Twilio voice/SMS gateway for AI agents using TwiML polling

Show HN: Clawphone – Twilio voice/SMS gateway for AI agents using TwiML polling github.com/ranacseruet/cl...

clawphone bridges Twilio phone calls and SMS to an OpenClaw AI agent using plain TwiML webhooks — no WebSocket server, no external STT/TTS APIs (OpenAI, ElevenLabs, etc.), no audio encoding pipeline. The design intentionally trades latency for operational simplicity. When a call comes in, Twilio handles speech-to-text via <Gather input="speech">, the agent runs async, and the reply is polled and spoken back via <Say>. It adds a couple seconds of round-trip vs. a Media Streams approach, but you only need one Twilio account and one Node process. Features:

Standalone server (Node / PM2) or OpenClaw plugin mode SMS support with fast-path (sync) and async fallback Twilio webhook signature validation Per-number rate limiting Graceful shutdown with in-flight voice call drain Structured JSON logging + optional Discord channel logging 166 tests using Node's built-in node:test (no external framework)

It's zero-dependency at the HTTP layer — raw node:http, ES Modules only. I built this because the official OpenClaw voice plugin requires a WebSocket gateway + external TTS/STT accounts. For a personal assistant or low-traffic deployment, that's a lot of infrastructure. This is the minimal path. GitHub: https://github.com/ranacseruet/clawphone npm: @ranacseruet/clawphone Happy to answer questions about the architecture or the TwiML polling approach.

Nice—this is a very pragmatic “works with just TwiML” approach.

A couple questions / thoughts from building voice agents in production:

- How do you handle barge‑in / interruptions? With <Gather input="speech"> + polling, it’s hard to do true full‑duplex + partial ASR. Have you considered a hybrid mode where you keep the TwiML simplicity for setup, but optionally switch to <Stream> (Media Streams) when people want sub‑second turn-taking? - Twilio’s built-in speech recog is convenient, but in my experience it can be the first thing teams outgrow (accuracy, language coverage, costs, and lack of token-level partials). Do you expose an interface so people can swap STT later without reworking the call control? - For long agent responses: do you chunk <Say> / keep call alive with <Pause>? Any gotchas around Twilio timeouts while the agent is “thinking”?

We’ve run into the same infra-vs-latency tradeoff at eboo.ai (real-time voice agents / telephony + WebRTC). If you ever want a sanity check on the lowest-latency Twilio path (Media Streams + incremental STT + barge-in), happy to compare notes.

ranacseruet · 17 days ago

Thanks for the complement. Yeah, so as the project's readme already explains, this is motivated/influenced by use cases for users who wants lightweight setup for ther openclaw deployment(local VM/VPC) without any complex/heavy setup(TTS/STT) on their openclaw server. As the project grows and the light-weight path is stable, media-stream support could definitely be a logical next step.

About barge-in/interruptions, we have partial support. You can look at the codebase and/or the documentations we have for architecture, research as well as what's being planned to address etc: https://github.com/ranacseruet/clawphone/tree/main/docs . Feel free to engage on the repo through issue tracking/suggestions etc.

Hope that helps. Thanks!