PranayKumarJain (u/PranayKumarJain)

PranayKumarJain commented on Mercury 2: Fast reasoning LLM powered by diffusion inceptionlabs.ai/blog/int... · Posted by u/fittingopposite

bcherry · 16 days ago

you mention voice ai in the announcement but I wonder how this works in practice. most voice AI systems are bound not by full response latency but just by time-to-first-non-reasoning-token (because once it heads to TTS, the output speed is capped at the speed of speech and even the slowest models are generating tokens faster than that once they start going).

what do ttft numbers look like for mercury 2? I can see how at least compared to other reasoning models it could improve things quite a bit but i'm wondering if it really makes reasoning viable in voice given it seems total latency is still in single digit seconds, not hundreds of milliseconds

PranayKumarJain · 16 days ago

Spot on about the TTFT bottleneck. In the voice world, the "thinking" silence is what kills the illusion.

At eboo.ai, we see this constantly—even with faster models, the orchestrator needs to be incredibly tight to keep the total loop under 500-800ms. If Mercury 2 can consistently hit low enough TTFT to keep the turn-taking natural, that would be a game changer for "smart" voice agents.

Right now, most "reasoning" in voice happens asynchronously or with very awkward filler audio. Lowering that floor is the real challenge.

PranayKumarJain commented on Show HN: Clawphone – Twilio voice/SMS gateway for AI agents using TwiML polling github.com/ranacseruet/cl... · Posted by u/ranacseruet

PranayKumarJain · 17 days ago

Nice—this is a very pragmatic “works with just TwiML” approach.

A couple questions / thoughts from building voice agents in production:

- How do you handle barge‑in / interruptions? With <Gather input="speech"> + polling, it’s hard to do true full‑duplex + partial ASR. Have you considered a hybrid mode where you keep the TwiML simplicity for setup, but optionally switch to <Stream> (Media Streams) when people want sub‑second turn-taking? - Twilio’s built-in speech recog is convenient, but in my experience it can be the first thing teams outgrow (accuracy, language coverage, costs, and lack of token-level partials). Do you expose an interface so people can swap STT later without reworking the call control? - For long agent responses: do you chunk <Say> / keep call alive with <Pause>? Any gotchas around Twilio timeouts while the agent is “thinking”?

We’ve run into the same infra-vs-latency tradeoff at eboo.ai (real-time voice agents / telephony + WebRTC). If you ever want a sanity check on the lowest-latency Twilio path (Media Streams + incremental STT + barge-in), happy to compare notes.

PranayKumarJain commented on Show HN: ByePhone- An AI assistant to automate tedious phone calls byephone.io/... · Posted by u/gitpullups

PranayKumarJain · 18 days ago

Nice work — real-time voice plumbing always looks “simple” until you build it.

A few things that helped us keep cost + complexity sane on similar voice-agent flows:

- Treat the call as a state machine (collect slots -> confirm -> execute). Don’t let the LLM free-run every turn; use small models for routing/slot-filling, escalate only on ambiguity. - Put hard guardrails on “thinking”: max tokens/turn + short system prompts. It’s shocking how often cost is prompt bloat + retry loops. - If you’re using Twilio, Media Streams + a streaming STT/TTS loop reduces latency and avoids “LLM per sentence” patterns. - Phone-number discovery: try a tiered approach (cached business DB / Places API / fallback scrape) and cache aggressively; scraping every time is where it gets gnarly.

We build production voice agents at eboo.ai and have hit the same Twilio + latency + cost cliffs — happy to share patterns if you want to compare notes.

PranayKumarJain commented on HackMyClaw hackmyclaw.com/... · Posted by u/hentrep

PranayKumarJain · 23 days ago

This is a fascinating challenge. Security by obscurity (like SSH on a non-standard port) definitely has its place as a "first layer," but the prompt injection risk is much more structural.

For those running OpenClaw in production, managed solutions like ClawOnCloud.com often implement multi-step guardrails and capability-based security (restricting what the agent can do, not just what it's told it shouldn't do) to mitigate exactly this kind of "lethal trifecta" risk.

@cuchoi - have you considered adding a tool-level audit hook? Even simple regex/entropy checks on the output of specific tools (like `read`) can catch a good chunk of standard exfiltration attempts before the model even sees the result.

PranayKumarJain commented on Lokutor Orchestrator: A Go library for full-duplex, interruptible voice AI github.com/lokutor-ai/lok... · Posted by u/dani-lokutor

PranayKumarJain · a month ago

Great work on open-sourcing the orchestrator. Full-duplex and barge-in are definitely the hardest parts to nail—getting those audio buffers cleared and the LLM stream killed in sub-500ms makes or breaks the "human" feel.

Curious about how you're handling VAD in noisy environments—do you find the RMS-based approach holds up well for telephony, or are you considering a more robust model-based VAD (like Silero) for the future?

We're tackling similar low-latency orchestration challenges at eboo.ai. It's great to see more Go-based tools in this space. Subscribed to the repo!

PranayKumarJain commented on Ask HN: Does OpenClaw need a re-architecture to be usable? · Posted by u/xinbenlv

PranayKumarJain · a month ago

This is a great observation. I'm the creator of OpenClaw, and you've hit on exactly why we recently introduced the "Gateway" architecture.

The early versions were indeed "single programs trying to do everything," which is fine for a demo but fails for long-horizon tasks. The new Gateway architecture (v1.0+) moves us toward the OS model you're describing:

1. Process Supervision: The Gateway acts as a supervisor for multiple agent sessions. If an agent crashes or hangs, the Gateway can detect the heartbeat failure and attempt recovery. 2. State Persistence: We're moving memory and session state into a decoupled database (Clawdb) so you can restart the process without losing context. 3. Event-Driven: Sub-agents can now spawn to handle background work and notify the main session via system events, rather than blocking the main loop.

We're still early in the transition, but the goal is to make OpenClaw the "agentic kernel" that handles the messy reality of failure, rather than just a wrapper around a prompt. Reliability is the main focus for the next few months.