moatmoat (u/moatmoat)

moatmoat · 3 months ago

TL;DR — Anthropic Postmortem of Three Recent Issues

In Aug–Sep 2025, Claude users saw degraded output quality due to infrastructure bugs, not intentional changes.

The Three Issues 1. *Context window routing error* - Short-context requests sometimes routed to long-context servers.

   - Started small, worsened after load-balancing changes.

2. *Output corruption* - TPU misconfigurations led to weird outputs (wrong language, syntax errors).

   - Runtime optimizations wrongly boosted improbable tokens.

3. *Approximate top-k miscompilation* - A compiler bug in TPU/XLA stack corrupted token probability selection.

   - Occasionally dropped the true top token.

Why It Was Hard to Detect - Bugs were subtle, intermittent, and platform-dependent.

- Benchmarks missed these degradations.

- Privacy/safety rules limited access to real user data for debugging.

Fixes and Next Steps - More sensitive, continuous evals on production.

- Better tools to debug user feedback safely.

- Stronger validation of routing, output correctness, and token-selection.