AI Infrastructure · Published
An OpenAI-compatible LLM gateway: routing, caching, and a circuit breaker
Point any OpenAI client at <host>/v1 and it just works — same request shape, same response shape — except now there's a cache in front of it, a circuit breaker watching each provider, and a fallback path that takes over when one falls down. The interesting part isn't that it proxies an LLM; it's that the client can't tell when a provider has failed. It serves real traffic on free Groq today, returns a cache hit in roughly 0 ms against a ~350 ms miss, and exposes a live cost/latency dashboard at /gateway. It runs inside the same Spring Boot backend as the rest of this site — no second deployable.
AI infrastructure is the most crowded corner of the industry right now, so I want to be concrete rather than topical. This is a small, honest control plane: it does four things — route, cache, protect, observe — and it does them transparently.
The problem (and why it's actually hard)
Calling a provider directly couples your application to one vendor, one failure mode, and no cost control. If that provider rate-limits you, has a bad five minutes, or returns 5xx, your app inherits the outage. The standard answer is a gateway: centralize routing, resilience, caching, and observability behind one stable contract so the app downstream never has to know.
The genuinely hard requirement is transparent resilience. When a provider falls over, the client should not see it — the gateway should fail fast off the broken provider and route to a healthy one without the caller changing a line of code. The naive fix is to retry the failed provider, but that's exactly wrong when the provider is down: every retry burns part of your latency budget waiting for a timeout that you already know is coming. You want to stop hammering a dead upstream quickly, and you want the contract — the OpenAI shape — to stay identical whether the answer came from the cache, the primary, or the fallback.
Streaming makes this sharper. Holding the OpenAI Server-Sent Events contract means relaying tokens to the client as they arrive, which preserves the token-by-token UX people expect — but it also means that once the first byte is on the wire, there is no clean way to fail over mid-stream without the client noticing.
How it works
A request enters through an OpenAI-compatible controller at POST /v1/chat/completions. It passes a per-client token-bucket rate limiter, then the cache. On a miss, the service builds an ordered list of provider candidates — the providers that support the requested model and whose circuit breaker currently allows traffic — and tries them in order. Each provider call is retried with exponential backoff on a transient failure; if one provider exhausts its retries, the gateway falls back to the next candidate and records the failure against the first one's breaker. A success caches the response and returns.
Streaming takes a deliberately simpler path: when "stream": true, the gateway routes to a single available provider and relays each upstream SSE data: chunk straight through to the client as it arrives, read off the response on a virtual thread (no reactive stack). Every non-streaming response carries gateway metadata on X-LLM-Provider, X-LLM-Latency-Ms, and X-LLM-Cache (HIT/MISS) headers, which is also what feeds the dashboard.
The design decisions that mattered
An OpenAI-compatible surface over a custom API
I chose to mimic the OpenAI /v1/chat/completions contract exactly rather than design a cleaner, gateway-native API. The reason is migration cost: every LLM SDK, every internal tool, every curl snippet already speaks this shape. Pointing an existing client at the gateway is a base-URL change and nothing else. A custom API would be tidier on paper and would let me surface gateway concepts as first-class fields — but it would also make adoption a rewrite, and a gateway nobody can drop in is a gateway nobody uses. The trade-off I accepted is that gateway-specific signal (which provider served you, was it a cache hit, how long it took) has to ride on response headers instead of the body, so as not to pollute the OpenAI shape.
A per-provider circuit breaker plus automatic fallback, over naive retry
This is the decision the whole project is organized around. Retrying a down provider is the obvious move and the wrong one — it spends your latency budget waiting on an upstream you already have evidence is unhealthy. Instead, each provider gets its own circuit breaker: after a configurable number of consecutive failures (default 4) its circuit opens and it is skipped entirely for a cooldown (default 30 s), after which a single probe is allowed (half-open); any success resets it. The router only ever considers providers whose breaker currently allows traffic, so a broken provider is removed from rotation before a request pays its timeout, and the request falls straight through to the next candidate.
// GatewayService.complete — the routed candidate list IS the resilience.
List<LlmProvider> candidates = providers.stream()
.filter(p -> p.supports(request.model())) // routing
.filter(p -> circuitBreaker.allows(p.name())) // skip open circuits
.toList();
for (LlmProvider provider : candidates) {
try {
ChatCompletionResponse response = callWithRetry(provider, request);
circuitBreaker.recordSuccess(provider.name());
cache.put(cacheKey, response, provider.name());
return new GatewayResult(response, provider.name(), latencyMs, false);
} catch (ExternalApiException e) {
circuitBreaker.recordFailure(provider.name());
// fall through to the next candidate
}
}What I traded away: breaker tuning is genuinely fiddly. The failure threshold, the cooldown length, and the half-open probe behavior all interact, and there's no single right setting — too sensitive and a couple of blips evict a healthy provider, too lenient and you keep routing into a brownout. I made all of them config-driven (gateway.circuit.*) rather than pretend I'd found universal values. The breaker also tracks consecutive failures and resets on any success, which is simple and predictable but coarser than a rolling error-rate window would be.
SSE streaming passthrough over buffer-then-forward
For streaming I relay chunks straight through instead of buffering the full completion and re-emitting it. Buffering would have made fallback trivial — you don't commit to a provider until you have the whole answer — but it destroys the entire reason to stream: the token-by-token UX. So I pass the upstream SSE through in real time on a virtual thread. The cost of that choice, stated plainly: there is no mid-stream fallback. Once I've routed a streaming request to a provider and the first chunk is out, I'm committed to it; if it dies halfway, the stream dies. Buffering and re-streaming, or a checkpoint-and-resume scheme, are the ways to fix that, and both are heavier than this demo needs. Non-streaming requests get the full breaker-and-fallback treatment; streaming requests get a single routed provider, and I think that's the right line for now.
Exact-match cache and a token-bucket rate limiter — and where each adds little
The cache is exact-match: it keys on the canonical request (model + messages + sampling params, ignoring the stream flag), so an identical non-streaming call returns the stored completion in ~0 ms and never touches a provider. I chose exact-match over semantic matching because it is simple and safe — it can't return a subtly-wrong answer for a different prompt. It's Caffeine-backed, TTL- and size-bounded, and the same caching library I already use elsewhere on this backend. Honestly: at low scale and for non-deterministic workloads it earns little, which is why it's one config flag (gateway.cache.enabled=false) to turn off. The token-bucket rate limiter is per-client (keyed on the Authorization header, refilling continuously) and returns a real 429 once the budget is spent — correct and necessary in front of a paid API, but at single-instance scale it's mostly there to demonstrate the behavior. Both of these are per-instance, like the rest of the site; a shared Redis would make them cluster-wide.
Does it actually work?
I verified the gateway end-to-end against live Groq (Llama 3.3 70B, the default model). A real completion comes back with X-LLM-Provider: groq and a measured latency on the order of ~350 ms for a cache miss. Re-issuing the identical request returns X-LLM-Cache: HIT at effectively 0 ms — no upstream call at all. Hammer the endpoint past the per-minute budget and you get a 429 with a Retry-After, exactly as a real provider would. SSE streaming relays tokens through in real time. The live dashboard at /gateway polls real counters off the running process: total requests, cache hit rate, tokens served and saved, p50/p99 latency, and a per-provider breakdown.
What those numbers prove and don't: the ~0 ms hit versus ~350 ms miss is a real before/after on the same request, and the 429 is genuine backpressure. They are single-instance, single-user measurements, not a load test — the p99 is computed over a bounded recent sample, and the cost figure is an illustrative estimate (tokens saved × a configurable price per 1K), not a billing integration. It demonstrates the mechanism honestly; it does not claim a production SLO.
What I'd do differently, and what's next
The biggest honest gap is mid-stream fallback — today a streaming request is committed to one provider once it starts. The other deliberate limits are scope, not oversight: per-instance metrics and rate limits (a shared Redis makes them cluster-wide), exact-match caching, and a consecutive-failure breaker rather than a rolling-error-rate one.
The next steps I'd actually build, in order: semantic caching (embedding-keyed, to catch near-duplicate prompts the exact-match cache misses — the obvious 10x on hit rate, at the cost of the safety the exact match guarantees); per-tenant budgets (real spend caps per API key, turning the illustrative cost number into enforcement); and provider-health-aware routing (route on observed latency and error rate, not just a binary open/closed breaker). Adding a second provider today is already config, not code — set GATEWAY_PROVIDER2_* to any OpenAI-compatible upstream (Cerebras, OpenRouter, etc.) and the router claims exactly the model prefixes it's told to.
Try it: the live gateway and dashboard are at /gateway; the source is on GitHub.