AI Infrastructure · Published

An OpenAI-compatible LLM gateway: routing, caching, and a circuit breaker

LLM Gateway · 2026~8 min read

Point any OpenAI client at <host>/v1 and it just works — same request shape, same response shape — except now there's a cache in front of it, a circuit breaker watching each provider, and a fallback path that takes over when one falls down. The interesting part isn't that it proxies an LLM; it's that the client can't tell when a provider has failed. It serves real traffic on free Groq today, returns a cache hit in roughly 0 ms against a ~350 ms miss, and exposes a live cost/latency dashboard at /gateway. It runs inside the same Spring Boot backend as the rest of this site — no second deployable.

AI infrastructure is the most crowded corner of the industry right now, so I want to be concrete rather than topical. This is a small, honest control plane: it does four things — route, cache, protect, observe — and it does them transparently.

The problem (and why it's actually hard)

Calling a provider directly couples your application to one vendor, one failure mode, and no cost control. If that provider rate-limits you, has a bad five minutes, or returns 5xx, your app inherits the outage. The standard answer is a gateway: centralize routing, resilience, caching, and observability behind one stable contract so the app downstream never has to know.

The genuinely hard requirement is transparent resilience. When a provider falls over, the client should not see it — the gateway should fail fast off the broken provider and route to a healthy one without the caller changing a line of code. The naive fix is to retry the failed provider, but that's exactly wrong when the provider is down: every retry burns part of your latency budget waiting for a timeout that you already know is coming. You want to stop hammering a dead upstream quickly, and you want the contract — the OpenAI shape — to stay identical whether the answer came from the cache, the primary, or the fallback.

Streaming makes this sharper. Holding the OpenAI Server-Sent Events contract means relaying tokens to the client as they arrive, which preserves the token-by-token UX people expect — but it also means that once the first byte is on the wire, there is no clean way to fail over mid-stream without the client noticing.

How it works

A request enters through an OpenAI-compatible controller at POST /v1/chat/completions. It passes a per-client token-bucket rate limiter, then the cache. On a miss, the service builds an ordered list of provider candidates — the providers that support the requested model and whose circuit breaker currently allows traffic — and tries them in order. Each provider call is retried with exponential backoff on a transient failure; if one provider exhausts its retries, the gateway falls back to the next candidate and records the failure against the first one's breaker. A success caches the response and returns.

client (OpenAI SDK) │ POST /v1/chat/completions ▼ ┌─────────────┐ 429 ┌──────────────┐ │ rate limiter ├────────▶│ budget spent │ └──────┬──────┘ └──────────────┘ ▼ ┌─────────────┐ HIT (~0 ms) ┌──────────────┐ │ exact-match ├────────────────▶│ return cached │ │ cache │ └──────────────┘ └──────┬──────┘ │ MISS ▼ ┌──────────────────────────────────────────────┐ │ router: providers that support(model) │ │ AND breaker.allows(name) │ └──────┬─────────────────────────────────┬───────┘ │ candidate #1 │ ▼ │ ┌──────────────┐ retries exhausted / │ FALLBACK │ provider A │ circuit OPEN │ ═════════▶ │ (e.g. Groq) ├──────────────────────────┘ └──────┬───────┘ ▼ │ success ┌──────────────┐ ▼ │ provider B │ cache.put + return │ (config-driven)│ X-LLM-Provider / -Cache / -Latency └──────────────┘

A non-streaming request: rate limit → cache → routed candidates with per-provider breakers. The highlighted path is the automatic fallback when the primary's circuit is open or its retries are exhausted.

Streaming takes a deliberately simpler path: when "stream": true, the gateway routes to a single available provider and relays each upstream SSE data: chunk straight through to the client as it arrives, read off the response on a virtual thread (no reactive stack). Every non-streaming response carries gateway metadata on X-LLM-Provider, X-LLM-Latency-Ms, and X-LLM-Cache (HIT/MISS) headers, which is also what feeds the dashboard.

The design decisions that mattered

An OpenAI-compatible surface over a custom API

I chose to mimic the OpenAI /v1/chat/completions contract exactly rather than design a cleaner, gateway-native API. The reason is migration cost: every LLM SDK, every internal tool, every curl snippet already speaks this shape. Pointing an existing client at the gateway is a base-URL change and nothing else. A custom API would be tidier on paper and would let me surface gateway concepts as first-class fields — but it would also make adoption a rewrite, and a gateway nobody can drop in is a gateway nobody uses. The trade-off I accepted is that gateway-specific signal (which provider served you, was it a cache hit, how long it took) has to ride on response headers instead of the body, so as not to pollute the OpenAI shape.

A per-provider circuit breaker plus automatic fallback, over naive retry

This is the decision the whole project is organized around. Retrying a down provider is the obvious move and the wrong one — it spends your latency budget waiting on an upstream you already have evidence is unhealthy. Instead, each provider gets its own circuit breaker: after a configurable number of consecutive failures (default 4) its circuit opens and it is skipped entirely for a cooldown (default 30 s), after which a single probe is allowed (half-open); any success resets it. The router only ever considers providers whose breaker currently allows traffic, so a broken provider is removed from rotation before a request pays its timeout, and the request falls straight through to the next candidate.

java

// GatewayService.complete — the routed candidate list IS the resilience.
List<LlmProvider> candidates = providers.stream()
      .filter(p -> p.supports(request.model()))      // routing
      .filter(p -> circuitBreaker.allows(p.name()))  // skip open circuits
      .toList();

for (LlmProvider provider : candidates) {
  try {
      ChatCompletionResponse response = callWithRetry(provider, request);
      circuitBreaker.recordSuccess(provider.name());
      cache.put(cacheKey, response, provider.name());
      return new GatewayResult(response, provider.name(), latencyMs, false);
  } catch (ExternalApiException e) {
      circuitBreaker.recordFailure(provider.name());
      // fall through to the next candidate
  }
}

What I traded away: breaker tuning is genuinely fiddly. The failure threshold, the cooldown length, and the half-open probe behavior all interact, and there's no single right setting — too sensitive and a couple of blips evict a healthy provider, too lenient and you keep routing into a brownout. I made all of them config-driven (gateway.circuit.*) rather than pretend I'd found universal values. The breaker also tracks consecutive failures and resets on any success, which is simple and predictable but coarser than a rolling error-rate window would be.

SSE streaming passthrough over buffer-then-forward

For streaming I relay chunks straight through instead of buffering the full completion and re-emitting it. Buffering would have made fallback trivial — you don't commit to a provider until you have the whole answer — but it destroys the entire reason to stream: the token-by-token UX. So I pass the upstream SSE through in real time on a virtual thread. The cost of that choice, stated plainly: there is no mid-stream fallback. Once I've routed a streaming request to a provider and the first chunk is out, I'm committed to it; if it dies halfway, the stream dies. Buffering and re-streaming, or a checkpoint-and-resume scheme, are the ways to fix that, and both are heavier than this demo needs. Non-streaming requests get the full breaker-and-fallback treatment; streaming requests get a single routed provider, and I think that's the right line for now.

Exact-match cache and a token-bucket rate limiter — and where each adds little

The cache is exact-match: it keys on the canonical request (model + messages + sampling params, ignoring the stream flag), so an identical non-streaming call returns the stored completion in ~0 ms and never touches a provider. I chose exact-match over semantic matching because it is simple and safe — it can't return a subtly-wrong answer for a different prompt. It's Caffeine-backed, TTL- and size-bounded, and the same caching library I already use elsewhere on this backend. Honestly: at low scale and for non-deterministic workloads it earns little, which is why it's one config flag (gateway.cache.enabled=false) to turn off. The token-bucket rate limiter is per-client (keyed on the Authorization header, refilling continuously) and returns a real 429 once the budget is spent — correct and necessary in front of a paid API, but at single-instance scale it's mostly there to demonstrate the behavior. Both of these are per-instance, like the rest of the site; a shared Redis would make them cluster-wide.

Does it actually work?

I verified the gateway end-to-end against live Groq (Llama 3.3 70B, the default model). A real completion comes back with X-LLM-Provider: groq and a measured latency on the order of ~350 ms for a cache miss. Re-issuing the identical request returns X-LLM-Cache: HIT at effectively 0 ms — no upstream call at all. Hammer the endpoint past the per-minute budget and you get a 429 with a Retry-After, exactly as a real provider would. SSE streaming relays tokens through in real time. The live dashboard at /gateway polls real counters off the running process: total requests, cache hit rate, tokens served and saved, p50/p99 latency, and a per-provider breakdown.

What those numbers prove and don't: the ~0 ms hit versus ~350 ms miss is a real before/after on the same request, and the 429 is genuine backpressure. They are single-instance, single-user measurements, not a load test — the p99 is computed over a bounded recent sample, and the cost figure is an illustrative estimate (tokens saved × a configurable price per 1K), not a billing integration. It demonstrates the mechanism honestly; it does not claim a production SLO.

What I'd do differently, and what's next

The biggest honest gap is mid-stream fallback — today a streaming request is committed to one provider once it starts. The other deliberate limits are scope, not oversight: per-instance metrics and rate limits (a shared Redis makes them cluster-wide), exact-match caching, and a consecutive-failure breaker rather than a rolling-error-rate one.

The next steps I'd actually build, in order: semantic caching (embedding-keyed, to catch near-duplicate prompts the exact-match cache misses — the obvious 10x on hit rate, at the cost of the safety the exact match guarantees); per-tenant budgets (real spend caps per API key, turning the illustrative cost number into enforcement); and provider-health-aware routing (route on observed latency and error rate, not just a binary open/closed breaker). Adding a second provider today is already config, not code — set GATEWAY_PROVIDER2_* to any OpenAI-compatible upstream (Cerebras, OpenRouter, etc.) and the router claims exactly the model prefixes it's told to.

Try it: the live gateway and dashboard are at /gateway; the source is on GitHub.