Skip to content

Features

What the gateway does today (delivered milestones M0–M6 + M2.5 — Phase 0 complete).

Work-type → model routing (M1)

Requests route to a logical model route by header/tag — workload: bulk goes to the cheaper bulk model, everything else to the coding default — with a documented fallback to a second provider. This is native Gateway API config (no plugin). Logical aliases (coding-default, bulk) are the stable client contract; the real provider/model behind each is data you can change without touching clients.

Per-user / per-group token limits (M2)

Token-per-minute rate limits are enforced by the built-in ai-statistics + ai-token-ratelimit plugins (Redis-backed), keyed by the caller's identity and namespaced by {org, project}. Over the limit → HTTP 429. Redis is managed by the Opstree operator (standalone for dev, HA + Sentinel available).

API-key authentication + USD budgets (M2.5)

The machine API (/v1) is authenticated with API keys (Authorization: Bearer …, OpenAI-compatible): each key maps to a consumer (project.user) that every policy keys on. On top of token limits, each consumer gets a real dollar budget: a controller continuously reads token usage, prices it with a per-model USD price table, and cuts the consumer off once they exceed their budget → HTTP 403. It's all in-cluster (no proprietary component), and the budget/price config lives in the Project spec.

Per-group model allow-list (M3)

A small custom Wasm guard rejects any request whose model isn't in the caller's group allow-listHTTP 403. The allow-list is data in the Project spec, scoped per Project, and on-prem self-contained (no cloud dependency). Non-LLM requests pass through untouched.

Guardrails — PII masking + prompt-injection (M5)

Two guards screen every prompt before the model sees it:

  • PII masking uses the built-in ai-data-masking plugin to mask sensitive data (emails, phone numbers, IPs, API keys) in both the request and the model's response. It runs entirely in-cluster with local rules — no data leaves the cluster — and the masking rules are data in the Project spec, toggleable per Project.
  • Prompt-injection blocking is a small custom Wasm guard that rejects known jailbreak / instruction-override prompts (e.g. "ignore previous instructions", "reveal your system prompt") → HTTP 403, with a configurable pattern list.

Both are on-prem and self-contained — no third-party cloud call — meeting the privacy requirement.

Observability (M4)

A full Grafana LGTM stack — Loki (logs), Mimir (metrics), Tempo (traces) — with Grafana Alloy scraping the gateway's token metrics and tailing its access logs. Grafana ships with an AI Gateway overview dashboard (tokens & latency by model/route, rate-limit rejections) and is reachable at its own hostname. The LGTM stores sit behind a basic-auth proxy (tenant credential), with org → tenant isolation ready for multi-tenancy.

Storage is host-mounted local disk for dev (no NFS) with an object-storage (S3 / SeaweedFS) path for HA.

Single sign-on (M6)

Human access to dashboards (and the future admin console) is gated by Google Workspace SSO, restricted to your company domain (hd / email-domain enforced server-side — outside accounts are denied). Unauthenticated visitors are redirected to Google; on login, the user's identity and group flow through as the same identity tuple the gateway's limits and guardrails already use, so SSO simply becomes the source of identity. Built on oauth2-proxy (no proprietary component), it runs entirely in your cluster.

In-cluster TLS + remote access (M0.5)

TLS is terminated in-cluster by Higress using a Let's Encrypt certificate from cert-manager (ACME DNS-01 via Cloudflare). A Cloudflare Tunnel is the dev front door; in production it's removed and the same certificate serves clients directly — zero manifest change.

Coming next

See the Roadmap — Phase 0 (the single-tenant engine) is complete; next is the multi-tenant control plane, USD budgets, and API keys.

Self-hosted AI gateway on Higress — Infrastructure-as-Code, on-prem, no cloud lock-in.