Deployments and Inference
How Bitstride routes deployment traffic and models runtime capabilities.
Deployments and inference
Bitstride uses a deployment-aware inference control plane. A deployment describes where traffic should go, how that runtime is reached, and what it can do.
Deployment model
Each deployment can carry:
subdomain— custom subdomain for the deploymentname— human-readable nameendpoint— backend endpoint URLbackend_type— which inference engine to useprotocol— communication protocolregion— deployment regionmin_replicas/max_replicas— scaling boundsconcurrency_limit— max concurrent requestscapabilities— supported features (streaming, adapters, etc.)status— current deployment state
Backend types
dynamo | triton | vllm | sglang | tgi | ollama | smg | custom
Supported protocols
http | grpc | openai
Capabilities
Capabilities let the control plane reason about runtime behavior:
streaming— supports streaming responsesadapter_types— supported LoRA/adapter typescache_affinity— cache preference hintssupported_protocols— which protocols this deployment accepts
Runtime surfaces
The API binary can run in two modes:
| Runtime | Purpose |
|---|---|
| Full runtime | Control plane routes plus deployment inference routes |
| Gateway runtime | Health, readiness, metrics, and deployment-scoped chat inference only |
Use the gateway runtime when you want a dedicated inference serving layer without the full control plane surface.
Inference endpoints
Bitstride exposes deployment-scoped chat inference in two shapes:
POST /v1/inference/{deployment_id}/chat
POST /v1/chat/completions
/v1/inference/{deployment_id}/chat routes through the control plane with full authz, rate
limiting, and usage recording.
/v1/chat/completions is intended for deployment-hostname routing with an OpenAI-compatible chat
shape.
What the gateway does
The gateway is responsible for:
- Authentication and authorization checks
- API-key rate limiting
- Deployment resolution (model → deployment → backend)
- Backend dispatch through the runtime adapter layer
- Inference usage recording
- Health and readiness reporting
Control plane endpoints for deployments
GET /v1/deployments
POST /v1/deployments
GET /v1/deployments/{id}
PUT /v1/deployments/{id}
That surface is where operators manage deployment metadata and scaling constraints. The inference gateway is where runtime traffic is executed.
Multi-backend support
Bitstride abstracts backend execution so you can swap inference engines without changing client code:
| Backend | Best for |
|---|---|
| NVIDIA Dynamo | Production gRPC serving with dynamic batching |
| vLLM | High-throughput LLM serving with PagedAttention |
| TGI | HuggingFace ecosystem compatibility |
| SGLang | Structured generation and program execution |
| Triton | Multi-model ensembles and custom backends |
| Ollama | Local development and small deployments |
| SMG | Bitstride's internal router abstraction |
| Custom | Bring your own inference runtime |
Region and routing
Current operation is effectively single-region, with region modeled explicitly in deployment metadata so future multi-region policy can be added without changing deployment contracts.
The DeploymentResolver selects backends based on:
- Model and organization
- Protocol and streaming requirements
- Capabilities and adapter types
- Region preference
Usage and observability
Every inference request is metered and recorded:
- Per-API-key usage attribution
- Per-deployment request counts and latency
- Prometheus metrics for operational monitoring
- PostgreSQL-backed usage records for billing