Deployments and inference

Bitstride uses a deployment-aware inference control plane. A deployment describes where traffic should go, how that runtime is reached, and what it can do.

Deployment model

Each deployment can carry:

subdomain — custom subdomain for the deployment
name — human-readable name
endpoint — backend endpoint URL
backend_type — which inference engine to use
protocol — communication protocol
region — deployment region
min_replicas / max_replicas — scaling bounds
concurrency_limit — max concurrent requests
capabilities — supported features (streaming, adapters, etc.)
status — current deployment state

Backend types

dynamo | triton | vllm | sglang | tgi | ollama | smg | custom

Supported protocols

http | grpc | openai

Capabilities

Capabilities let the control plane reason about runtime behavior:

streaming — supports streaming responses
adapter_types — supported LoRA/adapter types
cache_affinity — cache preference hints
supported_protocols — which protocols this deployment accepts

Runtime surfaces

The API binary can run in two modes:

Runtime	Purpose
Full runtime	Control plane routes plus deployment inference routes
Gateway runtime	Health, readiness, metrics, and deployment-scoped chat inference only

Use the gateway runtime when you want a dedicated inference serving layer without the full control plane surface.

Inference endpoints

Bitstride exposes deployment-scoped chat inference in two shapes:

POST /v1/inference/{deployment_id}/chat
POST /v1/chat/completions

/v1/inference/{deployment_id}/chat routes through the control plane with full authz, rate limiting, and usage recording.

/v1/chat/completions is intended for deployment-hostname routing with an OpenAI-compatible chat shape.

What the gateway does

The gateway is responsible for:

Authentication and authorization checks
API-key rate limiting
Deployment resolution (model → deployment → backend)
Backend dispatch through the runtime adapter layer
Inference usage recording
Health and readiness reporting

Control plane endpoints for deployments

GET  /v1/deployments
POST /v1/deployments
GET  /v1/deployments/{id}
PUT  /v1/deployments/{id}

That surface is where operators manage deployment metadata and scaling constraints. The inference gateway is where runtime traffic is executed.

Multi-backend support

Bitstride abstracts backend execution so you can swap inference engines without changing client code:

Backend	Best for
NVIDIA Dynamo	Production gRPC serving with dynamic batching
vLLM	High-throughput LLM serving with PagedAttention
TGI	HuggingFace ecosystem compatibility
SGLang	Structured generation and program execution
Triton	Multi-model ensembles and custom backends
Ollama	Local development and small deployments
SMG	Bitstride's internal router abstraction
Custom	Bring your own inference runtime

Region and routing

Current operation is effectively single-region, with region modeled explicitly in deployment metadata so future multi-region policy can be added without changing deployment contracts.

The DeploymentResolver selects backends based on:

Model and organization
Protocol and streaming requirements
Capabilities and adapter types
Region preference

Usage and observability

Every inference request is metered and recorded:

Per-API-key usage attribution
Per-deployment request counts and latency
Prometheus metrics for operational monitoring
PostgreSQL-backed usage records for billing

Deployments and Inference