BitstrideBitstride

Deployments and Inference

How Bitstride routes deployment traffic and models runtime capabilities.

Deployments and inference

Bitstride uses a deployment-aware inference control plane. A deployment describes where traffic should go, how that runtime is reached, and what it can do.

Deployment model

Each deployment can carry:

  • subdomain — custom subdomain for the deployment
  • name — human-readable name
  • endpoint — backend endpoint URL
  • backend_type — which inference engine to use
  • protocol — communication protocol
  • region — deployment region
  • min_replicas / max_replicas — scaling bounds
  • concurrency_limit — max concurrent requests
  • capabilities — supported features (streaming, adapters, etc.)
  • status — current deployment state

Backend types

dynamo | triton | vllm | sglang | tgi | ollama | smg | custom

Supported protocols

http | grpc | openai

Capabilities

Capabilities let the control plane reason about runtime behavior:

  • streaming — supports streaming responses
  • adapter_types — supported LoRA/adapter types
  • cache_affinity — cache preference hints
  • supported_protocols — which protocols this deployment accepts

Runtime surfaces

The API binary can run in two modes:

RuntimePurpose
Full runtimeControl plane routes plus deployment inference routes
Gateway runtimeHealth, readiness, metrics, and deployment-scoped chat inference only

Use the gateway runtime when you want a dedicated inference serving layer without the full control plane surface.

Inference endpoints

Bitstride exposes deployment-scoped chat inference in two shapes:

POST /v1/inference/{deployment_id}/chat
POST /v1/chat/completions

/v1/inference/{deployment_id}/chat routes through the control plane with full authz, rate limiting, and usage recording.

/v1/chat/completions is intended for deployment-hostname routing with an OpenAI-compatible chat shape.

What the gateway does

The gateway is responsible for:

  • Authentication and authorization checks
  • API-key rate limiting
  • Deployment resolution (model → deployment → backend)
  • Backend dispatch through the runtime adapter layer
  • Inference usage recording
  • Health and readiness reporting

Control plane endpoints for deployments

GET  /v1/deployments
POST /v1/deployments
GET  /v1/deployments/{id}
PUT  /v1/deployments/{id}

That surface is where operators manage deployment metadata and scaling constraints. The inference gateway is where runtime traffic is executed.

Multi-backend support

Bitstride abstracts backend execution so you can swap inference engines without changing client code:

BackendBest for
NVIDIA DynamoProduction gRPC serving with dynamic batching
vLLMHigh-throughput LLM serving with PagedAttention
TGIHuggingFace ecosystem compatibility
SGLangStructured generation and program execution
TritonMulti-model ensembles and custom backends
OllamaLocal development and small deployments
SMGBitstride's internal router abstraction
CustomBring your own inference runtime

Region and routing

Current operation is effectively single-region, with region modeled explicitly in deployment metadata so future multi-region policy can be added without changing deployment contracts.

The DeploymentResolver selects backends based on:

  • Model and organization
  • Protocol and streaming requirements
  • Capabilities and adapter types
  • Region preference

Usage and observability

Every inference request is metered and recorded:

  • Per-API-key usage attribution
  • Per-deployment request counts and latency
  • Prometheus metrics for operational monitoring
  • PostgreSQL-backed usage records for billing