Files
ai/doc/gateway_container_creation.md

9.0 KiB

Gateway Container Creation

Overview

The gateway automatically provisions user agent containers when users authenticate. This ensures each user has their own isolated environment running their MCP server with persistent storage.

Authentication Flow with Container Creation

User connects (WebSocket/Telegram)
         ↓
   Send "Authenticating..." status
         ↓
   Verify token/channel link
         ↓
   Lookup user license from DB
         ↓
   Send "Starting workspace..." status
         ↓
┌────────────────────────────────────┐
│  ContainerManager.ensureRunning() │
│  ┌──────────────────────────────┐ │
│  │ Check if deployment exists   │ │
│  └──────────────────────────────┘ │
│           ↓                        │
│     Does it exist?                 │
│     ↙         ↘                    │
│   Yes          No                  │
│    │            │                  │
│    │      ┌──────────────────┐    │
│    │      │ Create deployment│    │
│    │      │ Create PVC       │    │
│    │      │ Create service   │    │
│    │      └──────────────────┘    │
│    │            │                  │
│    └────────────┘                  │
│         ↓                          │
│  Wait for deployment ready         │
│  (polls every 2s, timeout 2min)    │
│         ↓                          │
│  Compute MCP endpoint URL          │
│  (internal k8s service DNS)        │
└────────────────────────────────────┘
         ↓
   Update license.mcpServerUrl
         ↓
   Send "Connected" status
         ↓
   Initialize AgentHarness
         ↓
   Connect to user's MCP server
         ↓
   Ready for messages

Container Naming Convention

All resources follow a consistent naming pattern based on userId:

userId: "user-abc123"
  
deploymentName: "agent-user-abc123"
serviceName: "agent-user-abc123"
pvcName: "agent-user-abc123-data"
mcpEndpoint: "http://agent-user-abc123.dexorder-agents.svc.cluster.local:3000"

User IDs are sanitized to be Kubernetes-compliant (lowercase alphanumeric + hyphens).

Templates by License Tier

Templates are located in gateway/src/k8s/templates/:

  • free-tier.yaml
  • pro-tier.yaml
  • enterprise-tier.yaml

Variable Substitution

Templates use simple string replacement:

  • {{userId}} - User ID
  • {{deploymentName}} - Computed deployment name
  • {{serviceName}} - Computed service name
  • {{pvcName}} - Computed PVC name
  • {{agentImage}} - Agent container image (from env)
  • {{sidecarImage}} - Lifecycle sidecar image (from env)
  • {{storageClass}} - Kubernetes storage class (from env)

Resource Limits

Tier Memory Request Memory Limit CPU Request CPU Limit Storage Idle Timeout
Free 256Mi 512Mi 100m 500m 1Gi 15min
Pro 512Mi 2Gi 250m 2000m 10Gi 60min
Enterprise 1Gi 4Gi 500m 4000m 50Gi Never (shutdown disabled)

Components

KubernetesClient (gateway/src/k8s/client.ts)

Low-level k8s API wrapper:

  • deploymentExists(name) - Check if deployment exists
  • createAgentDeployment(spec) - Create deployment/service/PVC from template
  • waitForDeploymentReady(name, timeout) - Poll until ready
  • getServiceEndpoint(name) - Get service URL
  • deleteAgentDeployment(userId) - Cleanup (for testing)

Static helpers:

  • getDeploymentName(userId) - Generate deployment name
  • getServiceName(userId) - Generate service name
  • getPvcName(userId) - Generate PVC name
  • getMcpEndpoint(userId, namespace) - Compute internal service URL

ContainerManager (gateway/src/k8s/container-manager.ts)

High-level orchestration:

  • ensureContainerRunning(userId, license) - Main entry point
    • Returns: { mcpEndpoint, wasCreated }
    • Creates deployment if missing
    • Waits for ready state
    • Returns endpoint URL
  • getContainerStatus(userId) - Check status without creating
  • deleteContainer(userId) - Manual cleanup

Authenticator (gateway/src/auth/authenticator.ts)

Updated to call container manager:

  • authenticateWebSocket() - Calls ensureContainerRunning() before returning AuthContext
  • authenticateTelegram() - Same for Telegram webhooks

WebSocketHandler (gateway/src/channels/websocket-handler.ts)

Multi-phase connection protocol:

  1. Send {type: 'status', status: 'authenticating'}
  2. Authenticate (may take 30-120s if creating container)
  3. Send {type: 'status', status: 'initializing'}
  4. Initialize agent harness
  5. Send {type: 'connected', ...}

This gives the client visibility into the startup process.

Configuration

Environment variables:

# Kubernetes
KUBERNETES_NAMESPACE=dexorder-agents
KUBERNETES_IN_CLUSTER=true         # false for local dev
KUBERNETES_CONTEXT=minikube        # for local dev only

# Container images
AGENT_IMAGE=ghcr.io/dexorder/agent:latest
SIDECAR_IMAGE=ghcr.io/dexorder/lifecycle-sidecar:latest

# Storage
AGENT_STORAGE_CLASS=standard

Security

The gateway uses a restricted ServiceAccount with RBAC:

Can do:

  • Create deployments in dexorder-agents namespace
  • Create services in dexorder-agents namespace
  • Create PVCs in dexorder-agents namespace
  • Read pod status and logs (debugging)
  • Update deployments (future: resource scaling)

Cannot do:

  • Delete deployments (handled by lifecycle sidecar)
  • Delete PVCs (handled by lifecycle sidecar)
  • Exec into pods
  • Access secrets or configmaps
  • Create resources in other namespaces
  • Access Kubernetes API from agent containers (blocked by NetworkPolicy)

See deploy/k8s/base/gateway-rbac.yaml for full configuration.

Lifecycle

Container Creation (Gateway)

  • User authenticates
  • Gateway checks if deployment exists
  • If missing, creates from template
  • Waits for ready (2min timeout)
  • Returns MCP endpoint

Container Deletion (Lifecycle Sidecar)

  • Container tracks activity and triggers
  • When idle (no triggers + timeout), exits with code 42
  • Sidecar detects exit code 42
  • Sidecar deletes deployment + optional PVC via k8s API
  • Gateway creates fresh container on next authentication

See doc/container_lifecycle_management.md for full lifecycle details.

Error Handling

Error Gateway Action User Experience
Deployment creation fails Log error, return auth failure "Authentication failed"
Wait timeout (image pull, etc.) Log warning, return 503 "Service unavailable, retry"
Service not found Retry with backoff Transparent retry
MCP connection fails Return error "Failed to connect to workspace"
Existing deployment not ready Wait 30s, continue if still not ready May connect to partially-ready container

Local Development

For local development (outside k8s):

  1. Start minikube:
minikube start
minikube addons enable storage-provisioner
  1. Apply security policies:
kubectl apply -k deploy/k8s/dev
  1. Configure gateway for local k8s:
# .env
KUBERNETES_IN_CLUSTER=false
KUBERNETES_CONTEXT=minikube
KUBERNETES_NAMESPACE=dexorder-agents
  1. Run gateway:
cd gateway
npm run dev
  1. Connect via WebSocket:
wscat -c "ws://localhost:3000/ws/chat" -H "Authorization: Bearer your-jwt"

The gateway will create deployments in minikube. View with:

kubectl get deployments -n dexorder-agents
kubectl get pods -n dexorder-agents
kubectl logs -n dexorder-agents agent-user-abc123 -c agent

Production Deployment

  1. Build and push gateway image:
cd gateway
docker build -t ghcr.io/dexorder/gateway:latest .
docker push ghcr.io/dexorder/gateway:latest
  1. Deploy to k8s:
kubectl apply -k deploy/k8s/prod
  1. Gateway runs in dexorder-system namespace
  2. Creates agent containers in dexorder-agents namespace
  3. Admission policies enforce image allowlist and security constraints

Monitoring

Useful metrics to track:

  • Container creation latency (time from auth to ready)
  • Container creation failure rate
  • Active containers by license tier
  • Resource usage per tier
  • Idle shutdown rate

These can be exported via Prometheus or logged to monitoring service.

Future Enhancements

  1. Pre-warming: Create containers for active users before they connect
  2. Image updates: Handle agent image version migrations with user consent
  3. Multi-region: Geo-distributed container placement
  4. Cost tracking: Per-user resource usage and billing
  5. Auto-scaling: Scale down to 0 replicas instead of deletion (faster restart)
  6. Container pools: Shared warm containers for anonymous users