# Gateway Container Creation ## Overview The gateway automatically provisions user agent containers when users authenticate. This ensures each user has their own isolated environment running their MCP server with persistent storage. ## Authentication Flow with Container Creation ``` User connects (WebSocket/Telegram) ↓ Send "Authenticating..." status ↓ Verify token/channel link ↓ Lookup user license from DB ↓ Send "Starting workspace..." status ↓ ┌────────────────────────────────────┐ │ ContainerManager.ensureRunning() │ │ ┌──────────────────────────────┐ │ │ │ Check if deployment exists │ │ │ └──────────────────────────────┘ │ │ ↓ │ │ Does it exist? │ │ ↙ ↘ │ │ Yes No │ │ │ │ │ │ │ ┌──────────────────┐ │ │ │ │ Create deployment│ │ │ │ │ Create PVC │ │ │ │ │ Create service │ │ │ │ └──────────────────┘ │ │ │ │ │ │ └────────────┘ │ │ ↓ │ │ Wait for deployment ready │ │ (polls every 2s, timeout 2min) │ │ ↓ │ │ Compute MCP endpoint URL │ │ (internal k8s service DNS) │ └────────────────────────────────────┘ ↓ Update license.mcpServerUrl ↓ Send "Connected" status ↓ Initialize AgentHarness ↓ Connect to user's MCP server ↓ Ready for messages ``` ## Container Naming Convention All resources follow a consistent naming pattern based on `userId`: ```typescript userId: "user-abc123" ↓ deploymentName: "agent-user-abc123" serviceName: "agent-user-abc123" pvcName: "agent-user-abc123-data" mcpEndpoint: "http://agent-user-abc123.dexorder-agents.svc.cluster.local:3000" ``` User IDs are sanitized to be Kubernetes-compliant (lowercase alphanumeric + hyphens). ## Templates by License Tier Templates are located in `gateway/src/k8s/templates/`: - `free-tier.yaml` - `pro-tier.yaml` - `enterprise-tier.yaml` ### Variable Substitution Templates use simple string replacement: - `{{userId}}` - User ID - `{{deploymentName}}` - Computed deployment name - `{{serviceName}}` - Computed service name - `{{pvcName}}` - Computed PVC name - `{{agentImage}}` - Agent container image (from env) - `{{sidecarImage}}` - Lifecycle sidecar image (from env) - `{{storageClass}}` - Kubernetes storage class (from env) ### Resource Limits | Tier | Memory Request | Memory Limit | CPU Request | CPU Limit | Storage | Idle Timeout | |------|----------------|--------------|-------------|-----------|---------|--------------| | **Free** | 256Mi | 512Mi | 100m | 500m | 1Gi | 15min | | **Pro** | 512Mi | 2Gi | 250m | 2000m | 10Gi | 60min | | **Enterprise** | 1Gi | 4Gi | 500m | 4000m | 50Gi | Never (shutdown disabled) | ## Components ### KubernetesClient (`gateway/src/k8s/client.ts`) Low-level k8s API wrapper: - `deploymentExists(name)` - Check if deployment exists - `createAgentDeployment(spec)` - Create deployment/service/PVC from template - `waitForDeploymentReady(name, timeout)` - Poll until ready - `getServiceEndpoint(name)` - Get service URL - `deleteAgentDeployment(userId)` - Cleanup (for testing) Static helpers: - `getDeploymentName(userId)` - Generate deployment name - `getServiceName(userId)` - Generate service name - `getPvcName(userId)` - Generate PVC name - `getMcpEndpoint(userId, namespace)` - Compute internal service URL ### ContainerManager (`gateway/src/k8s/container-manager.ts`) High-level orchestration: - `ensureContainerRunning(userId, license)` - Main entry point - Returns: `{ mcpEndpoint, wasCreated }` - Creates deployment if missing - Waits for ready state - Returns endpoint URL - `getContainerStatus(userId)` - Check status without creating - `deleteContainer(userId)` - Manual cleanup ### Authenticator (`gateway/src/auth/authenticator.ts`) Updated to call container manager: - `authenticateWebSocket()` - Calls `ensureContainerRunning()` before returning `AuthContext` - `authenticateTelegram()` - Same for Telegram webhooks ### WebSocketHandler (`gateway/src/channels/websocket-handler.ts`) Multi-phase connection protocol: 1. Send `{type: 'status', status: 'authenticating'}` 2. Authenticate (may take 30-120s if creating container) 3. Send `{type: 'status', status: 'initializing'}` 4. Initialize agent harness 5. Send `{type: 'connected', ...}` This gives the client visibility into the startup process. ## Configuration Environment variables: ```bash # Kubernetes KUBERNETES_NAMESPACE=dexorder-agents KUBERNETES_IN_CLUSTER=true # false for local dev KUBERNETES_CONTEXT=minikube # for local dev only # Container images AGENT_IMAGE=ghcr.io/dexorder/agent:latest SIDECAR_IMAGE=ghcr.io/dexorder/lifecycle-sidecar:latest # Storage AGENT_STORAGE_CLASS=standard ``` ## Security The gateway uses a restricted ServiceAccount with RBAC: **Can do:** - ✅ Create deployments in `dexorder-agents` namespace - ✅ Create services in `dexorder-agents` namespace - ✅ Create PVCs in `dexorder-agents` namespace - ✅ Read pod status and logs (debugging) - ✅ Update deployments (future: resource scaling) **Cannot do:** - ❌ Delete deployments (handled by lifecycle sidecar) - ❌ Delete PVCs (handled by lifecycle sidecar) - ❌ Exec into pods - ❌ Access secrets or configmaps - ❌ Create resources in other namespaces - ❌ Access Kubernetes API from agent containers (blocked by NetworkPolicy) See `deploy/k8s/base/gateway-rbac.yaml` for full configuration. ## Lifecycle ### Container Creation (Gateway) - User authenticates - Gateway checks if deployment exists - If missing, creates from template - Waits for ready (2min timeout) - Returns MCP endpoint ### Container Deletion (Lifecycle Sidecar) - Container tracks activity and triggers - When idle (no triggers + timeout), exits with code 42 - Sidecar detects exit code 42 - Sidecar deletes deployment + optional PVC via k8s API - Gateway creates fresh container on next authentication See `doc/container_lifecycle_management.md` for full lifecycle details. ## Error Handling | Error | Gateway Action | User Experience | |-------|----------------|-----------------| | Deployment creation fails | Log error, return auth failure | "Authentication failed" | | Wait timeout (image pull, etc.) | Log warning, return 503 | "Service unavailable, retry" | | Service not found | Retry with backoff | Transparent retry | | MCP connection fails | Return error | "Failed to connect to workspace" | | Existing deployment not ready | Wait 30s, continue if still not ready | May connect to partially-ready container | ## Local Development For local development (outside k8s): 1. Start minikube: ```bash minikube start minikube addons enable storage-provisioner ``` 2. Apply security policies: ```bash kubectl apply -k deploy/k8s/dev ``` 3. Configure gateway for local k8s: ```bash # .env KUBERNETES_IN_CLUSTER=false KUBERNETES_CONTEXT=minikube KUBERNETES_NAMESPACE=dexorder-agents ``` 4. Run gateway: ```bash cd gateway npm run dev ``` 5. Connect via WebSocket: ```bash wscat -c "ws://localhost:3000/ws/chat" -H "Authorization: Bearer your-jwt" ``` The gateway will create deployments in minikube. View with: ```bash kubectl get deployments -n dexorder-agents kubectl get pods -n dexorder-agents kubectl logs -n dexorder-agents agent-user-abc123 -c agent ``` ## Production Deployment 1. Build and push gateway image: ```bash cd gateway docker build -t ghcr.io/dexorder/gateway:latest . docker push ghcr.io/dexorder/gateway:latest ``` 2. Deploy to k8s: ```bash kubectl apply -k deploy/k8s/prod ``` 3. Gateway runs in `dexorder-system` namespace 4. Creates agent containers in `dexorder-agents` namespace 5. Admission policies enforce image allowlist and security constraints ## Monitoring Useful metrics to track: - Container creation latency (time from auth to ready) - Container creation failure rate - Active containers by license tier - Resource usage per tier - Idle shutdown rate These can be exported via Prometheus or logged to monitoring service. ## Future Enhancements 1. **Pre-warming**: Create containers for active users before they connect 2. **Image updates**: Handle agent image version migrations with user consent 3. **Multi-region**: Geo-distributed container placement 4. **Cost tracking**: Per-user resource usage and billing 5. **Auto-scaling**: Scale down to 0 replicas instead of deletion (faster restart) 6. **Container pools**: Shared warm containers for anonymous users