287 lines
9.0 KiB
Markdown
287 lines
9.0 KiB
Markdown
# Gateway Container Creation
|
|
|
|
## Overview
|
|
|
|
The gateway automatically provisions user agent containers when users authenticate. This ensures each user has their own isolated environment running their MCP server with persistent storage.
|
|
|
|
## Authentication Flow with Container Creation
|
|
|
|
```
|
|
User connects (WebSocket/Telegram)
|
|
↓
|
|
Send "Authenticating..." status
|
|
↓
|
|
Verify token/channel link
|
|
↓
|
|
Lookup user license from DB
|
|
↓
|
|
Send "Starting workspace..." status
|
|
↓
|
|
┌────────────────────────────────────┐
|
|
│ ContainerManager.ensureRunning() │
|
|
│ ┌──────────────────────────────┐ │
|
|
│ │ Check if deployment exists │ │
|
|
│ └──────────────────────────────┘ │
|
|
│ ↓ │
|
|
│ Does it exist? │
|
|
│ ↙ ↘ │
|
|
│ Yes No │
|
|
│ │ │ │
|
|
│ │ ┌──────────────────┐ │
|
|
│ │ │ Create deployment│ │
|
|
│ │ │ Create PVC │ │
|
|
│ │ │ Create service │ │
|
|
│ │ └──────────────────┘ │
|
|
│ │ │ │
|
|
│ └────────────┘ │
|
|
│ ↓ │
|
|
│ Wait for deployment ready │
|
|
│ (polls every 2s, timeout 2min) │
|
|
│ ↓ │
|
|
│ Compute MCP endpoint URL │
|
|
│ (internal k8s service DNS) │
|
|
└────────────────────────────────────┘
|
|
↓
|
|
Update license.mcpServerUrl
|
|
↓
|
|
Send "Connected" status
|
|
↓
|
|
Initialize AgentHarness
|
|
↓
|
|
Connect to user's MCP server
|
|
↓
|
|
Ready for messages
|
|
```
|
|
|
|
## Container Naming Convention
|
|
|
|
All resources follow a consistent naming pattern based on `userId`:
|
|
|
|
```typescript
|
|
userId: "user-abc123"
|
|
↓
|
|
deploymentName: "agent-user-abc123"
|
|
serviceName: "agent-user-abc123"
|
|
pvcName: "agent-user-abc123-data"
|
|
mcpEndpoint: "http://agent-user-abc123.dexorder-agents.svc.cluster.local:3000"
|
|
```
|
|
|
|
User IDs are sanitized to be Kubernetes-compliant (lowercase alphanumeric + hyphens).
|
|
|
|
## Templates by License Tier
|
|
|
|
Templates are located in `gateway/src/k8s/templates/`:
|
|
- `free-tier.yaml`
|
|
- `pro-tier.yaml`
|
|
- `enterprise-tier.yaml`
|
|
|
|
### Variable Substitution
|
|
|
|
Templates use simple string replacement:
|
|
- `{{userId}}` - User ID
|
|
- `{{deploymentName}}` - Computed deployment name
|
|
- `{{serviceName}}` - Computed service name
|
|
- `{{pvcName}}` - Computed PVC name
|
|
- `{{agentImage}}` - Agent container image (from env)
|
|
- `{{sidecarImage}}` - Lifecycle sidecar image (from env)
|
|
- `{{storageClass}}` - Kubernetes storage class (from env)
|
|
|
|
### Resource Limits
|
|
|
|
| Tier | Memory Request | Memory Limit | CPU Request | CPU Limit | Storage | Idle Timeout |
|
|
|------|----------------|--------------|-------------|-----------|---------|--------------|
|
|
| **Free** | 256Mi | 512Mi | 100m | 500m | 1Gi | 15min |
|
|
| **Pro** | 512Mi | 2Gi | 250m | 2000m | 10Gi | 60min |
|
|
| **Enterprise** | 1Gi | 4Gi | 500m | 4000m | 50Gi | Never (shutdown disabled) |
|
|
|
|
## Components
|
|
|
|
### KubernetesClient (`gateway/src/k8s/client.ts`)
|
|
|
|
Low-level k8s API wrapper:
|
|
- `deploymentExists(name)` - Check if deployment exists
|
|
- `createAgentDeployment(spec)` - Create deployment/service/PVC from template
|
|
- `waitForDeploymentReady(name, timeout)` - Poll until ready
|
|
- `getServiceEndpoint(name)` - Get service URL
|
|
- `deleteAgentDeployment(userId)` - Cleanup (for testing)
|
|
|
|
Static helpers:
|
|
- `getDeploymentName(userId)` - Generate deployment name
|
|
- `getServiceName(userId)` - Generate service name
|
|
- `getPvcName(userId)` - Generate PVC name
|
|
- `getMcpEndpoint(userId, namespace)` - Compute internal service URL
|
|
|
|
### ContainerManager (`gateway/src/k8s/container-manager.ts`)
|
|
|
|
High-level orchestration:
|
|
- `ensureContainerRunning(userId, license)` - Main entry point
|
|
- Returns: `{ mcpEndpoint, wasCreated }`
|
|
- Creates deployment if missing
|
|
- Waits for ready state
|
|
- Returns endpoint URL
|
|
- `getContainerStatus(userId)` - Check status without creating
|
|
- `deleteContainer(userId)` - Manual cleanup
|
|
|
|
### Authenticator (`gateway/src/auth/authenticator.ts`)
|
|
|
|
Updated to call container manager:
|
|
- `authenticateWebSocket()` - Calls `ensureContainerRunning()` before returning `AuthContext`
|
|
- `authenticateTelegram()` - Same for Telegram webhooks
|
|
|
|
### WebSocketHandler (`gateway/src/channels/websocket-handler.ts`)
|
|
|
|
Multi-phase connection protocol:
|
|
1. Send `{type: 'status', status: 'authenticating'}`
|
|
2. Authenticate (may take 30-120s if creating container)
|
|
3. Send `{type: 'status', status: 'initializing'}`
|
|
4. Initialize agent harness
|
|
5. Send `{type: 'connected', ...}`
|
|
|
|
This gives the client visibility into the startup process.
|
|
|
|
## Configuration
|
|
|
|
Environment variables:
|
|
|
|
```bash
|
|
# Kubernetes
|
|
KUBERNETES_NAMESPACE=dexorder-agents
|
|
KUBERNETES_IN_CLUSTER=true # false for local dev
|
|
KUBERNETES_CONTEXT=minikube # for local dev only
|
|
|
|
# Container images
|
|
AGENT_IMAGE=ghcr.io/dexorder/agent:latest
|
|
SIDECAR_IMAGE=ghcr.io/dexorder/lifecycle-sidecar:latest
|
|
|
|
# Storage
|
|
AGENT_STORAGE_CLASS=standard
|
|
```
|
|
|
|
## Security
|
|
|
|
The gateway uses a restricted ServiceAccount with RBAC:
|
|
|
|
**Can do:**
|
|
- ✅ Create deployments in `dexorder-agents` namespace
|
|
- ✅ Create services in `dexorder-agents` namespace
|
|
- ✅ Create PVCs in `dexorder-agents` namespace
|
|
- ✅ Read pod status and logs (debugging)
|
|
- ✅ Update deployments (future: resource scaling)
|
|
|
|
**Cannot do:**
|
|
- ❌ Delete deployments (handled by lifecycle sidecar)
|
|
- ❌ Delete PVCs (handled by lifecycle sidecar)
|
|
- ❌ Exec into pods
|
|
- ❌ Access secrets or configmaps
|
|
- ❌ Create resources in other namespaces
|
|
- ❌ Access Kubernetes API from agent containers (blocked by NetworkPolicy)
|
|
|
|
See `deploy/k8s/base/gateway-rbac.yaml` for full configuration.
|
|
|
|
## Lifecycle
|
|
|
|
### Container Creation (Gateway)
|
|
- User authenticates
|
|
- Gateway checks if deployment exists
|
|
- If missing, creates from template
|
|
- Waits for ready (2min timeout)
|
|
- Returns MCP endpoint
|
|
|
|
### Container Deletion (Lifecycle Sidecar)
|
|
- Container tracks activity and triggers
|
|
- When idle (no triggers + timeout), exits with code 42
|
|
- Sidecar detects exit code 42
|
|
- Sidecar deletes deployment + optional PVC via k8s API
|
|
- Gateway creates fresh container on next authentication
|
|
|
|
See `doc/container_lifecycle_management.md` for full lifecycle details.
|
|
|
|
## Error Handling
|
|
|
|
| Error | Gateway Action | User Experience |
|
|
|-------|----------------|-----------------|
|
|
| Deployment creation fails | Log error, return auth failure | "Authentication failed" |
|
|
| Wait timeout (image pull, etc.) | Log warning, return 503 | "Service unavailable, retry" |
|
|
| Service not found | Retry with backoff | Transparent retry |
|
|
| MCP connection fails | Return error | "Failed to connect to workspace" |
|
|
| Existing deployment not ready | Wait 30s, continue if still not ready | May connect to partially-ready container |
|
|
|
|
## Local Development
|
|
|
|
For local development (outside k8s):
|
|
|
|
1. Start minikube:
|
|
```bash
|
|
minikube start
|
|
minikube addons enable storage-provisioner
|
|
```
|
|
|
|
2. Apply security policies:
|
|
```bash
|
|
kubectl apply -k deploy/k8s/dev
|
|
```
|
|
|
|
3. Configure gateway for local k8s:
|
|
```bash
|
|
# .env
|
|
KUBERNETES_IN_CLUSTER=false
|
|
KUBERNETES_CONTEXT=minikube
|
|
KUBERNETES_NAMESPACE=dexorder-agents
|
|
```
|
|
|
|
4. Run gateway:
|
|
```bash
|
|
cd gateway
|
|
npm run dev
|
|
```
|
|
|
|
5. Connect via WebSocket:
|
|
```bash
|
|
wscat -c "ws://localhost:3000/ws/chat" -H "Authorization: Bearer your-jwt"
|
|
```
|
|
|
|
The gateway will create deployments in minikube. View with:
|
|
```bash
|
|
kubectl get deployments -n dexorder-agents
|
|
kubectl get pods -n dexorder-agents
|
|
kubectl logs -n dexorder-agents agent-user-abc123 -c agent
|
|
```
|
|
|
|
## Production Deployment
|
|
|
|
1. Build and push gateway image:
|
|
```bash
|
|
cd gateway
|
|
docker build -t ghcr.io/dexorder/gateway:latest .
|
|
docker push ghcr.io/dexorder/gateway:latest
|
|
```
|
|
|
|
2. Deploy to k8s:
|
|
```bash
|
|
kubectl apply -k deploy/k8s/prod
|
|
```
|
|
|
|
3. Gateway runs in `dexorder-system` namespace
|
|
4. Creates agent containers in `dexorder-agents` namespace
|
|
5. Admission policies enforce image allowlist and security constraints
|
|
|
|
## Monitoring
|
|
|
|
Useful metrics to track:
|
|
- Container creation latency (time from auth to ready)
|
|
- Container creation failure rate
|
|
- Active containers by license tier
|
|
- Resource usage per tier
|
|
- Idle shutdown rate
|
|
|
|
These can be exported via Prometheus or logged to monitoring service.
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Pre-warming**: Create containers for active users before they connect
|
|
2. **Image updates**: Handle agent image version migrations with user consent
|
|
3. **Multi-region**: Geo-distributed container placement
|
|
4. **Cost tracking**: Per-user resource usage and billing
|
|
5. **Auto-scaling**: Scale down to 0 replicas instead of deletion (faster restart)
|
|
6. **Container pools**: Shared warm containers for anonymous users
|