container lifecycle management
This commit is contained in:
286
doc/gateway_container_creation.md
Normal file
286
doc/gateway_container_creation.md
Normal file
@@ -0,0 +1,286 @@
|
||||
# Gateway Container Creation
|
||||
|
||||
## Overview
|
||||
|
||||
The gateway automatically provisions user agent containers when users authenticate. This ensures each user has their own isolated environment running their MCP server with persistent storage.
|
||||
|
||||
## Authentication Flow with Container Creation
|
||||
|
||||
```
|
||||
User connects (WebSocket/Telegram)
|
||||
↓
|
||||
Send "Authenticating..." status
|
||||
↓
|
||||
Verify token/channel link
|
||||
↓
|
||||
Lookup user license from DB
|
||||
↓
|
||||
Send "Starting workspace..." status
|
||||
↓
|
||||
┌────────────────────────────────────┐
|
||||
│ ContainerManager.ensureRunning() │
|
||||
│ ┌──────────────────────────────┐ │
|
||||
│ │ Check if deployment exists │ │
|
||||
│ └──────────────────────────────┘ │
|
||||
│ ↓ │
|
||||
│ Does it exist? │
|
||||
│ ↙ ↘ │
|
||||
│ Yes No │
|
||||
│ │ │ │
|
||||
│ │ ┌──────────────────┐ │
|
||||
│ │ │ Create deployment│ │
|
||||
│ │ │ Create PVC │ │
|
||||
│ │ │ Create service │ │
|
||||
│ │ └──────────────────┘ │
|
||||
│ │ │ │
|
||||
│ └────────────┘ │
|
||||
│ ↓ │
|
||||
│ Wait for deployment ready │
|
||||
│ (polls every 2s, timeout 2min) │
|
||||
│ ↓ │
|
||||
│ Compute MCP endpoint URL │
|
||||
│ (internal k8s service DNS) │
|
||||
└────────────────────────────────────┘
|
||||
↓
|
||||
Update license.mcpServerUrl
|
||||
↓
|
||||
Send "Connected" status
|
||||
↓
|
||||
Initialize AgentHarness
|
||||
↓
|
||||
Connect to user's MCP server
|
||||
↓
|
||||
Ready for messages
|
||||
```
|
||||
|
||||
## Container Naming Convention
|
||||
|
||||
All resources follow a consistent naming pattern based on `userId`:
|
||||
|
||||
```typescript
|
||||
userId: "user-abc123"
|
||||
↓
|
||||
deploymentName: "agent-user-abc123"
|
||||
serviceName: "agent-user-abc123"
|
||||
pvcName: "agent-user-abc123-data"
|
||||
mcpEndpoint: "http://agent-user-abc123.dexorder-agents.svc.cluster.local:3000"
|
||||
```
|
||||
|
||||
User IDs are sanitized to be Kubernetes-compliant (lowercase alphanumeric + hyphens).
|
||||
|
||||
## Templates by License Tier
|
||||
|
||||
Templates are located in `gateway/src/k8s/templates/`:
|
||||
- `free-tier.yaml`
|
||||
- `pro-tier.yaml`
|
||||
- `enterprise-tier.yaml`
|
||||
|
||||
### Variable Substitution
|
||||
|
||||
Templates use simple string replacement:
|
||||
- `{{userId}}` - User ID
|
||||
- `{{deploymentName}}` - Computed deployment name
|
||||
- `{{serviceName}}` - Computed service name
|
||||
- `{{pvcName}}` - Computed PVC name
|
||||
- `{{agentImage}}` - Agent container image (from env)
|
||||
- `{{sidecarImage}}` - Lifecycle sidecar image (from env)
|
||||
- `{{storageClass}}` - Kubernetes storage class (from env)
|
||||
|
||||
### Resource Limits
|
||||
|
||||
| Tier | Memory Request | Memory Limit | CPU Request | CPU Limit | Storage | Idle Timeout |
|
||||
|------|----------------|--------------|-------------|-----------|---------|--------------|
|
||||
| **Free** | 256Mi | 512Mi | 100m | 500m | 1Gi | 15min |
|
||||
| **Pro** | 512Mi | 2Gi | 250m | 2000m | 10Gi | 60min |
|
||||
| **Enterprise** | 1Gi | 4Gi | 500m | 4000m | 50Gi | Never (shutdown disabled) |
|
||||
|
||||
## Components
|
||||
|
||||
### KubernetesClient (`gateway/src/k8s/client.ts`)
|
||||
|
||||
Low-level k8s API wrapper:
|
||||
- `deploymentExists(name)` - Check if deployment exists
|
||||
- `createAgentDeployment(spec)` - Create deployment/service/PVC from template
|
||||
- `waitForDeploymentReady(name, timeout)` - Poll until ready
|
||||
- `getServiceEndpoint(name)` - Get service URL
|
||||
- `deleteAgentDeployment(userId)` - Cleanup (for testing)
|
||||
|
||||
Static helpers:
|
||||
- `getDeploymentName(userId)` - Generate deployment name
|
||||
- `getServiceName(userId)` - Generate service name
|
||||
- `getPvcName(userId)` - Generate PVC name
|
||||
- `getMcpEndpoint(userId, namespace)` - Compute internal service URL
|
||||
|
||||
### ContainerManager (`gateway/src/k8s/container-manager.ts`)
|
||||
|
||||
High-level orchestration:
|
||||
- `ensureContainerRunning(userId, license)` - Main entry point
|
||||
- Returns: `{ mcpEndpoint, wasCreated }`
|
||||
- Creates deployment if missing
|
||||
- Waits for ready state
|
||||
- Returns endpoint URL
|
||||
- `getContainerStatus(userId)` - Check status without creating
|
||||
- `deleteContainer(userId)` - Manual cleanup
|
||||
|
||||
### Authenticator (`gateway/src/auth/authenticator.ts`)
|
||||
|
||||
Updated to call container manager:
|
||||
- `authenticateWebSocket()` - Calls `ensureContainerRunning()` before returning `AuthContext`
|
||||
- `authenticateTelegram()` - Same for Telegram webhooks
|
||||
|
||||
### WebSocketHandler (`gateway/src/channels/websocket-handler.ts`)
|
||||
|
||||
Multi-phase connection protocol:
|
||||
1. Send `{type: 'status', status: 'authenticating'}`
|
||||
2. Authenticate (may take 30-120s if creating container)
|
||||
3. Send `{type: 'status', status: 'initializing'}`
|
||||
4. Initialize agent harness
|
||||
5. Send `{type: 'connected', ...}`
|
||||
|
||||
This gives the client visibility into the startup process.
|
||||
|
||||
## Configuration
|
||||
|
||||
Environment variables:
|
||||
|
||||
```bash
|
||||
# Kubernetes
|
||||
KUBERNETES_NAMESPACE=dexorder-agents
|
||||
KUBERNETES_IN_CLUSTER=true # false for local dev
|
||||
KUBERNETES_CONTEXT=minikube # for local dev only
|
||||
|
||||
# Container images
|
||||
AGENT_IMAGE=ghcr.io/dexorder/agent:latest
|
||||
SIDECAR_IMAGE=ghcr.io/dexorder/lifecycle-sidecar:latest
|
||||
|
||||
# Storage
|
||||
AGENT_STORAGE_CLASS=standard
|
||||
```
|
||||
|
||||
## Security
|
||||
|
||||
The gateway uses a restricted ServiceAccount with RBAC:
|
||||
|
||||
**Can do:**
|
||||
- ✅ Create deployments in `dexorder-agents` namespace
|
||||
- ✅ Create services in `dexorder-agents` namespace
|
||||
- ✅ Create PVCs in `dexorder-agents` namespace
|
||||
- ✅ Read pod status and logs (debugging)
|
||||
- ✅ Update deployments (future: resource scaling)
|
||||
|
||||
**Cannot do:**
|
||||
- ❌ Delete deployments (handled by lifecycle sidecar)
|
||||
- ❌ Delete PVCs (handled by lifecycle sidecar)
|
||||
- ❌ Exec into pods
|
||||
- ❌ Access secrets or configmaps
|
||||
- ❌ Create resources in other namespaces
|
||||
- ❌ Access Kubernetes API from agent containers (blocked by NetworkPolicy)
|
||||
|
||||
See `deploy/k8s/base/gateway-rbac.yaml` for full configuration.
|
||||
|
||||
## Lifecycle
|
||||
|
||||
### Container Creation (Gateway)
|
||||
- User authenticates
|
||||
- Gateway checks if deployment exists
|
||||
- If missing, creates from template
|
||||
- Waits for ready (2min timeout)
|
||||
- Returns MCP endpoint
|
||||
|
||||
### Container Deletion (Lifecycle Sidecar)
|
||||
- Container tracks activity and triggers
|
||||
- When idle (no triggers + timeout), exits with code 42
|
||||
- Sidecar detects exit code 42
|
||||
- Sidecar deletes deployment + optional PVC via k8s API
|
||||
- Gateway creates fresh container on next authentication
|
||||
|
||||
See `doc/container_lifecycle_management.md` for full lifecycle details.
|
||||
|
||||
## Error Handling
|
||||
|
||||
| Error | Gateway Action | User Experience |
|
||||
|-------|----------------|-----------------|
|
||||
| Deployment creation fails | Log error, return auth failure | "Authentication failed" |
|
||||
| Wait timeout (image pull, etc.) | Log warning, return 503 | "Service unavailable, retry" |
|
||||
| Service not found | Retry with backoff | Transparent retry |
|
||||
| MCP connection fails | Return error | "Failed to connect to workspace" |
|
||||
| Existing deployment not ready | Wait 30s, continue if still not ready | May connect to partially-ready container |
|
||||
|
||||
## Local Development
|
||||
|
||||
For local development (outside k8s):
|
||||
|
||||
1. Start minikube:
|
||||
```bash
|
||||
minikube start
|
||||
minikube addons enable storage-provisioner
|
||||
```
|
||||
|
||||
2. Apply security policies:
|
||||
```bash
|
||||
kubectl apply -k deploy/k8s/dev
|
||||
```
|
||||
|
||||
3. Configure gateway for local k8s:
|
||||
```bash
|
||||
# .env
|
||||
KUBERNETES_IN_CLUSTER=false
|
||||
KUBERNETES_CONTEXT=minikube
|
||||
KUBERNETES_NAMESPACE=dexorder-agents
|
||||
```
|
||||
|
||||
4. Run gateway:
|
||||
```bash
|
||||
cd gateway
|
||||
npm run dev
|
||||
```
|
||||
|
||||
5. Connect via WebSocket:
|
||||
```bash
|
||||
wscat -c "ws://localhost:3000/ws/chat" -H "Authorization: Bearer your-jwt"
|
||||
```
|
||||
|
||||
The gateway will create deployments in minikube. View with:
|
||||
```bash
|
||||
kubectl get deployments -n dexorder-agents
|
||||
kubectl get pods -n dexorder-agents
|
||||
kubectl logs -n dexorder-agents agent-user-abc123 -c agent
|
||||
```
|
||||
|
||||
## Production Deployment
|
||||
|
||||
1. Build and push gateway image:
|
||||
```bash
|
||||
cd gateway
|
||||
docker build -t ghcr.io/dexorder/gateway:latest .
|
||||
docker push ghcr.io/dexorder/gateway:latest
|
||||
```
|
||||
|
||||
2. Deploy to k8s:
|
||||
```bash
|
||||
kubectl apply -k deploy/k8s/prod
|
||||
```
|
||||
|
||||
3. Gateway runs in `dexorder-system` namespace
|
||||
4. Creates agent containers in `dexorder-agents` namespace
|
||||
5. Admission policies enforce image allowlist and security constraints
|
||||
|
||||
## Monitoring
|
||||
|
||||
Useful metrics to track:
|
||||
- Container creation latency (time from auth to ready)
|
||||
- Container creation failure rate
|
||||
- Active containers by license tier
|
||||
- Resource usage per tier
|
||||
- Idle shutdown rate
|
||||
|
||||
These can be exported via Prometheus or logged to monitoring service.
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Pre-warming**: Create containers for active users before they connect
|
||||
2. **Image updates**: Handle agent image version migrations with user consent
|
||||
3. **Multi-region**: Geo-distributed container placement
|
||||
4. **Cost tracking**: Per-user resource usage and billing
|
||||
5. **Auto-scaling**: Scale down to 0 replicas instead of deletion (faster restart)
|
||||
6. **Container pools**: Shared warm containers for anonymous users
|
||||
Reference in New Issue
Block a user