# Gateway Container Creation

## Overview

The gateway automatically provisions user agent containers when users authenticate. This ensures each user has their own isolated environment running their MCP server with persistent storage.

## Authentication Flow with Container Creation

```
User connects (WebSocket/Telegram)
         ↓
   Send "Authenticating..." status
         ↓
   Verify token/channel link
         ↓
   Lookup user license from DB
         ↓
   Send "Starting workspace..." status
         ↓
┌────────────────────────────────────┐
│  ContainerManager.ensureRunning() │
│  ┌──────────────────────────────┐ │
│  │ Check if deployment exists   │ │
│  └──────────────────────────────┘ │
│           ↓                        │
│     Does it exist?                 │
│     ↙         ↘                    │
│   Yes          No                  │
│    │            │                  │
│    │      ┌──────────────────┐    │
│    │      │ Create deployment│    │
│    │      │ Create PVC       │    │
│    │      │ Create service   │    │
│    │      └──────────────────┘    │
│    │            │                  │
│    └────────────┘                  │
│         ↓                          │
│  Wait for deployment ready         │
│  (polls every 2s, timeout 2min)    │
│         ↓                          │
│  Compute MCP endpoint URL          │
│  (internal k8s service DNS)        │
└────────────────────────────────────┘
         ↓
   Update license.mcpServerUrl
         ↓
   Send "Connected" status
         ↓
   Initialize AgentHarness
         ↓
   Connect to user's MCP server
         ↓
   Ready for messages
```

## Container Naming Convention

All resources follow a consistent naming pattern based on `userId`:

```typescript
userId: "user-abc123"
  ↓
deploymentName: "agent-user-abc123"
serviceName: "agent-user-abc123"
pvcName: "agent-user-abc123-data"
mcpEndpoint: "http://agent-user-abc123.dexorder-agents.svc.cluster.local:3000"
```

User IDs are sanitized to be Kubernetes-compliant (lowercase alphanumeric + hyphens).

## Templates by License Tier

Templates are located in `gateway/src/k8s/templates/`:
- `free-tier.yaml`
- `pro-tier.yaml`
- `enterprise-tier.yaml`

### Variable Substitution

Templates use simple string replacement:
- `{{userId}}` - User ID
- `{{deploymentName}}` - Computed deployment name
- `{{serviceName}}` - Computed service name
- `{{pvcName}}` - Computed PVC name
- `{{agentImage}}` - Agent container image (from env)
- `{{sidecarImage}}` - Lifecycle sidecar image (from env)
- `{{storageClass}}` - Kubernetes storage class (from env)

### Resource Limits

| Tier | Memory Request | Memory Limit | CPU Request | CPU Limit | Storage | Idle Timeout |
|------|----------------|--------------|-------------|-----------|---------|--------------|
| **Free** | 256Mi | 512Mi | 100m | 500m | 1Gi | 15min |
| **Pro** | 512Mi | 2Gi | 250m | 2000m | 10Gi | 60min |
| **Enterprise** | 1Gi | 4Gi | 500m | 4000m | 50Gi | Never (shutdown disabled) |

## Components

### KubernetesClient (`gateway/src/k8s/client.ts`)

Low-level k8s API wrapper:
- `deploymentExists(name)` - Check if deployment exists
- `createAgentDeployment(spec)` - Create deployment/service/PVC from template
- `waitForDeploymentReady(name, timeout)` - Poll until ready
- `getServiceEndpoint(name)` - Get service URL
- `deleteAgentDeployment(userId)` - Cleanup (for testing)

Static helpers:
- `getDeploymentName(userId)` - Generate deployment name
- `getServiceName(userId)` - Generate service name
- `getPvcName(userId)` - Generate PVC name
- `getMcpEndpoint(userId, namespace)` - Compute internal service URL

### ContainerManager (`gateway/src/k8s/container-manager.ts`)

High-level orchestration:
- `ensureContainerRunning(userId, license)` - Main entry point
  - Returns: `{ mcpEndpoint, wasCreated }`
  - Creates deployment if missing
  - Waits for ready state
  - Returns endpoint URL
- `getContainerStatus(userId)` - Check status without creating
- `deleteContainer(userId)` - Manual cleanup

### Authenticator (`gateway/src/auth/authenticator.ts`)

Updated to call container manager:
- `authenticateWebSocket()` - Calls `ensureContainerRunning()` before returning `AuthContext`
- `authenticateTelegram()` - Same for Telegram webhooks

### WebSocketHandler (`gateway/src/channels/websocket-handler.ts`)

Multi-phase connection protocol:
1. Send `{type: 'status', status: 'authenticating'}`
2. Authenticate (may take 30-120s if creating container)
3. Send `{type: 'status', status: 'initializing'}`
4. Initialize agent harness
5. Send `{type: 'connected', ...}`

This gives the client visibility into the startup process.

## Configuration

Environment variables:

```bash
# Kubernetes
KUBERNETES_NAMESPACE=dexorder-agents
KUBERNETES_IN_CLUSTER=true         # false for local dev
KUBERNETES_CONTEXT=minikube        # for local dev only

# Container images
AGENT_IMAGE=ghcr.io/dexorder/agent:latest
SIDECAR_IMAGE=ghcr.io/dexorder/lifecycle-sidecar:latest

# Storage
AGENT_STORAGE_CLASS=standard
```

## Security

The gateway uses a restricted ServiceAccount with RBAC:

**Can do:**
- ✅ Create deployments in `dexorder-agents` namespace
- ✅ Create services in `dexorder-agents` namespace
- ✅ Create PVCs in `dexorder-agents` namespace
- ✅ Read pod status and logs (debugging)
- ✅ Update deployments (future: resource scaling)

**Cannot do:**
- ❌ Delete deployments (handled by lifecycle sidecar)
- ❌ Delete PVCs (handled by lifecycle sidecar)
- ❌ Exec into pods
- ❌ Access secrets or configmaps
- ❌ Create resources in other namespaces
- ❌ Access Kubernetes API from agent containers (blocked by NetworkPolicy)

See `deploy/k8s/base/gateway-rbac.yaml` for full configuration.

## Lifecycle

### Container Creation (Gateway)
- User authenticates
- Gateway checks if deployment exists
- If missing, creates from template
- Waits for ready (2min timeout)
- Returns MCP endpoint

### Container Deletion (Lifecycle Sidecar)
- Container tracks activity and triggers
- When idle (no triggers + timeout), exits with code 42
- Sidecar detects exit code 42
- Sidecar deletes deployment + optional PVC via k8s API
- Gateway creates fresh container on next authentication

See `doc/container_lifecycle_management.md` for full lifecycle details.

## Error Handling

| Error | Gateway Action | User Experience |
|-------|----------------|-----------------|
| Deployment creation fails | Log error, return auth failure | "Authentication failed" |
| Wait timeout (image pull, etc.) | Log warning, return 503 | "Service unavailable, retry" |
| Service not found | Retry with backoff | Transparent retry |
| MCP connection fails | Return error | "Failed to connect to workspace" |
| Existing deployment not ready | Wait 30s, continue if still not ready | May connect to partially-ready container |

## Local Development

For local development (outside k8s):

1. Start minikube:
```bash
minikube start
minikube addons enable storage-provisioner
```

2. Apply security policies:
```bash
kubectl apply -k deploy/k8s/dev
```

3. Configure gateway for local k8s:
```bash
# .env
KUBERNETES_IN_CLUSTER=false
KUBERNETES_CONTEXT=minikube
KUBERNETES_NAMESPACE=dexorder-agents
```

4. Run gateway:
```bash
cd gateway
npm run dev
```

5. Connect via WebSocket:
```bash
wscat -c "ws://localhost:3000/ws/chat" -H "Authorization: Bearer your-jwt"
```

The gateway will create deployments in minikube. View with:
```bash
kubectl get deployments -n dexorder-agents
kubectl get pods -n dexorder-agents
kubectl logs -n dexorder-agents agent-user-abc123 -c agent
```

## Production Deployment

1. Build and push gateway image:
```bash
cd gateway
docker build -t ghcr.io/dexorder/gateway:latest .
docker push ghcr.io/dexorder/gateway:latest
```

2. Deploy to k8s:
```bash
kubectl apply -k deploy/k8s/prod
```

3. Gateway runs in `dexorder-system` namespace
4. Creates agent containers in `dexorder-agents` namespace
5. Admission policies enforce image allowlist and security constraints

## Monitoring

Useful metrics to track:
- Container creation latency (time from auth to ready)
- Container creation failure rate
- Active containers by license tier
- Resource usage per tier
- Idle shutdown rate

These can be exported via Prometheus or logged to monitoring service.

## Future Enhancements

1. **Pre-warming**: Create containers for active users before they connect
2. **Image updates**: Handle agent image version migrations with user consent
3. **Multi-region**: Geo-distributed container placement
4. **Cost tracking**: Per-user resource usage and billing
5. **Auto-scaling**: Scale down to 0 replicas instead of deletion (faster restart)
6. **Container pools**: Shared warm containers for anonymous users