Files
ai/doc/container_lifecycle_management.md

314 lines
10 KiB
Markdown

# Container Lifecycle Management
## Overview
User agent containers self-manage their lifecycle to optimize resource usage. Containers automatically shut down when idle (no triggers + no recent activity) and clean themselves up using a lifecycle sidecar.
## Architecture
```
┌──────────────────────────────────────────────────────────┐
│ Agent Pod │
│ ┌───────────────────┐ ┌──────────────────────┐ │
│ │ Agent Container │ │ Lifecycle Sidecar │ │
│ │ ─────────────── │ │ ────────────────── │ │
│ │ │ │ │ │
│ │ Lifecycle Manager │ │ Watches exit code │ │
│ │ - Track activity │ │ - Detects exit 42 │ │
│ │ - Track triggers │ │ - Calls k8s API │ │
│ │ - Exit 42 if idle │ │ - Deletes deployment │ │
│ └───────────────────┘ └──────────────────────┘ │
│ │ │ │
│ │ writes exit_code │ │
│ └────►/var/run/agent/exit_code │
│ │ │
└───────────────────────────────────────┼──────────────────┘
▼ k8s API (RBAC)
┌─────────────────────┐
│ Delete Deployment │
│ Delete PVC (if anon)│
└─────────────────────┘
```
## Components
### 1. Lifecycle Manager (Python)
**Location**: `client-py/dexorder/lifecycle_manager.py`
Runs inside the agent container and tracks:
- **Activity**: MCP tool/resource/prompt calls reset the idle timer
- **Triggers**: Data subscriptions, CEP patterns, etc.
- **Idle state**: No triggers + idle timeout exceeded
**Configuration** (via environment variables):
- `IDLE_TIMEOUT_MINUTES`: Minutes before shutdown (default: 15)
- `IDLE_CHECK_INTERVAL_SECONDS`: Check frequency (default: 60)
- `ENABLE_IDLE_SHUTDOWN`: Enable/disable shutdown (default: true)
**Usage in agent code**:
```python
from dexorder.lifecycle_manager import get_lifecycle_manager
# On startup
manager = get_lifecycle_manager()
await manager.start()
# On MCP calls (tool/resource/prompt)
manager.record_activity()
# When triggers change
manager.add_trigger("data_sub_BTC_USDT")
manager.remove_trigger("data_sub_BTC_USDT")
# Or batch update
manager.update_triggers({"trigger_1", "trigger_2"})
```
**Exit behavior**:
- Idle shutdown: Exit with code `42`
- Signal (SIGTERM/SIGINT): Exit with code `0` (allows restart)
- Errors/crashes: Exit with error code (allows restart)
### 2. Lifecycle Sidecar (Go)
**Location**: `lifecycle-sidecar/`
Runs alongside the agent container with shared PID namespace. Monitors the main container process and:
- On exit code `42`: Deletes deployment (and PVC if anonymous user)
- On any other exit code: Exits with same code (k8s restarts pod)
**Configuration** (via environment, injected by downward API):
- `NAMESPACE`: Pod's namespace
- `DEPLOYMENT_NAME`: Deployment name (from pod label)
- `USER_TYPE`: License tier (`anonymous`, `free`, `paid`, `enterprise`)
- `MAIN_CONTAINER_PID`: PID of main container (default: 1)
**RBAC**: Has permission to delete deployments and PVCs **only in dexorder-agents namespace**. Cannot delete other deployments due to:
1. Only knows its own deployment name (from env)
2. RBAC scoped to namespace
3. No cross-pod communication
### 3. Gateway (TypeScript)
**Location**: `gateway/src/harness/agent-harness.ts`
Creates agent deployments when users connect. Has permissions to:
- ✅ Create deployments, services, PVCs
- ✅ Read pod status and logs
- ✅ Update deployments (e.g., resource limits)
- ❌ Delete deployments (handled by sidecar)
- ❌ Exec into pods
- ❌ Access secrets
## Lifecycle States
```
┌─────────────┐
│ CREATED │ ← Gateway creates deployment
└──────┬──────┘
┌─────────────┐
│ RUNNING │ ← User interacts, has triggers
└──────┬──────┘
┌─────────────┐
│ IDLE │ ← No triggers + timeout exceeded
└──────┬──────┘
┌─────────────┐
│ SHUTDOWN │ ← Exit code 42
└──────┬──────┘
┌─────────────┐
│ DELETED │ ← Sidecar deletes deployment
└─────────────┘
```
## Idle Detection Logic
Container is **IDLE** when:
1. `active_triggers.isEmpty()` AND
2. `(now - last_activity) > idle_timeout`
Container is **ACTIVE** when:
1. Has any active triggers (data subscriptions, CEP patterns, etc.) OR
2. Recent user activity (MCP calls within timeout)
## Cleanup Policies by License Tier
| User Type | Idle Timeout | PVC Policy | Notes |
|--------------|--------------|------------|-------|
| Anonymous | 15 minutes | Delete | Ephemeral, no data retention |
| Free | 15 minutes | Retain | Can resume session |
| Paid | 60 minutes | Retain | Longer grace period |
| Enterprise | No shutdown | Retain | Always-on containers |
Configured via `USER_TYPE` env var in deployment.
## Security
### Principle of Least Privilege
**Gateway**:
- Can create agent resources
- Cannot delete agent resources
- Cannot access other namespaces
- Cannot exec into pods
**Lifecycle Sidecar**:
- Can delete its own deployment only
- Cannot delete other deployments
- Scoped to dexorder-agents namespace
- No exec, no secrets access
### Admission Control
All deployments in `dexorder-agents` namespace are subject to:
- Image allowlist (only approved images)
- Security context enforcement (non-root, drop caps, read-only rootfs)
- Resource limits required
- PodSecurity standards (restricted profile)
See `deploy/k8s/base/admission-policy.yaml`
### Network Isolation
Agents are network-isolated via NetworkPolicy:
- Can connect to gateway (MCP)
- Can connect to Redpanda (data streams)
- Can make outbound HTTPS (exchanges, LLM APIs)
- Cannot access k8s API
- Cannot access system namespace
- Cannot access other agent pods
See `deploy/k8s/base/network-policies.yaml`
## Deployment
### 1. Apply Security Policies
```bash
kubectl apply -k deploy/k8s/dev # or prod
```
This creates:
- Namespaces (`dexorder-system`, `dexorder-agents`)
- RBAC (gateway, lifecycle sidecar)
- Admission policies
- Network policies
- Resource quotas
### 2. Build and Push Lifecycle Sidecar
```bash
cd lifecycle-sidecar
docker build -t ghcr.io/dexorder/lifecycle-sidecar:latest .
docker push ghcr.io/dexorder/lifecycle-sidecar:latest
```
### 3. Gateway Creates Agent Deployments
When a user connects, the gateway creates:
- Deployment with agent + sidecar
- PVC for persistent data
- Service for MCP endpoint
See `deploy/k8s/base/agent-deployment-example.yaml` for template.
## Testing
### Test Lifecycle Manager Locally
```python
from dexorder.lifecycle_manager import LifecycleManager
# Disable actual shutdown for testing
manager = LifecycleManager(
idle_timeout_minutes=1,
check_interval_seconds=10,
enable_shutdown=False # Only log, don't exit
)
await manager.start()
# Simulate activity
manager.record_activity()
# Simulate triggers
manager.add_trigger("test_trigger")
await asyncio.sleep(70) # Wait past timeout
manager.remove_trigger("test_trigger")
await asyncio.sleep(70) # Should detect idle
await manager.stop()
```
### Test Sidecar Locally
```bash
# Build
cd lifecycle-sidecar
go build -o lifecycle-sidecar main.go
# Run (requires k8s config)
export NAMESPACE=dexorder-agents
export DEPLOYMENT_NAME=agent-test
export USER_TYPE=free
./lifecycle-sidecar
```
### Integration Test
1. Deploy test agent with sidecar
2. Verify agent starts and is healthy
3. Stop sending MCP calls and remove all triggers
4. Wait for idle timeout + check interval
5. Verify deployment is deleted
## Troubleshooting
### Container not shutting down when idle
Check logs:
```bash
kubectl logs -n dexorder-agents agent-user-abc123 -c agent
```
Verify:
- `ENABLE_IDLE_SHUTDOWN=true`
- No active triggers: `manager.active_triggers` should be empty
- Idle timeout exceeded
### Sidecar not deleting deployment
Check sidecar logs:
```bash
kubectl logs -n dexorder-agents agent-user-abc123 -c lifecycle-sidecar
```
Verify:
- Exit code file exists: `/var/run/agent/exit_code` contains `42`
- RBAC permissions: `kubectl auth can-i delete deployments --as=system:serviceaccount:dexorder-agents:agent-lifecycle -n dexorder-agents`
- Deployment name matches: Check `DEPLOYMENT_NAME` env var
### Gateway can't create deployments
Check gateway logs and verify:
- ServiceAccount exists: `kubectl get sa gateway -n dexorder-system`
- RoleBinding exists: `kubectl get rolebinding gateway-agent-creator -n dexorder-agents`
- Admission policy allows image: Check image name matches allowlist in `admission-policy.yaml`
## Future Enhancements
1. **Graceful shutdown notifications**: Warn users before shutdown via websocket
2. **Predictive scaling**: Keep frequently-used containers warm
3. **Tiered storage**: Move old PVCs to cheaper storage class
4. **Metrics**: Expose lifecycle metrics (idle rate, shutdown count, etc.)
5. **Cost allocation**: Track resource usage per user/license tier