container lifecycle management
This commit is contained in:
313
doc/container_lifecycle_management.md
Normal file
313
doc/container_lifecycle_management.md
Normal file
@@ -0,0 +1,313 @@
|
||||
# Container Lifecycle Management
|
||||
|
||||
## Overview
|
||||
|
||||
User agent containers self-manage their lifecycle to optimize resource usage. Containers automatically shut down when idle (no triggers + no recent activity) and clean themselves up using a lifecycle sidecar.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ Agent Pod │
|
||||
│ ┌───────────────────┐ ┌──────────────────────┐ │
|
||||
│ │ Agent Container │ │ Lifecycle Sidecar │ │
|
||||
│ │ ─────────────── │ │ ────────────────── │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ Lifecycle Manager │ │ Watches exit code │ │
|
||||
│ │ - Track activity │ │ - Detects exit 42 │ │
|
||||
│ │ - Track triggers │ │ - Calls k8s API │ │
|
||||
│ │ - Exit 42 if idle │ │ - Deletes deployment │ │
|
||||
│ └───────────────────┘ └──────────────────────┘ │
|
||||
│ │ │ │
|
||||
│ │ writes exit_code │ │
|
||||
│ └────►/var/run/agent/exit_code │
|
||||
│ │ │
|
||||
└───────────────────────────────────────┼──────────────────┘
|
||||
│
|
||||
▼ k8s API (RBAC)
|
||||
┌─────────────────────┐
|
||||
│ Delete Deployment │
|
||||
│ Delete PVC (if anon)│
|
||||
└─────────────────────┘
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Lifecycle Manager (Python)
|
||||
|
||||
**Location**: `client-py/dexorder/lifecycle_manager.py`
|
||||
|
||||
Runs inside the agent container and tracks:
|
||||
- **Activity**: MCP tool/resource/prompt calls reset the idle timer
|
||||
- **Triggers**: Data subscriptions, CEP patterns, etc.
|
||||
- **Idle state**: No triggers + idle timeout exceeded
|
||||
|
||||
**Configuration** (via environment variables):
|
||||
- `IDLE_TIMEOUT_MINUTES`: Minutes before shutdown (default: 15)
|
||||
- `IDLE_CHECK_INTERVAL_SECONDS`: Check frequency (default: 60)
|
||||
- `ENABLE_IDLE_SHUTDOWN`: Enable/disable shutdown (default: true)
|
||||
|
||||
**Usage in agent code**:
|
||||
```python
|
||||
from dexorder.lifecycle_manager import get_lifecycle_manager
|
||||
|
||||
# On startup
|
||||
manager = get_lifecycle_manager()
|
||||
await manager.start()
|
||||
|
||||
# On MCP calls (tool/resource/prompt)
|
||||
manager.record_activity()
|
||||
|
||||
# When triggers change
|
||||
manager.add_trigger("data_sub_BTC_USDT")
|
||||
manager.remove_trigger("data_sub_BTC_USDT")
|
||||
|
||||
# Or batch update
|
||||
manager.update_triggers({"trigger_1", "trigger_2"})
|
||||
```
|
||||
|
||||
**Exit behavior**:
|
||||
- Idle shutdown: Exit with code `42`
|
||||
- Signal (SIGTERM/SIGINT): Exit with code `0` (allows restart)
|
||||
- Errors/crashes: Exit with error code (allows restart)
|
||||
|
||||
### 2. Lifecycle Sidecar (Go)
|
||||
|
||||
**Location**: `lifecycle-sidecar/`
|
||||
|
||||
Runs alongside the agent container with shared PID namespace. Monitors the main container process and:
|
||||
- On exit code `42`: Deletes deployment (and PVC if anonymous user)
|
||||
- On any other exit code: Exits with same code (k8s restarts pod)
|
||||
|
||||
**Configuration** (via environment, injected by downward API):
|
||||
- `NAMESPACE`: Pod's namespace
|
||||
- `DEPLOYMENT_NAME`: Deployment name (from pod label)
|
||||
- `USER_TYPE`: License tier (`anonymous`, `free`, `paid`, `enterprise`)
|
||||
- `MAIN_CONTAINER_PID`: PID of main container (default: 1)
|
||||
|
||||
**RBAC**: Has permission to delete deployments and PVCs **only in dexorder-agents namespace**. Cannot delete other deployments due to:
|
||||
1. Only knows its own deployment name (from env)
|
||||
2. RBAC scoped to namespace
|
||||
3. No cross-pod communication
|
||||
|
||||
### 3. Gateway (TypeScript)
|
||||
|
||||
**Location**: `gateway/src/harness/agent-harness.ts`
|
||||
|
||||
Creates agent deployments when users connect. Has permissions to:
|
||||
- ✅ Create deployments, services, PVCs
|
||||
- ✅ Read pod status and logs
|
||||
- ✅ Update deployments (e.g., resource limits)
|
||||
- ❌ Delete deployments (handled by sidecar)
|
||||
- ❌ Exec into pods
|
||||
- ❌ Access secrets
|
||||
|
||||
## Lifecycle States
|
||||
|
||||
```
|
||||
┌─────────────┐
|
||||
│ CREATED │ ← Gateway creates deployment
|
||||
└──────┬──────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ RUNNING │ ← User interacts, has triggers
|
||||
└──────┬──────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ IDLE │ ← No triggers + timeout exceeded
|
||||
└──────┬──────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ SHUTDOWN │ ← Exit code 42
|
||||
└──────┬──────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ DELETED │ ← Sidecar deletes deployment
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
## Idle Detection Logic
|
||||
|
||||
Container is **IDLE** when:
|
||||
1. `active_triggers.isEmpty()` AND
|
||||
2. `(now - last_activity) > idle_timeout`
|
||||
|
||||
Container is **ACTIVE** when:
|
||||
1. Has any active triggers (data subscriptions, CEP patterns, etc.) OR
|
||||
2. Recent user activity (MCP calls within timeout)
|
||||
|
||||
## Cleanup Policies by License Tier
|
||||
|
||||
| User Type | Idle Timeout | PVC Policy | Notes |
|
||||
|--------------|--------------|------------|-------|
|
||||
| Anonymous | 15 minutes | Delete | Ephemeral, no data retention |
|
||||
| Free | 15 minutes | Retain | Can resume session |
|
||||
| Paid | 60 minutes | Retain | Longer grace period |
|
||||
| Enterprise | No shutdown | Retain | Always-on containers |
|
||||
|
||||
Configured via `USER_TYPE` env var in deployment.
|
||||
|
||||
## Security
|
||||
|
||||
### Principle of Least Privilege
|
||||
|
||||
**Gateway**:
|
||||
- Can create agent resources
|
||||
- Cannot delete agent resources
|
||||
- Cannot access other namespaces
|
||||
- Cannot exec into pods
|
||||
|
||||
**Lifecycle Sidecar**:
|
||||
- Can delete its own deployment only
|
||||
- Cannot delete other deployments
|
||||
- Scoped to dexorder-agents namespace
|
||||
- No exec, no secrets access
|
||||
|
||||
### Admission Control
|
||||
|
||||
All deployments in `dexorder-agents` namespace are subject to:
|
||||
- Image allowlist (only approved images)
|
||||
- Security context enforcement (non-root, drop caps, read-only rootfs)
|
||||
- Resource limits required
|
||||
- PodSecurity standards (restricted profile)
|
||||
|
||||
See `deploy/k8s/base/admission-policy.yaml`
|
||||
|
||||
### Network Isolation
|
||||
|
||||
Agents are network-isolated via NetworkPolicy:
|
||||
- Can connect to gateway (MCP)
|
||||
- Can connect to Redpanda (data streams)
|
||||
- Can make outbound HTTPS (exchanges, LLM APIs)
|
||||
- Cannot access k8s API
|
||||
- Cannot access system namespace
|
||||
- Cannot access other agent pods
|
||||
|
||||
See `deploy/k8s/base/network-policies.yaml`
|
||||
|
||||
## Deployment
|
||||
|
||||
### 1. Apply Security Policies
|
||||
|
||||
```bash
|
||||
kubectl apply -k deploy/k8s/dev # or prod
|
||||
```
|
||||
|
||||
This creates:
|
||||
- Namespaces (`dexorder-system`, `dexorder-agents`)
|
||||
- RBAC (gateway, lifecycle sidecar)
|
||||
- Admission policies
|
||||
- Network policies
|
||||
- Resource quotas
|
||||
|
||||
### 2. Build and Push Lifecycle Sidecar
|
||||
|
||||
```bash
|
||||
cd lifecycle-sidecar
|
||||
docker build -t ghcr.io/dexorder/lifecycle-sidecar:latest .
|
||||
docker push ghcr.io/dexorder/lifecycle-sidecar:latest
|
||||
```
|
||||
|
||||
### 3. Gateway Creates Agent Deployments
|
||||
|
||||
When a user connects, the gateway creates:
|
||||
- Deployment with agent + sidecar
|
||||
- PVC for persistent data
|
||||
- Service for MCP endpoint
|
||||
|
||||
See `deploy/k8s/base/agent-deployment-example.yaml` for template.
|
||||
|
||||
## Testing
|
||||
|
||||
### Test Lifecycle Manager Locally
|
||||
|
||||
```python
|
||||
from dexorder.lifecycle_manager import LifecycleManager
|
||||
|
||||
# Disable actual shutdown for testing
|
||||
manager = LifecycleManager(
|
||||
idle_timeout_minutes=1,
|
||||
check_interval_seconds=10,
|
||||
enable_shutdown=False # Only log, don't exit
|
||||
)
|
||||
|
||||
await manager.start()
|
||||
|
||||
# Simulate activity
|
||||
manager.record_activity()
|
||||
|
||||
# Simulate triggers
|
||||
manager.add_trigger("test_trigger")
|
||||
await asyncio.sleep(70) # Wait past timeout
|
||||
manager.remove_trigger("test_trigger")
|
||||
await asyncio.sleep(70) # Should detect idle
|
||||
|
||||
await manager.stop()
|
||||
```
|
||||
|
||||
### Test Sidecar Locally
|
||||
|
||||
```bash
|
||||
# Build
|
||||
cd lifecycle-sidecar
|
||||
go build -o lifecycle-sidecar main.go
|
||||
|
||||
# Run (requires k8s config)
|
||||
export NAMESPACE=dexorder-agents
|
||||
export DEPLOYMENT_NAME=agent-test
|
||||
export USER_TYPE=free
|
||||
./lifecycle-sidecar
|
||||
```
|
||||
|
||||
### Integration Test
|
||||
|
||||
1. Deploy test agent with sidecar
|
||||
2. Verify agent starts and is healthy
|
||||
3. Stop sending MCP calls and remove all triggers
|
||||
4. Wait for idle timeout + check interval
|
||||
5. Verify deployment is deleted
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Container not shutting down when idle
|
||||
|
||||
Check logs:
|
||||
```bash
|
||||
kubectl logs -n dexorder-agents agent-user-abc123 -c agent
|
||||
```
|
||||
|
||||
Verify:
|
||||
- `ENABLE_IDLE_SHUTDOWN=true`
|
||||
- No active triggers: `manager.active_triggers` should be empty
|
||||
- Idle timeout exceeded
|
||||
|
||||
### Sidecar not deleting deployment
|
||||
|
||||
Check sidecar logs:
|
||||
```bash
|
||||
kubectl logs -n dexorder-agents agent-user-abc123 -c lifecycle-sidecar
|
||||
```
|
||||
|
||||
Verify:
|
||||
- Exit code file exists: `/var/run/agent/exit_code` contains `42`
|
||||
- RBAC permissions: `kubectl auth can-i delete deployments --as=system:serviceaccount:dexorder-agents:agent-lifecycle -n dexorder-agents`
|
||||
- Deployment name matches: Check `DEPLOYMENT_NAME` env var
|
||||
|
||||
### Gateway can't create deployments
|
||||
|
||||
Check gateway logs and verify:
|
||||
- ServiceAccount exists: `kubectl get sa gateway -n dexorder-system`
|
||||
- RoleBinding exists: `kubectl get rolebinding gateway-agent-creator -n dexorder-agents`
|
||||
- Admission policy allows image: Check image name matches allowlist in `admission-policy.yaml`
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Graceful shutdown notifications**: Warn users before shutdown via websocket
|
||||
2. **Predictive scaling**: Keep frequently-used containers warm
|
||||
3. **Tiered storage**: Move old PVCs to cheaper storage class
|
||||
4. **Metrics**: Expose lifecycle metrics (idle rate, shutdown count, etc.)
|
||||
5. **Cost allocation**: Track resource usage per user/license tier
|
||||
Reference in New Issue
Block a user