314 lines
10 KiB
Markdown
314 lines
10 KiB
Markdown
# Container Lifecycle Management
|
|
|
|
## Overview
|
|
|
|
User agent containers self-manage their lifecycle to optimize resource usage. Containers automatically shut down when idle (no triggers + no recent activity) and clean themselves up using a lifecycle sidecar.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────┐
|
|
│ Agent Pod │
|
|
│ ┌───────────────────┐ ┌──────────────────────┐ │
|
|
│ │ Agent Container │ │ Lifecycle Sidecar │ │
|
|
│ │ ─────────────── │ │ ────────────────── │ │
|
|
│ │ │ │ │ │
|
|
│ │ Lifecycle Manager │ │ Watches exit code │ │
|
|
│ │ - Track activity │ │ - Detects exit 42 │ │
|
|
│ │ - Track triggers │ │ - Calls k8s API │ │
|
|
│ │ - Exit 42 if idle │ │ - Deletes deployment │ │
|
|
│ └───────────────────┘ └──────────────────────┘ │
|
|
│ │ │ │
|
|
│ │ writes exit_code │ │
|
|
│ └────►/var/run/agent/exit_code │
|
|
│ │ │
|
|
└───────────────────────────────────────┼──────────────────┘
|
|
│
|
|
▼ k8s API (RBAC)
|
|
┌─────────────────────┐
|
|
│ Delete Deployment │
|
|
│ Delete PVC (if anon)│
|
|
└─────────────────────┘
|
|
```
|
|
|
|
## Components
|
|
|
|
### 1. Lifecycle Manager (Python)
|
|
|
|
**Location**: `client-py/dexorder/lifecycle_manager.py`
|
|
|
|
Runs inside the agent container and tracks:
|
|
- **Activity**: MCP tool/resource/prompt calls reset the idle timer
|
|
- **Triggers**: Data subscriptions, CEP patterns, etc.
|
|
- **Idle state**: No triggers + idle timeout exceeded
|
|
|
|
**Configuration** (via environment variables):
|
|
- `IDLE_TIMEOUT_MINUTES`: Minutes before shutdown (default: 15)
|
|
- `IDLE_CHECK_INTERVAL_SECONDS`: Check frequency (default: 60)
|
|
- `ENABLE_IDLE_SHUTDOWN`: Enable/disable shutdown (default: true)
|
|
|
|
**Usage in agent code**:
|
|
```python
|
|
from dexorder.lifecycle_manager import get_lifecycle_manager
|
|
|
|
# On startup
|
|
manager = get_lifecycle_manager()
|
|
await manager.start()
|
|
|
|
# On MCP calls (tool/resource/prompt)
|
|
manager.record_activity()
|
|
|
|
# When triggers change
|
|
manager.add_trigger("data_sub_BTC_USDT")
|
|
manager.remove_trigger("data_sub_BTC_USDT")
|
|
|
|
# Or batch update
|
|
manager.update_triggers({"trigger_1", "trigger_2"})
|
|
```
|
|
|
|
**Exit behavior**:
|
|
- Idle shutdown: Exit with code `42`
|
|
- Signal (SIGTERM/SIGINT): Exit with code `0` (allows restart)
|
|
- Errors/crashes: Exit with error code (allows restart)
|
|
|
|
### 2. Lifecycle Sidecar (Go)
|
|
|
|
**Location**: `lifecycle-sidecar/`
|
|
|
|
Runs alongside the agent container with shared PID namespace. Monitors the main container process and:
|
|
- On exit code `42`: Deletes deployment (and PVC if anonymous user)
|
|
- On any other exit code: Exits with same code (k8s restarts pod)
|
|
|
|
**Configuration** (via environment, injected by downward API):
|
|
- `NAMESPACE`: Pod's namespace
|
|
- `DEPLOYMENT_NAME`: Deployment name (from pod label)
|
|
- `USER_TYPE`: License tier (`anonymous`, `free`, `paid`, `enterprise`)
|
|
- `MAIN_CONTAINER_PID`: PID of main container (default: 1)
|
|
|
|
**RBAC**: Has permission to delete deployments and PVCs **only in dexorder-agents namespace**. Cannot delete other deployments due to:
|
|
1. Only knows its own deployment name (from env)
|
|
2. RBAC scoped to namespace
|
|
3. No cross-pod communication
|
|
|
|
### 3. Gateway (TypeScript)
|
|
|
|
**Location**: `gateway/src/harness/agent-harness.ts`
|
|
|
|
Creates agent deployments when users connect. Has permissions to:
|
|
- ✅ Create deployments, services, PVCs
|
|
- ✅ Read pod status and logs
|
|
- ✅ Update deployments (e.g., resource limits)
|
|
- ❌ Delete deployments (handled by sidecar)
|
|
- ❌ Exec into pods
|
|
- ❌ Access secrets
|
|
|
|
## Lifecycle States
|
|
|
|
```
|
|
┌─────────────┐
|
|
│ CREATED │ ← Gateway creates deployment
|
|
└──────┬──────┘
|
|
│
|
|
▼
|
|
┌─────────────┐
|
|
│ RUNNING │ ← User interacts, has triggers
|
|
└──────┬──────┘
|
|
│
|
|
▼
|
|
┌─────────────┐
|
|
│ IDLE │ ← No triggers + timeout exceeded
|
|
└──────┬──────┘
|
|
│
|
|
▼
|
|
┌─────────────┐
|
|
│ SHUTDOWN │ ← Exit code 42
|
|
└──────┬──────┘
|
|
│
|
|
▼
|
|
┌─────────────┐
|
|
│ DELETED │ ← Sidecar deletes deployment
|
|
└─────────────┘
|
|
```
|
|
|
|
## Idle Detection Logic
|
|
|
|
Container is **IDLE** when:
|
|
1. `active_triggers.isEmpty()` AND
|
|
2. `(now - last_activity) > idle_timeout`
|
|
|
|
Container is **ACTIVE** when:
|
|
1. Has any active triggers (data subscriptions, CEP patterns, etc.) OR
|
|
2. Recent user activity (MCP calls within timeout)
|
|
|
|
## Cleanup Policies by License Tier
|
|
|
|
| User Type | Idle Timeout | PVC Policy | Notes |
|
|
|--------------|--------------|------------|-------|
|
|
| Anonymous | 15 minutes | Delete | Ephemeral, no data retention |
|
|
| Free | 15 minutes | Retain | Can resume session |
|
|
| Paid | 60 minutes | Retain | Longer grace period |
|
|
| Enterprise | No shutdown | Retain | Always-on containers |
|
|
|
|
Configured via `USER_TYPE` env var in deployment.
|
|
|
|
## Security
|
|
|
|
### Principle of Least Privilege
|
|
|
|
**Gateway**:
|
|
- Can create agent resources
|
|
- Cannot delete agent resources
|
|
- Cannot access other namespaces
|
|
- Cannot exec into pods
|
|
|
|
**Lifecycle Sidecar**:
|
|
- Can delete its own deployment only
|
|
- Cannot delete other deployments
|
|
- Scoped to dexorder-agents namespace
|
|
- No exec, no secrets access
|
|
|
|
### Admission Control
|
|
|
|
All deployments in `dexorder-agents` namespace are subject to:
|
|
- Image allowlist (only approved images)
|
|
- Security context enforcement (non-root, drop caps, read-only rootfs)
|
|
- Resource limits required
|
|
- PodSecurity standards (restricted profile)
|
|
|
|
See `deploy/k8s/base/admission-policy.yaml`
|
|
|
|
### Network Isolation
|
|
|
|
Agents are network-isolated via NetworkPolicy:
|
|
- Can connect to gateway (MCP)
|
|
- Can connect to Redpanda (data streams)
|
|
- Can make outbound HTTPS (exchanges, LLM APIs)
|
|
- Cannot access k8s API
|
|
- Cannot access system namespace
|
|
- Cannot access other agent pods
|
|
|
|
See `deploy/k8s/base/network-policies.yaml`
|
|
|
|
## Deployment
|
|
|
|
### 1. Apply Security Policies
|
|
|
|
```bash
|
|
kubectl apply -k deploy/k8s/dev # or prod
|
|
```
|
|
|
|
This creates:
|
|
- Namespaces (`dexorder-system`, `dexorder-agents`)
|
|
- RBAC (gateway, lifecycle sidecar)
|
|
- Admission policies
|
|
- Network policies
|
|
- Resource quotas
|
|
|
|
### 2. Build and Push Lifecycle Sidecar
|
|
|
|
```bash
|
|
cd lifecycle-sidecar
|
|
docker build -t ghcr.io/dexorder/lifecycle-sidecar:latest .
|
|
docker push ghcr.io/dexorder/lifecycle-sidecar:latest
|
|
```
|
|
|
|
### 3. Gateway Creates Agent Deployments
|
|
|
|
When a user connects, the gateway creates:
|
|
- Deployment with agent + sidecar
|
|
- PVC for persistent data
|
|
- Service for MCP endpoint
|
|
|
|
See `deploy/k8s/base/agent-deployment-example.yaml` for template.
|
|
|
|
## Testing
|
|
|
|
### Test Lifecycle Manager Locally
|
|
|
|
```python
|
|
from dexorder.lifecycle_manager import LifecycleManager
|
|
|
|
# Disable actual shutdown for testing
|
|
manager = LifecycleManager(
|
|
idle_timeout_minutes=1,
|
|
check_interval_seconds=10,
|
|
enable_shutdown=False # Only log, don't exit
|
|
)
|
|
|
|
await manager.start()
|
|
|
|
# Simulate activity
|
|
manager.record_activity()
|
|
|
|
# Simulate triggers
|
|
manager.add_trigger("test_trigger")
|
|
await asyncio.sleep(70) # Wait past timeout
|
|
manager.remove_trigger("test_trigger")
|
|
await asyncio.sleep(70) # Should detect idle
|
|
|
|
await manager.stop()
|
|
```
|
|
|
|
### Test Sidecar Locally
|
|
|
|
```bash
|
|
# Build
|
|
cd lifecycle-sidecar
|
|
go build -o lifecycle-sidecar main.go
|
|
|
|
# Run (requires k8s config)
|
|
export NAMESPACE=dexorder-agents
|
|
export DEPLOYMENT_NAME=agent-test
|
|
export USER_TYPE=free
|
|
./lifecycle-sidecar
|
|
```
|
|
|
|
### Integration Test
|
|
|
|
1. Deploy test agent with sidecar
|
|
2. Verify agent starts and is healthy
|
|
3. Stop sending MCP calls and remove all triggers
|
|
4. Wait for idle timeout + check interval
|
|
5. Verify deployment is deleted
|
|
|
|
## Troubleshooting
|
|
|
|
### Container not shutting down when idle
|
|
|
|
Check logs:
|
|
```bash
|
|
kubectl logs -n dexorder-agents agent-user-abc123 -c agent
|
|
```
|
|
|
|
Verify:
|
|
- `ENABLE_IDLE_SHUTDOWN=true`
|
|
- No active triggers: `manager.active_triggers` should be empty
|
|
- Idle timeout exceeded
|
|
|
|
### Sidecar not deleting deployment
|
|
|
|
Check sidecar logs:
|
|
```bash
|
|
kubectl logs -n dexorder-agents agent-user-abc123 -c lifecycle-sidecar
|
|
```
|
|
|
|
Verify:
|
|
- Exit code file exists: `/var/run/agent/exit_code` contains `42`
|
|
- RBAC permissions: `kubectl auth can-i delete deployments --as=system:serviceaccount:dexorder-agents:agent-lifecycle -n dexorder-agents`
|
|
- Deployment name matches: Check `DEPLOYMENT_NAME` env var
|
|
|
|
### Gateway can't create deployments
|
|
|
|
Check gateway logs and verify:
|
|
- ServiceAccount exists: `kubectl get sa gateway -n dexorder-system`
|
|
- RoleBinding exists: `kubectl get rolebinding gateway-agent-creator -n dexorder-agents`
|
|
- Admission policy allows image: Check image name matches allowlist in `admission-policy.yaml`
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Graceful shutdown notifications**: Warn users before shutdown via websocket
|
|
2. **Predictive scaling**: Keep frequently-used containers warm
|
|
3. **Tiered storage**: Move old PVCs to cheaper storage class
|
|
4. **Metrics**: Expose lifecycle metrics (idle rate, shutdown count, etc.)
|
|
5. **Cost allocation**: Track resource usage per user/license tier
|