10 KiB
Container Lifecycle Management
Overview
User agent containers self-manage their lifecycle to optimize resource usage. Containers automatically shut down when idle (no triggers + no recent activity) and clean themselves up using a lifecycle sidecar.
Architecture
┌──────────────────────────────────────────────────────────┐
│ Agent Pod │
│ ┌───────────────────┐ ┌──────────────────────┐ │
│ │ Agent Container │ │ Lifecycle Sidecar │ │
│ │ ─────────────── │ │ ────────────────── │ │
│ │ │ │ │ │
│ │ Lifecycle Manager │ │ Watches exit code │ │
│ │ - Track activity │ │ - Detects exit 42 │ │
│ │ - Track triggers │ │ - Calls k8s API │ │
│ │ - Exit 42 if idle │ │ - Deletes deployment │ │
│ └───────────────────┘ └──────────────────────┘ │
│ │ │ │
│ │ writes exit_code │ │
│ └────►/var/run/agent/exit_code │
│ │ │
└───────────────────────────────────────┼──────────────────┘
│
▼ k8s API (RBAC)
┌─────────────────────┐
│ Delete Deployment │
│ Delete PVC (if anon)│
└─────────────────────┘
Components
1. Lifecycle Manager (Python)
Location: client-py/dexorder/lifecycle_manager.py
Runs inside the agent container and tracks:
- Activity: MCP tool/resource/prompt calls reset the idle timer
- Triggers: Data subscriptions, CEP patterns, etc.
- Idle state: No triggers + idle timeout exceeded
Configuration (via environment variables):
IDLE_TIMEOUT_MINUTES: Minutes before shutdown (default: 15)IDLE_CHECK_INTERVAL_SECONDS: Check frequency (default: 60)ENABLE_IDLE_SHUTDOWN: Enable/disable shutdown (default: true)
Usage in agent code:
from dexorder.lifecycle_manager import get_lifecycle_manager
# On startup
manager = get_lifecycle_manager()
await manager.start()
# On MCP calls (tool/resource/prompt)
manager.record_activity()
# When triggers change
manager.add_trigger("data_sub_BTC_USDT")
manager.remove_trigger("data_sub_BTC_USDT")
# Or batch update
manager.update_triggers({"trigger_1", "trigger_2"})
Exit behavior:
- Idle shutdown: Exit with code
42 - Signal (SIGTERM/SIGINT): Exit with code
0(allows restart) - Errors/crashes: Exit with error code (allows restart)
2. Lifecycle Sidecar (Go)
Location: lifecycle-sidecar/
Runs alongside the agent container with shared PID namespace. Monitors the main container process and:
- On exit code
42: Deletes deployment (and PVC if anonymous user) - On any other exit code: Exits with same code (k8s restarts pod)
Configuration (via environment, injected by downward API):
NAMESPACE: Pod's namespaceDEPLOYMENT_NAME: Deployment name (from pod label)USER_TYPE: License tier (anonymous,free,paid,enterprise)MAIN_CONTAINER_PID: PID of main container (default: 1)
RBAC: Has permission to delete deployments and PVCs only in dexorder-agents namespace. Cannot delete other deployments due to:
- Only knows its own deployment name (from env)
- RBAC scoped to namespace
- No cross-pod communication
3. Gateway (TypeScript)
Location: gateway/src/harness/agent-harness.ts
Creates agent deployments when users connect. Has permissions to:
- ✅ Create deployments, services, PVCs
- ✅ Read pod status and logs
- ✅ Update deployments (e.g., resource limits)
- ❌ Delete deployments (handled by sidecar)
- ❌ Exec into pods
- ❌ Access secrets
Lifecycle States
┌─────────────┐
│ CREATED │ ← Gateway creates deployment
└──────┬──────┘
│
▼
┌─────────────┐
│ RUNNING │ ← User interacts, has triggers
└──────┬──────┘
│
▼
┌─────────────┐
│ IDLE │ ← No triggers + timeout exceeded
└──────┬──────┘
│
▼
┌─────────────┐
│ SHUTDOWN │ ← Exit code 42
└──────┬──────┘
│
▼
┌─────────────┐
│ DELETED │ ← Sidecar deletes deployment
└─────────────┘
Idle Detection Logic
Container is IDLE when:
active_triggers.isEmpty()AND(now - last_activity) > idle_timeout
Container is ACTIVE when:
- Has any active triggers (data subscriptions, CEP patterns, etc.) OR
- Recent user activity (MCP calls within timeout)
Cleanup Policies by License Tier
| User Type | Idle Timeout | PVC Policy | Notes |
|---|---|---|---|
| Anonymous | 15 minutes | Delete | Ephemeral, no data retention |
| Free | 15 minutes | Retain | Can resume session |
| Paid | 60 minutes | Retain | Longer grace period |
| Enterprise | No shutdown | Retain | Always-on containers |
Configured via USER_TYPE env var in deployment.
Security
Principle of Least Privilege
Gateway:
- Can create agent resources
- Cannot delete agent resources
- Cannot access other namespaces
- Cannot exec into pods
Lifecycle Sidecar:
- Can delete its own deployment only
- Cannot delete other deployments
- Scoped to dexorder-agents namespace
- No exec, no secrets access
Admission Control
All deployments in dexorder-agents namespace are subject to:
- Image allowlist (only approved images)
- Security context enforcement (non-root, drop caps, read-only rootfs)
- Resource limits required
- PodSecurity standards (restricted profile)
See deploy/k8s/base/admission-policy.yaml
Network Isolation
Agents are network-isolated via NetworkPolicy:
- Can connect to gateway (MCP)
- Can connect to Redpanda (data streams)
- Can make outbound HTTPS (exchanges, LLM APIs)
- Cannot access k8s API
- Cannot access system namespace
- Cannot access other agent pods
See deploy/k8s/base/network-policies.yaml
Deployment
1. Apply Security Policies
kubectl apply -k deploy/k8s/dev # or prod
This creates:
- Namespaces (
dexorder-system,dexorder-agents) - RBAC (gateway, lifecycle sidecar)
- Admission policies
- Network policies
- Resource quotas
2. Build and Push Lifecycle Sidecar
cd lifecycle-sidecar
docker build -t ghcr.io/dexorder/lifecycle-sidecar:latest .
docker push ghcr.io/dexorder/lifecycle-sidecar:latest
3. Gateway Creates Agent Deployments
When a user connects, the gateway creates:
- Deployment with agent + sidecar
- PVC for persistent data
- Service for MCP endpoint
See deploy/k8s/base/agent-deployment-example.yaml for template.
Testing
Test Lifecycle Manager Locally
from dexorder.lifecycle_manager import LifecycleManager
# Disable actual shutdown for testing
manager = LifecycleManager(
idle_timeout_minutes=1,
check_interval_seconds=10,
enable_shutdown=False # Only log, don't exit
)
await manager.start()
# Simulate activity
manager.record_activity()
# Simulate triggers
manager.add_trigger("test_trigger")
await asyncio.sleep(70) # Wait past timeout
manager.remove_trigger("test_trigger")
await asyncio.sleep(70) # Should detect idle
await manager.stop()
Test Sidecar Locally
# Build
cd lifecycle-sidecar
go build -o lifecycle-sidecar main.go
# Run (requires k8s config)
export NAMESPACE=dexorder-agents
export DEPLOYMENT_NAME=agent-test
export USER_TYPE=free
./lifecycle-sidecar
Integration Test
- Deploy test agent with sidecar
- Verify agent starts and is healthy
- Stop sending MCP calls and remove all triggers
- Wait for idle timeout + check interval
- Verify deployment is deleted
Troubleshooting
Container not shutting down when idle
Check logs:
kubectl logs -n dexorder-agents agent-user-abc123 -c agent
Verify:
ENABLE_IDLE_SHUTDOWN=true- No active triggers:
manager.active_triggersshould be empty - Idle timeout exceeded
Sidecar not deleting deployment
Check sidecar logs:
kubectl logs -n dexorder-agents agent-user-abc123 -c lifecycle-sidecar
Verify:
- Exit code file exists:
/var/run/agent/exit_codecontains42 - RBAC permissions:
kubectl auth can-i delete deployments --as=system:serviceaccount:dexorder-agents:agent-lifecycle -n dexorder-agents - Deployment name matches: Check
DEPLOYMENT_NAMEenv var
Gateway can't create deployments
Check gateway logs and verify:
- ServiceAccount exists:
kubectl get sa gateway -n dexorder-system - RoleBinding exists:
kubectl get rolebinding gateway-agent-creator -n dexorder-agents - Admission policy allows image: Check image name matches allowlist in
admission-policy.yaml
Future Enhancements
- Graceful shutdown notifications: Warn users before shutdown via websocket
- Predictive scaling: Keep frequently-used containers warm
- Tiered storage: Move old PVCs to cheaper storage class
- Metrics: Expose lifecycle metrics (idle rate, shutdown count, etc.)
- Cost allocation: Track resource usage per user/license tier