# Container Lifecycle Management ## Overview User agent containers self-manage their lifecycle to optimize resource usage. Containers automatically shut down when idle (no triggers + no recent activity) and clean themselves up using a lifecycle sidecar. ## Architecture ``` ┌──────────────────────────────────────────────────────────┐ │ Agent Pod │ │ ┌───────────────────┐ ┌──────────────────────┐ │ │ │ Agent Container │ │ Lifecycle Sidecar │ │ │ │ ─────────────── │ │ ────────────────── │ │ │ │ │ │ │ │ │ │ Lifecycle Manager │ │ Watches exit code │ │ │ │ - Track activity │ │ - Detects exit 42 │ │ │ │ - Track triggers │ │ - Calls k8s API │ │ │ │ - Exit 42 if idle │ │ - Deletes deployment │ │ │ └───────────────────┘ └──────────────────────┘ │ │ │ │ │ │ │ writes exit_code │ │ │ └────►/var/run/agent/exit_code │ │ │ │ └───────────────────────────────────────┼──────────────────┘ │ ▼ k8s API (RBAC) ┌─────────────────────┐ │ Delete Deployment │ │ Delete PVC (if anon)│ └─────────────────────┘ ``` ## Components ### 1. Lifecycle Manager (Python) **Location**: `client-py/dexorder/lifecycle_manager.py` Runs inside the agent container and tracks: - **Activity**: MCP tool/resource/prompt calls reset the idle timer - **Triggers**: Data subscriptions, CEP patterns, etc. - **Idle state**: No triggers + idle timeout exceeded **Configuration** (via environment variables): - `IDLE_TIMEOUT_MINUTES`: Minutes before shutdown (default: 15) - `IDLE_CHECK_INTERVAL_SECONDS`: Check frequency (default: 60) - `ENABLE_IDLE_SHUTDOWN`: Enable/disable shutdown (default: true) **Usage in agent code**: ```python from dexorder.lifecycle_manager import get_lifecycle_manager # On startup manager = get_lifecycle_manager() await manager.start() # On MCP calls (tool/resource/prompt) manager.record_activity() # When triggers change manager.add_trigger("data_sub_BTC_USDT") manager.remove_trigger("data_sub_BTC_USDT") # Or batch update manager.update_triggers({"trigger_1", "trigger_2"}) ``` **Exit behavior**: - Idle shutdown: Exit with code `42` - Signal (SIGTERM/SIGINT): Exit with code `0` (allows restart) - Errors/crashes: Exit with error code (allows restart) ### 2. Lifecycle Sidecar (Go) **Location**: `lifecycle-sidecar/` Runs alongside the agent container with shared PID namespace. Monitors the main container process and: - On exit code `42`: Deletes deployment (and PVC if anonymous user) - On any other exit code: Exits with same code (k8s restarts pod) **Configuration** (via environment, injected by downward API): - `NAMESPACE`: Pod's namespace - `DEPLOYMENT_NAME`: Deployment name (from pod label) - `USER_TYPE`: License tier (`anonymous`, `free`, `paid`, `enterprise`) - `MAIN_CONTAINER_PID`: PID of main container (default: 1) **RBAC**: Has permission to delete deployments and PVCs **only in dexorder-agents namespace**. Cannot delete other deployments due to: 1. Only knows its own deployment name (from env) 2. RBAC scoped to namespace 3. No cross-pod communication ### 3. Gateway (TypeScript) **Location**: `gateway/src/harness/agent-harness.ts` Creates agent deployments when users connect. Has permissions to: - ✅ Create deployments, services, PVCs - ✅ Read pod status and logs - ✅ Update deployments (e.g., resource limits) - ❌ Delete deployments (handled by sidecar) - ❌ Exec into pods - ❌ Access secrets ## Lifecycle States ``` ┌─────────────┐ │ CREATED │ ← Gateway creates deployment └──────┬──────┘ │ ▼ ┌─────────────┐ │ RUNNING │ ← User interacts, has triggers └──────┬──────┘ │ ▼ ┌─────────────┐ │ IDLE │ ← No triggers + timeout exceeded └──────┬──────┘ │ ▼ ┌─────────────┐ │ SHUTDOWN │ ← Exit code 42 └──────┬──────┘ │ ▼ ┌─────────────┐ │ DELETED │ ← Sidecar deletes deployment └─────────────┘ ``` ## Idle Detection Logic Container is **IDLE** when: 1. `active_triggers.isEmpty()` AND 2. `(now - last_activity) > idle_timeout` Container is **ACTIVE** when: 1. Has any active triggers (data subscriptions, CEP patterns, etc.) OR 2. Recent user activity (MCP calls within timeout) ## Cleanup Policies by License Tier | User Type | Idle Timeout | PVC Policy | Notes | |--------------|--------------|------------|-------| | Anonymous | 15 minutes | Delete | Ephemeral, no data retention | | Free | 15 minutes | Retain | Can resume session | | Paid | 60 minutes | Retain | Longer grace period | | Enterprise | No shutdown | Retain | Always-on containers | Configured via `USER_TYPE` env var in deployment. ## Security ### Principle of Least Privilege **Gateway**: - Can create agent resources - Cannot delete agent resources - Cannot access other namespaces - Cannot exec into pods **Lifecycle Sidecar**: - Can delete its own deployment only - Cannot delete other deployments - Scoped to dexorder-agents namespace - No exec, no secrets access ### Admission Control All deployments in `dexorder-agents` namespace are subject to: - Image allowlist (only approved images) - Security context enforcement (non-root, drop caps, read-only rootfs) - Resource limits required - PodSecurity standards (restricted profile) See `deploy/k8s/base/admission-policy.yaml` ### Network Isolation Agents are network-isolated via NetworkPolicy: - Can connect to gateway (MCP) - Can connect to Redpanda (data streams) - Can make outbound HTTPS (exchanges, LLM APIs) - Cannot access k8s API - Cannot access system namespace - Cannot access other agent pods See `deploy/k8s/base/network-policies.yaml` ## Deployment ### 1. Apply Security Policies ```bash kubectl apply -k deploy/k8s/dev # or prod ``` This creates: - Namespaces (`dexorder-system`, `dexorder-agents`) - RBAC (gateway, lifecycle sidecar) - Admission policies - Network policies - Resource quotas ### 2. Build and Push Lifecycle Sidecar ```bash cd lifecycle-sidecar docker build -t ghcr.io/dexorder/lifecycle-sidecar:latest . docker push ghcr.io/dexorder/lifecycle-sidecar:latest ``` ### 3. Gateway Creates Agent Deployments When a user connects, the gateway creates: - Deployment with agent + sidecar - PVC for persistent data - Service for MCP endpoint See `deploy/k8s/base/agent-deployment-example.yaml` for template. ## Testing ### Test Lifecycle Manager Locally ```python from dexorder.lifecycle_manager import LifecycleManager # Disable actual shutdown for testing manager = LifecycleManager( idle_timeout_minutes=1, check_interval_seconds=10, enable_shutdown=False # Only log, don't exit ) await manager.start() # Simulate activity manager.record_activity() # Simulate triggers manager.add_trigger("test_trigger") await asyncio.sleep(70) # Wait past timeout manager.remove_trigger("test_trigger") await asyncio.sleep(70) # Should detect idle await manager.stop() ``` ### Test Sidecar Locally ```bash # Build cd lifecycle-sidecar go build -o lifecycle-sidecar main.go # Run (requires k8s config) export NAMESPACE=dexorder-agents export DEPLOYMENT_NAME=agent-test export USER_TYPE=free ./lifecycle-sidecar ``` ### Integration Test 1. Deploy test agent with sidecar 2. Verify agent starts and is healthy 3. Stop sending MCP calls and remove all triggers 4. Wait for idle timeout + check interval 5. Verify deployment is deleted ## Troubleshooting ### Container not shutting down when idle Check logs: ```bash kubectl logs -n dexorder-agents agent-user-abc123 -c agent ``` Verify: - `ENABLE_IDLE_SHUTDOWN=true` - No active triggers: `manager.active_triggers` should be empty - Idle timeout exceeded ### Sidecar not deleting deployment Check sidecar logs: ```bash kubectl logs -n dexorder-agents agent-user-abc123 -c lifecycle-sidecar ``` Verify: - Exit code file exists: `/var/run/agent/exit_code` contains `42` - RBAC permissions: `kubectl auth can-i delete deployments --as=system:serviceaccount:dexorder-agents:agent-lifecycle -n dexorder-agents` - Deployment name matches: Check `DEPLOYMENT_NAME` env var ### Gateway can't create deployments Check gateway logs and verify: - ServiceAccount exists: `kubectl get sa gateway -n dexorder-system` - RoleBinding exists: `kubectl get rolebinding gateway-agent-creator -n dexorder-agents` - Admission policy allows image: Check image name matches allowlist in `admission-policy.yaml` ## Future Enhancements 1. **Graceful shutdown notifications**: Warn users before shutdown via websocket 2. **Predictive scaling**: Keep frequently-used containers warm 3. **Tiered storage**: Move old PVCs to cheaper storage class 4. **Metrics**: Expose lifecycle metrics (idle rate, shutdown count, etc.) 5. **Cost allocation**: Track resource usage per user/license tier