Files
ai/doc/container_lifecycle_management.md

10 KiB

Container Lifecycle Management

Overview

User agent containers self-manage their lifecycle to optimize resource usage. Containers automatically shut down when idle (no triggers + no recent activity) and clean themselves up using a lifecycle sidecar.

Architecture

┌──────────────────────────────────────────────────────────┐
│                   Agent Pod                              │
│  ┌───────────────────┐       ┌──────────────────────┐   │
│  │  Agent Container  │       │ Lifecycle Sidecar    │   │
│  │  ───────────────  │       │ ──────────────────   │   │
│  │                   │       │                      │   │
│  │ Lifecycle Manager │       │ Watches exit code    │   │
│  │ - Track activity  │       │ - Detects exit 42    │   │
│  │ - Track triggers  │       │ - Calls k8s API      │   │
│  │ - Exit 42 if idle │       │ - Deletes deployment │   │
│  └───────────────────┘       └──────────────────────┘   │
│           │                           │                  │
│           │ writes exit_code          │                  │
│           └────►/var/run/agent/exit_code                │
│                                       │                  │
└───────────────────────────────────────┼──────────────────┘
                                        │
                                        ▼ k8s API (RBAC)
                              ┌─────────────────────┐
                              │ Delete Deployment   │
                              │ Delete PVC (if anon)│
                              └─────────────────────┘

Components

1. Lifecycle Manager (Python)

Location: client-py/dexorder/lifecycle_manager.py

Runs inside the agent container and tracks:

  • Activity: MCP tool/resource/prompt calls reset the idle timer
  • Triggers: Data subscriptions, CEP patterns, etc.
  • Idle state: No triggers + idle timeout exceeded

Configuration (via environment variables):

  • IDLE_TIMEOUT_MINUTES: Minutes before shutdown (default: 15)
  • IDLE_CHECK_INTERVAL_SECONDS: Check frequency (default: 60)
  • ENABLE_IDLE_SHUTDOWN: Enable/disable shutdown (default: true)

Usage in agent code:

from dexorder.lifecycle_manager import get_lifecycle_manager

# On startup
manager = get_lifecycle_manager()
await manager.start()

# On MCP calls (tool/resource/prompt)
manager.record_activity()

# When triggers change
manager.add_trigger("data_sub_BTC_USDT")
manager.remove_trigger("data_sub_BTC_USDT")

# Or batch update
manager.update_triggers({"trigger_1", "trigger_2"})

Exit behavior:

  • Idle shutdown: Exit with code 42
  • Signal (SIGTERM/SIGINT): Exit with code 0 (allows restart)
  • Errors/crashes: Exit with error code (allows restart)

2. Lifecycle Sidecar (Go)

Location: lifecycle-sidecar/

Runs alongside the agent container with shared PID namespace. Monitors the main container process and:

  • On exit code 42: Deletes deployment (and PVC if anonymous user)
  • On any other exit code: Exits with same code (k8s restarts pod)

Configuration (via environment, injected by downward API):

  • NAMESPACE: Pod's namespace
  • DEPLOYMENT_NAME: Deployment name (from pod label)
  • USER_TYPE: License tier (anonymous, free, paid, enterprise)
  • MAIN_CONTAINER_PID: PID of main container (default: 1)

RBAC: Has permission to delete deployments and PVCs only in dexorder-agents namespace. Cannot delete other deployments due to:

  1. Only knows its own deployment name (from env)
  2. RBAC scoped to namespace
  3. No cross-pod communication

3. Gateway (TypeScript)

Location: gateway/src/harness/agent-harness.ts

Creates agent deployments when users connect. Has permissions to:

  • Create deployments, services, PVCs
  • Read pod status and logs
  • Update deployments (e.g., resource limits)
  • Delete deployments (handled by sidecar)
  • Exec into pods
  • Access secrets

Lifecycle States

┌─────────────┐
│   CREATED   │ ← Gateway creates deployment
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   RUNNING   │ ← User interacts, has triggers
└──────┬──────┘
       │
       ▼
┌─────────────┐
│    IDLE     │ ← No triggers + timeout exceeded
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  SHUTDOWN   │ ← Exit code 42
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   DELETED   │ ← Sidecar deletes deployment
└─────────────┘

Idle Detection Logic

Container is IDLE when:

  1. active_triggers.isEmpty() AND
  2. (now - last_activity) > idle_timeout

Container is ACTIVE when:

  1. Has any active triggers (data subscriptions, CEP patterns, etc.) OR
  2. Recent user activity (MCP calls within timeout)

Cleanup Policies by License Tier

User Type Idle Timeout PVC Policy Notes
Anonymous 15 minutes Delete Ephemeral, no data retention
Free 15 minutes Retain Can resume session
Paid 60 minutes Retain Longer grace period
Enterprise No shutdown Retain Always-on containers

Configured via USER_TYPE env var in deployment.

Security

Principle of Least Privilege

Gateway:

  • Can create agent resources
  • Cannot delete agent resources
  • Cannot access other namespaces
  • Cannot exec into pods

Lifecycle Sidecar:

  • Can delete its own deployment only
  • Cannot delete other deployments
  • Scoped to dexorder-agents namespace
  • No exec, no secrets access

Admission Control

All deployments in dexorder-agents namespace are subject to:

  • Image allowlist (only approved images)
  • Security context enforcement (non-root, drop caps, read-only rootfs)
  • Resource limits required
  • PodSecurity standards (restricted profile)

See deploy/k8s/base/admission-policy.yaml

Network Isolation

Agents are network-isolated via NetworkPolicy:

  • Can connect to gateway (MCP)
  • Can connect to Redpanda (data streams)
  • Can make outbound HTTPS (exchanges, LLM APIs)
  • Cannot access k8s API
  • Cannot access system namespace
  • Cannot access other agent pods

See deploy/k8s/base/network-policies.yaml

Deployment

1. Apply Security Policies

kubectl apply -k deploy/k8s/dev  # or prod

This creates:

  • Namespaces (dexorder-system, dexorder-agents)
  • RBAC (gateway, lifecycle sidecar)
  • Admission policies
  • Network policies
  • Resource quotas

2. Build and Push Lifecycle Sidecar

cd lifecycle-sidecar
docker build -t ghcr.io/dexorder/lifecycle-sidecar:latest .
docker push ghcr.io/dexorder/lifecycle-sidecar:latest

3. Gateway Creates Agent Deployments

When a user connects, the gateway creates:

  • Deployment with agent + sidecar
  • PVC for persistent data
  • Service for MCP endpoint

See deploy/k8s/base/agent-deployment-example.yaml for template.

Testing

Test Lifecycle Manager Locally

from dexorder.lifecycle_manager import LifecycleManager

# Disable actual shutdown for testing
manager = LifecycleManager(
    idle_timeout_minutes=1,
    check_interval_seconds=10,
    enable_shutdown=False  # Only log, don't exit
)

await manager.start()

# Simulate activity
manager.record_activity()

# Simulate triggers
manager.add_trigger("test_trigger")
await asyncio.sleep(70)  # Wait past timeout
manager.remove_trigger("test_trigger")
await asyncio.sleep(70)  # Should detect idle

await manager.stop()

Test Sidecar Locally

# Build
cd lifecycle-sidecar
go build -o lifecycle-sidecar main.go

# Run (requires k8s config)
export NAMESPACE=dexorder-agents
export DEPLOYMENT_NAME=agent-test
export USER_TYPE=free
./lifecycle-sidecar

Integration Test

  1. Deploy test agent with sidecar
  2. Verify agent starts and is healthy
  3. Stop sending MCP calls and remove all triggers
  4. Wait for idle timeout + check interval
  5. Verify deployment is deleted

Troubleshooting

Container not shutting down when idle

Check logs:

kubectl logs -n dexorder-agents agent-user-abc123 -c agent

Verify:

  • ENABLE_IDLE_SHUTDOWN=true
  • No active triggers: manager.active_triggers should be empty
  • Idle timeout exceeded

Sidecar not deleting deployment

Check sidecar logs:

kubectl logs -n dexorder-agents agent-user-abc123 -c lifecycle-sidecar

Verify:

  • Exit code file exists: /var/run/agent/exit_code contains 42
  • RBAC permissions: kubectl auth can-i delete deployments --as=system:serviceaccount:dexorder-agents:agent-lifecycle -n dexorder-agents
  • Deployment name matches: Check DEPLOYMENT_NAME env var

Gateway can't create deployments

Check gateway logs and verify:

  • ServiceAccount exists: kubectl get sa gateway -n dexorder-system
  • RoleBinding exists: kubectl get rolebinding gateway-agent-creator -n dexorder-agents
  • Admission policy allows image: Check image name matches allowlist in admission-policy.yaml

Future Enhancements

  1. Graceful shutdown notifications: Warn users before shutdown via websocket
  2. Predictive scaling: Keep frequently-used containers warm
  3. Tiered storage: Move old PVCs to cheaper storage class
  4. Metrics: Expose lifecycle metrics (idle rate, shutdown count, etc.)
  5. Cost allocation: Track resource usage per user/license tier