Files
ai/doc/CLUSTER_SETUP.md
2026-04-01 18:34:08 -04:00

13 KiB

Production Cluster Setup Guide

This guide covers setting up the Dexorder AI platform from scratch on a fresh Kubernetes cluster.


Overview

The platform runs across two namespaces:

Namespace Contents
ai Gateway, web UI, all infrastructure services (postgres, minio, kafka, flink, relay, ingestor, qdrant, dragonfly, iceberg-catalog)
sandbox Per-user sandbox containers (created dynamically by the gateway)

Secrets are managed via 1Password CLI (op inject). All .tpl.yaml files in deploy/k8s/prod/secrets/ contain op:// references and are safe to commit; actual values are never stored in git.


Prerequisites

Tooling

Tool Purpose Min Version
kubectl Cluster management 1.30+
kustomize Manifest rendering 5.x
op 1Password CLI 2.x
docker Image builds -

Cluster Requirements

  • Kubernetes: 1.30+ (required for ValidatingAdmissionPolicy GA)
  • nginx-ingress-controller: For ingress routing and WebSocket support
  • cert-manager: For TLS certificate provisioning (with letsencrypt-prod ClusterIssuer)
  • Persistent volume provisioner: StorageClass standard must exist and be functional
  • DNS: dexorder.ai resolves to the cluster's ingress IP/load balancer

Container Registry Access

Images are hosted at git.dxod.org/dexorder/dexorder/. The cluster must be able to pull from this registry. If the registry requires authentication, create an image pull secret before deploying.


Step 1 — Configure kubectl Context

Create a dedicated context named prod that defaults to the ai namespace:

# Add cluster credentials (replace with your actual kubeconfig details)
kubectl config set-cluster prod-cluster \
  --server=https://<your-cluster-api-endpoint> \
  --certificate-authority=/path/to/ca.crt

kubectl config set-credentials prod-user \
  --client-certificate=/path/to/client.crt \
  --client-key=/path/to/client.key

kubectl config set-context prod \
  --cluster=prod-cluster \
  --user=prod-user \
  --namespace=ai

# Verify
kubectl --context=prod cluster-info

All bin/ scripts use kubectl --context=prod for production operations.


Step 2 — Install Cluster Prerequisites

nginx-ingress-controller

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.10.0/deploy/static/provider/cloud/deploy.yaml
kubectl -n ingress-nginx wait --for=condition=ready pod -l app.kubernetes.io/component=controller --timeout=120s

cert-manager

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml
kubectl -n cert-manager wait --for=condition=ready pod -l app=cert-manager --timeout=120s

Then create the letsencrypt-prod ClusterIssuer. Edit the email address:

# Save as /tmp/clusterissuer.yaml and apply
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: your-email@dexorder.ai
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      - http01:
          ingress:
            class: nginx
kubectl apply -f /tmp/clusterissuer.yaml

Step 3 — Set Up 1Password Vault

All production secrets are stored under the AI Prod vault in 1Password. The bin/op-setup script creates the vault and all required items with placeholder values so you can fill them in before deploying.

# Sign in to 1Password
op signin

# Preview what will be created (no changes)
bin/op-setup --dry-run

# Create the vault and all items
bin/op-setup

After running the script, open 1Password and update each item in the AI Prod vault with real values:

Item Fields Where to get the value
PostgreSQL password Generate: openssl rand -base64 32
MinIO access_key, secret_key access_key can stay minio-admin; generate a strong secret_key
Gateway anthropic_api_key Anthropic Console → API Keys
Gateway jwt_secret Generate: openssl rand -base64 48
Gateway openai_api_key OpenAI Platform → API Keys (optional)
Gateway google_api_key Google AI Studio (optional)
Gateway openrouter_api_key OpenRouter (optional)
Telegram bot_token BotFather → /newbot (optional)
Ingestor binance_api_key/secret Binance API Management (optional)
Ingestor coinbase_api_key/secret Coinbase CDP Portal (optional)
Ingestor kraken_api_key/secret Kraken API Settings (optional)

Verify the references resolve before continuing:

op inject -i deploy/k8s/prod/secrets/gateway-secrets.tpl.yaml | head -20

Step 4 — Apply Base Manifests

This creates namespaces, RBAC, network policies, admission policies, and resource quotas.

kubectl --context=prod apply -k deploy/k8s/prod/

Verify the namespaces and key resources are created:

kubectl --context=prod get namespaces ai sandbox
kubectl --context=prod -n ai get serviceaccount gateway
kubectl --context=prod -n sandbox get serviceaccount sandbox-lifecycle
kubectl --context=prod get validatingadmissionpolicy dexorder-sandbox-image-policy

Step 5 — Apply Secrets

# Apply all secrets (uses op inject to resolve op:// references)
bin/secret-update prod

This will prompt for confirmation, then apply all 7 secrets:

  • ai-secrets (Anthropic API key)
  • postgres-secret (PostgreSQL password)
  • minio-secret (MinIO credentials)
  • ingestor-secrets (exchange API keys)
  • flink-secrets (MinIO credentials for Flink)
  • gateway-secrets (gateway application secrets)
  • sandbox-secrets (secrets mounted in sandbox pods)

Verify:

kubectl --context=prod -n ai get secrets
kubectl --context=prod -n sandbox get secret sandbox-secrets

Step 6 — Apply Configs

# Apply all configs (gateway-config uses op inject; others are plain YAML)
bin/config-update prod

This applies:

  • relay-config — ZMQ relay configuration
  • ingestor-config — CCXT ingestor configuration
  • flink-config — Flink job configuration
  • gateway-config — Gateway config (DB credentials resolved via op inject)

Verify:

kubectl --context=prod -n ai get configmaps

Step 7 — Deploy Infrastructure

Infrastructure services (postgres, minio, kafka, iceberg-catalog, dragonfly, qdrant, relay, ingestor, flink) are defined in deploy/k8s/prod/infrastructure.yaml and were applied in Step 4.

Wait for the StatefulSets and Deployments to become ready:

kubectl --context=prod -n ai rollout status statefulset/postgres
kubectl --context=prod -n ai rollout status statefulset/minio
kubectl --context=prod -n ai rollout status statefulset/kafka
kubectl --context=prod -n ai rollout status statefulset/qdrant
kubectl --context=prod -n ai rollout status deployment/dragonfly
kubectl --context=prod -n ai rollout status deployment/iceberg-catalog
kubectl --context=prod -n ai rollout status deployment/relay
kubectl --context=prod -n ai rollout status deployment/ingestor
kubectl --context=prod -n ai rollout status deployment/flink-jobmanager
kubectl --context=prod -n ai rollout status deployment/flink-taskmanager

MinIO will automatically run a Job to create the warehouse bucket on first start. Confirm it completes:

kubectl --context=prod -n ai get jobs
kubectl --context=prod -n ai wait --for=condition=complete job/minio-init --timeout=120s

Step 8 — Deploy Application Images

Build and push the application images:

# Build and push all services
bin/deploy gateway prod
bin/deploy web prod
bin/deploy sandbox prod
bin/deploy lifecycle-sidecar prod
bin/deploy flink prod
bin/deploy relay prod
bin/deploy ingestor prod

Each bin/deploy command builds the Docker image, tags it with the current git SHA, pushes to git.dxod.org/dexorder/dexorder/, and updates the live deployment via kubectl set image.

Wait for the gateway and web to be ready:

kubectl --context=prod -n ai rollout status deployment/gateway
kubectl --context=prod -n ai rollout status deployment/ai-web

Step 9 — Initialize Schema and Admin User

bin/init prod

This will:

  1. Wait for postgres to be ready
  2. Check if the schema exists; apply gateway/schema.sql if not
  3. Prompt for admin user credentials (email, password, display name, license tier)
  4. Register the user via the API
  5. Insert the license record into the database

Step 10 — Verify TLS and Ingress

cert-manager should automatically provision TLS certificates via Let's Encrypt once the ingress resources are applied and DNS is resolving correctly.

# Check certificate status
kubectl --context=prod -n ai get certificates
kubectl --context=prod -n ai describe certificate dexorder-ai-tls

# Certificates are ready when READY=True
# This can take 1-2 minutes for HTTP-01 challenge completion

Once ready, verify the application is accessible:

curl -I https://dexorder.ai/api/health
# Expected: HTTP/2 200

Day-2 Operations

Update a Service After Code Changes

# Rebuild and redeploy a single service
bin/deploy gateway prod
bin/deploy web prod

Update Secrets

# Update all secrets
bin/secret-update prod

# Update a specific secret
bin/secret-update prod ai-secrets

Update Config

# Update all configs (triggers pod restarts)
bin/config-update prod

# Update a specific config
bin/config-update prod gateway-config

Add a New User

# Re-run init to add another user
bin/init prod

Or insert directly via psql:

PG_POD=$(kubectl --context=prod -n ai get pods -l app=postgres -o jsonpath='{.items[0].metadata.name}')
kubectl --context=prod -n ai exec -it "$PG_POD" -- psql -U postgres -d iceberg

View Logs

kubectl --context=prod -n ai logs -f deployment/gateway
kubectl --context=prod -n ai logs -f deployment/ingestor
kubectl --context=prod -n ai logs -f deployment/flink-jobmanager
kubectl --context=prod -n sandbox logs -l dexorder.io/component=sandbox

Check Sandbox Status

# List all running sandboxes
kubectl --context=prod -n sandbox get deployments
kubectl --context=prod -n sandbox get pods

# Check resource usage in sandbox namespace
kubectl --context=prod -n sandbox top pods

Namespace & Security Architecture

Internet
    │
    ▼
nginx-ingress (dexorder.ai)
    │
    ├──/──────────────────► ai-web:5173 (Vue.js UI)
    │
    └──/api/───────────────► gateway:3000 (Node.js API)
                                  │
                                  │  Creates/manages via k8s API
                                  ▼
                             sandbox namespace
                             ┌──────────────────────┐
                             │  sandbox-<userId>     │
                             │  ├── sandbox          │
                             │  │   (MCP server)     │
                             │  └── lifecycle-sidecar│
                             └──────────────────────┘
                                  │
                                  │  Egress: only ai namespace
                                  │  services + external HTTPS
                                  ▼
                             ai namespace services:
                             gateway:5571 (ZMQ events)
                             iceberg-catalog:8181
                             minio:9000
                             relay:5559

Network Isolation

  • Sandbox pods have default-deny network policy
  • Sandboxes can reach: gateway (ZMQ + callbacks), iceberg-catalog, minio, relay, external HTTPS (port 443)
  • Sandboxes cannot reach: other sandbox pods, the Kubernetes API, private IP ranges
  • The admission policy (dexorder-sandbox-image-policy) prevents non-approved images from running in the sandbox namespace

Troubleshooting

Pods stuck in Pending

kubectl --context=prod -n ai describe pod <pod-name>
# Look for: resource quota exceeded, PVC not bound, image pull errors

Certificate not issuing

kubectl --context=prod -n ai describe certificaterequest
kubectl --context=prod -n cert-manager logs -l app=cert-manager
# Common cause: DNS not pointing to cluster ingress IP yet

Gateway can't create sandboxes

# Verify RBAC is correct
kubectl --context=prod auth can-i create deployments \
  --as=system:serviceaccount:ai:gateway -n sandbox

# Should return: yes

Sandbox pod fails to start with "configmap not found"

This would indicate a leftover reference to sandbox-config (removed from the template). Check the sandbox deployment spec:

kubectl --context=prod -n sandbox describe deployment sandbox-<userId>

1Password auth expired

op signin
bin/secret-update prod