13 KiB
Production Cluster Setup Guide
This guide covers setting up the Dexorder AI platform from scratch on a fresh Kubernetes cluster.
Overview
The platform runs across two namespaces:
| Namespace | Contents |
|---|---|
ai |
Gateway, web UI, all infrastructure services (postgres, minio, kafka, flink, relay, ingestor, qdrant, dragonfly, iceberg-catalog) |
sandbox |
Per-user sandbox containers (created dynamically by the gateway) |
Secrets are managed via 1Password CLI (op inject). All .tpl.yaml files in deploy/k8s/prod/secrets/ contain op:// references and are safe to commit; actual values are never stored in git.
Prerequisites
Tooling
| Tool | Purpose | Min Version |
|---|---|---|
kubectl |
Cluster management | 1.30+ |
kustomize |
Manifest rendering | 5.x |
op |
1Password CLI | 2.x |
docker |
Image builds | - |
Cluster Requirements
- Kubernetes: 1.30+ (required for
ValidatingAdmissionPolicyGA) - nginx-ingress-controller: For ingress routing and WebSocket support
- cert-manager: For TLS certificate provisioning (with
letsencrypt-prodClusterIssuer) - Persistent volume provisioner: StorageClass
standardmust exist and be functional - DNS:
dexorder.airesolves to the cluster's ingress IP/load balancer
Container Registry Access
Images are hosted at git.dxod.org/dexorder/dexorder/. The cluster must be able to pull from this registry. If the registry requires authentication, create an image pull secret before deploying.
Step 1 — Configure kubectl Context
Create a dedicated context named prod that defaults to the ai namespace:
# Add cluster credentials (replace with your actual kubeconfig details)
kubectl config set-cluster prod-cluster \
--server=https://<your-cluster-api-endpoint> \
--certificate-authority=/path/to/ca.crt
kubectl config set-credentials prod-user \
--client-certificate=/path/to/client.crt \
--client-key=/path/to/client.key
kubectl config set-context prod \
--cluster=prod-cluster \
--user=prod-user \
--namespace=ai
# Verify
kubectl --context=prod cluster-info
All bin/ scripts use kubectl --context=prod for production operations.
Step 2 — Install Cluster Prerequisites
nginx-ingress-controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.10.0/deploy/static/provider/cloud/deploy.yaml
kubectl -n ingress-nginx wait --for=condition=ready pod -l app.kubernetes.io/component=controller --timeout=120s
cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml
kubectl -n cert-manager wait --for=condition=ready pod -l app=cert-manager --timeout=120s
Then create the letsencrypt-prod ClusterIssuer. Edit the email address:
# Save as /tmp/clusterissuer.yaml and apply
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: your-email@dexorder.ai
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- http01:
ingress:
class: nginx
kubectl apply -f /tmp/clusterissuer.yaml
Step 3 — Set Up 1Password Vault
All production secrets are stored under the AI Prod vault in 1Password. The bin/op-setup script creates the vault and all required items with placeholder values so you can fill them in before deploying.
# Sign in to 1Password
op signin
# Preview what will be created (no changes)
bin/op-setup --dry-run
# Create the vault and all items
bin/op-setup
After running the script, open 1Password and update each item in the AI Prod vault with real values:
| Item | Fields | Where to get the value |
|---|---|---|
PostgreSQL |
password |
Generate: openssl rand -base64 32 |
MinIO |
access_key, secret_key |
access_key can stay minio-admin; generate a strong secret_key |
Gateway |
anthropic_api_key |
Anthropic Console → API Keys |
Gateway |
jwt_secret |
Generate: openssl rand -base64 48 |
Gateway |
openai_api_key |
OpenAI Platform → API Keys (optional) |
Gateway |
google_api_key |
Google AI Studio (optional) |
Gateway |
openrouter_api_key |
OpenRouter (optional) |
Telegram |
bot_token |
BotFather → /newbot (optional) |
Ingestor |
binance_api_key/secret |
Binance API Management (optional) |
Ingestor |
coinbase_api_key/secret |
Coinbase CDP Portal (optional) |
Ingestor |
kraken_api_key/secret |
Kraken API Settings (optional) |
Verify the references resolve before continuing:
op inject -i deploy/k8s/prod/secrets/gateway-secrets.tpl.yaml | head -20
Step 4 — Apply Base Manifests
This creates namespaces, RBAC, network policies, admission policies, and resource quotas.
kubectl --context=prod apply -k deploy/k8s/prod/
Verify the namespaces and key resources are created:
kubectl --context=prod get namespaces ai sandbox
kubectl --context=prod -n ai get serviceaccount gateway
kubectl --context=prod -n sandbox get serviceaccount sandbox-lifecycle
kubectl --context=prod get validatingadmissionpolicy dexorder-sandbox-image-policy
Step 5 — Apply Secrets
# Apply all secrets (uses op inject to resolve op:// references)
bin/secret-update prod
This will prompt for confirmation, then apply all 7 secrets:
ai-secrets(Anthropic API key)postgres-secret(PostgreSQL password)minio-secret(MinIO credentials)ingestor-secrets(exchange API keys)flink-secrets(MinIO credentials for Flink)gateway-secrets(gateway application secrets)sandbox-secrets(secrets mounted in sandbox pods)
Verify:
kubectl --context=prod -n ai get secrets
kubectl --context=prod -n sandbox get secret sandbox-secrets
Step 6 — Apply Configs
# Apply all configs (gateway-config uses op inject; others are plain YAML)
bin/config-update prod
This applies:
relay-config— ZMQ relay configurationingestor-config— CCXT ingestor configurationflink-config— Flink job configurationgateway-config— Gateway config (DB credentials resolved via op inject)
Verify:
kubectl --context=prod -n ai get configmaps
Step 7 — Deploy Infrastructure
Infrastructure services (postgres, minio, kafka, iceberg-catalog, dragonfly, qdrant, relay, ingestor, flink) are defined in deploy/k8s/prod/infrastructure.yaml and were applied in Step 4.
Wait for the StatefulSets and Deployments to become ready:
kubectl --context=prod -n ai rollout status statefulset/postgres
kubectl --context=prod -n ai rollout status statefulset/minio
kubectl --context=prod -n ai rollout status statefulset/kafka
kubectl --context=prod -n ai rollout status statefulset/qdrant
kubectl --context=prod -n ai rollout status deployment/dragonfly
kubectl --context=prod -n ai rollout status deployment/iceberg-catalog
kubectl --context=prod -n ai rollout status deployment/relay
kubectl --context=prod -n ai rollout status deployment/ingestor
kubectl --context=prod -n ai rollout status deployment/flink-jobmanager
kubectl --context=prod -n ai rollout status deployment/flink-taskmanager
MinIO will automatically run a Job to create the warehouse bucket on first start. Confirm it completes:
kubectl --context=prod -n ai get jobs
kubectl --context=prod -n ai wait --for=condition=complete job/minio-init --timeout=120s
Step 8 — Deploy Application Images
Build and push the application images:
# Build and push all services
bin/deploy gateway prod
bin/deploy web prod
bin/deploy sandbox prod
bin/deploy lifecycle-sidecar prod
bin/deploy flink prod
bin/deploy relay prod
bin/deploy ingestor prod
Each bin/deploy command builds the Docker image, tags it with the current git SHA, pushes to git.dxod.org/dexorder/dexorder/, and updates the live deployment via kubectl set image.
Wait for the gateway and web to be ready:
kubectl --context=prod -n ai rollout status deployment/gateway
kubectl --context=prod -n ai rollout status deployment/ai-web
Step 9 — Initialize Schema and Admin User
bin/init prod
This will:
- Wait for postgres to be ready
- Check if the schema exists; apply
gateway/schema.sqlif not - Prompt for admin user credentials (email, password, display name, license tier)
- Register the user via the API
- Insert the license record into the database
Step 10 — Verify TLS and Ingress
cert-manager should automatically provision TLS certificates via Let's Encrypt once the ingress resources are applied and DNS is resolving correctly.
# Check certificate status
kubectl --context=prod -n ai get certificates
kubectl --context=prod -n ai describe certificate dexorder-ai-tls
# Certificates are ready when READY=True
# This can take 1-2 minutes for HTTP-01 challenge completion
Once ready, verify the application is accessible:
curl -I https://dexorder.ai/api/health
# Expected: HTTP/2 200
Day-2 Operations
Update a Service After Code Changes
# Rebuild and redeploy a single service
bin/deploy gateway prod
bin/deploy web prod
Update Secrets
# Update all secrets
bin/secret-update prod
# Update a specific secret
bin/secret-update prod ai-secrets
Update Config
# Update all configs (triggers pod restarts)
bin/config-update prod
# Update a specific config
bin/config-update prod gateway-config
Add a New User
# Re-run init to add another user
bin/init prod
Or insert directly via psql:
PG_POD=$(kubectl --context=prod -n ai get pods -l app=postgres -o jsonpath='{.items[0].metadata.name}')
kubectl --context=prod -n ai exec -it "$PG_POD" -- psql -U postgres -d iceberg
View Logs
kubectl --context=prod -n ai logs -f deployment/gateway
kubectl --context=prod -n ai logs -f deployment/ingestor
kubectl --context=prod -n ai logs -f deployment/flink-jobmanager
kubectl --context=prod -n sandbox logs -l dexorder.io/component=sandbox
Check Sandbox Status
# List all running sandboxes
kubectl --context=prod -n sandbox get deployments
kubectl --context=prod -n sandbox get pods
# Check resource usage in sandbox namespace
kubectl --context=prod -n sandbox top pods
Namespace & Security Architecture
Internet
│
▼
nginx-ingress (dexorder.ai)
│
├──/──────────────────► ai-web:5173 (Vue.js UI)
│
└──/api/───────────────► gateway:3000 (Node.js API)
│
│ Creates/manages via k8s API
▼
sandbox namespace
┌──────────────────────┐
│ sandbox-<userId> │
│ ├── sandbox │
│ │ (MCP server) │
│ └── lifecycle-sidecar│
└──────────────────────┘
│
│ Egress: only ai namespace
│ services + external HTTPS
▼
ai namespace services:
gateway:5571 (ZMQ events)
iceberg-catalog:8181
minio:9000
relay:5559
Network Isolation
- Sandbox pods have default-deny network policy
- Sandboxes can reach: gateway (ZMQ + callbacks), iceberg-catalog, minio, relay, external HTTPS (port 443)
- Sandboxes cannot reach: other sandbox pods, the Kubernetes API, private IP ranges
- The admission policy (
dexorder-sandbox-image-policy) prevents non-approved images from running in the sandbox namespace
Troubleshooting
Pods stuck in Pending
kubectl --context=prod -n ai describe pod <pod-name>
# Look for: resource quota exceeded, PVC not bound, image pull errors
Certificate not issuing
kubectl --context=prod -n ai describe certificaterequest
kubectl --context=prod -n cert-manager logs -l app=cert-manager
# Common cause: DNS not pointing to cluster ingress IP yet
Gateway can't create sandboxes
# Verify RBAC is correct
kubectl --context=prod auth can-i create deployments \
--as=system:serviceaccount:ai:gateway -n sandbox
# Should return: yes
Sandbox pod fails to start with "configmap not found"
This would indicate a leftover reference to sandbox-config (removed from the template). Check the sandbox deployment spec:
kubectl --context=prod -n sandbox describe deployment sandbox-<userId>
1Password auth expired
op signin
bin/secret-update prod