dexorder/ai

Fork 0

Files

Tim Olson eab581f8cb prod deployment

2026-04-01 18:34:08 -04:00

13 KiB

Raw Blame History

Production Cluster Setup Guide

This guide covers setting up the Dexorder AI platform from scratch on a fresh Kubernetes cluster.

Overview

The platform runs across two namespaces:

Namespace	Contents
`ai`	Gateway, web UI, all infrastructure services (postgres, minio, kafka, flink, relay, ingestor, qdrant, dragonfly, iceberg-catalog)
`sandbox`	Per-user sandbox containers (created dynamically by the gateway)

Secrets are managed via 1Password CLI (op inject). All .tpl.yaml files in deploy/k8s/prod/secrets/ contain op:// references and are safe to commit; actual values are never stored in git.

Prerequisites

Tooling

Tool	Purpose	Min Version
`kubectl`	Cluster management	1.30+
`kustomize`	Manifest rendering	5.x
`op`	1Password CLI	2.x
`docker`	Image builds	-

Cluster Requirements

Kubernetes: 1.30+ (required for ValidatingAdmissionPolicy GA)
nginx-ingress-controller: For ingress routing and WebSocket support
cert-manager: For TLS certificate provisioning (with letsencrypt-prod ClusterIssuer)
Persistent volume provisioner: StorageClass standard must exist and be functional
DNS: dexorder.ai resolves to the cluster's ingress IP/load balancer

Container Registry Access

Images are hosted at git.dxod.org/dexorder/dexorder/. The cluster must be able to pull from this registry. If the registry requires authentication, create an image pull secret before deploying.

Step 1 — Configure kubectl Context

Create a dedicated context named prod that defaults to the ai namespace:

# Add cluster credentials (replace with your actual kubeconfig details)
kubectl config set-cluster prod-cluster \
  --server=https://<your-cluster-api-endpoint> \
  --certificate-authority=/path/to/ca.crt

kubectl config set-credentials prod-user \
  --client-certificate=/path/to/client.crt \
  --client-key=/path/to/client.key

kubectl config set-context prod \
  --cluster=prod-cluster \
  --user=prod-user \
  --namespace=ai

# Verify
kubectl --context=prod cluster-info

All bin/ scripts use kubectl --context=prod for production operations.

Step 2 — Install Cluster Prerequisites

nginx-ingress-controller

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.10.0/deploy/static/provider/cloud/deploy.yaml
kubectl -n ingress-nginx wait --for=condition=ready pod -l app.kubernetes.io/component=controller --timeout=120s

cert-manager

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml
kubectl -n cert-manager wait --for=condition=ready pod -l app=cert-manager --timeout=120s

Then create the letsencrypt-prod ClusterIssuer. Edit the email address:

# Save as /tmp/clusterissuer.yaml and apply
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: your-email@dexorder.ai
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      - http01:
          ingress:
            class: nginx

kubectl apply -f /tmp/clusterissuer.yaml

Step 3 — Set Up 1Password Vault

All production secrets are stored under the AI Prod vault in 1Password. The bin/op-setup script creates the vault and all required items with placeholder values so you can fill them in before deploying.

# Sign in to 1Password
op signin

# Preview what will be created (no changes)
bin/op-setup --dry-run

# Create the vault and all items
bin/op-setup

After running the script, open 1Password and update each item in the AI Prod vault with real values:

Item	Fields	Where to get the value
`PostgreSQL`	`password`	Generate: `openssl rand -base64 32`
`MinIO`	`access_key`, `secret_key`	`access_key` can stay `minio-admin`; generate a strong `secret_key`
`Gateway`	`anthropic_api_key`	Anthropic Console → API Keys
`Gateway`	`jwt_secret`	Generate: `openssl rand -base64 48`
`Gateway`	`openai_api_key`	OpenAI Platform → API Keys (optional)
`Gateway`	`google_api_key`	Google AI Studio (optional)
`Gateway`	`openrouter_api_key`	OpenRouter (optional)
`Telegram`	`bot_token`	BotFather → `/newbot` (optional)
`Ingestor`	`binance_api_key/secret`	Binance API Management (optional)
`Ingestor`	`coinbase_api_key/secret`	Coinbase CDP Portal (optional)
`Ingestor`	`kraken_api_key/secret`	Kraken API Settings (optional)

Verify the references resolve before continuing:

op inject -i deploy/k8s/prod/secrets/gateway-secrets.tpl.yaml | head -20

Step 4 — Apply Base Manifests

This creates namespaces, RBAC, network policies, admission policies, and resource quotas.

kubectl --context=prod apply -k deploy/k8s/prod/

Verify the namespaces and key resources are created:

kubectl --context=prod get namespaces ai sandbox
kubectl --context=prod -n ai get serviceaccount gateway
kubectl --context=prod -n sandbox get serviceaccount sandbox-lifecycle
kubectl --context=prod get validatingadmissionpolicy dexorder-sandbox-image-policy

Step 5 — Apply Secrets

# Apply all secrets (uses op inject to resolve op:// references)
bin/secret-update prod

This will prompt for confirmation, then apply all 7 secrets:

ai-secrets (Anthropic API key)
postgres-secret (PostgreSQL password)
minio-secret (MinIO credentials)
ingestor-secrets (exchange API keys)
flink-secrets (MinIO credentials for Flink)
gateway-secrets (gateway application secrets)
sandbox-secrets (secrets mounted in sandbox pods)

Verify:

kubectl --context=prod -n ai get secrets
kubectl --context=prod -n sandbox get secret sandbox-secrets

Step 6 — Apply Configs

# Apply all configs (gateway-config uses op inject; others are plain YAML)
bin/config-update prod

This applies:

relay-config — ZMQ relay configuration
ingestor-config — CCXT ingestor configuration
flink-config — Flink job configuration
gateway-config — Gateway config (DB credentials resolved via op inject)

Verify:

kubectl --context=prod -n ai get configmaps

Step 7 — Deploy Infrastructure

Infrastructure services (postgres, minio, kafka, iceberg-catalog, dragonfly, qdrant, relay, ingestor, flink) are defined in deploy/k8s/prod/infrastructure.yaml and were applied in Step 4.

Wait for the StatefulSets and Deployments to become ready:

kubectl --context=prod -n ai rollout status statefulset/postgres
kubectl --context=prod -n ai rollout status statefulset/minio
kubectl --context=prod -n ai rollout status statefulset/kafka
kubectl --context=prod -n ai rollout status statefulset/qdrant
kubectl --context=prod -n ai rollout status deployment/dragonfly
kubectl --context=prod -n ai rollout status deployment/iceberg-catalog
kubectl --context=prod -n ai rollout status deployment/relay
kubectl --context=prod -n ai rollout status deployment/ingestor
kubectl --context=prod -n ai rollout status deployment/flink-jobmanager
kubectl --context=prod -n ai rollout status deployment/flink-taskmanager

MinIO will automatically run a Job to create the warehouse bucket on first start. Confirm it completes:

kubectl --context=prod -n ai get jobs
kubectl --context=prod -n ai wait --for=condition=complete job/minio-init --timeout=120s

Step 8 — Deploy Application Images

Build and push the application images:

# Build and push all services
bin/deploy gateway prod
bin/deploy web prod
bin/deploy sandbox prod
bin/deploy lifecycle-sidecar prod
bin/deploy flink prod
bin/deploy relay prod
bin/deploy ingestor prod

Each bin/deploy command builds the Docker image, tags it with the current git SHA, pushes to git.dxod.org/dexorder/dexorder/, and updates the live deployment via kubectl set image.

Wait for the gateway and web to be ready:

kubectl --context=prod -n ai rollout status deployment/gateway
kubectl --context=prod -n ai rollout status deployment/ai-web

Step 9 — Initialize Schema and Admin User

bin/init prod

This will:

Wait for postgres to be ready
Check if the schema exists; apply gateway/schema.sql if not
Prompt for admin user credentials (email, password, display name, license tier)
Register the user via the API
Insert the license record into the database

Step 10 — Verify TLS and Ingress

cert-manager should automatically provision TLS certificates via Let's Encrypt once the ingress resources are applied and DNS is resolving correctly.

# Check certificate status
kubectl --context=prod -n ai get certificates
kubectl --context=prod -n ai describe certificate dexorder-ai-tls

# Certificates are ready when READY=True
# This can take 1-2 minutes for HTTP-01 challenge completion

Once ready, verify the application is accessible:

curl -I https://dexorder.ai/api/health
# Expected: HTTP/2 200

Day-2 Operations

Update a Service After Code Changes

# Rebuild and redeploy a single service
bin/deploy gateway prod
bin/deploy web prod

Update Secrets

# Update all secrets
bin/secret-update prod

# Update a specific secret
bin/secret-update prod ai-secrets

Update Config

# Update all configs (triggers pod restarts)
bin/config-update prod

# Update a specific config
bin/config-update prod gateway-config

Add a New User

# Re-run init to add another user
bin/init prod

Or insert directly via psql:

PG_POD=$(kubectl --context=prod -n ai get pods -l app=postgres -o jsonpath='{.items[0].metadata.name}')
kubectl --context=prod -n ai exec -it "$PG_POD" -- psql -U postgres -d iceberg

View Logs

kubectl --context=prod -n ai logs -f deployment/gateway
kubectl --context=prod -n ai logs -f deployment/ingestor
kubectl --context=prod -n ai logs -f deployment/flink-jobmanager
kubectl --context=prod -n sandbox logs -l dexorder.io/component=sandbox

Check Sandbox Status

# List all running sandboxes
kubectl --context=prod -n sandbox get deployments
kubectl --context=prod -n sandbox get pods

# Check resource usage in sandbox namespace
kubectl --context=prod -n sandbox top pods

Namespace & Security Architecture

Internet
    │
    ▼
nginx-ingress (dexorder.ai)
    │
    ├──/──────────────────► ai-web:5173 (Vue.js UI)
    │
    └──/api/───────────────► gateway:3000 (Node.js API)
                                  │
                                  │  Creates/manages via k8s API
                                  ▼
                             sandbox namespace
                             ┌──────────────────────┐
                             │  sandbox-<userId>     │
                             │  ├── sandbox          │
                             │  │   (MCP server)     │
                             │  └── lifecycle-sidecar│
                             └──────────────────────┘
                                  │
                                  │  Egress: only ai namespace
                                  │  services + external HTTPS
                                  ▼
                             ai namespace services:
                             gateway:5571 (ZMQ events)
                             iceberg-catalog:8181
                             minio:9000
                             relay:5559

Network Isolation

Sandbox pods have default-deny network policy
Sandboxes can reach: gateway (ZMQ + callbacks), iceberg-catalog, minio, relay, external HTTPS (port 443)
Sandboxes cannot reach: other sandbox pods, the Kubernetes API, private IP ranges
The admission policy (dexorder-sandbox-image-policy) prevents non-approved images from running in the sandbox namespace

Troubleshooting

Pods stuck in `Pending`

kubectl --context=prod -n ai describe pod <pod-name>
# Look for: resource quota exceeded, PVC not bound, image pull errors

Certificate not issuing

kubectl --context=prod -n ai describe certificaterequest
kubectl --context=prod -n cert-manager logs -l app=cert-manager
# Common cause: DNS not pointing to cluster ingress IP yet

Gateway can't create sandboxes

# Verify RBAC is correct
kubectl --context=prod auth can-i create deployments \
  --as=system:serviceaccount:ai:gateway -n sandbox

# Should return: yes

Sandbox pod fails to start with "configmap not found"

This would indicate a leftover reference to sandbox-config (removed from the template). Check the sandbox deployment spec:

kubectl --context=prod -n sandbox describe deployment sandbox-<userId>

1Password auth expired

op signin
bin/secret-update prod

13 KiB Raw Blame History