# Production Cluster Setup Guide This guide covers setting up the Dexorder AI platform from scratch on a fresh Kubernetes cluster. --- ## Overview The platform runs across two namespaces: | Namespace | Contents | |-----------|----------| | `ai` | Gateway, web UI, all infrastructure services (postgres, minio, kafka, flink, relay, ingestor, dragonfly, iceberg-catalog) | | `sandbox` | Per-user sandbox containers (created dynamically by the gateway) | Secrets are managed via 1Password CLI (`op inject`). All `.tpl.yaml` files in `deploy/k8s/prod/secrets/` contain `op://` references and are safe to commit; actual values are never stored in git. --- ## Prerequisites ### Tooling | Tool | Purpose | Min Version | |------|---------|-------------| | `kubectl` | Cluster management | 1.30+ | | `kustomize` | Manifest rendering | 5.x | | `op` | 1Password CLI | 2.x | | `docker` | Image builds | - | ### Cluster Requirements - **Kubernetes**: 1.30+ (required for `ValidatingAdmissionPolicy` GA) - **nginx-ingress-controller**: For ingress routing and WebSocket support - **cert-manager**: For TLS certificate provisioning (with `letsencrypt-prod` ClusterIssuer) - **Persistent volume provisioner**: StorageClass `standard` must exist and be functional - **DNS**: `dexorder.ai` resolves to the cluster's ingress IP/load balancer ### Container Registry Access Images are hosted at `git.dxod.org/dexorder/dexorder/`. The cluster must be able to pull from this registry. If the registry requires authentication, create an image pull secret before deploying. --- ## Step 1 — Configure kubectl Context Create a dedicated context named `prod` that defaults to the `ai` namespace: ```bash # Add cluster credentials (replace with your actual kubeconfig details) kubectl config set-cluster prod-cluster \ --server=https:// \ --certificate-authority=/path/to/ca.crt kubectl config set-credentials prod-user \ --client-certificate=/path/to/client.crt \ --client-key=/path/to/client.key kubectl config set-context prod \ --cluster=prod-cluster \ --user=prod-user \ --namespace=ai # Verify kubectl --context=prod cluster-info ``` All `bin/` scripts use `kubectl --context=prod` for production operations. --- ## Step 2 — Install Cluster Prerequisites ### nginx-ingress-controller ```bash kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.10.0/deploy/static/provider/cloud/deploy.yaml kubectl -n ingress-nginx wait --for=condition=ready pod -l app.kubernetes.io/component=controller --timeout=120s ``` ### cert-manager ```bash kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml kubectl -n cert-manager wait --for=condition=ready pod -l app=cert-manager --timeout=120s ``` Then create the `letsencrypt-prod` ClusterIssuer. Edit the email address: ```yaml # Save as /tmp/clusterissuer.yaml and apply apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: letsencrypt-prod spec: acme: server: https://acme-v02.api.letsencrypt.org/directory email: your-email@dexorder.ai privateKeySecretRef: name: letsencrypt-prod-key solvers: - http01: ingress: class: nginx ``` ```bash kubectl apply -f /tmp/clusterissuer.yaml ``` --- ## Step 3 — Set Up 1Password Vault All production secrets are stored under the **AI Prod** vault in 1Password. The `bin/op-setup` script creates the vault and all required items with placeholder values so you can fill them in before deploying. ```bash # Sign in to 1Password op signin # Preview what will be created (no changes) bin/op-setup --dry-run # Create the vault and all items bin/op-setup ``` After running the script, open 1Password and update each item in the **AI Prod** vault with real values: | Item | Fields | Where to get the value | |------|--------|------------------------| | `PostgreSQL` | `password` | Generate: `openssl rand -base64 32` | | `MinIO` | `access_key`, `secret_key` | `access_key` can stay `minio-admin`; generate a strong `secret_key` | | `Gateway` | `anthropic_api_key` | [Anthropic Console](https://console.anthropic.com) → API Keys | | `Gateway` | `jwt_secret` | Generate: `openssl rand -base64 48` | | `Gateway` | `openai_api_key` | [OpenAI Platform](https://platform.openai.com) → API Keys (optional) | | `Gateway` | `google_api_key` | Google AI Studio (optional) | | `Gateway` | `openrouter_api_key` | [OpenRouter](https://openrouter.ai) (optional) | | `Telegram` | `bot_token` | BotFather → `/newbot` (optional) | | `Ingestor` | `binance_api_key/secret` | Binance API Management (optional) | | `Ingestor` | `coinbase_api_key/secret` | Coinbase CDP Portal (optional) | | `Ingestor` | `kraken_api_key/secret` | Kraken API Settings (optional) | Verify the references resolve before continuing: ```bash op inject -i deploy/k8s/prod/secrets/gateway-secrets.tpl.yaml | head -20 ``` --- ## Step 4 — Apply Base Manifests This creates namespaces, RBAC, network policies, admission policies, and resource quotas. ```bash kubectl --context=prod apply -k deploy/k8s/prod/ ``` Verify the namespaces and key resources are created: ```bash kubectl --context=prod get namespaces ai sandbox kubectl --context=prod -n ai get serviceaccount gateway kubectl --context=prod -n sandbox get serviceaccount sandbox-lifecycle kubectl --context=prod get validatingadmissionpolicy dexorder-sandbox-image-policy ``` --- ## Step 5 — Apply Secrets ```bash # Apply all secrets (uses op inject to resolve op:// references) bin/secret-update prod ``` This will prompt for confirmation, then apply all 7 secrets: - `ai-secrets` (Anthropic API key) - `postgres-secret` (PostgreSQL password) - `minio-secret` (MinIO credentials) - `ingestor-secrets` (exchange API keys) - `flink-secrets` (MinIO credentials for Flink) - `gateway-secrets` (gateway application secrets) - `sandbox-secrets` (secrets mounted in sandbox pods) Verify: ```bash kubectl --context=prod -n ai get secrets kubectl --context=prod -n sandbox get secret sandbox-secrets ``` --- ## Step 6 — Apply Configs ```bash # Apply all configs (gateway-config uses op inject; others are plain YAML) bin/config-update prod ``` This applies: - `relay-config` — ZMQ relay configuration - `ingestor-config` — CCXT ingestor configuration - `flink-config` — Flink job configuration - `gateway-config` — Gateway config (DB credentials resolved via op inject) Verify: ```bash kubectl --context=prod -n ai get configmaps ``` --- ## Step 7 — Deploy Infrastructure Infrastructure services (postgres, minio, kafka, iceberg-catalog, dragonfly, relay, ingestor, flink) are defined in `deploy/k8s/prod/infrastructure.yaml` and were applied in Step 4. Wait for the StatefulSets and Deployments to become ready: ```bash kubectl --context=prod -n ai rollout status statefulset/postgres kubectl --context=prod -n ai rollout status statefulset/minio kubectl --context=prod -n ai rollout status statefulset/kafka kubectl --context=prod -n ai rollout status deployment/dragonfly kubectl --context=prod -n ai rollout status deployment/iceberg-catalog kubectl --context=prod -n ai rollout status deployment/relay kubectl --context=prod -n ai rollout status deployment/ingestor kubectl --context=prod -n ai rollout status deployment/flink-jobmanager kubectl --context=prod -n ai rollout status deployment/flink-taskmanager ``` MinIO will automatically run a Job to create the `warehouse` bucket on first start. Confirm it completes: ```bash kubectl --context=prod -n ai get jobs kubectl --context=prod -n ai wait --for=condition=complete job/minio-init --timeout=120s ``` --- ## Step 8 — Deploy Application Images Build and push the application images: ```bash # Build and push all services bin/deploy gateway prod bin/deploy web prod bin/deploy sandbox prod bin/deploy lifecycle-sidecar prod bin/deploy flink prod bin/deploy relay prod bin/deploy ingestor prod ``` Each `bin/deploy` command builds the Docker image, tags it with the current git SHA, pushes to `git.dxod.org/dexorder/dexorder/`, and updates the live deployment via `kubectl set image`. Wait for the gateway and web to be ready: ```bash kubectl --context=prod -n ai rollout status deployment/gateway kubectl --context=prod -n ai rollout status deployment/ai-web ``` --- ## Step 9 — Initialize Schema and Admin User ```bash bin/init prod ``` This will: 1. Wait for postgres to be ready 2. Check if the schema exists; apply `gateway/schema.sql` if not 3. Prompt for admin user credentials (email, password, display name, license tier) 4. Register the user via the API 5. Insert the license record into the database --- ## Step 10 — Verify TLS and Ingress cert-manager should automatically provision TLS certificates via Let's Encrypt once the ingress resources are applied and DNS is resolving correctly. ```bash # Check certificate status kubectl --context=prod -n ai get certificates kubectl --context=prod -n ai describe certificate dexorder-ai-tls # Certificates are ready when READY=True # This can take 1-2 minutes for HTTP-01 challenge completion ``` Once ready, verify the application is accessible: ```bash curl -I https://dexorder.ai/api/health # Expected: HTTP/2 200 ``` --- ## Day-2 Operations ### Update a Service After Code Changes ```bash # Rebuild and redeploy a single service bin/deploy gateway prod bin/deploy web prod ``` ### Update Secrets ```bash # Update all secrets bin/secret-update prod # Update a specific secret bin/secret-update prod ai-secrets ``` ### Update Config ```bash # Update all configs (triggers pod restarts) bin/config-update prod # Update a specific config bin/config-update prod gateway-config ``` ### Add a New User ```bash # Re-run init to add another user bin/init prod ``` Or insert directly via psql: ```bash PG_POD=$(kubectl --context=prod -n ai get pods -l app=postgres -o jsonpath='{.items[0].metadata.name}') kubectl --context=prod -n ai exec -it "$PG_POD" -- psql -U postgres -d iceberg ``` ### View Logs ```bash kubectl --context=prod -n ai logs -f deployment/gateway kubectl --context=prod -n ai logs -f deployment/ingestor kubectl --context=prod -n ai logs -f deployment/flink-jobmanager kubectl --context=prod -n sandbox logs -l dexorder.io/component=sandbox ``` ### Check Sandbox Status ```bash # List all running sandboxes kubectl --context=prod -n sandbox get deployments kubectl --context=prod -n sandbox get pods # Check resource usage in sandbox namespace kubectl --context=prod -n sandbox top pods ``` --- ## Namespace & Security Architecture ``` Internet │ ▼ nginx-ingress (dexorder.ai) │ ├──/──────────────────► ai-web:5173 (Vue.js UI) │ └──/api/───────────────► gateway:3000 (Node.js API) │ │ Creates/manages via k8s API ▼ sandbox namespace ┌──────────────────────┐ │ sandbox- │ │ ├── sandbox │ │ │ (MCP server) │ │ └── lifecycle-sidecar│ └──────────────────────┘ │ │ Egress: only ai namespace │ services + external HTTPS ▼ ai namespace services: gateway:5571 (ZMQ events) iceberg-catalog:8181 minio:9000 relay:5559 ``` ### Network Isolation - Sandbox pods have default-deny network policy - Sandboxes can reach: gateway (ZMQ + callbacks), iceberg-catalog, minio, relay, external HTTPS (port 443) - Sandboxes cannot reach: other sandbox pods, the Kubernetes API, private IP ranges - The admission policy (`dexorder-sandbox-image-policy`) prevents non-approved images from running in the sandbox namespace --- ## Troubleshooting ### Pods stuck in `Pending` ```bash kubectl --context=prod -n ai describe pod # Look for: resource quota exceeded, PVC not bound, image pull errors ``` ### Certificate not issuing ```bash kubectl --context=prod -n ai describe certificaterequest kubectl --context=prod -n cert-manager logs -l app=cert-manager # Common cause: DNS not pointing to cluster ingress IP yet ``` ### Gateway can't create sandboxes ```bash # Verify RBAC is correct kubectl --context=prod auth can-i create deployments \ --as=system:serviceaccount:ai:gateway -n sandbox # Should return: yes ``` ### Sandbox pod fails to start with "configmap not found" This would indicate a leftover reference to `sandbox-config` (removed from the template). Check the sandbox deployment spec: ```bash kubectl --context=prod -n sandbox describe deployment sandbox- ``` ### 1Password auth expired ```bash op signin bin/secret-update prod ```