# Production Deployment Guide This document describes the full process for deploying the AI platform to the production Kubernetes cluster, including the special steps required when the Iceberg schema has changed. ## Overview The production cluster runs under `kubectl --context prod`, defaulting to the `ai` namespace. The `sandbox` namespace is shared between dev and prod. Deployment consists of two parts: 1. **Standard deploy** — rebuild and push all images, apply k8s manifests, roll out services 2. **Iceberg schema wipe** *(when schema has changed)* — clear both the Iceberg REST catalog (postgres) and the MinIO data warehouse before deploying --- ## Standard Deployment (no schema changes) ```bash bin/deploy-all --sandboxes ``` This script (hardcoded to `--context=prod`) performs: 1. Applies base kustomize manifests (`deploy/k8s/prod/`) — namespaces, RBAC, policies 2. Applies `deploy/k8s/prod/infrastructure.yaml` — statefulsets, deployments 3. Runs `bin/config-update prod` — updates ConfigMaps 4. Builds and pushes images for all 7 services: `gateway`, `web`, `sandbox`, `lifecycle-sidecar`, `flink`, `relay`, `ingestor` 5. *(with `--sandboxes`)* Deletes sandbox Deployments and Services in the `sandbox` namespace (PVCs are retained; gateway recreates them on next login) 6. Waits for rollouts on all 6 main deployments > **Secrets are NOT updated by this script.** Run `bin/secret-update prod` separately if secrets have changed. ### Post-deploy: refresh user licenses After any deploy that changes license tier templates (`gateway/src/types/user.ts`), run: ```bash bin/create-all-users prod ``` This upserts all alpha users and re-applies the current tier template to their `user_licenses` row. Safe to run on an existing database — it will not delete users or lose data. New sandbox deployments will pick up the updated resource limits on next login. --- ## Full Deploy with Iceberg Schema Wipe Use this when the Iceberg table schema has changed (e.g. protobuf/column changes in the `trading.ohlc` table). ### Architecture note The Iceberg REST catalog uses **two storage layers** that must both be cleared: | Layer | What it stores | How to clear | |---|---|---| | PostgreSQL `iceberg` database | Table/namespace metadata (catalog) | Drop and recreate the database | | MinIO `warehouse` bucket | Parquet data files | `mc rm --recursive --force` | **Important:** The gateway also uses the `iceberg` postgres database for its own auth tables (`user`, `user_licenses`, `session`, etc.). Wiping the database removes all user accounts. After the wipe, the schema must be re-applied and users recreated. ### Step-by-step #### 1. Scale down Iceberg consumers ```bash kubectl --context prod -n ai scale deployment iceberg-catalog flink-jobmanager flink-taskmanager --replicas=0 ``` This prevents in-flight writes during the wipe. #### 2. Wipe the Iceberg PostgreSQL catalog ```bash kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "DROP DATABASE iceberg;" kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "CREATE DATABASE iceberg;" ``` #### 3. Wipe the MinIO warehouse bucket Get MinIO credentials from the cluster secret: ```bash kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-user}' | base64 -d kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d ``` Configure the `mc` client inside the MinIO pod and remove all objects: ```bash kubectl --context prod -n ai exec minio-0 -- mc alias set local http://localhost:9000 kubectl --context prod -n ai exec minio-0 -- mc rm --recursive --force local/warehouse/ ``` #### 4. Delete sandbox deployments and wipe sandbox PVCs Sandbox PVCs have a finalizer that prevents deletion until the sandbox pod is gone. Delete the deployments first, then the PVCs: ```bash kubectl --context prod -n sandbox delete deployments --all kubectl --context prod -n sandbox delete pvc --all ``` The PVC deletion will complete once the pods finish terminating (Ceph cleanup can take ~30s). You can proceed to the deploy immediately — it does not depend on PVC termination completing. #### 5. Run the full deploy ```bash bin/deploy-all --sandboxes ``` This rebuilds and redeploys all services, including `iceberg-catalog`, `flink-jobmanager`, and `flink-taskmanager` (which were scaled to zero above — `deploy-all` will restore them to their manifest replica counts). The `--sandboxes` flag also cleans up any remaining sandbox Services. #### 6. Re-apply the gateway database schema The gateway does **not** auto-migrate. After the `iceberg` database is recreated, the schema must be applied manually: ```bash kubectl --context prod -n ai exec -i postgres-0 -- psql -U postgres -d iceberg < gateway/schema.sql ``` This creates the `user`, `session`, `user_licenses`, and related tables. #### 7. Recreate all users ```bash bin/create-all-users prod ``` This registers all alpha test users via the gateway API and assigns their licenses. Users are defined in the script itself (`bin/create-all-users`). To add or modify users, edit that file or run `bin/create-user prod` interactively. --- ## Verification ```bash curl -I https://dexorder.ai/api/health ``` Check gateway logs for errors: ```bash kubectl --context prod -n ai logs deployment/gateway --tail=100 ``` --- ## Common Issues ### Login fails after Iceberg wipe **Symptom:** `Sign in failed` (401) or `User creation failed` (postgres error `42P01: undefined table`) **Cause:** Dropping the `iceberg` database removes the gateway's auth tables along with the Iceberg catalog metadata — they share the same database. **Fix:** Re-apply the schema and recreate users (steps 6 and 7 above). ### Gateway shows `42P01` errors but pod is running The gateway does not auto-migrate on startup. The schema file must be applied manually after any database recreation. A gateway restart alone will not fix this. ### Gateway CrashLoopBackOff — `ECONNREFUSED postgresql://localhost/dexorder` **Symptom:** New gateway pod crashes immediately with `Database connection failed` and logs show `databaseUrl: "postgresql://localhost/dexorder"`. **Cause:** The gateway reads `database.url` from `config.yaml` (via `configData`). If that key is absent, it falls back to the localhost default — even if `secrets.yaml` has `database.url`. The code checks `configData.database?.url || secretsData.database?.url || ...` (as of `c8fa99c`), so both sources work, but both files must be present and correctly mounted. **What to check:** 1. Does the `gateway-config` ConfigMap have a `database:` section? (It should not — credentials belong in secrets as of the nautilus branch.) 2. Does `gateway-secrets` have `database.url`? Verify: `kubectl --context prod -n ai get secret gateway-secrets -o jsonpath='{.data.secrets\.yaml}' | base64 -d` 3. If the secret is missing the database section, run `bin/secret-update prod` (requires 1Password desktop to be unlocked — must run interactively, not via pipe). ### `bin/secret-update prod` fails with "authorization prompt dismissed" 1Password's `op inject` requires interactive desktop authentication. Running it via `echo "yes" | bin/secret-update prod` or any background/piped invocation will fail silently (the script prints `✓` even though `kubectl apply` received empty input). **Fix:** Run `bin/secret-update prod` in an interactive terminal with 1Password unlocked. ### Config validation warnings during `bin/deploy-all` **Symptom:** Step 3 (config update) prints errors like: ``` error: error validating "deploy/k8s/prod/configs/relay-config.yaml": error validating data: [apiVersion not set, kind not set] ``` for `relay-config`, `ingestor-config`, and `flink-config`. **Cause:** These config files are raw data files (not Kubernetes manifests), so `kubectl` can't validate their structure. The underlying `kubectl create configmap` command succeeds regardless. **Impact:** None — the configs are applied correctly and the script reports `✓ All configs updated successfully`. These warnings are expected and can be ignored. ### Flink image build produces many Maven shading warnings **Symptom:** During Step 4, the Flink image build outputs dozens of `[WARNING] Discovered module-info.class` and overlapping class/resource warnings from Maven. **Impact:** None — these are pre-existing warnings from bundling Iceberg, AWS SDK, and Flink dependencies together into a shaded JAR. The build completes successfully. ### `bin/deploy-all` confirmation prompt Unlike `bin/secret-update`, the `bin/deploy-all` confirmation prompt (`Are you sure you want to continue? (yes/no)`) works fine with `echo "yes" | bin/deploy-all --sandboxes` from a script or non-interactive context.