# Production Deployment Guide This document describes the full process for deploying the AI platform to the production Kubernetes cluster, including the special steps required when the Iceberg schema has changed. ## Overview The production cluster runs under `kubectl --context prod`, defaulting to the `ai` namespace. The `sandbox` namespace is shared between dev and prod. Deployment consists of two parts: 1. **Standard deploy** — rebuild and push all images, apply k8s manifests, roll out services 2. **Iceberg schema wipe** *(when schema has changed)* — clear both the Iceberg REST catalog (postgres) and the MinIO data warehouse before deploying --- ## Standard Deployment (no schema changes) ```bash bin/deploy-all --sandboxes ``` This script (hardcoded to `--context=prod`) performs: 1. Applies base kustomize manifests (`deploy/k8s/prod/`) — namespaces, RBAC, policies 2. Applies `deploy/k8s/prod/infrastructure.yaml` — statefulsets, deployments 3. Runs `bin/config-update prod` — updates ConfigMaps 4. Builds and pushes images for all 7 services: `gateway`, `web`, `sandbox`, `lifecycle-sidecar`, `flink`, `relay`, `ingestor` 5. *(with `--sandboxes`)* Deletes sandbox Deployments and Services in the `sandbox` namespace (PVCs are retained; gateway recreates them on next login) 6. Waits for rollouts on all 6 main deployments > **Secrets are NOT updated by this script.** Run `bin/secret-update prod` separately if secrets have changed. --- ## Full Deploy with Iceberg Schema Wipe Use this when the Iceberg table schema has changed (e.g. protobuf/column changes in the `trading.ohlc` table). ### Architecture note The Iceberg REST catalog uses **two storage layers** that must both be cleared: | Layer | What it stores | How to clear | |---|---|---| | PostgreSQL `iceberg` database | Table/namespace metadata (catalog) | Drop and recreate the database | | MinIO `warehouse` bucket | Parquet data files | `mc rm --recursive --force` | **Important:** The gateway also uses the `iceberg` postgres database for its own auth tables (`user`, `user_licenses`, `session`, etc.). Wiping the database removes all user accounts. After the wipe, the schema must be re-applied and users recreated. ### Step-by-step #### 1. Scale down Iceberg consumers ```bash kubectl --context prod -n ai scale deployment iceberg-catalog flink-jobmanager flink-taskmanager --replicas=0 ``` This prevents in-flight writes during the wipe. #### 2. Wipe the Iceberg PostgreSQL catalog ```bash kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "DROP DATABASE iceberg;" kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "CREATE DATABASE iceberg;" ``` #### 3. Wipe the MinIO warehouse bucket Get MinIO credentials from the cluster secret: ```bash kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-user}' | base64 -d kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d ``` Configure the `mc` client inside the MinIO pod and remove all objects: ```bash kubectl --context prod -n ai exec minio-0 -- mc alias set local http://localhost:9000 kubectl --context prod -n ai exec minio-0 -- mc rm --recursive --force local/warehouse/ ``` #### 4. Run the full deploy ```bash bin/deploy-all --sandboxes ``` This rebuilds and redeploys all services, including `iceberg-catalog`, `flink-jobmanager`, and `flink-taskmanager` (which were scaled to zero above — `deploy-all` will restore them to their manifest replica counts). #### 5. Re-apply the gateway database schema The gateway does **not** auto-migrate. After the `iceberg` database is recreated, the schema must be applied manually: ```bash kubectl --context prod -n ai exec -i postgres-0 -- psql -U postgres -d iceberg < gateway/schema.sql ``` This creates the `user`, `session`, `user_licenses`, and related tables. #### 6. Recreate all users ```bash bin/create-all-users prod ``` This registers all alpha test users via the gateway API and assigns their licenses. Users are defined in the script itself (`bin/create-all-users`). To add or modify users, edit that file or run `bin/create-user prod` interactively. --- ## Verification ```bash curl -I https://dexorder.ai/api/health ``` Check gateway logs for errors: ```bash kubectl --context prod -n ai logs deployment/gateway --tail=100 ``` --- ## Common Issues ### Login fails after Iceberg wipe **Symptom:** `Sign in failed` (401) or `User creation failed` (postgres error `42P01: undefined table`) **Cause:** Dropping the `iceberg` database removes the gateway's auth tables along with the Iceberg catalog metadata — they share the same database. **Fix:** Re-apply the schema and recreate users (steps 5 and 6 above). ### Gateway shows `42P01` errors but pod is running The gateway does not auto-migrate on startup. The schema file must be applied manually after any database recreation. A gateway restart alone will not fix this.