4.9 KiB
Production Deployment Guide
This document describes the full process for deploying the AI platform to the production Kubernetes cluster, including the special steps required when the Iceberg schema has changed.
Overview
The production cluster runs under kubectl --context prod, defaulting to the ai namespace. The sandbox namespace is shared between dev and prod.
Deployment consists of two parts:
- Standard deploy — rebuild and push all images, apply k8s manifests, roll out services
- Iceberg schema wipe (when schema has changed) — clear both the Iceberg REST catalog (postgres) and the MinIO data warehouse before deploying
Standard Deployment (no schema changes)
bin/deploy-all --sandboxes
This script (hardcoded to --context=prod) performs:
- Applies base kustomize manifests (
deploy/k8s/prod/) — namespaces, RBAC, policies - Applies
deploy/k8s/prod/infrastructure.yaml— statefulsets, deployments - Runs
bin/config-update prod— updates ConfigMaps - Builds and pushes images for all 7 services:
gateway,web,sandbox,lifecycle-sidecar,flink,relay,ingestor - (with
--sandboxes) Deletes sandbox Deployments and Services in thesandboxnamespace (PVCs are retained; gateway recreates them on next login) - Waits for rollouts on all 6 main deployments
Secrets are NOT updated by this script. Run
bin/secret-update prodseparately if secrets have changed.
Full Deploy with Iceberg Schema Wipe
Use this when the Iceberg table schema has changed (e.g. protobuf/column changes in the trading.ohlc table).
Architecture note
The Iceberg REST catalog uses two storage layers that must both be cleared:
| Layer | What it stores | How to clear |
|---|---|---|
PostgreSQL iceberg database |
Table/namespace metadata (catalog) | Drop and recreate the database |
MinIO warehouse bucket |
Parquet data files | mc rm --recursive --force |
Important: The gateway also uses the iceberg postgres database for its own auth tables (user, user_licenses, session, etc.). Wiping the database removes all user accounts. After the wipe, the schema must be re-applied and users recreated.
Step-by-step
1. Scale down Iceberg consumers
kubectl --context prod -n ai scale deployment iceberg-catalog flink-jobmanager flink-taskmanager --replicas=0
This prevents in-flight writes during the wipe.
2. Wipe the Iceberg PostgreSQL catalog
kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "DROP DATABASE iceberg;"
kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "CREATE DATABASE iceberg;"
3. Wipe the MinIO warehouse bucket
Get MinIO credentials from the cluster secret:
kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-user}' | base64 -d
kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d
Configure the mc client inside the MinIO pod and remove all objects:
kubectl --context prod -n ai exec minio-0 -- mc alias set local http://localhost:9000 <user> <password>
kubectl --context prod -n ai exec minio-0 -- mc rm --recursive --force local/warehouse/
4. Run the full deploy
bin/deploy-all --sandboxes
This rebuilds and redeploys all services, including iceberg-catalog, flink-jobmanager, and flink-taskmanager (which were scaled to zero above — deploy-all will restore them to their manifest replica counts).
5. Re-apply the gateway database schema
The gateway does not auto-migrate. After the iceberg database is recreated, the schema must be applied manually:
kubectl --context prod -n ai exec -i postgres-0 -- psql -U postgres -d iceberg < gateway/schema.sql
This creates the user, session, user_licenses, and related tables.
6. Recreate all users
bin/create-all-users prod
This registers all alpha test users via the gateway API and assigns their licenses. Users are defined in the script itself (bin/create-all-users).
To add or modify users, edit that file or run bin/create-user prod interactively.
Verification
curl -I https://dexorder.ai/api/health
Check gateway logs for errors:
kubectl --context prod -n ai logs deployment/gateway --tail=100
Common Issues
Login fails after Iceberg wipe
Symptom: Sign in failed (401) or User creation failed (postgres error 42P01: undefined table)
Cause: Dropping the iceberg database removes the gateway's auth tables along with the Iceberg catalog metadata — they share the same database.
Fix: Re-apply the schema and recreate users (steps 5 and 6 above).
Gateway shows 42P01 errors but pod is running
The gateway does not auto-migrate on startup. The schema file must be applied manually after any database recreation. A gateway restart alone will not fix this.