Files
ai/doc/prod_deployment.md

4.9 KiB

Production Deployment Guide

This document describes the full process for deploying the AI platform to the production Kubernetes cluster, including the special steps required when the Iceberg schema has changed.

Overview

The production cluster runs under kubectl --context prod, defaulting to the ai namespace. The sandbox namespace is shared between dev and prod.

Deployment consists of two parts:

  1. Standard deploy — rebuild and push all images, apply k8s manifests, roll out services
  2. Iceberg schema wipe (when schema has changed) — clear both the Iceberg REST catalog (postgres) and the MinIO data warehouse before deploying

Standard Deployment (no schema changes)

bin/deploy-all --sandboxes

This script (hardcoded to --context=prod) performs:

  1. Applies base kustomize manifests (deploy/k8s/prod/) — namespaces, RBAC, policies
  2. Applies deploy/k8s/prod/infrastructure.yaml — statefulsets, deployments
  3. Runs bin/config-update prod — updates ConfigMaps
  4. Builds and pushes images for all 7 services: gateway, web, sandbox, lifecycle-sidecar, flink, relay, ingestor
  5. (with --sandboxes) Deletes sandbox Deployments and Services in the sandbox namespace (PVCs are retained; gateway recreates them on next login)
  6. Waits for rollouts on all 6 main deployments

Secrets are NOT updated by this script. Run bin/secret-update prod separately if secrets have changed.


Full Deploy with Iceberg Schema Wipe

Use this when the Iceberg table schema has changed (e.g. protobuf/column changes in the trading.ohlc table).

Architecture note

The Iceberg REST catalog uses two storage layers that must both be cleared:

Layer What it stores How to clear
PostgreSQL iceberg database Table/namespace metadata (catalog) Drop and recreate the database
MinIO warehouse bucket Parquet data files mc rm --recursive --force

Important: The gateway also uses the iceberg postgres database for its own auth tables (user, user_licenses, session, etc.). Wiping the database removes all user accounts. After the wipe, the schema must be re-applied and users recreated.

Step-by-step

1. Scale down Iceberg consumers

kubectl --context prod -n ai scale deployment iceberg-catalog flink-jobmanager flink-taskmanager --replicas=0

This prevents in-flight writes during the wipe.

2. Wipe the Iceberg PostgreSQL catalog

kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "DROP DATABASE iceberg;"
kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "CREATE DATABASE iceberg;"

3. Wipe the MinIO warehouse bucket

Get MinIO credentials from the cluster secret:

kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-user}' | base64 -d
kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d

Configure the mc client inside the MinIO pod and remove all objects:

kubectl --context prod -n ai exec minio-0 -- mc alias set local http://localhost:9000 <user> <password>
kubectl --context prod -n ai exec minio-0 -- mc rm --recursive --force local/warehouse/

4. Run the full deploy

bin/deploy-all --sandboxes

This rebuilds and redeploys all services, including iceberg-catalog, flink-jobmanager, and flink-taskmanager (which were scaled to zero above — deploy-all will restore them to their manifest replica counts).

5. Re-apply the gateway database schema

The gateway does not auto-migrate. After the iceberg database is recreated, the schema must be applied manually:

kubectl --context prod -n ai exec -i postgres-0 -- psql -U postgres -d iceberg < gateway/schema.sql

This creates the user, session, user_licenses, and related tables.

6. Recreate all users

bin/create-all-users prod

This registers all alpha test users via the gateway API and assigns their licenses. Users are defined in the script itself (bin/create-all-users).

To add or modify users, edit that file or run bin/create-user prod interactively.


Verification

curl -I https://dexorder.ai/api/health

Check gateway logs for errors:

kubectl --context prod -n ai logs deployment/gateway --tail=100

Common Issues

Login fails after Iceberg wipe

Symptom: Sign in failed (401) or User creation failed (postgres error 42P01: undefined table)

Cause: Dropping the iceberg database removes the gateway's auth tables along with the Iceberg catalog metadata — they share the same database.

Fix: Re-apply the schema and recreate users (steps 5 and 6 above).

Gateway shows 42P01 errors but pod is running

The gateway does not auto-migrate on startup. The schema file must be applied manually after any database recreation. A gateway restart alone will not fix this.