Files

Tim Olson 2268ef0d3f Update prod_deployment.md

2026-04-24 21:16:29 -04:00

8.6 KiB

Raw Blame History

Production Deployment Guide

This document describes the full process for deploying the AI platform to the production Kubernetes cluster, including the special steps required when the Iceberg schema has changed.

Overview

The production cluster runs under kubectl --context prod, defaulting to the ai namespace. The sandbox namespace is shared between dev and prod.

Deployment consists of two parts:

Standard deploy — rebuild and push all images, apply k8s manifests, roll out services
Iceberg schema wipe (when schema has changed) — clear both the Iceberg REST catalog (postgres) and the MinIO data warehouse before deploying

Standard Deployment (no schema changes)

bin/deploy-all --sandboxes

This script (hardcoded to --context=prod) performs:

Applies base kustomize manifests (deploy/k8s/prod/) — namespaces, RBAC, policies
Applies deploy/k8s/prod/infrastructure.yaml — statefulsets, deployments
Runs bin/config-update prod — updates ConfigMaps
Builds and pushes images for all 7 services: gateway, web, sandbox, lifecycle-sidecar, flink, relay, ingestor
(with --sandboxes) Deletes sandbox Deployments and Services in the sandbox namespace (PVCs are retained; gateway recreates them on next login)
Waits for rollouts on all 6 main deployments

Secrets are NOT updated by this script. Run bin/secret-update prod separately if secrets have changed.

Post-deploy: refresh user licenses

After any deploy that changes license tier templates (gateway/src/types/user.ts), run:

bin/create-all-users prod

This upserts all alpha users and re-applies the current tier template to their user_licenses row. Safe to run on an existing database — it will not delete users or lose data. New sandbox deployments will pick up the updated resource limits on next login.

Full Deploy with Iceberg Schema Wipe

Use this when the Iceberg table schema has changed (e.g. protobuf/column changes in the trading.ohlc table).

Architecture note

The Iceberg REST catalog uses two storage layers that must both be cleared:

Layer	What it stores	How to clear
PostgreSQL `iceberg` database	Table/namespace metadata (catalog)	Drop and recreate the database
MinIO `warehouse` bucket	Parquet data files	`mc rm --recursive --force`

Important: The gateway also uses the iceberg postgres database for its own auth tables (user, user_licenses, session, etc.). Wiping the database removes all user accounts. After the wipe, the schema must be re-applied and users recreated.

Step-by-step

1. Scale down Iceberg consumers

kubectl --context prod -n ai scale deployment iceberg-catalog flink-jobmanager flink-taskmanager --replicas=0

This prevents in-flight writes during the wipe.

2. Wipe the Iceberg PostgreSQL catalog

kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "DROP DATABASE iceberg;"
kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "CREATE DATABASE iceberg;"

3. Wipe the MinIO warehouse bucket

Get MinIO credentials from the cluster secret:

kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-user}' | base64 -d
kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d

Configure the mc client inside the MinIO pod and remove all objects:

kubectl --context prod -n ai exec minio-0 -- mc alias set local http://localhost:9000 <user> <password>
kubectl --context prod -n ai exec minio-0 -- mc rm --recursive --force local/warehouse/

4. Delete sandbox deployments and wipe sandbox PVCs

Sandbox PVCs have a finalizer that prevents deletion until the sandbox pod is gone. Delete the deployments first, then the PVCs:

kubectl --context prod -n sandbox delete deployments --all
kubectl --context prod -n sandbox delete pvc --all

The PVC deletion will complete once the pods finish terminating (Ceph cleanup can take ~30s). You can proceed to the deploy immediately — it does not depend on PVC termination completing.

5. Run the full deploy

bin/deploy-all --sandboxes

This rebuilds and redeploys all services, including iceberg-catalog, flink-jobmanager, and flink-taskmanager (which were scaled to zero above — deploy-all will restore them to their manifest replica counts). The --sandboxes flag also cleans up any remaining sandbox Services.

6. Re-apply the gateway database schema

The gateway does not auto-migrate. After the iceberg database is recreated, the schema must be applied manually:

kubectl --context prod -n ai exec -i postgres-0 -- psql -U postgres -d iceberg < gateway/schema.sql

This creates the user, session, user_licenses, and related tables.

7. Recreate all users

bin/create-all-users prod

This registers all alpha test users via the gateway API and assigns their licenses. Users are defined in the script itself (bin/create-all-users).

To add or modify users, edit that file or run bin/create-user prod interactively.

Verification

curl -I https://dexorder.ai/api/health

Check gateway logs for errors:

kubectl --context prod -n ai logs deployment/gateway --tail=100

Common Issues

Symptom: Sign in failed (401) or User creation failed (postgres error 42P01: undefined table)

Cause: Dropping the iceberg database removes the gateway's auth tables along with the Iceberg catalog metadata — they share the same database.

Fix: Re-apply the schema and recreate users (steps 6 and 7 above).

Gateway shows `42P01` errors but pod is running

The gateway does not auto-migrate on startup. The schema file must be applied manually after any database recreation. A gateway restart alone will not fix this.

Gateway CrashLoopBackOff — `ECONNREFUSED postgresql://localhost/dexorder`

Symptom: New gateway pod crashes immediately with Database connection failed and logs show databaseUrl: "postgresql://localhost/dexorder".

Cause: The gateway reads database.url from config.yaml (via configData). If that key is absent, it falls back to the localhost default — even if secrets.yaml has database.url. The code checks configData.database?.url || secretsData.database?.url || ... (as of c8fa99c), so both sources work, but both files must be present and correctly mounted.

What to check:

Does the gateway-config ConfigMap have a database: section? (It should not — credentials belong in secrets as of the nautilus branch.)
Does gateway-secrets have database.url? Verify: kubectl --context prod -n ai get secret gateway-secrets -o jsonpath='{.data.secrets\.yaml}' | base64 -d
If the secret is missing the database section, run bin/secret-update prod (requires 1Password desktop to be unlocked — must run interactively, not via pipe).

`bin/secret-update prod` fails with "authorization prompt dismissed"

1Password's op inject requires interactive desktop authentication. Running it via echo "yes" | bin/secret-update prod or any background/piped invocation will fail silently (the script prints ✓ even though kubectl apply received empty input).

Fix: Run bin/secret-update prod in an interactive terminal with 1Password unlocked.

Config validation warnings during `bin/deploy-all`

Symptom: Step 3 (config update) prints errors like:

error: error validating "deploy/k8s/prod/configs/relay-config.yaml": error validating data: [apiVersion not set, kind not set]

for relay-config, ingestor-config, and flink-config.

Cause: These config files are raw data files (not Kubernetes manifests), so kubectl can't validate their structure. The underlying kubectl create configmap command succeeds regardless.

Impact: None — the configs are applied correctly and the script reports ✓ All configs updated successfully. These warnings are expected and can be ignored.

Flink image build produces many Maven shading warnings

Symptom: During Step 4, the Flink image build outputs dozens of [WARNING] Discovered module-info.class and overlapping class/resource warnings from Maven.

Impact: None — these are pre-existing warnings from bundling Iceberg, AWS SDK, and Flink dependencies together into a shaded JAR. The build completes successfully.

`bin/deploy-all` confirmation prompt

Unlike bin/secret-update, the bin/deploy-all confirmation prompt (Are you sure you want to continue? (yes/no)) works fine with echo "yes" | bin/deploy-all --sandboxes from a script or non-interactive context.

8.6 KiB Raw Blame History