Files
ai/doc/prod_deployment.md

140 lines
4.9 KiB
Markdown

# Production Deployment Guide
This document describes the full process for deploying the AI platform to the production Kubernetes cluster, including the special steps required when the Iceberg schema has changed.
## Overview
The production cluster runs under `kubectl --context prod`, defaulting to the `ai` namespace. The `sandbox` namespace is shared between dev and prod.
Deployment consists of two parts:
1. **Standard deploy** — rebuild and push all images, apply k8s manifests, roll out services
2. **Iceberg schema wipe** *(when schema has changed)* — clear both the Iceberg REST catalog (postgres) and the MinIO data warehouse before deploying
---
## Standard Deployment (no schema changes)
```bash
bin/deploy-all --sandboxes
```
This script (hardcoded to `--context=prod`) performs:
1. Applies base kustomize manifests (`deploy/k8s/prod/`) — namespaces, RBAC, policies
2. Applies `deploy/k8s/prod/infrastructure.yaml` — statefulsets, deployments
3. Runs `bin/config-update prod` — updates ConfigMaps
4. Builds and pushes images for all 7 services: `gateway`, `web`, `sandbox`, `lifecycle-sidecar`, `flink`, `relay`, `ingestor`
5. *(with `--sandboxes`)* Deletes sandbox Deployments and Services in the `sandbox` namespace (PVCs are retained; gateway recreates them on next login)
6. Waits for rollouts on all 6 main deployments
> **Secrets are NOT updated by this script.** Run `bin/secret-update prod` separately if secrets have changed.
---
## Full Deploy with Iceberg Schema Wipe
Use this when the Iceberg table schema has changed (e.g. protobuf/column changes in the `trading.ohlc` table).
### Architecture note
The Iceberg REST catalog uses **two storage layers** that must both be cleared:
| Layer | What it stores | How to clear |
|---|---|---|
| PostgreSQL `iceberg` database | Table/namespace metadata (catalog) | Drop and recreate the database |
| MinIO `warehouse` bucket | Parquet data files | `mc rm --recursive --force` |
**Important:** The gateway also uses the `iceberg` postgres database for its own auth tables (`user`, `user_licenses`, `session`, etc.). Wiping the database removes all user accounts. After the wipe, the schema must be re-applied and users recreated.
### Step-by-step
#### 1. Scale down Iceberg consumers
```bash
kubectl --context prod -n ai scale deployment iceberg-catalog flink-jobmanager flink-taskmanager --replicas=0
```
This prevents in-flight writes during the wipe.
#### 2. Wipe the Iceberg PostgreSQL catalog
```bash
kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "DROP DATABASE iceberg;"
kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "CREATE DATABASE iceberg;"
```
#### 3. Wipe the MinIO warehouse bucket
Get MinIO credentials from the cluster secret:
```bash
kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-user}' | base64 -d
kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d
```
Configure the `mc` client inside the MinIO pod and remove all objects:
```bash
kubectl --context prod -n ai exec minio-0 -- mc alias set local http://localhost:9000 <user> <password>
kubectl --context prod -n ai exec minio-0 -- mc rm --recursive --force local/warehouse/
```
#### 4. Run the full deploy
```bash
bin/deploy-all --sandboxes
```
This rebuilds and redeploys all services, including `iceberg-catalog`, `flink-jobmanager`, and `flink-taskmanager` (which were scaled to zero above — `deploy-all` will restore them to their manifest replica counts).
#### 5. Re-apply the gateway database schema
The gateway does **not** auto-migrate. After the `iceberg` database is recreated, the schema must be applied manually:
```bash
kubectl --context prod -n ai exec -i postgres-0 -- psql -U postgres -d iceberg < gateway/schema.sql
```
This creates the `user`, `session`, `user_licenses`, and related tables.
#### 6. Recreate all users
```bash
bin/create-all-users prod
```
This registers all alpha test users via the gateway API and assigns their licenses. Users are defined in the script itself (`bin/create-all-users`).
To add or modify users, edit that file or run `bin/create-user prod` interactively.
---
## Verification
```bash
curl -I https://dexorder.ai/api/health
```
Check gateway logs for errors:
```bash
kubectl --context prod -n ai logs deployment/gateway --tail=100
```
---
## Common Issues
### Login fails after Iceberg wipe
**Symptom:** `Sign in failed` (401) or `User creation failed` (postgres error `42P01: undefined table`)
**Cause:** Dropping the `iceberg` database removes the gateway's auth tables along with the Iceberg catalog metadata — they share the same database.
**Fix:** Re-apply the schema and recreate users (steps 5 and 6 above).
### Gateway shows `42P01` errors but pod is running
The gateway does not auto-migrate on startup. The schema file must be applied manually after any database recreation. A gateway restart alone will not fix this.