140 lines
4.9 KiB
Markdown
140 lines
4.9 KiB
Markdown
# Production Deployment Guide
|
|
|
|
This document describes the full process for deploying the AI platform to the production Kubernetes cluster, including the special steps required when the Iceberg schema has changed.
|
|
|
|
## Overview
|
|
|
|
The production cluster runs under `kubectl --context prod`, defaulting to the `ai` namespace. The `sandbox` namespace is shared between dev and prod.
|
|
|
|
Deployment consists of two parts:
|
|
|
|
1. **Standard deploy** — rebuild and push all images, apply k8s manifests, roll out services
|
|
2. **Iceberg schema wipe** *(when schema has changed)* — clear both the Iceberg REST catalog (postgres) and the MinIO data warehouse before deploying
|
|
|
|
---
|
|
|
|
## Standard Deployment (no schema changes)
|
|
|
|
```bash
|
|
bin/deploy-all --sandboxes
|
|
```
|
|
|
|
This script (hardcoded to `--context=prod`) performs:
|
|
|
|
1. Applies base kustomize manifests (`deploy/k8s/prod/`) — namespaces, RBAC, policies
|
|
2. Applies `deploy/k8s/prod/infrastructure.yaml` — statefulsets, deployments
|
|
3. Runs `bin/config-update prod` — updates ConfigMaps
|
|
4. Builds and pushes images for all 7 services: `gateway`, `web`, `sandbox`, `lifecycle-sidecar`, `flink`, `relay`, `ingestor`
|
|
5. *(with `--sandboxes`)* Deletes sandbox Deployments and Services in the `sandbox` namespace (PVCs are retained; gateway recreates them on next login)
|
|
6. Waits for rollouts on all 6 main deployments
|
|
|
|
> **Secrets are NOT updated by this script.** Run `bin/secret-update prod` separately if secrets have changed.
|
|
|
|
---
|
|
|
|
## Full Deploy with Iceberg Schema Wipe
|
|
|
|
Use this when the Iceberg table schema has changed (e.g. protobuf/column changes in the `trading.ohlc` table).
|
|
|
|
### Architecture note
|
|
|
|
The Iceberg REST catalog uses **two storage layers** that must both be cleared:
|
|
|
|
| Layer | What it stores | How to clear |
|
|
|---|---|---|
|
|
| PostgreSQL `iceberg` database | Table/namespace metadata (catalog) | Drop and recreate the database |
|
|
| MinIO `warehouse` bucket | Parquet data files | `mc rm --recursive --force` |
|
|
|
|
**Important:** The gateway also uses the `iceberg` postgres database for its own auth tables (`user`, `user_licenses`, `session`, etc.). Wiping the database removes all user accounts. After the wipe, the schema must be re-applied and users recreated.
|
|
|
|
### Step-by-step
|
|
|
|
#### 1. Scale down Iceberg consumers
|
|
|
|
```bash
|
|
kubectl --context prod -n ai scale deployment iceberg-catalog flink-jobmanager flink-taskmanager --replicas=0
|
|
```
|
|
|
|
This prevents in-flight writes during the wipe.
|
|
|
|
#### 2. Wipe the Iceberg PostgreSQL catalog
|
|
|
|
```bash
|
|
kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "DROP DATABASE iceberg;"
|
|
kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "CREATE DATABASE iceberg;"
|
|
```
|
|
|
|
#### 3. Wipe the MinIO warehouse bucket
|
|
|
|
Get MinIO credentials from the cluster secret:
|
|
|
|
```bash
|
|
kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-user}' | base64 -d
|
|
kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d
|
|
```
|
|
|
|
Configure the `mc` client inside the MinIO pod and remove all objects:
|
|
|
|
```bash
|
|
kubectl --context prod -n ai exec minio-0 -- mc alias set local http://localhost:9000 <user> <password>
|
|
kubectl --context prod -n ai exec minio-0 -- mc rm --recursive --force local/warehouse/
|
|
```
|
|
|
|
#### 4. Run the full deploy
|
|
|
|
```bash
|
|
bin/deploy-all --sandboxes
|
|
```
|
|
|
|
This rebuilds and redeploys all services, including `iceberg-catalog`, `flink-jobmanager`, and `flink-taskmanager` (which were scaled to zero above — `deploy-all` will restore them to their manifest replica counts).
|
|
|
|
#### 5. Re-apply the gateway database schema
|
|
|
|
The gateway does **not** auto-migrate. After the `iceberg` database is recreated, the schema must be applied manually:
|
|
|
|
```bash
|
|
kubectl --context prod -n ai exec -i postgres-0 -- psql -U postgres -d iceberg < gateway/schema.sql
|
|
```
|
|
|
|
This creates the `user`, `session`, `user_licenses`, and related tables.
|
|
|
|
#### 6. Recreate all users
|
|
|
|
```bash
|
|
bin/create-all-users prod
|
|
```
|
|
|
|
This registers all alpha test users via the gateway API and assigns their licenses. Users are defined in the script itself (`bin/create-all-users`).
|
|
|
|
To add or modify users, edit that file or run `bin/create-user prod` interactively.
|
|
|
|
---
|
|
|
|
## Verification
|
|
|
|
```bash
|
|
curl -I https://dexorder.ai/api/health
|
|
```
|
|
|
|
Check gateway logs for errors:
|
|
|
|
```bash
|
|
kubectl --context prod -n ai logs deployment/gateway --tail=100
|
|
```
|
|
|
|
---
|
|
|
|
## Common Issues
|
|
|
|
### Login fails after Iceberg wipe
|
|
|
|
**Symptom:** `Sign in failed` (401) or `User creation failed` (postgres error `42P01: undefined table`)
|
|
|
|
**Cause:** Dropping the `iceberg` database removes the gateway's auth tables along with the Iceberg catalog metadata — they share the same database.
|
|
|
|
**Fix:** Re-apply the schema and recreate users (steps 5 and 6 above).
|
|
|
|
### Gateway shows `42P01` errors but pod is running
|
|
|
|
The gateway does not auto-migrate on startup. The schema file must be applied manually after any database recreation. A gateway restart alone will not fix this.
|