diff --git a/doc/prod_deployment.md b/doc/prod_deployment.md index 0c256154..5f9ea744 100644 --- a/doc/prod_deployment.md +++ b/doc/prod_deployment.md @@ -90,15 +90,26 @@ kubectl --context prod -n ai exec minio-0 -- mc alias set local http://localhost kubectl --context prod -n ai exec minio-0 -- mc rm --recursive --force local/warehouse/ ``` -#### 4. Run the full deploy +#### 4. Delete sandbox deployments and wipe sandbox PVCs + +Sandbox PVCs have a finalizer that prevents deletion until the sandbox pod is gone. Delete the deployments first, then the PVCs: + +```bash +kubectl --context prod -n sandbox delete deployments --all +kubectl --context prod -n sandbox delete pvc --all +``` + +The PVC deletion will complete once the pods finish terminating (Ceph cleanup can take ~30s). You can proceed to the deploy immediately — it does not depend on PVC termination completing. + +#### 5. Run the full deploy ```bash bin/deploy-all --sandboxes ``` -This rebuilds and redeploys all services, including `iceberg-catalog`, `flink-jobmanager`, and `flink-taskmanager` (which were scaled to zero above — `deploy-all` will restore them to their manifest replica counts). +This rebuilds and redeploys all services, including `iceberg-catalog`, `flink-jobmanager`, and `flink-taskmanager` (which were scaled to zero above — `deploy-all` will restore them to their manifest replica counts). The `--sandboxes` flag also cleans up any remaining sandbox Services. -#### 5. Re-apply the gateway database schema +#### 6. Re-apply the gateway database schema The gateway does **not** auto-migrate. After the `iceberg` database is recreated, the schema must be applied manually: @@ -108,7 +119,7 @@ kubectl --context prod -n ai exec -i postgres-0 -- psql -U postgres -d iceberg < This creates the `user`, `session`, `user_licenses`, and related tables. -#### 6. Recreate all users +#### 7. Recreate all users ```bash bin/create-all-users prod @@ -142,7 +153,7 @@ kubectl --context prod -n ai logs deployment/gateway --tail=100 **Cause:** Dropping the `iceberg` database removes the gateway's auth tables along with the Iceberg catalog metadata — they share the same database. -**Fix:** Re-apply the schema and recreate users (steps 5 and 6 above). +**Fix:** Re-apply the schema and recreate users (steps 6 and 7 above). ### Gateway shows `42P01` errors but pod is running @@ -164,3 +175,25 @@ The gateway does not auto-migrate on startup. The schema file must be applied ma 1Password's `op inject` requires interactive desktop authentication. Running it via `echo "yes" | bin/secret-update prod` or any background/piped invocation will fail silently (the script prints `✓` even though `kubectl apply` received empty input). **Fix:** Run `bin/secret-update prod` in an interactive terminal with 1Password unlocked. + +### Config validation warnings during `bin/deploy-all` + +**Symptom:** Step 3 (config update) prints errors like: +``` +error: error validating "deploy/k8s/prod/configs/relay-config.yaml": error validating data: [apiVersion not set, kind not set] +``` +for `relay-config`, `ingestor-config`, and `flink-config`. + +**Cause:** These config files are raw data files (not Kubernetes manifests), so `kubectl` can't validate their structure. The underlying `kubectl create configmap` command succeeds regardless. + +**Impact:** None — the configs are applied correctly and the script reports `✓ All configs updated successfully`. These warnings are expected and can be ignored. + +### Flink image build produces many Maven shading warnings + +**Symptom:** During Step 4, the Flink image build outputs dozens of `[WARNING] Discovered module-info.class` and overlapping class/resource warnings from Maven. + +**Impact:** None — these are pre-existing warnings from bundling Iceberg, AWS SDK, and Flink dependencies together into a shaded JAR. The build completes successfully. + +### `bin/deploy-all` confirmation prompt + +Unlike `bin/secret-update`, the `bin/deploy-all` confirmation prompt (`Are you sure you want to continue? (yes/no)`) works fine with `echo "yes" | bin/deploy-all --sandboxes` from a script or non-interactive context.