Update prod_deployment.md

This commit is contained in:
2026-04-24 21:16:29 -04:00
parent fecefa15ef
commit 2268ef0d3f

View File

@@ -90,15 +90,26 @@ kubectl --context prod -n ai exec minio-0 -- mc alias set local http://localhost
kubectl --context prod -n ai exec minio-0 -- mc rm --recursive --force local/warehouse/
```
#### 4. Run the full deploy
#### 4. Delete sandbox deployments and wipe sandbox PVCs
Sandbox PVCs have a finalizer that prevents deletion until the sandbox pod is gone. Delete the deployments first, then the PVCs:
```bash
kubectl --context prod -n sandbox delete deployments --all
kubectl --context prod -n sandbox delete pvc --all
```
The PVC deletion will complete once the pods finish terminating (Ceph cleanup can take ~30s). You can proceed to the deploy immediately — it does not depend on PVC termination completing.
#### 5. Run the full deploy
```bash
bin/deploy-all --sandboxes
```
This rebuilds and redeploys all services, including `iceberg-catalog`, `flink-jobmanager`, and `flink-taskmanager` (which were scaled to zero above — `deploy-all` will restore them to their manifest replica counts).
This rebuilds and redeploys all services, including `iceberg-catalog`, `flink-jobmanager`, and `flink-taskmanager` (which were scaled to zero above — `deploy-all` will restore them to their manifest replica counts). The `--sandboxes` flag also cleans up any remaining sandbox Services.
#### 5. Re-apply the gateway database schema
#### 6. Re-apply the gateway database schema
The gateway does **not** auto-migrate. After the `iceberg` database is recreated, the schema must be applied manually:
@@ -108,7 +119,7 @@ kubectl --context prod -n ai exec -i postgres-0 -- psql -U postgres -d iceberg <
This creates the `user`, `session`, `user_licenses`, and related tables.
#### 6. Recreate all users
#### 7. Recreate all users
```bash
bin/create-all-users prod
@@ -142,7 +153,7 @@ kubectl --context prod -n ai logs deployment/gateway --tail=100
**Cause:** Dropping the `iceberg` database removes the gateway's auth tables along with the Iceberg catalog metadata — they share the same database.
**Fix:** Re-apply the schema and recreate users (steps 5 and 6 above).
**Fix:** Re-apply the schema and recreate users (steps 6 and 7 above).
### Gateway shows `42P01` errors but pod is running
@@ -164,3 +175,25 @@ The gateway does not auto-migrate on startup. The schema file must be applied ma
1Password's `op inject` requires interactive desktop authentication. Running it via `echo "yes" | bin/secret-update prod` or any background/piped invocation will fail silently (the script prints `✓` even though `kubectl apply` received empty input).
**Fix:** Run `bin/secret-update prod` in an interactive terminal with 1Password unlocked.
### Config validation warnings during `bin/deploy-all`
**Symptom:** Step 3 (config update) prints errors like:
```
error: error validating "deploy/k8s/prod/configs/relay-config.yaml": error validating data: [apiVersion not set, kind not set]
```
for `relay-config`, `ingestor-config`, and `flink-config`.
**Cause:** These config files are raw data files (not Kubernetes manifests), so `kubectl` can't validate their structure. The underlying `kubectl create configmap` command succeeds regardless.
**Impact:** None — the configs are applied correctly and the script reports `✓ All configs updated successfully`. These warnings are expected and can be ignored.
### Flink image build produces many Maven shading warnings
**Symptom:** During Step 4, the Flink image build outputs dozens of `[WARNING] Discovered module-info.class` and overlapping class/resource warnings from Maven.
**Impact:** None — these are pre-existing warnings from bundling Iceberg, AWS SDK, and Flink dependencies together into a shaded JAR. The build completes successfully.
### `bin/deploy-all` confirmation prompt
Unlike `bin/secret-update`, the `bin/deploy-all` confirmation prompt (`Are you sure you want to continue? (yes/no)`) works fine with `echo "yes" | bin/deploy-all --sandboxes` from a script or non-interactive context.