bugfixes; research subproc; higher sandbox limits

2026-04-16 18:11:26 -04:00
parent f80c943dc3
commit 3153e89d4f
54 changed files with 1947 additions and 498 deletions
--- a/doc/plan.md
+++ b/doc/plan.md
@@ -0,0 +1,10 @@
+# Development Plan
+
+* Realtime data
+* Triggers
+* Strategy UI
+* Backtesting TV integration
+* Paper Trading
+* User secrets
+* Live Execution
+* Sandbox <=> Dexorder auth
--- a/doc/prod_deployment.md
+++ b/doc/prod_deployment.md
@@ -0,0 +1,139 @@
+# Production Deployment Guide
+
+This document describes the full process for deploying the AI platform to the production Kubernetes cluster, including the special steps required when the Iceberg schema has changed.
+
+## Overview
+
+The production cluster runs under `kubectl --context prod`, defaulting to the `ai` namespace. The `sandbox` namespace is shared between dev and prod.
+
+Deployment consists of two parts:
+
+1. **Standard deploy** — rebuild and push all images, apply k8s manifests, roll out services
+2. **Iceberg schema wipe** *(when schema has changed)* — clear both the Iceberg REST catalog (postgres) and the MinIO data warehouse before deploying
+
+---
+
+## Standard Deployment (no schema changes)
+
+```bash
+bin/deploy-all --sandboxes
+```
+
+This script (hardcoded to `--context=prod`) performs:
+
+1. Applies base kustomize manifests (`deploy/k8s/prod/`) — namespaces, RBAC, policies
+2. Applies `deploy/k8s/prod/infrastructure.yaml` — statefulsets, deployments
+3. Runs `bin/config-update prod` — updates ConfigMaps
+4. Builds and pushes images for all 7 services: `gateway`, `web`, `sandbox`, `lifecycle-sidecar`, `flink`, `relay`, `ingestor`
+5. *(with `--sandboxes`)* Deletes sandbox Deployments and Services in the `sandbox` namespace (PVCs are retained; gateway recreates them on next login)
+6. Waits for rollouts on all 6 main deployments
+
+> **Secrets are NOT updated by this script.** Run `bin/secret-update prod` separately if secrets have changed.
+
+---
+
+## Full Deploy with Iceberg Schema Wipe
+
+Use this when the Iceberg table schema has changed (e.g. protobuf/column changes in the `trading.ohlc` table).
+
+### Architecture note
+
+The Iceberg REST catalog uses **two storage layers** that must both be cleared:
+
+| Layer | What it stores | How to clear |
+|---|---|---|
+| PostgreSQL `iceberg` database | Table/namespace metadata (catalog) | Drop and recreate the database |
+| MinIO `warehouse` bucket | Parquet data files | `mc rm --recursive --force` |
+
+**Important:** The gateway also uses the `iceberg` postgres database for its own auth tables (`user`, `user_licenses`, `session`, etc.). Wiping the database removes all user accounts. After the wipe, the schema must be re-applied and users recreated.
+
+### Step-by-step
+
+#### 1. Scale down Iceberg consumers
+
+```bash
+kubectl --context prod -n ai scale deployment iceberg-catalog flink-jobmanager flink-taskmanager --replicas=0
+```
+
+This prevents in-flight writes during the wipe.
+
+#### 2. Wipe the Iceberg PostgreSQL catalog
+
+```bash
+kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "DROP DATABASE iceberg;"
+kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "CREATE DATABASE iceberg;"
+```
+
+#### 3. Wipe the MinIO warehouse bucket
+
+Get MinIO credentials from the cluster secret:
+
+```bash
+kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-user}' | base64 -d
+kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d
+```
+
+Configure the `mc` client inside the MinIO pod and remove all objects:
+
+```bash
+kubectl --context prod -n ai exec minio-0 -- mc alias set local http://localhost:9000 <user> <password>
+kubectl --context prod -n ai exec minio-0 -- mc rm --recursive --force local/warehouse/
+```
+
+#### 4. Run the full deploy
+
+```bash
+bin/deploy-all --sandboxes
+```
+
+This rebuilds and redeploys all services, including `iceberg-catalog`, `flink-jobmanager`, and `flink-taskmanager` (which were scaled to zero above — `deploy-all` will restore them to their manifest replica counts).
+
+#### 5. Re-apply the gateway database schema
+
+The gateway does **not** auto-migrate. After the `iceberg` database is recreated, the schema must be applied manually:
+
+```bash
+kubectl --context prod -n ai exec -i postgres-0 -- psql -U postgres -d iceberg < gateway/schema.sql
+```
+
+This creates the `user`, `session`, `user_licenses`, and related tables.
+
+#### 6. Recreate all users
+
+```bash
+bin/create-all-users prod
+```
+
+This registers all alpha test users via the gateway API and assigns their licenses. Users are defined in the script itself (`bin/create-all-users`).
+
+To add or modify users, edit that file or run `bin/create-user prod` interactively.
+
+---
+
+## Verification
+
+```bash
+curl -I https://dexorder.ai/api/health
+```
+
+Check gateway logs for errors:
+
+```bash
+kubectl --context prod -n ai logs deployment/gateway --tail=100
+```
+
+---
+
+## Common Issues
+
+### Login fails after Iceberg wipe
+
+**Symptom:** `Sign in failed` (401) or `User creation failed` (postgres error `42P01: undefined table`)
+
+**Cause:** Dropping the `iceberg` database removes the gateway's auth tables along with the Iceberg catalog metadata — they share the same database.
+
+**Fix:** Re-apply the schema and recreate users (steps 5 and 6 above).
+
+### Gateway shows `42P01` errors but pod is running
+
+The gateway does not auto-migrate on startup. The schema file must be applied manually after any database recreation. A gateway restart alone will not fix this.
--- a/doc/protocol.md
+++ b/doc/protocol.md
@@ -81,18 +81,29 @@ All sockets bind on **Relay** (well-known endpoint). Components connect to relay
  - Relay publishes DataRequest to ingestor work queue
  - No request tracking - relay is stateless

-### 2. Ingestor Work Queue (Relay → Ingestors)
-**Pattern**: PUB/SUB with exchange prefix filtering
- **Socket Type**: Relay uses PUB (bind), Ingestors use SUB (connect)
- **Endpoint**: `tcp://*:5555` (Relay binds)
- **Message Types**: `DataRequest` (historical or realtime)
- **Topic Prefix**: Market name (e.g., `BTC/USDT.`, `ETH/BTC.`)
- **Behavior**:
-  - Relay publishes work with exchange prefix from ticker
-  - Ingestors subscribe only to exchanges they support
-  - Multiple ingestors can compete for same exchange
-  - Ingestors write data to Kafka only (no direct response)
-  - Flink processes Kafka → Iceberg → notification
+### 2. Ingestor Work Queue (Flink ↔ Ingestors)
+**Pattern**: ROUTER/DEALER slot-based broker
+- **Socket Type**: Flink `IngestorBroker` uses ROUTER (bind), Ingestors use DEALER (connect)
+- **Endpoint**: `tcp://*:5567` (Flink binds)
+- **Message Types**: `WorkerReady` (slot offer), `DataRequest` (work assignment), `WorkComplete`, `WorkHeartbeat`, `WorkReject`, `WorkStop`
+- **Capacity model**:
+  - Each `WorkerReady` (0x20) is ONE slot offer for one exchange and one job type (`SlotType`: `HISTORICAL=1`, `REALTIME=2`, `ANY=0`)
+  - Ingestors send N `WorkerReady` messages at startup — one per available slot per exchange per type
+  - Flink dispatches a job by matching the slot's exchange and SlotType to the request
+  - The slot is consumed on dispatch; the ingestor re-offers it (new `WorkerReady`) when the job ends
+  - Rate-limit backoff: if the exchange returns a 429, the ingestor delays the re-offer by the `Retry-After` duration from the response header
+- **Historical job lifecycle**:
+  - Flink dispatches `DataRequest` (HISTORICAL_OHLC) → ingestor fetches and writes to Kafka → sends `WorkComplete` (0x21) → sends new `WorkerReady` for that slot
+- **Realtime job lifecycle**:
+  - Flink dispatches `DataRequest` (REALTIME_TICKS) → ingestor polls exchange and writes ticks to Kafka → sends `WorkHeartbeat` (0x22) every 5 s → on `WorkStop` (0x25) from Flink: cancels and sends new `WorkerReady`
+- **Slot configuration** (per ingestor, per exchange):
+  ```yaml
+  exchange_capacity:
+    BINANCE:  { historical_slots: 3, realtime_slots: 5 }
+    KRAKEN:   { historical_slots: 2, realtime_slots: 3 }
+    COINBASE: { historical_slots: 2, realtime_slots: 4 }
+  ```
+- **Flink restart**: when Flink restarts its `freeSlots` deque is cleared; all in-flight jobs time out on the ingestor side, releasing their slots, which then re-offer via `WorkerReady`

 ### 3. Market Data Fanout (Relay ↔ Flink ↔ Clients)
 **Pattern**: XPUB/XSUB proxy
--- a/doc/test_prompt.md
+++ b/doc/test_prompt.md
@@ -1,4 +1 @@
-what conclusions can you make by analyzing historical data on ETH price direction changes near market session overlaps and market sessions changes on monday and tuesday?
-
---
-
+what conclusions can you make by analyzing historical data on ETH price direction changes near market session overlaps and market sessions changes on monday and tuesday?