bugfixes; research subproc; higher sandbox limits
This commit is contained in:
10
doc/plan.md
Normal file
10
doc/plan.md
Normal file
@@ -0,0 +1,10 @@
|
||||
# Development Plan
|
||||
|
||||
* Realtime data
|
||||
* Triggers
|
||||
* Strategy UI
|
||||
* Backtesting TV integration
|
||||
* Paper Trading
|
||||
* User secrets
|
||||
* Live Execution
|
||||
* Sandbox <=> Dexorder auth
|
||||
139
doc/prod_deployment.md
Normal file
139
doc/prod_deployment.md
Normal file
@@ -0,0 +1,139 @@
|
||||
# Production Deployment Guide
|
||||
|
||||
This document describes the full process for deploying the AI platform to the production Kubernetes cluster, including the special steps required when the Iceberg schema has changed.
|
||||
|
||||
## Overview
|
||||
|
||||
The production cluster runs under `kubectl --context prod`, defaulting to the `ai` namespace. The `sandbox` namespace is shared between dev and prod.
|
||||
|
||||
Deployment consists of two parts:
|
||||
|
||||
1. **Standard deploy** — rebuild and push all images, apply k8s manifests, roll out services
|
||||
2. **Iceberg schema wipe** *(when schema has changed)* — clear both the Iceberg REST catalog (postgres) and the MinIO data warehouse before deploying
|
||||
|
||||
---
|
||||
|
||||
## Standard Deployment (no schema changes)
|
||||
|
||||
```bash
|
||||
bin/deploy-all --sandboxes
|
||||
```
|
||||
|
||||
This script (hardcoded to `--context=prod`) performs:
|
||||
|
||||
1. Applies base kustomize manifests (`deploy/k8s/prod/`) — namespaces, RBAC, policies
|
||||
2. Applies `deploy/k8s/prod/infrastructure.yaml` — statefulsets, deployments
|
||||
3. Runs `bin/config-update prod` — updates ConfigMaps
|
||||
4. Builds and pushes images for all 7 services: `gateway`, `web`, `sandbox`, `lifecycle-sidecar`, `flink`, `relay`, `ingestor`
|
||||
5. *(with `--sandboxes`)* Deletes sandbox Deployments and Services in the `sandbox` namespace (PVCs are retained; gateway recreates them on next login)
|
||||
6. Waits for rollouts on all 6 main deployments
|
||||
|
||||
> **Secrets are NOT updated by this script.** Run `bin/secret-update prod` separately if secrets have changed.
|
||||
|
||||
---
|
||||
|
||||
## Full Deploy with Iceberg Schema Wipe
|
||||
|
||||
Use this when the Iceberg table schema has changed (e.g. protobuf/column changes in the `trading.ohlc` table).
|
||||
|
||||
### Architecture note
|
||||
|
||||
The Iceberg REST catalog uses **two storage layers** that must both be cleared:
|
||||
|
||||
| Layer | What it stores | How to clear |
|
||||
|---|---|---|
|
||||
| PostgreSQL `iceberg` database | Table/namespace metadata (catalog) | Drop and recreate the database |
|
||||
| MinIO `warehouse` bucket | Parquet data files | `mc rm --recursive --force` |
|
||||
|
||||
**Important:** The gateway also uses the `iceberg` postgres database for its own auth tables (`user`, `user_licenses`, `session`, etc.). Wiping the database removes all user accounts. After the wipe, the schema must be re-applied and users recreated.
|
||||
|
||||
### Step-by-step
|
||||
|
||||
#### 1. Scale down Iceberg consumers
|
||||
|
||||
```bash
|
||||
kubectl --context prod -n ai scale deployment iceberg-catalog flink-jobmanager flink-taskmanager --replicas=0
|
||||
```
|
||||
|
||||
This prevents in-flight writes during the wipe.
|
||||
|
||||
#### 2. Wipe the Iceberg PostgreSQL catalog
|
||||
|
||||
```bash
|
||||
kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "DROP DATABASE iceberg;"
|
||||
kubectl --context prod -n ai exec postgres-0 -- psql -U postgres -c "CREATE DATABASE iceberg;"
|
||||
```
|
||||
|
||||
#### 3. Wipe the MinIO warehouse bucket
|
||||
|
||||
Get MinIO credentials from the cluster secret:
|
||||
|
||||
```bash
|
||||
kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-user}' | base64 -d
|
||||
kubectl --context prod -n ai get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d
|
||||
```
|
||||
|
||||
Configure the `mc` client inside the MinIO pod and remove all objects:
|
||||
|
||||
```bash
|
||||
kubectl --context prod -n ai exec minio-0 -- mc alias set local http://localhost:9000 <user> <password>
|
||||
kubectl --context prod -n ai exec minio-0 -- mc rm --recursive --force local/warehouse/
|
||||
```
|
||||
|
||||
#### 4. Run the full deploy
|
||||
|
||||
```bash
|
||||
bin/deploy-all --sandboxes
|
||||
```
|
||||
|
||||
This rebuilds and redeploys all services, including `iceberg-catalog`, `flink-jobmanager`, and `flink-taskmanager` (which were scaled to zero above — `deploy-all` will restore them to their manifest replica counts).
|
||||
|
||||
#### 5. Re-apply the gateway database schema
|
||||
|
||||
The gateway does **not** auto-migrate. After the `iceberg` database is recreated, the schema must be applied manually:
|
||||
|
||||
```bash
|
||||
kubectl --context prod -n ai exec -i postgres-0 -- psql -U postgres -d iceberg < gateway/schema.sql
|
||||
```
|
||||
|
||||
This creates the `user`, `session`, `user_licenses`, and related tables.
|
||||
|
||||
#### 6. Recreate all users
|
||||
|
||||
```bash
|
||||
bin/create-all-users prod
|
||||
```
|
||||
|
||||
This registers all alpha test users via the gateway API and assigns their licenses. Users are defined in the script itself (`bin/create-all-users`).
|
||||
|
||||
To add or modify users, edit that file or run `bin/create-user prod` interactively.
|
||||
|
||||
---
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
curl -I https://dexorder.ai/api/health
|
||||
```
|
||||
|
||||
Check gateway logs for errors:
|
||||
|
||||
```bash
|
||||
kubectl --context prod -n ai logs deployment/gateway --tail=100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Login fails after Iceberg wipe
|
||||
|
||||
**Symptom:** `Sign in failed` (401) or `User creation failed` (postgres error `42P01: undefined table`)
|
||||
|
||||
**Cause:** Dropping the `iceberg` database removes the gateway's auth tables along with the Iceberg catalog metadata — they share the same database.
|
||||
|
||||
**Fix:** Re-apply the schema and recreate users (steps 5 and 6 above).
|
||||
|
||||
### Gateway shows `42P01` errors but pod is running
|
||||
|
||||
The gateway does not auto-migrate on startup. The schema file must be applied manually after any database recreation. A gateway restart alone will not fix this.
|
||||
@@ -81,18 +81,29 @@ All sockets bind on **Relay** (well-known endpoint). Components connect to relay
|
||||
- Relay publishes DataRequest to ingestor work queue
|
||||
- No request tracking - relay is stateless
|
||||
|
||||
### 2. Ingestor Work Queue (Relay → Ingestors)
|
||||
**Pattern**: PUB/SUB with exchange prefix filtering
|
||||
- **Socket Type**: Relay uses PUB (bind), Ingestors use SUB (connect)
|
||||
- **Endpoint**: `tcp://*:5555` (Relay binds)
|
||||
- **Message Types**: `DataRequest` (historical or realtime)
|
||||
- **Topic Prefix**: Market name (e.g., `BTC/USDT.`, `ETH/BTC.`)
|
||||
- **Behavior**:
|
||||
- Relay publishes work with exchange prefix from ticker
|
||||
- Ingestors subscribe only to exchanges they support
|
||||
- Multiple ingestors can compete for same exchange
|
||||
- Ingestors write data to Kafka only (no direct response)
|
||||
- Flink processes Kafka → Iceberg → notification
|
||||
### 2. Ingestor Work Queue (Flink ↔ Ingestors)
|
||||
**Pattern**: ROUTER/DEALER slot-based broker
|
||||
- **Socket Type**: Flink `IngestorBroker` uses ROUTER (bind), Ingestors use DEALER (connect)
|
||||
- **Endpoint**: `tcp://*:5567` (Flink binds)
|
||||
- **Message Types**: `WorkerReady` (slot offer), `DataRequest` (work assignment), `WorkComplete`, `WorkHeartbeat`, `WorkReject`, `WorkStop`
|
||||
- **Capacity model**:
|
||||
- Each `WorkerReady` (0x20) is ONE slot offer for one exchange and one job type (`SlotType`: `HISTORICAL=1`, `REALTIME=2`, `ANY=0`)
|
||||
- Ingestors send N `WorkerReady` messages at startup — one per available slot per exchange per type
|
||||
- Flink dispatches a job by matching the slot's exchange and SlotType to the request
|
||||
- The slot is consumed on dispatch; the ingestor re-offers it (new `WorkerReady`) when the job ends
|
||||
- Rate-limit backoff: if the exchange returns a 429, the ingestor delays the re-offer by the `Retry-After` duration from the response header
|
||||
- **Historical job lifecycle**:
|
||||
- Flink dispatches `DataRequest` (HISTORICAL_OHLC) → ingestor fetches and writes to Kafka → sends `WorkComplete` (0x21) → sends new `WorkerReady` for that slot
|
||||
- **Realtime job lifecycle**:
|
||||
- Flink dispatches `DataRequest` (REALTIME_TICKS) → ingestor polls exchange and writes ticks to Kafka → sends `WorkHeartbeat` (0x22) every 5 s → on `WorkStop` (0x25) from Flink: cancels and sends new `WorkerReady`
|
||||
- **Slot configuration** (per ingestor, per exchange):
|
||||
```yaml
|
||||
exchange_capacity:
|
||||
BINANCE: { historical_slots: 3, realtime_slots: 5 }
|
||||
KRAKEN: { historical_slots: 2, realtime_slots: 3 }
|
||||
COINBASE: { historical_slots: 2, realtime_slots: 4 }
|
||||
```
|
||||
- **Flink restart**: when Flink restarts its `freeSlots` deque is cleared; all in-flight jobs time out on the ingestor side, releasing their slots, which then re-offer via `WorkerReady`
|
||||
|
||||
### 3. Market Data Fanout (Relay ↔ Flink ↔ Clients)
|
||||
**Pattern**: XPUB/XSUB proxy
|
||||
|
||||
@@ -1,4 +1 @@
|
||||
what conclusions can you make by analyzing historical data on ETH price direction changes near market session overlaps and market sessions changes on monday and tuesday?
|
||||
|
||||
---
|
||||
|
||||
what conclusions can you make by analyzing historical data on ETH price direction changes near market session overlaps and market sessions changes on monday and tuesday?
|
||||
Reference in New Issue
Block a user