redesign fully scaffolded and web login works

2026-03-17 20:10:47 -04:00
parent b9cc397e05
commit f6bd22a8ef
143 changed files with 17317 additions and 693 deletions
--- a/doc/architecture.md
+++ b/doc/architecture.md
@@ -0,0 +1,656 @@
+# DexOrder AI Platform Architecture
+
+## Overview
+
+DexOrder is an AI-powered trading platform that combines real-time market data processing, user-specific AI agents, and a flexible data pipeline. The system is designed for scalability, isolation, and extensibility.
+
+## High-Level Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                         User Clients                             │
+│              (Web, Mobile, Telegram, External MCP)               │
+└────────────────────────────┬────────────────────────────────────┘
+                             │
+                             ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                          Gateway                                 │
+│  • WebSocket/HTTP/Telegram handlers                             │
+│  • Authentication & session management                           │
+│  • Agent Harness (LangChain/LangGraph orchestration)            │
+│    - MCP client connector to user containers                    │
+│    - RAG retriever (Qdrant)                                     │
+│    - Model router (LLM selection)                               │
+│    - Skills & subagents framework                               │
+│  • Dynamic user container provisioning                           │
+│  • Event routing (informational & critical)                      │
+└────────┬──────────────────┬────────────────────┬────────────────┘
+         │                  │                    │
+         ▼                  ▼                    ▼
+┌──────────────────┐  ┌──────────────┐   ┌──────────────────────┐
+│ User Containers  │  │    Relay     │   │   Infrastructure     │
+│ (per-user pods)  │  │ (ZMQ Router) │   │ • DragonflyDB (cache)│
+│                  │  │              │   │ • Qdrant (vectors)   │
+│ • MCP Server     │  │ • Market data│   │ • PostgreSQL (meta)  │
+│ • User files:    │  │   fanout     │   │ • MinIO (S3)         │
+│   - Indicators   │  │ • Work queue │   │                      │
+│   - Strategies   │  │ • Stateless  │   │                      │
+│   - Preferences  │  │              │   │                      │
+│ • Event Publisher│  │              │   │                      │
+│ • Lifecycle Mgmt │  │              │   │                      │
+└──────────────────┘  └──────┬───────┘   └──────────────────────┘
+                             │
+              ┌──────────────┴──────────────┐
+              │                             │
+              ▼                             ▼
+    ┌──────────────────┐        ┌──────────────────────┐
+    │   Ingestors      │        │    Flink Cluster     │
+    │ • CCXT adapters  │        │ • Deduplication      │
+    │ • Exchange APIs  │        │ • OHLC aggregation   │
+    │ • Push to Kafka  │        │ • CEP engine         │
+    └────────┬─────────┘        │ • Writes to Iceberg  │
+             │                  │ • Market data PUB    │
+             │                  └──────────┬───────────┘
+             ▼                             │
+    ┌─────────────────────────────────────▼────────────┐
+    │                    Kafka                          │
+    │  • Durable append log                            │
+    │  • Topic-based streams                           │
+    │  • Event sourcing                                │
+    └──────────────────────┬───────────────────────────┘
+                           │
+                           ▼
+                  ┌─────────────────┐
+                  │ Iceberg Catalog │
+                  │ • Historical    │
+                  │   OHLC storage  │
+                  │ • Query API     │
+                  └─────────────────┘
+```
+
+## Core Components
+
+### 1. Gateway
+
+**Location:** `gateway/`
+**Language:** TypeScript (Node.js)
+**Purpose:** Entry point for all user interactions
+
+**Responsibilities:**
+- **Authentication:** JWT tokens, Telegram OAuth, multi-tier licensing
+- **Session Management:** WebSocket connections, Telegram webhooks, multi-channel support
+- **Container Orchestration:** Dynamic provisioning of user agent pods ([[gateway_container_creation]])
+- **Event Handling:**
+  - Subscribe to user container events (XPUB/SUB for informational)
+  - Route critical events (ROUTER/DEALER with ack) ([[user_container_events]])
+- **Agent Harness (LangChain/LangGraph):** ([[agent_harness]])
+  - Stateless LLM orchestration
+  - MCP client connector to user containers
+  - RAG retrieval from Qdrant (global + user-specific knowledge)
+  - Model routing based on license tier and complexity
+  - Skills and subagents framework
+  - Workflow state machines with validation loops
+
+**Key Features:**
+- **Stateless design:** All conversation state lives in user containers or Qdrant
+- **Multi-channel support:** WebSocket, Telegram (future: mobile, Discord, Slack)
+- **Kubernetes-native:** Uses k8s API for container management
+- **Three-tier memory:**
+  - Redis: Hot storage, active sessions, LangGraph checkpoints (1 hour TTL)
+  - Qdrant: Vector search, RAG, global + user knowledge, GDPR-compliant
+  - Iceberg: Cold storage, full history, analytics, time-travel queries
+
+**Infrastructure:**
+- Deployed in `dexorder-system` namespace
+- RBAC: Can create but not delete user containers
+- Network policies: Access to k8s API, user containers, infrastructure
+
+---
+
+### 2. User Containers
+
+**Location:** `client-py/`
+**Language:** Python
+**Purpose:** Per-user isolated workspace and data storage
+
+**Architecture:**
+- One pod per user (auto-provisioned by gateway)
+- Persistent storage (PVC) for user data
+- Multi-container pod:
+  - **Agent container:** MCP server + event publisher + user files
+  - **Lifecycle sidecar:** Auto-shutdown and cleanup
+
+**Components:**
+
+#### MCP Server
+Exposes user-specific resources and tools via Model Context Protocol.
+
+**Resources (Context for LLM):**
+Gateway fetches these before each LLM call:
+- `context://user-profile` - Trading preferences, style, risk tolerance
+- `context://conversation-summary` - Recent conversation with semantic context
+- `context://workspace-state` - Current chart, watchlist, positions, alerts
+- `context://system-prompt` - User's custom AI instructions
+
+**Tools (Actions with side effects):**
+Gateway proxies these to user's MCP server:
+- `save_message(role, content)` - Save to conversation history
+- `search_conversation(query)` - Semantic search over past conversations
+- `list_strategies()`, `read_strategy(name)`, `write_strategy(name, code)`
+- `list_indicators()`, `read_indicator(name)`, `write_indicator(name, code)`
+- `run_backtest(strategy, params)` - Execute backtest
+- `get_watchlist()`, `execute_trade(params)`, `get_positions()`
+- `run_python(code)` - Execute Python with data science libraries
+
+**User Files:**
+- `indicators/*.py` - Custom technical indicators
+- `strategies/*.py` - Trading strategies with entry/exit rules
+- Watchlists and preferences
+- Git-versioned in persistent volume
+
+#### Event Publisher ([[user_container_events]])
+Publishes user events (order fills, alerts, workspace changes) via dual-channel ZMQ:
+- **XPUB:** Informational events (fire-and-forget to active sessions)
+- **DEALER:** Critical events (guaranteed delivery with ack)
+
+#### Lifecycle Manager ([[container_lifecycle_management]])
+Tracks activity and triggers; auto-shuts down when idle:
+- Configurable idle timeouts by license tier
+- Exit code 42 signals intentional shutdown
+- Sidecar deletes deployment and optionally PVC
+
+**Isolation:**
+- Network policies: Cannot access k8s API, other users, or system services
+- PodSecurity: Non-root, read-only rootfs, dropped capabilities
+- Resource limits enforced by license tier
+
+---
+
+### 3. Data Pipeline
+
+#### Relay (ZMQ Router)
+
+**Location:** `relay/`
+**Language:** Rust
+**Purpose:** Stateless message router for market data and requests
+
+**Architecture:**
+- Well-known bind point (all components connect to it)
+- No request tracking or state
+- Topic-based routing
+
+**Channels:**
+1. **Client Requests (ROUTER):** Port 5559 - Historical data requests
+2. **Ingestor Work Queue (PUB):** Port 5555 - Work distribution with exchange prefix
+3. **Market Data Fanout (XPUB/XSUB):** Port 5558 - Realtime data + notifications
+4. **Responses (SUB → PUB proxy):** Notifications from Flink to clients
+
+See [[protocol]] for detailed ZMQ patterns and message formats.
+
+---
+
+#### Ingestors
+
+**Location:** `ingestor/`
+**Language:** Python
+**Purpose:** Fetch market data from exchanges
+
+**Features:**
+- CCXT-based exchange adapters
+- Subscribes to work queue via exchange prefix (e.g., `BINANCE:`)
+- Writes raw data to Kafka only (no direct client responses)
+- Supports realtime ticks and historical OHLC
+
+**Data Flow:**
+```
+Exchange API → Ingestor → Kafka → Flink → Iceberg
+                                      ↓
+                                  Notification → Relay → Clients
+```
+
+---
+
+#### Kafka
+
+**Deployment:** KRaft mode (no Zookeeper)
+**Purpose:** Durable event log and stream processing backbone
+
+**Topics:**
+- Raw market data streams (per exchange/symbol)
+- Processed OHLC data
+- Notification events
+- User events (orders, alerts)
+
+**Retention:**
+- Configurable per topic (default: 7 days for raw data)
+- Longer retention for aggregated data
+
+---
+
+#### Flink
+
+**Deployment:** JobManager + TaskManager(s)
+**Purpose:** Stream processing and aggregation
+
+**Jobs:**
+1. **Deduplication:** Remove duplicate ticks from multiple ingestors
+2. **OHLC Aggregation:** Build candles from tick streams
+3. **CEP (Complex Event Processing):** Pattern detection and alerts
+4. **Iceberg Writer:** Batch write to long-term storage
+5. **Notification Publisher:** ZMQ PUB for async client notifications
+
+**State:**
+- Checkpointing to MinIO (S3-compatible)
+- Exactly-once processing semantics
+
+**Scaling:**
+- Multiple TaskManagers for parallelism
+- Headless service for ZMQ discovery (see [[protocol#TODO: Flink-to-Relay ZMQ Discovery]])
+
+---
+
+#### Apache Iceberg
+
+**Deployment:** REST catalog with PostgreSQL backend
+**Purpose:** Historical data lake for OHLC and analytics
+
+**Features:**
+- Schema evolution
+- Time travel queries
+- Partitioning by date/symbol
+- Efficient columnar storage (Parquet)
+
+**Storage:** MinIO (S3-compatible object storage)
+
+---
+
+### 4. Infrastructure Services
+
+#### DragonflyDB
+- Redis-compatible in-memory cache
+- Session state, rate limiting, hot data
+
+#### Qdrant
+- Vector database for RAG
+- **Global knowledge** (user_id="0"): Platform capabilities, trading concepts, strategy patterns
+- **User knowledge** (user_id=specific): Personal conversations, preferences, strategies
+- GDPR-compliant (indexed by user_id for fast deletion)
+
+#### PostgreSQL
+- Iceberg catalog metadata
+- User accounts and license info (gateway)
+- Per-user data lives in user containers
+
+#### MinIO
+- S3-compatible object storage
+- Iceberg table data
+- Flink checkpoints
+- User file uploads
+
+---
+
+## Data Flow Patterns
+
+### Historical Data Query (Async)
+
+```
+1. Client → Gateway → User Container MCP: User requests data
+2. Gateway → Relay (REQ/REP): Submit historical request
+3. Relay → Ingestors (PUB/SUB): Broadcast work with exchange prefix
+4. Ingestor → Exchange API: Fetch data
+5. Ingestor → Kafka: Write OHLC batch with metadata
+6. Flink → Kafka: Read, process, dedupe
+7. Flink → Iceberg: Write to table
+8. Flink → Relay (PUB): Publish HistoryReadyNotification
+9. Relay → Client (SUB): Notification delivered
+10. Client → Iceberg: Query data directly
+```
+
+**Key Design:**
+- Client subscribes to notification topic BEFORE submitting request (prevents race)
+- Notification topics are deterministic: `RESPONSE:{client_id}` or `HISTORY_READY:{request_id}`
+- No state in Relay (fully topic-based routing)
+
+See [[protocol#Historical Data Query Flow]] for details.
+
+---
+
+### Realtime Market Data
+
+```
+1. Ingestor → Kafka: Write realtime ticks
+2. Flink → Kafka: Read and aggregate OHLC
+3. Flink → Relay (PUB): Publish market data
+4. Relay → Clients (XPUB/SUB): Fanout to subscribers
+```
+
+**Topic Format:** `{ticker}|{data_type}` (e.g., `BINANCE:BTC/USDT|tick`)
+
+---
+
+### User Events
+
+User containers emit events (order fills, alerts) that must reach users reliably.
+
+**Dual-Channel Design:**
+
+1. **Informational Events (XPUB/SUB):**
+   - Container tracks active subscriptions via XPUB
+   - Publishes only if someone is listening
+   - Zero latency, fire-and-forget
+
+2. **Critical Events (DEALER/ROUTER):**
+   - Container sends to gateway ROUTER with event ID
+   - Gateway delivers via Telegram/email/push
+   - Gateway sends EventAck back to container
+   - Container retries on timeout
+   - Persisted to disk on shutdown
+
+See [[user_container_events]] for implementation.
+
+---
+
+## Container Lifecycle
+
+### Creation ([[gateway_container_creation]])
+
+```
+User authenticates → Gateway checks if deployment exists
+                  → If missing, create from template (based on license tier)
+                  → Wait for ready (2min timeout)
+                  → Return MCP endpoint
+```
+
+**Templates by Tier:**
+| Tier | Memory | CPU | Storage | Idle Timeout |
+|------|--------|-----|---------|--------------|
+| Free | 512Mi | 500m | 1Gi | 15min |
+| Pro | 2Gi | 2000m | 10Gi | 60min |
+| Enterprise | 4Gi | 4000m | 50Gi | Never |
+
+---
+
+### Lifecycle Management ([[container_lifecycle_management]])
+
+**Idle Detection:**
+- Container is idle when: no active triggers + no recent MCP activity
+- Lifecycle manager tracks:
+  - MCP tool/resource calls (reset idle timer)
+  - Active triggers (data subscriptions, CEP patterns)
+
+**Shutdown:**
+- On idle timeout: exit with code 42
+- Lifecycle sidecar detects exit code 42
+- Sidecar calls k8s API to delete deployment
+- Optionally deletes PVC (anonymous users only)
+
+**Security:**
+- Sidecar has RBAC to delete its own deployment only
+- Cannot delete other deployments or access other namespaces
+- Gateway cannot delete deployments (separation of concerns)
+
+---
+
+## Security Architecture
+
+### Network Isolation
+
+**NetworkPolicies:**
+- User containers:
+  - ✅ Connect to gateway (MCP)
+  - ✅ Connect to relay (market data)
+  - ✅ Outbound HTTPS (exchanges, LLM APIs)
+  - ❌ No k8s API access
+  - ❌ No system namespace access
+  - ❌ No inter-user communication
+
+- Gateway:
+  - ✅ k8s API (create containers)
+  - ✅ User containers (MCP client)
+  - ✅ Infrastructure (Postgres, Redis)
+  - ✅ Outbound (Anthropic API)
+
+---
+
+### RBAC
+
+**Gateway ServiceAccount:**
+- Create deployments/services/PVCs in `dexorder-agents` namespace
+- Read pod status and logs
+- Cannot delete, exec, or access secrets
+
+**Lifecycle Sidecar ServiceAccount:**
+- Delete deployments in `dexorder-agents` namespace
+- Delete PVCs (conditional on user type)
+- Cannot access other resources
+
+---
+
+### Admission Control
+
+All pods in `dexorder-agents` namespace must:
+- Use approved images only (allowlist)
+- Run as non-root
+- Drop all capabilities
+- Use read-only root filesystem
+- Have resource limits
+
+See `deploy/k8s/base/admission-policy.yaml`
+
+---
+
+## Agent Harness Flow
+
+The gateway's agent harness (LangChain/LangGraph) orchestrates LLM interactions with full context.
+
+```
+1. User sends message → Gateway (WebSocket/Telegram)
+   ↓
+2. Authenticator validates user and gets license info
+   ↓
+3. Container Manager ensures user's MCP container is running
+   ↓
+4. Agent Harness processes message:
+   │
+   ├─→ a. MCPClientConnector fetches context resources from user's MCP:
+   │      - context://user-profile
+   │      - context://conversation-summary
+   │      - context://workspace-state
+   │      - context://system-prompt
+   │
+   ├─→ b. RAGRetriever searches Qdrant for relevant memories:
+   │      - Embeds user query
+   │      - Searches: user_id IN (current_user, "0")
+   │      - Returns user-specific + global platform knowledge
+   │
+   ├─→ c. Build system prompt:
+   │      - Base platform prompt
+   │      - User profile context
+   │      - Workspace state
+   │      - Custom user instructions
+   │      - Relevant RAG memories
+   │
+   ├─→ d. ModelRouter selects LLM:
+   │      - Based on license tier
+   │      - Query complexity
+   │      - Routing strategy (cost/speed/quality)
+   │
+   ├─→ e. LLM invocation with tool support:
+   │      - Send messages to LLM
+   │      - If tool calls requested:
+   │         • Platform tools → handled by gateway
+   │         • User tools → proxied to MCP container
+   │      - Loop until no more tool calls
+   │
+   ├─→ f. Save conversation to MCP:
+   │      - mcp.callTool('save_message', user_message)
+   │      - mcp.callTool('save_message', assistant_message)
+   │
+   └─→ g. Return response to user via channel
+```
+
+**Key Architecture:**
+- **Gateway is stateless:** No conversation history stored in gateway
+- **User context in MCP:** All user-specific data lives in user's container
+- **Global knowledge in Qdrant:** Platform documentation loaded from `gateway/knowledge/`
+- **RAG at gateway level:** Semantic search combines global + user knowledge
+- **Skills vs Subagents:**
+  - Skills: Well-defined, single-purpose tasks
+  - Subagents: Complex domain expertise with multi-file context
+- **Workflows:** LangGraph state machines for multi-step processes
+
+See [[agent_harness]] for detailed implementation.
+
+---
+
+## Configuration Management
+
+All services use dual YAML files:
+- `config.yaml` - Non-sensitive configuration (mounted from ConfigMap)
+- `secrets.yaml` - Credentials and tokens (mounted from Secret)
+
+**Environment Variables:**
+- K8s downward API for pod metadata
+- Service discovery via DNS (e.g., `kafka:9092`)
+
+---
+
+## Deployment
+
+### Development
+
+```bash
+# Start local k8s
+minikube start
+
+# Apply infrastructure
+kubectl apply -k deploy/k8s/dev
+
+# Build and load images
+docker build -t dexorder/gateway:latest gateway/
+minikube image load dexorder/gateway:latest
+
+# Port-forward for access
+kubectl port-forward -n dexorder-system svc/gateway 3000:3000
+```
+
+---
+
+### Production
+
+```bash
+# Apply production configs
+kubectl apply -k deploy/k8s/prod
+
+# Push images to registry
+docker push ghcr.io/dexorder/gateway:latest
+docker push ghcr.io/dexorder/agent:latest
+docker push ghcr.io/dexorder/lifecycle-sidecar:latest
+```
+
+**Namespaces:**
+- `dexorder-system` - Platform services (gateway, infrastructure)
+- `dexorder-agents` - User containers (isolated)
+
+---
+
+## Observability
+
+### Metrics (Prometheus)
+- Container creation/deletion rates
+- Idle shutdown counts
+- MCP call latency and errors
+- Event delivery rates and retries
+- Kafka lag and throughput
+- Flink checkpoint duration
+
+### Logging
+- Structured JSON logs
+- User ID in all agent logs
+- Aggregated via Loki or CloudWatch
+
+### Tracing
+- OpenTelemetry spans across gateway → MCP → LLM
+- User-scoped traces for debugging
+
+---
+
+## Scalability
+
+### Horizontal Scaling
+
+**Stateless Components:**
+- Gateway: Add replicas behind load balancer
+- Relay: Single instance (stateless router)
+- Ingestors: Scale by exchange workload
+
+**Stateful Components:**
+- Flink: Scale TaskManagers
+- User containers: One per user (1000s of pods)
+
+**Bottlenecks:**
+- Flink → Relay ZMQ: Requires discovery protocol (see [[protocol#TODO: Flink-to-Relay ZMQ Discovery]])
+- Kafka: Partition by symbol for parallelism
+- Iceberg: Partition by date/symbol
+
+---
+
+### Cost Optimization
+
+**Tiered Resources:**
+- Free users: Aggressive idle shutdown (15min)
+- Pro users: Longer timeout (60min)
+- Enterprise: Always-on containers
+
+**Storage:**
+- PVC deletion for anonymous users
+- Tiered storage classes (fast SSD → cheap HDD)
+
+**LLM Costs:**
+- Rate limiting per license tier
+- Caching of MCP resources (1-5min TTL)
+- Conversation summarization to reduce context size
+
+---
+
+## Development Roadmap
+
+See [[backend_redesign]] for detailed notes.
+
+**Phase 1: Foundation (Complete)**
+- Gateway with k8s integration
+- User container provisioning
+- MCP protocol implementation
+- Basic market data pipeline
+
+**Phase 2: Data Pipeline (In Progress)**
+- Kafka topic schemas
+- Flink jobs for aggregation
+- Iceberg integration
+- Historical backfill service
+
+**Phase 3: Agent Features**
+- RAG integration (Qdrant)
+- Strategy backtesting
+- Risk management tools
+- Portfolio analytics
+
+**Phase 4: Production Hardening**
+- Multi-region deployment
+- HA for infrastructure
+- Comprehensive monitoring
+- Performance optimization
+
+---
+
+## Related Documentation
+
+- [[protocol]] - ZMQ message protocols and data flow
+- [[gateway_container_creation]] - Dynamic container provisioning
+- [[container_lifecycle_management]] - Idle shutdown and cleanup
+- [[user_container_events]] - Event system implementation
+- [[agent_harness]] - LLM orchestration flow
+- [[m_c_p_tools_architecture]] - User MCP tools specification
+- [[user_mcp_resources]] - Context resources and RAG
+- [[m_c_p_client_authentication_modes]] - MCP authentication patterns
+- [[backend_redesign]] - Design notes and TODO items