redesign fully scaffolded and web login works

This commit is contained in:
2026-03-17 20:10:47 -04:00
parent b9cc397e05
commit f6bd22a8ef
143 changed files with 17317 additions and 693 deletions

392
doc/agent_harness.md Normal file
View File

@@ -0,0 +1,392 @@
# Agent Harness Architecture
The Agent Harness is the core orchestration layer for the Dexorder AI platform, built on LangChain.js and LangGraph.js.
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────┐
│ Gateway (Fastify) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ WebSocket │ │ Telegram │ │ Event │ │
│ │ Handler │ │ Handler │ │ Router │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────────────────┴──────────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ Agent Harness │ │
│ │ (Stateless) │ │
│ └───────┬────────┘ │
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ │ │ │ │
│ ┌────▼─────┐ ┌────▼─────┐ ┌────▼─────┐ │
│ │ MCP │ │ LLM │ │ RAG │ │
│ │ Connector│ │ Router │ │ Retriever│ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
└─────────┼──────────────────┼──────────────────┼─────────────┘
│ │ │
▼ ▼ ▼
┌────────────┐ ┌───────────┐ ┌───────────┐
│ User's │ │ LLM │ │ Qdrant │
│ MCP │ │ Providers │ │ (Vectors) │
│ Container │ │(Anthropic,│ │ │
│ (k8s pod) │ │ OpenAI, │ │ Global + │
│ │ │ etc) │ │ User │
└────────────┘ └───────────┘ └───────────┘
```
## Message Processing Flow
When a user sends a message:
```
1. Gateway receives message via channel (WebSocket/Telegram)
2. Authenticator validates user and gets license info
3. Container Manager ensures user's MCP container is running
4. Agent Harness processes message:
├─→ a. MCPClientConnector fetches context resources:
│ - context://user-profile
│ - context://conversation-summary
│ - context://workspace-state
│ - context://system-prompt
├─→ b. RAGRetriever searches for relevant memories:
│ - Embeds user query
│ - Searches Qdrant: user_id = current_user OR user_id = "0"
│ - Returns user-specific + global platform knowledge
├─→ c. Build system prompt:
│ - Base platform prompt
│ - User profile context
│ - Workspace state
│ - Custom user instructions
│ - Relevant RAG memories
├─→ d. ModelRouter selects LLM:
│ - Based on license tier
│ - Query complexity
│ - Configured routing strategy
├─→ e. LLM invocation with tool support:
│ - Send messages to LLM
│ - If tool calls requested:
│ • Platform tools → handled by gateway
│ • User tools → proxied to MCP container
│ - Loop until no more tool calls
├─→ f. Save conversation to MCP:
│ - mcp.callTool('save_message', user_message)
│ - mcp.callTool('save_message', assistant_message)
└─→ g. Return response to user via channel
```
## Core Components
### 1. Agent Harness (`gateway/src/harness/agent-harness.ts`)
**Stateless orchestrator** - all state lives in user's MCP container or RAG.
**Responsibilities:**
- Fetch context from user's MCP resources
- Query RAG for relevant memories
- Build prompts with full context
- Route to appropriate LLM
- Handle tool calls (platform vs user)
- Save conversation back to MCP
- Stream responses to user
**Key Methods:**
- `handleMessage()`: Process single message (non-streaming)
- `streamMessage()`: Process with streaming response
- `initialize()`: Connect to user's MCP server
### 2. MCP Client Connector (`gateway/src/harness/mcp-client.ts`)
Connects to user's MCP container using Model Context Protocol.
**Features:**
- Resource reading (context://, indicators://, strategies://)
- Tool execution (save_message, run_backtest, etc.)
- Automatic reconnection on container restarts
- Error handling and fallbacks
### 3. Model Router (`gateway/src/llm/router.ts`)
Routes queries to appropriate LLM based on:
- **License tier**: Free users → smaller models, paid → better models
- **Complexity**: Simple queries → fast models, complex → powerful models
- **Cost optimization**: Balance performance vs cost
**Routing Strategies:**
- `COST`: Minimize cost
- `COMPLEXITY`: Match model to query complexity
- `SPEED`: Prioritize fast responses
- `QUALITY`: Best available model
### 4. Memory Layer
#### Three-Tier Storage:
**Redis** (Hot Storage)
- Active session state
- Recent conversation history (last 50 messages)
- LangGraph checkpoints (1 hour TTL)
- Fast reads for active conversations
**Qdrant** (Vector Search)
- Conversation embeddings
- User-specific memories (user_id = actual user ID)
- **Global platform knowledge** (user_id = "0")
- RAG retrieval with cosine similarity
- GDPR-compliant (indexed by user_id for fast deletion)
**Iceberg** (Cold Storage)
- Full conversation history (partitioned by user_id, session_id)
- Checkpoint snapshots for replay
- Analytics and time-travel queries
- GDPR-compliant with compaction
#### RAG System:
**Global Knowledge** (user_id="0"):
- Platform capabilities and architecture
- Trading concepts and fundamentals
- Indicator development guides
- Strategy patterns and examples
- Loaded from `gateway/knowledge/` markdown files
**User Knowledge** (user_id=specific user):
- Personal conversation history
- Trading preferences and style
- Custom indicators and strategies
- Workspace state and context
**Query Flow:**
1. User query is embedded using EmbeddingService
2. Qdrant searches: `user_id IN (current_user, "0")`
3. Top-K relevant chunks returned
4. Added to LLM context automatically
### 5. Skills vs Subagents
#### Skills (`gateway/src/harness/skills/`)
**Use for**: Well-defined, specific tasks
- Market analysis
- Strategy validation
- Single-purpose capabilities
- Defined in markdown + TypeScript
**Structure:**
```typescript
class MarketAnalysisSkill extends BaseSkill {
async execute(context, parameters) {
// Implementation
}
}
```
#### Subagents (`gateway/src/harness/subagents/`)
**Use for**: Complex domain expertise with context
- Code reviewer with review guidelines
- Risk analyzer with risk models
- Multi-file knowledge base in memory/ directory
- Custom system prompts
**Structure:**
```
subagents/
code-reviewer/
config.yaml # Model, memory files, capabilities
system-prompt.md # Specialized instructions
memory/
review-guidelines.md
common-patterns.md
best-practices.md
index.ts
```
**Recommendation**: Prefer skills for most tasks. Use subagents when you need:
- Substantial domain-specific knowledge
- Multi-file context management
- Specialized system prompts
### 6. Workflows (`gateway/src/harness/workflows/`)
LangGraph state machines for multi-step orchestration:
**Features:**
- Validation loops (retry with fixes)
- Human-in-the-loop (approval gates)
- Error recovery
- State persistence via checkpoints
**Example Workflows:**
- Strategy validation: review → backtest → risk → approval
- Trading request: analysis → risk → approval → execute
## User Context Structure
Every interaction includes rich context:
```typescript
interface UserContext {
userId: string;
sessionId: string;
license: UserLicense;
// Multi-channel support
activeChannel: {
type: 'websocket' | 'telegram' | 'slack' | 'discord';
channelUserId: string;
capabilities: {
supportsMarkdown: boolean;
supportsImages: boolean;
supportsButtons: boolean;
maxMessageLength: number;
};
};
// Retrieved from MCP + RAG
conversationHistory: BaseMessage[];
relevantMemories: MemoryChunk[];
workspaceState: WorkspaceContext;
}
```
## User-Specific Files and Tools
User's MCP container provides access to:
**Indicators** (`indicators/*.py`)
- Custom technical indicators
- Pure functions: DataFrame → Series/DataFrame
- Version controlled in user's git repo
**Strategies** (`strategies/*.py`)
- Trading strategies with entry/exit rules
- Position sizing and risk management
- Backtestable and deployable
**Watchlists**
- Saved ticker lists
- Market monitoring
**Preferences**
- Trading style and risk tolerance
- Chart settings and colors
- Notification preferences
**Executors** (sub-strategies)
- Tactical order generators (TWAP, iceberg, etc.)
- Smart order routing
## Global Knowledge Management
### Document Loading
At gateway startup:
1. DocumentLoader scans `gateway/knowledge/` directory
2. Markdown files chunked by headers (~1000 tokens/chunk)
3. Embeddings generated via EmbeddingService
4. Stored in Qdrant with user_id="0"
5. Content hashing enables incremental updates
### Directory Structure
```
gateway/knowledge/
├── platform/ # Platform capabilities
├── trading/ # Trading fundamentals
├── indicators/ # Indicator development
└── strategies/ # Strategy patterns
```
### Updating Knowledge
**Development:**
```bash
curl -X POST http://localhost:3000/admin/reload-knowledge
```
**Production:**
- Update markdown files
- Deploy new version
- Auto-loaded on startup
**Monitoring:**
```bash
curl http://localhost:3000/admin/knowledge-stats
```
## Container Lifecycle
### User Container Creation
When user connects:
1. Gateway checks if container exists (ContainerManager)
2. If not, creates Kubernetes pod with:
- Agent container (Python + conda)
- Lifecycle sidecar (container management)
- Persistent volume (git repo)
3. Waits for MCP server ready (~5-10s cold start)
4. Establishes MCP connection
5. Begins message processing
### Container Shutdown
**Free users:** 15 minutes idle timeout
**Paid users:** Longer timeout based on license
**On shutdown:**
- Graceful save of all state
- Persistent storage retained
- Fast restart on next connection
### MCP Authentication Modes
1. **Public Mode** (Free tier): No auth, read-only, anonymous session
2. **Gateway Auth** (Standard): Gateway authenticates, container trusts gateway
3. **Direct Auth** (Enterprise): User authenticates directly with container
## Implementation Status
### ✅ Completed
- Agent Harness with MCP integration
- Model routing with license tiers
- RAG retriever with Qdrant
- Document loader for global knowledge
- EmbeddingService (Ollama/OpenAI)
- Skills and subagents framework
- Multi-channel support (WebSocket, Telegram)
- Container lifecycle management
- Event system with ZeroMQ
### 🚧 In Progress
- Iceberg integration (checkpoint-saver, conversation-store)
- More subagents (risk-analyzer, market-analyst)
- LangGraph workflows with interrupts
- Platform tools (market data, charting)
### 📋 Planned
- File watcher for hot-reload in development
- Advanced RAG strategies (hybrid search, re-ranking)
- Caching layer for expensive operations
- Performance monitoring and metrics
## References
- Implementation: `gateway/src/harness/`
- Documentation: `gateway/src/harness/README.md`
- Knowledge base: `gateway/knowledge/`
- LangGraph: https://langchain-ai.github.io/langgraphjs/
- Qdrant: https://qdrant.tech/documentation/
- MCP Spec: https://modelcontextprotocol.io/

View File

@@ -1,21 +0,0 @@
┌─────────────────────────────────────────────────┐
│ Agent Harness (your servers) │
│ │
│ on_message(user_id, message): │
│ 1. Look up user's MCP endpoint from Postgres │
│ 2. mcp.call("get_context_summary") │
│ 3. mcp.call("get_conversation_history", 20) │
│ 4. Build prompt: │
│ system = BASE_PROMPT │
│ + context_summary │
│ + user_agent_prompt (from MCP) │
│ messages = history + new message │
│ 5. LLM call (your API key) │
│ 6. While LLM wants tool calls: │
│ - Platform tools → handle locally │
│ - User tools → proxy to MCP │
│ - LLM call again with results │
│ 7. mcp.call("save_message", ...) │
│ 8. Return response to user │
│ │
└─────────────────────────────────────────────────┘

View File

@@ -1,11 +0,0 @@
Generally use skills instead of subagents, except for the analysis subagent.
## User-specific files and tools
* Indicators
* Strategies
* Watchlists
* Preferences
* Trading style
* Charting / colors
* Executors (really just sub-strategies)
* tactical-level order generators e.g. TWAP, iceberg, etc.

656
doc/architecture.md Normal file
View File

@@ -0,0 +1,656 @@
# DexOrder AI Platform Architecture
## Overview
DexOrder is an AI-powered trading platform that combines real-time market data processing, user-specific AI agents, and a flexible data pipeline. The system is designed for scalability, isolation, and extensibility.
## High-Level Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ User Clients │
│ (Web, Mobile, Telegram, External MCP) │
└────────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Gateway │
│ • WebSocket/HTTP/Telegram handlers │
│ • Authentication & session management │
│ • Agent Harness (LangChain/LangGraph orchestration) │
│ - MCP client connector to user containers │
│ - RAG retriever (Qdrant) │
│ - Model router (LLM selection) │
│ - Skills & subagents framework │
│ • Dynamic user container provisioning │
│ • Event routing (informational & critical) │
└────────┬──────────────────┬────────────────────┬────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────┐ ┌──────────────────────┐
│ User Containers │ │ Relay │ │ Infrastructure │
│ (per-user pods) │ │ (ZMQ Router) │ │ • DragonflyDB (cache)│
│ │ │ │ │ • Qdrant (vectors) │
│ • MCP Server │ │ • Market data│ │ • PostgreSQL (meta) │
│ • User files: │ │ fanout │ │ • MinIO (S3) │
│ - Indicators │ │ • Work queue │ │ │
│ - Strategies │ │ • Stateless │ │ │
│ - Preferences │ │ │ │ │
│ • Event Publisher│ │ │ │ │
│ • Lifecycle Mgmt │ │ │ │ │
└──────────────────┘ └──────┬───────┘ └──────────────────────┘
┌──────────────┴──────────────┐
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────────┐
│ Ingestors │ │ Flink Cluster │
│ • CCXT adapters │ │ • Deduplication │
│ • Exchange APIs │ │ • OHLC aggregation │
│ • Push to Kafka │ │ • CEP engine │
└────────┬─────────┘ │ • Writes to Iceberg │
│ │ • Market data PUB │
│ └──────────┬───────────┘
▼ │
┌─────────────────────────────────────▼────────────┐
│ Kafka │
│ • Durable append log │
│ • Topic-based streams │
│ • Event sourcing │
└──────────────────────┬───────────────────────────┘
┌─────────────────┐
│ Iceberg Catalog │
│ • Historical │
│ OHLC storage │
│ • Query API │
└─────────────────┘
```
## Core Components
### 1. Gateway
**Location:** `gateway/`
**Language:** TypeScript (Node.js)
**Purpose:** Entry point for all user interactions
**Responsibilities:**
- **Authentication:** JWT tokens, Telegram OAuth, multi-tier licensing
- **Session Management:** WebSocket connections, Telegram webhooks, multi-channel support
- **Container Orchestration:** Dynamic provisioning of user agent pods ([[gateway_container_creation]])
- **Event Handling:**
- Subscribe to user container events (XPUB/SUB for informational)
- Route critical events (ROUTER/DEALER with ack) ([[user_container_events]])
- **Agent Harness (LangChain/LangGraph):** ([[agent_harness]])
- Stateless LLM orchestration
- MCP client connector to user containers
- RAG retrieval from Qdrant (global + user-specific knowledge)
- Model routing based on license tier and complexity
- Skills and subagents framework
- Workflow state machines with validation loops
**Key Features:**
- **Stateless design:** All conversation state lives in user containers or Qdrant
- **Multi-channel support:** WebSocket, Telegram (future: mobile, Discord, Slack)
- **Kubernetes-native:** Uses k8s API for container management
- **Three-tier memory:**
- Redis: Hot storage, active sessions, LangGraph checkpoints (1 hour TTL)
- Qdrant: Vector search, RAG, global + user knowledge, GDPR-compliant
- Iceberg: Cold storage, full history, analytics, time-travel queries
**Infrastructure:**
- Deployed in `dexorder-system` namespace
- RBAC: Can create but not delete user containers
- Network policies: Access to k8s API, user containers, infrastructure
---
### 2. User Containers
**Location:** `client-py/`
**Language:** Python
**Purpose:** Per-user isolated workspace and data storage
**Architecture:**
- One pod per user (auto-provisioned by gateway)
- Persistent storage (PVC) for user data
- Multi-container pod:
- **Agent container:** MCP server + event publisher + user files
- **Lifecycle sidecar:** Auto-shutdown and cleanup
**Components:**
#### MCP Server
Exposes user-specific resources and tools via Model Context Protocol.
**Resources (Context for LLM):**
Gateway fetches these before each LLM call:
- `context://user-profile` - Trading preferences, style, risk tolerance
- `context://conversation-summary` - Recent conversation with semantic context
- `context://workspace-state` - Current chart, watchlist, positions, alerts
- `context://system-prompt` - User's custom AI instructions
**Tools (Actions with side effects):**
Gateway proxies these to user's MCP server:
- `save_message(role, content)` - Save to conversation history
- `search_conversation(query)` - Semantic search over past conversations
- `list_strategies()`, `read_strategy(name)`, `write_strategy(name, code)`
- `list_indicators()`, `read_indicator(name)`, `write_indicator(name, code)`
- `run_backtest(strategy, params)` - Execute backtest
- `get_watchlist()`, `execute_trade(params)`, `get_positions()`
- `run_python(code)` - Execute Python with data science libraries
**User Files:**
- `indicators/*.py` - Custom technical indicators
- `strategies/*.py` - Trading strategies with entry/exit rules
- Watchlists and preferences
- Git-versioned in persistent volume
#### Event Publisher ([[user_container_events]])
Publishes user events (order fills, alerts, workspace changes) via dual-channel ZMQ:
- **XPUB:** Informational events (fire-and-forget to active sessions)
- **DEALER:** Critical events (guaranteed delivery with ack)
#### Lifecycle Manager ([[container_lifecycle_management]])
Tracks activity and triggers; auto-shuts down when idle:
- Configurable idle timeouts by license tier
- Exit code 42 signals intentional shutdown
- Sidecar deletes deployment and optionally PVC
**Isolation:**
- Network policies: Cannot access k8s API, other users, or system services
- PodSecurity: Non-root, read-only rootfs, dropped capabilities
- Resource limits enforced by license tier
---
### 3. Data Pipeline
#### Relay (ZMQ Router)
**Location:** `relay/`
**Language:** Rust
**Purpose:** Stateless message router for market data and requests
**Architecture:**
- Well-known bind point (all components connect to it)
- No request tracking or state
- Topic-based routing
**Channels:**
1. **Client Requests (ROUTER):** Port 5559 - Historical data requests
2. **Ingestor Work Queue (PUB):** Port 5555 - Work distribution with exchange prefix
3. **Market Data Fanout (XPUB/XSUB):** Port 5558 - Realtime data + notifications
4. **Responses (SUB → PUB proxy):** Notifications from Flink to clients
See [[protocol]] for detailed ZMQ patterns and message formats.
---
#### Ingestors
**Location:** `ingestor/`
**Language:** Python
**Purpose:** Fetch market data from exchanges
**Features:**
- CCXT-based exchange adapters
- Subscribes to work queue via exchange prefix (e.g., `BINANCE:`)
- Writes raw data to Kafka only (no direct client responses)
- Supports realtime ticks and historical OHLC
**Data Flow:**
```
Exchange API → Ingestor → Kafka → Flink → Iceberg
Notification → Relay → Clients
```
---
#### Kafka
**Deployment:** KRaft mode (no Zookeeper)
**Purpose:** Durable event log and stream processing backbone
**Topics:**
- Raw market data streams (per exchange/symbol)
- Processed OHLC data
- Notification events
- User events (orders, alerts)
**Retention:**
- Configurable per topic (default: 7 days for raw data)
- Longer retention for aggregated data
---
#### Flink
**Deployment:** JobManager + TaskManager(s)
**Purpose:** Stream processing and aggregation
**Jobs:**
1. **Deduplication:** Remove duplicate ticks from multiple ingestors
2. **OHLC Aggregation:** Build candles from tick streams
3. **CEP (Complex Event Processing):** Pattern detection and alerts
4. **Iceberg Writer:** Batch write to long-term storage
5. **Notification Publisher:** ZMQ PUB for async client notifications
**State:**
- Checkpointing to MinIO (S3-compatible)
- Exactly-once processing semantics
**Scaling:**
- Multiple TaskManagers for parallelism
- Headless service for ZMQ discovery (see [[protocol#TODO: Flink-to-Relay ZMQ Discovery]])
---
#### Apache Iceberg
**Deployment:** REST catalog with PostgreSQL backend
**Purpose:** Historical data lake for OHLC and analytics
**Features:**
- Schema evolution
- Time travel queries
- Partitioning by date/symbol
- Efficient columnar storage (Parquet)
**Storage:** MinIO (S3-compatible object storage)
---
### 4. Infrastructure Services
#### DragonflyDB
- Redis-compatible in-memory cache
- Session state, rate limiting, hot data
#### Qdrant
- Vector database for RAG
- **Global knowledge** (user_id="0"): Platform capabilities, trading concepts, strategy patterns
- **User knowledge** (user_id=specific): Personal conversations, preferences, strategies
- GDPR-compliant (indexed by user_id for fast deletion)
#### PostgreSQL
- Iceberg catalog metadata
- User accounts and license info (gateway)
- Per-user data lives in user containers
#### MinIO
- S3-compatible object storage
- Iceberg table data
- Flink checkpoints
- User file uploads
---
## Data Flow Patterns
### Historical Data Query (Async)
```
1. Client → Gateway → User Container MCP: User requests data
2. Gateway → Relay (REQ/REP): Submit historical request
3. Relay → Ingestors (PUB/SUB): Broadcast work with exchange prefix
4. Ingestor → Exchange API: Fetch data
5. Ingestor → Kafka: Write OHLC batch with metadata
6. Flink → Kafka: Read, process, dedupe
7. Flink → Iceberg: Write to table
8. Flink → Relay (PUB): Publish HistoryReadyNotification
9. Relay → Client (SUB): Notification delivered
10. Client → Iceberg: Query data directly
```
**Key Design:**
- Client subscribes to notification topic BEFORE submitting request (prevents race)
- Notification topics are deterministic: `RESPONSE:{client_id}` or `HISTORY_READY:{request_id}`
- No state in Relay (fully topic-based routing)
See [[protocol#Historical Data Query Flow]] for details.
---
### Realtime Market Data
```
1. Ingestor → Kafka: Write realtime ticks
2. Flink → Kafka: Read and aggregate OHLC
3. Flink → Relay (PUB): Publish market data
4. Relay → Clients (XPUB/SUB): Fanout to subscribers
```
**Topic Format:** `{ticker}|{data_type}` (e.g., `BINANCE:BTC/USDT|tick`)
---
### User Events
User containers emit events (order fills, alerts) that must reach users reliably.
**Dual-Channel Design:**
1. **Informational Events (XPUB/SUB):**
- Container tracks active subscriptions via XPUB
- Publishes only if someone is listening
- Zero latency, fire-and-forget
2. **Critical Events (DEALER/ROUTER):**
- Container sends to gateway ROUTER with event ID
- Gateway delivers via Telegram/email/push
- Gateway sends EventAck back to container
- Container retries on timeout
- Persisted to disk on shutdown
See [[user_container_events]] for implementation.
---
## Container Lifecycle
### Creation ([[gateway_container_creation]])
```
User authenticates → Gateway checks if deployment exists
→ If missing, create from template (based on license tier)
→ Wait for ready (2min timeout)
→ Return MCP endpoint
```
**Templates by Tier:**
| Tier | Memory | CPU | Storage | Idle Timeout |
|------|--------|-----|---------|--------------|
| Free | 512Mi | 500m | 1Gi | 15min |
| Pro | 2Gi | 2000m | 10Gi | 60min |
| Enterprise | 4Gi | 4000m | 50Gi | Never |
---
### Lifecycle Management ([[container_lifecycle_management]])
**Idle Detection:**
- Container is idle when: no active triggers + no recent MCP activity
- Lifecycle manager tracks:
- MCP tool/resource calls (reset idle timer)
- Active triggers (data subscriptions, CEP patterns)
**Shutdown:**
- On idle timeout: exit with code 42
- Lifecycle sidecar detects exit code 42
- Sidecar calls k8s API to delete deployment
- Optionally deletes PVC (anonymous users only)
**Security:**
- Sidecar has RBAC to delete its own deployment only
- Cannot delete other deployments or access other namespaces
- Gateway cannot delete deployments (separation of concerns)
---
## Security Architecture
### Network Isolation
**NetworkPolicies:**
- User containers:
- ✅ Connect to gateway (MCP)
- ✅ Connect to relay (market data)
- ✅ Outbound HTTPS (exchanges, LLM APIs)
- ❌ No k8s API access
- ❌ No system namespace access
- ❌ No inter-user communication
- Gateway:
- ✅ k8s API (create containers)
- ✅ User containers (MCP client)
- ✅ Infrastructure (Postgres, Redis)
- ✅ Outbound (Anthropic API)
---
### RBAC
**Gateway ServiceAccount:**
- Create deployments/services/PVCs in `dexorder-agents` namespace
- Read pod status and logs
- Cannot delete, exec, or access secrets
**Lifecycle Sidecar ServiceAccount:**
- Delete deployments in `dexorder-agents` namespace
- Delete PVCs (conditional on user type)
- Cannot access other resources
---
### Admission Control
All pods in `dexorder-agents` namespace must:
- Use approved images only (allowlist)
- Run as non-root
- Drop all capabilities
- Use read-only root filesystem
- Have resource limits
See `deploy/k8s/base/admission-policy.yaml`
---
## Agent Harness Flow
The gateway's agent harness (LangChain/LangGraph) orchestrates LLM interactions with full context.
```
1. User sends message → Gateway (WebSocket/Telegram)
2. Authenticator validates user and gets license info
3. Container Manager ensures user's MCP container is running
4. Agent Harness processes message:
├─→ a. MCPClientConnector fetches context resources from user's MCP:
│ - context://user-profile
│ - context://conversation-summary
│ - context://workspace-state
│ - context://system-prompt
├─→ b. RAGRetriever searches Qdrant for relevant memories:
│ - Embeds user query
│ - Searches: user_id IN (current_user, "0")
│ - Returns user-specific + global platform knowledge
├─→ c. Build system prompt:
│ - Base platform prompt
│ - User profile context
│ - Workspace state
│ - Custom user instructions
│ - Relevant RAG memories
├─→ d. ModelRouter selects LLM:
│ - Based on license tier
│ - Query complexity
│ - Routing strategy (cost/speed/quality)
├─→ e. LLM invocation with tool support:
│ - Send messages to LLM
│ - If tool calls requested:
│ • Platform tools → handled by gateway
│ • User tools → proxied to MCP container
│ - Loop until no more tool calls
├─→ f. Save conversation to MCP:
│ - mcp.callTool('save_message', user_message)
│ - mcp.callTool('save_message', assistant_message)
└─→ g. Return response to user via channel
```
**Key Architecture:**
- **Gateway is stateless:** No conversation history stored in gateway
- **User context in MCP:** All user-specific data lives in user's container
- **Global knowledge in Qdrant:** Platform documentation loaded from `gateway/knowledge/`
- **RAG at gateway level:** Semantic search combines global + user knowledge
- **Skills vs Subagents:**
- Skills: Well-defined, single-purpose tasks
- Subagents: Complex domain expertise with multi-file context
- **Workflows:** LangGraph state machines for multi-step processes
See [[agent_harness]] for detailed implementation.
---
## Configuration Management
All services use dual YAML files:
- `config.yaml` - Non-sensitive configuration (mounted from ConfigMap)
- `secrets.yaml` - Credentials and tokens (mounted from Secret)
**Environment Variables:**
- K8s downward API for pod metadata
- Service discovery via DNS (e.g., `kafka:9092`)
---
## Deployment
### Development
```bash
# Start local k8s
minikube start
# Apply infrastructure
kubectl apply -k deploy/k8s/dev
# Build and load images
docker build -t dexorder/gateway:latest gateway/
minikube image load dexorder/gateway:latest
# Port-forward for access
kubectl port-forward -n dexorder-system svc/gateway 3000:3000
```
---
### Production
```bash
# Apply production configs
kubectl apply -k deploy/k8s/prod
# Push images to registry
docker push ghcr.io/dexorder/gateway:latest
docker push ghcr.io/dexorder/agent:latest
docker push ghcr.io/dexorder/lifecycle-sidecar:latest
```
**Namespaces:**
- `dexorder-system` - Platform services (gateway, infrastructure)
- `dexorder-agents` - User containers (isolated)
---
## Observability
### Metrics (Prometheus)
- Container creation/deletion rates
- Idle shutdown counts
- MCP call latency and errors
- Event delivery rates and retries
- Kafka lag and throughput
- Flink checkpoint duration
### Logging
- Structured JSON logs
- User ID in all agent logs
- Aggregated via Loki or CloudWatch
### Tracing
- OpenTelemetry spans across gateway → MCP → LLM
- User-scoped traces for debugging
---
## Scalability
### Horizontal Scaling
**Stateless Components:**
- Gateway: Add replicas behind load balancer
- Relay: Single instance (stateless router)
- Ingestors: Scale by exchange workload
**Stateful Components:**
- Flink: Scale TaskManagers
- User containers: One per user (1000s of pods)
**Bottlenecks:**
- Flink → Relay ZMQ: Requires discovery protocol (see [[protocol#TODO: Flink-to-Relay ZMQ Discovery]])
- Kafka: Partition by symbol for parallelism
- Iceberg: Partition by date/symbol
---
### Cost Optimization
**Tiered Resources:**
- Free users: Aggressive idle shutdown (15min)
- Pro users: Longer timeout (60min)
- Enterprise: Always-on containers
**Storage:**
- PVC deletion for anonymous users
- Tiered storage classes (fast SSD → cheap HDD)
**LLM Costs:**
- Rate limiting per license tier
- Caching of MCP resources (1-5min TTL)
- Conversation summarization to reduce context size
---
## Development Roadmap
See [[backend_redesign]] for detailed notes.
**Phase 1: Foundation (Complete)**
- Gateway with k8s integration
- User container provisioning
- MCP protocol implementation
- Basic market data pipeline
**Phase 2: Data Pipeline (In Progress)**
- Kafka topic schemas
- Flink jobs for aggregation
- Iceberg integration
- Historical backfill service
**Phase 3: Agent Features**
- RAG integration (Qdrant)
- Strategy backtesting
- Risk management tools
- Portfolio analytics
**Phase 4: Production Hardening**
- Multi-region deployment
- HA for infrastructure
- Comprehensive monitoring
- Performance optimization
---
## Related Documentation
- [[protocol]] - ZMQ message protocols and data flow
- [[gateway_container_creation]] - Dynamic container provisioning
- [[container_lifecycle_management]] - Idle shutdown and cleanup
- [[user_container_events]] - Event system implementation
- [[agent_harness]] - LLM orchestration flow
- [[m_c_p_tools_architecture]] - User MCP tools specification
- [[user_mcp_resources]] - Context resources and RAG
- [[m_c_p_client_authentication_modes]] - MCP authentication patterns
- [[backend_redesign]] - Design notes and TODO items

468
doc/auth.md Normal file
View File

@@ -0,0 +1,468 @@
# Authentication System Setup
This document describes the multi-channel authentication system for the Dexorder AI Gateway.
## Overview
The gateway now implements a comprehensive authentication system using **Better Auth** with support for:
- ✅ Email/Password authentication
- ✅ Passkey/WebAuthn (passwordless biometric auth)
- ✅ JWT token-based sessions
- ✅ Multi-channel support (WebSocket, Telegram, REST API)
- ✅ PostgreSQL-based user management
- ✅ Secure password hashing with Argon2
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Client Apps │
│ (Web, Mobile, CLI, Telegram, etc.) │
└────────────┬────────────────────────────────┬───────────────┘
│ │
│ HTTP/REST │ WebSocket
│ │
┌────────────▼────────────────────────────────▼───────────────┐
│ Gateway (Fastify) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Auth Routes │ │ WebSocket │ │ Telegram │ │
│ │ /auth/* │ │ Handler │ │ Handler │ │
│ └──────┬───────┘ └──────┬───────┘ └────────┬────────┘ │
│ │ │ │ │
│ └─────────────────┴────────────────────┘ │
│ │ │
│ ┌────────▼──────────┐ │
│ │ Auth Service │ │
│ │ (Better Auth) │ │
│ └────────┬──────────┘ │
│ │ │
└───────────────────────────┼────────────────────────────────┘
┌────────▼────────┐
│ PostgreSQL │
│ - users │
│ - sessions │
│ - passkeys │
│ - credentials │
└─────────────────┘
```
## Database Schema
The authentication system uses the following PostgreSQL tables:
### Core Tables
1. **users** - Core user accounts
- `id` (PRIMARY KEY)
- `email` (UNIQUE)
- `email_verified`
- `name`
- `created_at`, `updated_at`
2. **user_credentials** - Password hashes
- `user_id` (FOREIGN KEY → users.id)
- `password_hash` (Argon2)
3. **sessions** - JWT sessions
- `id` (PRIMARY KEY)
- `user_id` (FOREIGN KEY → users.id)
- `expires_at`
- `ip_address`, `user_agent`
4. **passkeys** - WebAuthn credentials
- `id` (PRIMARY KEY)
- `user_id` (FOREIGN KEY → users.id)
- `credential_id` (UNIQUE)
- `credential_public_key`
- `counter`, `transports`
5. **verification_tokens** - Email verification, password reset
- `identifier`, `token`, `expires_at`
### Integration Tables
6. **user_licenses** - User authorization & feature flags
- `user_id` (FOREIGN KEY → users.id)
- `license_type` (free, pro, enterprise)
- `features` (JSONB)
- `resource_limits` (JSONB)
7. **user_channel_links** - Multi-channel support
- `user_id` (FOREIGN KEY → users.id)
- `channel_type` (telegram, slack, discord, websocket)
- `channel_user_id`
## Installation
1. **Install dependencies:**
```bash
cd gateway
npm install
```
The following packages are added:
- `better-auth` - Main authentication framework
- `@simplewebauthn/server` - WebAuthn/passkey support
- `@simplewebauthn/browser` - Client-side passkey helpers
- `@fastify/jwt` - JWT utilities
- `argon2` - Secure password hashing
2. **Apply database schema:**
```bash
psql $DATABASE_URL -f schema.sql
```
3. **Configure secrets:**
Copy `secrets.example.yaml` to your actual secrets file and update:
```yaml
auth:
secret: "YOUR-SUPER-SECRET-KEY-HERE" # Generate with: openssl rand -base64 32
```
4. **Configure server:**
Update `config.yaml`:
```yaml
server:
base_url: http://localhost:3000 # Or your production URL
trusted_origins:
- http://localhost:3000
- http://localhost:5173 # Your web app
- https://yourdomain.com
```
## API Endpoints
### Authentication Routes
All Better Auth automatic routes are available at `/api/auth/*`:
- `POST /api/auth/sign-up/email` - Register with email/password
- `POST /api/auth/sign-in/email` - Sign in with email/password
- `POST /api/auth/sign-out` - Sign out
- `GET /api/auth/session` - Get current session
- `POST /api/auth/passkey/register` - Register passkey
- `POST /api/auth/passkey/authenticate` - Authenticate with passkey
### Custom Routes (Simplified)
- `POST /auth/register` - Register and auto sign-in
- `POST /auth/login` - Sign in
- `POST /auth/logout` - Sign out
- `GET /auth/session` - Get session
- `GET /auth/health` - Auth system health check
### Example Usage
#### Register a new user
```bash
curl -X POST http://localhost:3000/auth/register \
-H "Content-Type: application/json" \
-d '{
"email": "user@example.com",
"password": "SecurePassword123!",
"name": "John Doe"
}'
```
Response:
```json
{
"success": true,
"userId": "user_1234567890_abc123",
"token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
}
```
#### Sign in
```bash
curl -X POST http://localhost:3000/auth/login \
-H "Content-Type: application/json" \
-d '{
"email": "user@example.com",
"password": "SecurePassword123!"
}'
```
Response:
```json
{
"success": true,
"userId": "user_1234567890_abc123",
"token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
}
```
#### Connect to WebSocket with JWT
```javascript
const ws = new WebSocket('ws://localhost:3000/ws/chat');
ws.addEventListener('open', () => {
// Send auth token in initial message
ws.send(JSON.stringify({
type: 'auth',
token: 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'
}));
});
```
Or use Authorization header:
```javascript
const ws = new WebSocket('ws://localhost:3000/ws/chat', {
headers: {
'Authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'
}
});
```
#### Get current session
```bash
curl http://localhost:3000/auth/session \
-H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
```
## Passkey (WebAuthn) Support
### Server Setup
Passkeys are automatically configured in `better-auth-config.ts`:
```typescript
passkey({
rpName: 'Dexorder AI',
rpID: new URL(config.baseUrl).hostname,
origin: config.baseUrl,
})
```
### Client-Side Integration
```typescript
import { startRegistration, startAuthentication } from '@simplewebauthn/browser';
// 1. Register a passkey (user must be logged in)
async function registerPasskey(token: string) {
// Get options from server
const optionsResponse = await fetch('/auth/passkey/register/options', {
method: 'POST',
headers: {
'Authorization': `Bearer ${token}`
}
});
const options = await optionsResponse.json();
// Start WebAuthn registration
const credential = await startRegistration(options);
// Send credential to server
const response = await fetch('/api/auth/passkey/register', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${token}`
},
body: JSON.stringify({ credential })
});
return response.json();
}
// 2. Authenticate with passkey
async function authenticateWithPasskey() {
// Get challenge from server
const optionsResponse = await fetch('/api/auth/passkey/authenticate/options');
const options = await optionsResponse.json();
// Start WebAuthn authentication
const credential = await startAuthentication(options);
// Verify with server
const response = await fetch('/auth/passkey/authenticate', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({ credential })
});
const { token, userId } = await response.json();
return { token, userId };
}
```
## Multi-Channel Support
### WebSocket Authentication
The WebSocket handler (`websocket-handler.ts`) now properly verifies JWT tokens:
```typescript
// User connects with JWT token in Authorization header
const authContext = await authenticator.authenticateWebSocket(request);
// Returns: { userId, sessionId, license, ... }
```
### Telegram Bot Authentication
Users link their Telegram account via the `user_channel_links` table:
```sql
INSERT INTO user_channel_links (user_id, channel_type, channel_user_id)
VALUES ('user_1234567890_abc123', 'telegram', '987654321');
```
The `authenticator.authenticateTelegram()` method resolves the user from their Telegram ID.
### API Authentication
All REST API calls use the `Authorization: Bearer <token>` header.
## Security Considerations
### Production Checklist
- [ ] Generate a strong random secret: `openssl rand -base64 32`
- [ ] Enable email verification: Set `requireEmailVerification: true`
- [ ] Configure HTTPS only in production
- [ ] Set proper `trusted_origins` for CORS
- [ ] Implement rate limiting (consider adding `@fastify/rate-limit`)
- [ ] Set up email service for password reset
- [ ] Configure session expiry based on security requirements
- [ ] Enable 2FA for sensitive operations
- [ ] Implement audit logging for auth events
- [ ] Set up monitoring for failed login attempts
### Password Security
- Uses **Argon2** (winner of Password Hashing Competition)
- Automatically salted and hashed by Better Auth
- Never stored or logged in plain text
### JWT Security
- Tokens expire after 7 days (configurable)
- Sessions update every 24 hours
- Tokens signed with HMAC-SHA256
- Store secret in k8s secrets, never in code
### Passkey Security
- Uses FIDO2/WebAuthn standards
- Hardware-backed authentication
- Phishing-resistant
- No passwords to leak or forget
## Migration Guide
If you have existing users with a different auth system:
1. **Create users in new schema:**
```sql
INSERT INTO users (id, email, email_verified, name)
SELECT user_id, email, true, name FROM old_users_table;
```
2. **Migrate licenses:**
```sql
-- Ensure user_licenses references users.id
UPDATE user_licenses SET user_id = users.id WHERE ...;
```
3. **User must reset password** or register passkey on first login with new system
## Troubleshooting
### "Authentication failed" on WebSocket
- Check that the JWT token is valid and not expired
- Verify the Authorization header format: `Bearer <token>`
- Check server logs for detailed error messages
### "Invalid credentials" on login
- Verify the user exists in the `users` table
- Check that `user_credentials` has a password_hash for the user
- Passwords are case-sensitive
### Passkey registration fails
- Check browser support for WebAuthn
- Verify HTTPS is enabled (required for WebAuthn in production)
- Check `rpID` matches your domain
- Ensure user is authenticated before registering passkey
## Development Tips
### Testing with dev user
A development user is created automatically:
```javascript
// Email: dev@example.com
// User ID: dev-user-001
// License: pro
```
Generate a token for testing:
```bash
curl -X POST http://localhost:3000/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"dev@example.com","password":"<set-in-db>"}'
```
### Inspecting tokens
```bash
# Decode JWT (header and payload only, signature verification needed)
echo "eyJhbGc..." | cut -d. -f2 | base64 -d | jq
```
### Database queries
```sql
-- List all users
SELECT id, email, name, created_at FROM users;
-- List active sessions
SELECT s.id, s.user_id, u.email, s.expires_at
FROM sessions s
JOIN users u ON s.user_id = u.id
WHERE s.expires_at > NOW();
-- List passkeys
SELECT p.id, p.name, u.email, p.created_at
FROM passkeys p
JOIN users u ON p.user_id = u.id;
```
## Future Enhancements
Potential additions to consider:
- [ ] OAuth providers (Google, GitHub, etc.)
- [ ] Magic link authentication
- [ ] Two-factor authentication (TOTP)
- [ ] Session management dashboard
- [ ] Audit log for security events
- [ ] IP-based restrictions
- [ ] Device management (trusted devices)
- [ ] Anonymous authentication for trials
## References
- [Better Auth Documentation](https://better-auth.com/)
- [SimpleWebAuthn Guide](https://simplewebauthn.dev/)
- [WebAuthn Guide](https://webauthn.guide/)
- [FIDO Alliance](https://fidoalliance.org/)
- [Fastify Authentication](https://fastify.dev/docs/latest/Guides/Getting-Started/#your-first-plugin)

View File

@@ -100,7 +100,7 @@ Ingestion API
* RAG namespace
* Agents
* Top-level coordinator
* TradingView agent
* TradingView skill
* Indicators, Drawings, Annotations
* Research Agent
* Pandas/Polars analysis

View File

@@ -151,6 +151,310 @@ The two-frame envelope is the **logical protocol format**, but physical transmis
| 0x11 | SubmitResponse | Immediate ack with notification topic |
| 0x12 | HistoryReadyNotification | Notification that data is ready in Iceberg |
## User Container Event System
User containers emit events (order executions, alerts, workspace changes) that must be delivered to users via their active session or external channels (Telegram, email, push). This requires two ZMQ patterns with different delivery guarantees.
### Event Flow Overview
```
┌─────────────────────────────────────────────────────────────┐
│ User Container │
│ │
│ Strategy/Indicator Engine │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Event Publisher │ │
│ │ │ │
│ │ 1. Check delivery spec │ │
│ │ 2. If INFORMATIONAL or has_active_subscriber(): │ │
│ │ → XPUB (fast path) │ │
│ │ 3. Else (CRITICAL or no active session): │ │
│ │ → DEALER (guaranteed delivery) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │ │
│ XPUB socket DEALER socket │
│ (port 5570) (port 5571) │
└─────────┼───────────────────────────┼───────────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Gateway Pool │
│ │
│ ┌──────────────────┐ ┌──────────────────────────┐ │
│ │ SUB socket │ │ ROUTER socket │ │
│ │ (per-session) │ │ (shared, any gateway) │ │
│ │ │ │ │ │
│ │ Subscribe to │ │ Pull event, deliver, │ │
│ │ USER:{user_id} │ │ send EventAck back │ │
│ │ on connect │ │ │ │
│ └────────┬─────────┘ └─────────────┬────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Active WS/ │ │ Telegram API / Email / │ │
│ │ Telegram │ │ Push Notification │ │
│ └─────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
### 5. User Event Channel - Informational (Container → Gateway)
**Pattern**: XPUB/SUB with subscription tracking
- **Socket Type**: Container uses XPUB (bind), Gateway uses SUB (connect)
- **Endpoint**: `tcp://*:5570` (Container binds)
- **Message Types**: `UserEvent`
- **Topic Format**: `USER:{user_id}` (e.g., `USER:user-abc123`)
- **Behavior**:
- Gateway subscribes to `USER:{user_id}` when user's WebSocket/Telegram session connects
- Gateway unsubscribes when session disconnects
- Container uses XPUB with `ZMQ_XPUB_VERBOSE` to track active subscriptions
- Container checks subscription set before publishing
- If no subscriber, message is either dropped (INFORMATIONAL) or routed to critical channel
- Zero coordination, fire-and-forget for active sessions
### 6. User Event Channel - Critical (Container → Gateway)
**Pattern**: DEALER/ROUTER with acknowledgment
- **Socket Type**: Container uses DEALER (connect), Gateway uses ROUTER (bind)
- **Endpoint**: `tcp://gateway:5571` (Gateway binds, containers connect)
- **Message Types**: `UserEvent``EventAck`
- **Behavior**:
- Container sends `UserEvent` with `event_id` via DEALER
- DEALER round-robins to available gateway ROUTER sockets
- Gateway processes event (sends to Telegram, email, etc.)
- Gateway sends `EventAck` back to container
- Container tracks pending events with timeout (30s default)
- On timeout without ack: resend (DEALER routes to next gateway)
- On container shutdown: persist pending to disk, reload on startup
- Provides at-least-once delivery guarantee
### Subscription Tracking (Container Side)
Container uses XPUB to detect active sessions:
```python
# Container event publisher initialization
xpub_socket = ctx.socket(zmq.XPUB)
xpub_socket.setsockopt(zmq.XPUB_VERBOSE, 1) # Receive all sub/unsub
xpub_socket.bind("tcp://*:5570")
active_subscriptions: set[str] = set()
# In event loop, handle subscription messages
def process_subscriptions():
while xpub_socket.poll(0):
msg = xpub_socket.recv()
topic = msg[1:].decode() # Skip first byte (sub/unsub flag)
if msg[0] == 1: # Subscribe
active_subscriptions.add(topic)
elif msg[0] == 0: # Unsubscribe
active_subscriptions.discard(topic)
def has_active_subscriber(user_id: str) -> bool:
return f"USER:{user_id}" in active_subscriptions
```
### Event Routing Logic (Container Side)
```python
def publish_event(event: UserEvent):
topic = f"USER:{event.user_id}"
if event.delivery.priority == Priority.INFORMATIONAL:
# Fire and forget - drop if nobody's listening
if has_active_subscriber(event.user_id):
xpub_socket.send_multipart([topic.encode(), serialize(event)])
# else: silently drop
elif has_active_subscriber(event.user_id):
# Active session exists - use fast path
xpub_socket.send_multipart([topic.encode(), serialize(event)])
else:
# No active session - use guaranteed delivery
send_via_dealer(event)
def send_via_dealer(event: UserEvent):
pending_events[event.event_id] = PendingEvent(
event=event,
sent_at=time.time(),
retries=0
)
dealer_socket.send(serialize(event))
```
### Message Type IDs (User Events)
| Type ID | Message Type | Description |
|---------|-----------------|------------------------------------------------|
| 0x20 | UserEvent | Container → Gateway event |
| 0x21 | EventAck | Gateway → Container acknowledgment |
### UserEvent Message
```protobuf
message UserEvent {
string user_id = 1;
string event_id = 2; // UUID for dedup/ack
int64 timestamp = 3; // Unix millis
EventType event_type = 4;
bytes payload = 5; // JSON or nested protobuf
DeliverySpec delivery = 6;
}
enum EventType {
ORDER_PLACED = 0;
ORDER_FILLED = 1;
ORDER_CANCELLED = 2;
ALERT_TRIGGERED = 3;
POSITION_UPDATED = 4;
WORKSPACE_CHANGED = 5;
STRATEGY_LOG = 6;
}
message DeliverySpec {
Priority priority = 1;
repeated ChannelPreference channels = 2; // Ordered preference list
}
enum Priority {
INFORMATIONAL = 0; // Drop if no active session
NORMAL = 1; // Best effort, short queue
CRITICAL = 2; // Must deliver, retry, escalate
}
message ChannelPreference {
ChannelType channel = 1;
bool only_if_active = 2; // true = skip if not connected
}
enum ChannelType {
ACTIVE_SESSION = 0; // Whatever's currently connected
WEB = 1;
TELEGRAM = 2;
EMAIL = 3;
PUSH = 4; // Mobile push notification
}
```
### EventAck Message
```protobuf
message EventAck {
string event_id = 1;
AckStatus status = 2;
string error_message = 3; // If status is ERROR
}
enum AckStatus {
DELIVERED = 0; // Successfully sent to at least one channel
QUEUED = 1; // Accepted, will retry (e.g., Telegram rate limit)
ERROR = 2; // Permanent failure
}
```
### Delivery Examples
```python
# "Show on screen if they're watching, otherwise don't bother"
# → Uses XPUB path only, dropped if no subscriber
UserEvent(
delivery=DeliverySpec(
priority=Priority.INFORMATIONAL,
channels=[ChannelPreference(ChannelType.ACTIVE_SESSION, only_if_active=True)]
)
)
# "Active session preferred, fallback to Telegram"
# → Tries XPUB first (if subscribed), else DEALER for Telegram delivery
UserEvent(
delivery=DeliverySpec(
priority=Priority.NORMAL,
channels=[
ChannelPreference(ChannelType.ACTIVE_SESSION, only_if_active=True),
ChannelPreference(ChannelType.TELEGRAM, only_if_active=False),
]
)
)
# "Order executed - MUST get through"
# → Always uses DEALER path for guaranteed delivery
UserEvent(
delivery=DeliverySpec(
priority=Priority.CRITICAL,
channels=[
ChannelPreference(ChannelType.ACTIVE_SESSION, only_if_active=True),
ChannelPreference(ChannelType.TELEGRAM, only_if_active=False),
ChannelPreference(ChannelType.PUSH, only_if_active=False),
ChannelPreference(ChannelType.EMAIL, only_if_active=False),
]
)
)
```
### Gateway Event Processing
Gateway maintains:
1. **Session registry**: Maps user_id → active WebSocket/channel connections
2. **Channel credentials**: Telegram bot token, email service keys, push certificates
3. **SUB socket per user session**: Subscribes to `USER:{user_id}` on container's XPUB
4. **Shared ROUTER socket**: Receives critical events from any container
```typescript
// On user WebSocket connect
async onSessionConnect(userId: string, ws: WebSocket) {
// Subscribe to user's informational events
subSocket.subscribe(`USER:${userId}`);
sessions.set(userId, ws);
}
// On user WebSocket disconnect
async onSessionDisconnect(userId: string) {
subSocket.unsubscribe(`USER:${userId}`);
sessions.delete(userId);
}
// Handle informational events (from SUB socket)
subSocket.on('message', (topic, payload) => {
const event = deserialize(payload);
const ws = sessions.get(event.userId);
if (ws) {
ws.send(JSON.stringify({ type: 'event', ...event }));
}
});
// Handle critical events (from ROUTER socket)
routerSocket.on('message', (identity, payload) => {
const event = deserialize(payload);
deliverEvent(event).then(status => {
routerSocket.send([identity, serialize(EventAck(event.eventId, status))]);
});
});
async function deliverEvent(event: UserEvent): Promise<AckStatus> {
for (const pref of event.delivery.channels) {
if (pref.onlyIfActive && !sessions.has(event.userId)) continue;
switch (pref.channel) {
case ChannelType.ACTIVE_SESSION:
const ws = sessions.get(event.userId);
if (ws) { ws.send(...); return AckStatus.DELIVERED; }
break;
case ChannelType.TELEGRAM:
await telegramBot.sendMessage(event.userId, formatEvent(event));
return AckStatus.DELIVERED;
case ChannelType.EMAIL:
await emailService.send(event.userId, formatEvent(event));
return AckStatus.DELIVERED;
// ... etc
}
}
return AckStatus.ERROR;
}
```
## Error Handling
**Async Architecture Error Handling**:
@@ -162,7 +466,51 @@ The two-frame envelope is the **logical protocol format**, but physical transmis
- PUB/SUB has no delivery guarantees (Kafka provides durability)
- No response routing needed - all notifications via topic-based pub/sub
**User Event Error Handling**:
- Informational events: dropped silently if no active session (by design)
- Critical events: container retries on ack timeout (30s default)
- Gateway tracks event_id for deduplication (5 minute window)
- If all channels fail: return ERROR ack, container may escalate or log
- Container persists pending critical events to disk on shutdown
**Durability**:
- All data flows through Kafka for durability
- Flink checkpointing ensures exactly-once processing
- Client can retry request with new request_id if notification not received
- Critical user events use DEALER/ROUTER with ack for at-least-once delivery
## Scaling
### TODO: Flink-to-Relay ZMQ Discovery
Currently Relay connects to Flink via XSUB on a single endpoint. With multiple Flink instances behind a K8s service, we need many-to-many connectivity.
**Problem**: K8s service load balancing doesn't help ZMQ since connections are persistent. Relay needs to connect to ALL Flink instances to receive all published messages.
**Proposed Solution**: Use a K8s headless service for Flink workers:
```yaml
apiVersion: v1
kind: Service
metadata:
name: flink-workers
spec:
clusterIP: None
selector:
app: flink
```
Relay implementation:
1. On startup and periodically (every N seconds), resolve `flink-workers.namespace.svc.cluster.local`
2. DNS returns A records for all Flink pod IPs
3. Diff against current XSUB connections
4. Connect to new pods, disconnect from removed pods
**Alternative approaches considered**:
- XPUB/XSUB broker: Adds single point of failure and latency
- Service discovery (etcd/Redis): More complex, requires additional infrastructure
**Open questions**:
- Appropriate polling interval for DNS resolution (5-10 seconds?)
- Handling of brief disconnection during pod replacement
- Whether to use K8s Endpoints API watch instead of DNS polling for faster reaction

1304
doc/user_container_events.md Normal file

File diff suppressed because it is too large Load Diff