Files

Tim Olson f6bd22a8ef redesign fully scaffolded and web login works

2026-03-17 20:10:47 -04:00

22 KiB

Raw Blame History

DexOrder AI Platform Architecture

Overview

DexOrder is an AI-powered trading platform that combines real-time market data processing, user-specific AI agents, and a flexible data pipeline. The system is designed for scalability, isolation, and extensibility.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         User Clients                             │
│              (Web, Mobile, Telegram, External MCP)               │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                          Gateway                                 │
│  • WebSocket/HTTP/Telegram handlers                             │
│  • Authentication & session management                           │
│  • Agent Harness (LangChain/LangGraph orchestration)            │
│    - MCP client connector to user containers                    │
│    - RAG retriever (Qdrant)                                     │
│    - Model router (LLM selection)                               │
│    - Skills & subagents framework                               │
│  • Dynamic user container provisioning                           │
│  • Event routing (informational & critical)                      │
└────────┬──────────────────┬────────────────────┬────────────────┘
         │                  │                    │
         ▼                  ▼                    ▼
┌──────────────────┐  ┌──────────────┐   ┌──────────────────────┐
│ User Containers  │  │    Relay     │   │   Infrastructure     │
│ (per-user pods)  │  │ (ZMQ Router) │   │ • DragonflyDB (cache)│
│                  │  │              │   │ • Qdrant (vectors)   │
│ • MCP Server     │  │ • Market data│   │ • PostgreSQL (meta)  │
│ • User files:    │  │   fanout     │   │ • MinIO (S3)         │
│   - Indicators   │  │ • Work queue │   │                      │
│   - Strategies   │  │ • Stateless  │   │                      │
│   - Preferences  │  │              │   │                      │
│ • Event Publisher│  │              │   │                      │
│ • Lifecycle Mgmt │  │              │   │                      │
└──────────────────┘  └──────┬───────┘   └──────────────────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
              ▼                             ▼
    ┌──────────────────┐        ┌──────────────────────┐
    │   Ingestors      │        │    Flink Cluster     │
    │ • CCXT adapters  │        │ • Deduplication      │
    │ • Exchange APIs  │        │ • OHLC aggregation   │
    │ • Push to Kafka  │        │ • CEP engine         │
    └────────┬─────────┘        │ • Writes to Iceberg  │
             │                  │ • Market data PUB    │
             │                  └──────────┬───────────┘
             ▼                             │
    ┌─────────────────────────────────────▼────────────┐
    │                    Kafka                          │
    │  • Durable append log                            │
    │  • Topic-based streams                           │
    │  • Event sourcing                                │
    └──────────────────────┬───────────────────────────┘
                           │
                           ▼
                  ┌─────────────────┐
                  │ Iceberg Catalog │
                  │ • Historical    │
                  │   OHLC storage  │
                  │ • Query API     │
                  └─────────────────┘

Core Components

1. Gateway

Location: gateway/ Language: TypeScript (Node.js) Purpose: Entry point for all user interactions

Responsibilities:

Authentication: JWT tokens, Telegram OAuth, multi-tier licensing
Session Management: WebSocket connections, Telegram webhooks, multi-channel support
Container Orchestration: Dynamic provisioning of user agent pods (gateway_container_creation)
Event Handling:
- Subscribe to user container events (XPUB/SUB for informational)
- Route critical events (ROUTER/DEALER with ack) (user_container_events)
Agent Harness (LangChain/LangGraph): (agent_harness)
- Stateless LLM orchestration
- MCP client connector to user containers
- RAG retrieval from Qdrant (global + user-specific knowledge)
- Model routing based on license tier and complexity
- Skills and subagents framework
- Workflow state machines with validation loops

Key Features:

Stateless design: All conversation state lives in user containers or Qdrant
Multi-channel support: WebSocket, Telegram (future: mobile, Discord, Slack)
Kubernetes-native: Uses k8s API for container management
Three-tier memory:
- Redis: Hot storage, active sessions, LangGraph checkpoints (1 hour TTL)
- Qdrant: Vector search, RAG, global + user knowledge, GDPR-compliant
- Iceberg: Cold storage, full history, analytics, time-travel queries

Infrastructure:

Deployed in dexorder-system namespace
RBAC: Can create but not delete user containers
Network policies: Access to k8s API, user containers, infrastructure

2. User Containers

Location: client-py/ Language: Python Purpose: Per-user isolated workspace and data storage

Architecture:

One pod per user (auto-provisioned by gateway)
Persistent storage (PVC) for user data
Multi-container pod:
- Agent container: MCP server + event publisher + user files
- Lifecycle sidecar: Auto-shutdown and cleanup

Components:

MCP Server

Exposes user-specific resources and tools via Model Context Protocol.

Resources (Context for LLM): Gateway fetches these before each LLM call:

context://user-profile - Trading preferences, style, risk tolerance
context://conversation-summary - Recent conversation with semantic context
context://workspace-state - Current chart, watchlist, positions, alerts
context://system-prompt - User's custom AI instructions

Tools (Actions with side effects): Gateway proxies these to user's MCP server:

save_message(role, content) - Save to conversation history
search_conversation(query) - Semantic search over past conversations
list_strategies(), read_strategy(name), write_strategy(name, code)
list_indicators(), read_indicator(name), write_indicator(name, code)
run_backtest(strategy, params) - Execute backtest
get_watchlist(), execute_trade(params), get_positions()
run_python(code) - Execute Python with data science libraries

User Files:

indicators/*.py - Custom technical indicators
strategies/*.py - Trading strategies with entry/exit rules
Watchlists and preferences
Git-versioned in persistent volume

Event Publisher (user_container_events)

Publishes user events (order fills, alerts, workspace changes) via dual-channel ZMQ:

XPUB: Informational events (fire-and-forget to active sessions)
DEALER: Critical events (guaranteed delivery with ack)

Lifecycle Manager (container_lifecycle_management)

Tracks activity and triggers; auto-shuts down when idle:

Configurable idle timeouts by license tier
Exit code 42 signals intentional shutdown
Sidecar deletes deployment and optionally PVC

Isolation:

Network policies: Cannot access k8s API, other users, or system services
PodSecurity: Non-root, read-only rootfs, dropped capabilities
Resource limits enforced by license tier

3. Data Pipeline

Relay (ZMQ Router)

Location: relay/ Language: Rust Purpose: Stateless message router for market data and requests

Architecture:

Well-known bind point (all components connect to it)
No request tracking or state
Topic-based routing

Channels:

Client Requests (ROUTER): Port 5559 - Historical data requests
Ingestor Work Queue (PUB): Port 5555 - Work distribution with exchange prefix
Market Data Fanout (XPUB/XSUB): Port 5558 - Realtime data + notifications
Responses (SUB → PUB proxy): Notifications from Flink to clients

See protocol for detailed ZMQ patterns and message formats.

Ingestors

Location: ingestor/ Language: Python Purpose: Fetch market data from exchanges

Features:

CCXT-based exchange adapters
Subscribes to work queue via exchange prefix (e.g., BINANCE:)
Writes raw data to Kafka only (no direct client responses)
Supports realtime ticks and historical OHLC

Data Flow:

Exchange API → Ingestor → Kafka → Flink → Iceberg
                                      ↓
                                  Notification → Relay → Clients

Kafka

Deployment: KRaft mode (no Zookeeper) Purpose: Durable event log and stream processing backbone

Topics:

Raw market data streams (per exchange/symbol)
Processed OHLC data
Notification events
User events (orders, alerts)

Retention:

Configurable per topic (default: 7 days for raw data)
Longer retention for aggregated data

Flink

Deployment: JobManager + TaskManager(s) Purpose: Stream processing and aggregation

Jobs:

Deduplication: Remove duplicate ticks from multiple ingestors
OHLC Aggregation: Build candles from tick streams
CEP (Complex Event Processing): Pattern detection and alerts
Iceberg Writer: Batch write to long-term storage
Notification Publisher: ZMQ PUB for async client notifications

State:

Checkpointing to MinIO (S3-compatible)
Exactly-once processing semantics

Scaling:

Multiple TaskManagers for parallelism
Headless service for ZMQ discovery (see protocol#TODO: Flink-to-Relay ZMQ Discovery)

Apache Iceberg

Deployment: REST catalog with PostgreSQL backend Purpose: Historical data lake for OHLC and analytics

Features:

Schema evolution
Time travel queries
Partitioning by date/symbol
Efficient columnar storage (Parquet)

Storage: MinIO (S3-compatible object storage)

4. Infrastructure Services

DragonflyDB

Redis-compatible in-memory cache
Session state, rate limiting, hot data

Qdrant

Vector database for RAG
Global knowledge (user_id="0"): Platform capabilities, trading concepts, strategy patterns
User knowledge (user_id=specific): Personal conversations, preferences, strategies
GDPR-compliant (indexed by user_id for fast deletion)

PostgreSQL

Iceberg catalog metadata
User accounts and license info (gateway)
Per-user data lives in user containers

MinIO

S3-compatible object storage
Iceberg table data
Flink checkpoints
User file uploads

Data Flow Patterns

Historical Data Query (Async)

1. Client → Gateway → User Container MCP: User requests data
2. Gateway → Relay (REQ/REP): Submit historical request
3. Relay → Ingestors (PUB/SUB): Broadcast work with exchange prefix
4. Ingestor → Exchange API: Fetch data
5. Ingestor → Kafka: Write OHLC batch with metadata
6. Flink → Kafka: Read, process, dedupe
7. Flink → Iceberg: Write to table
8. Flink → Relay (PUB): Publish HistoryReadyNotification
9. Relay → Client (SUB): Notification delivered
10. Client → Iceberg: Query data directly

Key Design:

Client subscribes to notification topic BEFORE submitting request (prevents race)
Notification topics are deterministic: RESPONSE:{client_id} or HISTORY_READY:{request_id}
No state in Relay (fully topic-based routing)

See protocol#Historical Data Query Flow for details.

Realtime Market Data

1. Ingestor → Kafka: Write realtime ticks
2. Flink → Kafka: Read and aggregate OHLC
3. Flink → Relay (PUB): Publish market data
4. Relay → Clients (XPUB/SUB): Fanout to subscribers

Topic Format: {ticker}|{data_type} (e.g., BINANCE:BTC/USDT|tick)

User Events

User containers emit events (order fills, alerts) that must reach users reliably.

Dual-Channel Design:

Informational Events (XPUB/SUB):
- Container tracks active subscriptions via XPUB
- Publishes only if someone is listening
- Zero latency, fire-and-forget
Critical Events (DEALER/ROUTER):
- Container sends to gateway ROUTER with event ID
- Gateway delivers via Telegram/email/push
- Gateway sends EventAck back to container
- Container retries on timeout
- Persisted to disk on shutdown

See user_container_events for implementation.

Container Lifecycle

Creation (gateway_container_creation)

User authenticates → Gateway checks if deployment exists
                  → If missing, create from template (based on license tier)
                  → Wait for ready (2min timeout)
                  → Return MCP endpoint

Templates by Tier:

Tier	Memory	CPU	Storage	Idle Timeout
Free	512Mi	500m	1Gi	15min
Pro	2Gi	2000m	10Gi	60min
Enterprise	4Gi	4000m	50Gi	Never

Lifecycle Management (container_lifecycle_management)

Idle Detection:

Container is idle when: no active triggers + no recent MCP activity
Lifecycle manager tracks:
- MCP tool/resource calls (reset idle timer)
- Active triggers (data subscriptions, CEP patterns)

Shutdown:

On idle timeout: exit with code 42
Lifecycle sidecar detects exit code 42
Sidecar calls k8s API to delete deployment
Optionally deletes PVC (anonymous users only)

Security:

Sidecar has RBAC to delete its own deployment only
Cannot delete other deployments or access other namespaces
Gateway cannot delete deployments (separation of concerns)

Security Architecture

Network Isolation

NetworkPolicies:

User containers:
- ✅ Connect to gateway (MCP)
- ✅ Connect to relay (market data)
- ✅ Outbound HTTPS (exchanges, LLM APIs)
- ❌ No k8s API access
- ❌ No system namespace access
- ❌ No inter-user communication
Gateway:
- ✅ k8s API (create containers)
- ✅ User containers (MCP client)
- ✅ Infrastructure (Postgres, Redis)
- ✅ Outbound (Anthropic API)

RBAC

Gateway ServiceAccount:

Create deployments/services/PVCs in dexorder-agents namespace
Read pod status and logs
Cannot delete, exec, or access secrets

Lifecycle Sidecar ServiceAccount:

Delete deployments in dexorder-agents namespace
Delete PVCs (conditional on user type)
Cannot access other resources

Admission Control

All pods in dexorder-agents namespace must:

Use approved images only (allowlist)
Run as non-root
Drop all capabilities
Use read-only root filesystem
Have resource limits

See deploy/k8s/base/admission-policy.yaml

Agent Harness Flow

The gateway's agent harness (LangChain/LangGraph) orchestrates LLM interactions with full context.

1. User sends message → Gateway (WebSocket/Telegram)
   ↓
2. Authenticator validates user and gets license info
   ↓
3. Container Manager ensures user's MCP container is running
   ↓
4. Agent Harness processes message:
   │
   ├─→ a. MCPClientConnector fetches context resources from user's MCP:
   │      - context://user-profile
   │      - context://conversation-summary
   │      - context://workspace-state
   │      - context://system-prompt
   │
   ├─→ b. RAGRetriever searches Qdrant for relevant memories:
   │      - Embeds user query
   │      - Searches: user_id IN (current_user, "0")
   │      - Returns user-specific + global platform knowledge
   │
   ├─→ c. Build system prompt:
   │      - Base platform prompt
   │      - User profile context
   │      - Workspace state
   │      - Custom user instructions
   │      - Relevant RAG memories
   │
   ├─→ d. ModelRouter selects LLM:
   │      - Based on license tier
   │      - Query complexity
   │      - Routing strategy (cost/speed/quality)
   │
   ├─→ e. LLM invocation with tool support:
   │      - Send messages to LLM
   │      - If tool calls requested:
   │         • Platform tools → handled by gateway
   │         • User tools → proxied to MCP container
   │      - Loop until no more tool calls
   │
   ├─→ f. Save conversation to MCP:
   │      - mcp.callTool('save_message', user_message)
   │      - mcp.callTool('save_message', assistant_message)
   │
   └─→ g. Return response to user via channel

Key Architecture:

Gateway is stateless: No conversation history stored in gateway
User context in MCP: All user-specific data lives in user's container
Global knowledge in Qdrant: Platform documentation loaded from gateway/knowledge/
RAG at gateway level: Semantic search combines global + user knowledge
Skills vs Subagents:
- Skills: Well-defined, single-purpose tasks
- Subagents: Complex domain expertise with multi-file context
Workflows: LangGraph state machines for multi-step processes

See agent_harness for detailed implementation.

Configuration Management

All services use dual YAML files:

config.yaml - Non-sensitive configuration (mounted from ConfigMap)
secrets.yaml - Credentials and tokens (mounted from Secret)

Environment Variables:

K8s downward API for pod metadata
Service discovery via DNS (e.g., kafka:9092)

Deployment

Development

# Start local k8s
minikube start

# Apply infrastructure
kubectl apply -k deploy/k8s/dev

# Build and load images
docker build -t dexorder/gateway:latest gateway/
minikube image load dexorder/gateway:latest

# Port-forward for access
kubectl port-forward -n dexorder-system svc/gateway 3000:3000

Production

# Apply production configs
kubectl apply -k deploy/k8s/prod

# Push images to registry
docker push ghcr.io/dexorder/gateway:latest
docker push ghcr.io/dexorder/agent:latest
docker push ghcr.io/dexorder/lifecycle-sidecar:latest

Namespaces:

dexorder-system - Platform services (gateway, infrastructure)
dexorder-agents - User containers (isolated)

Observability

Metrics (Prometheus)

Container creation/deletion rates
Idle shutdown counts
MCP call latency and errors
Event delivery rates and retries
Kafka lag and throughput
Flink checkpoint duration

Logging

Structured JSON logs
User ID in all agent logs
Aggregated via Loki or CloudWatch

Tracing

OpenTelemetry spans across gateway → MCP → LLM
User-scoped traces for debugging

Scalability

Horizontal Scaling

Stateless Components:

Gateway: Add replicas behind load balancer
Relay: Single instance (stateless router)
Ingestors: Scale by exchange workload

Stateful Components:

Flink: Scale TaskManagers
User containers: One per user (1000s of pods)

Bottlenecks:

Flink → Relay ZMQ: Requires discovery protocol (see protocol#TODO: Flink-to-Relay ZMQ Discovery)
Kafka: Partition by symbol for parallelism
Iceberg: Partition by date/symbol

Cost Optimization

Tiered Resources:

Free users: Aggressive idle shutdown (15min)
Pro users: Longer timeout (60min)
Enterprise: Always-on containers

Storage:

PVC deletion for anonymous users
Tiered storage classes (fast SSD → cheap HDD)

LLM Costs:

Rate limiting per license tier
Caching of MCP resources (1-5min TTL)
Conversation summarization to reduce context size

Development Roadmap

See backend_redesign for detailed notes.

Phase 1: Foundation (Complete)

Gateway with k8s integration
User container provisioning
MCP protocol implementation
Basic market data pipeline

Phase 2: Data Pipeline (In Progress)

Kafka topic schemas
Flink jobs for aggregation
Iceberg integration
Historical backfill service

Phase 3: Agent Features

RAG integration (Qdrant)
Strategy backtesting
Risk management tools
Portfolio analytics

Phase 4: Production Hardening

Multi-region deployment
HA for infrastructure
Comprehensive monitoring
Performance optimization

protocol - ZMQ message protocols and data flow
gateway_container_creation - Dynamic container provisioning
container_lifecycle_management - Idle shutdown and cleanup
user_container_events - Event system implementation
agent_harness - LLM orchestration flow
m_c_p_tools_architecture - User MCP tools specification
user_mcp_resources - Context resources and RAG
m_c_p_client_authentication_modes - MCP authentication patterns
backend_redesign - Design notes and TODO items

22 KiB Raw Blame History