redesign fully scaffolded and web login works
This commit is contained in:
656
doc/architecture.md
Normal file
656
doc/architecture.md
Normal file
@@ -0,0 +1,656 @@
|
||||
# DexOrder AI Platform Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
DexOrder is an AI-powered trading platform that combines real-time market data processing, user-specific AI agents, and a flexible data pipeline. The system is designed for scalability, isolation, and extensibility.
|
||||
|
||||
## High-Level Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ User Clients │
|
||||
│ (Web, Mobile, Telegram, External MCP) │
|
||||
└────────────────────────────┬────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Gateway │
|
||||
│ • WebSocket/HTTP/Telegram handlers │
|
||||
│ • Authentication & session management │
|
||||
│ • Agent Harness (LangChain/LangGraph orchestration) │
|
||||
│ - MCP client connector to user containers │
|
||||
│ - RAG retriever (Qdrant) │
|
||||
│ - Model router (LLM selection) │
|
||||
│ - Skills & subagents framework │
|
||||
│ • Dynamic user container provisioning │
|
||||
│ • Event routing (informational & critical) │
|
||||
└────────┬──────────────────┬────────────────────┬────────────────┘
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌──────────────────┐ ┌──────────────┐ ┌──────────────────────┐
|
||||
│ User Containers │ │ Relay │ │ Infrastructure │
|
||||
│ (per-user pods) │ │ (ZMQ Router) │ │ • DragonflyDB (cache)│
|
||||
│ │ │ │ │ • Qdrant (vectors) │
|
||||
│ • MCP Server │ │ • Market data│ │ • PostgreSQL (meta) │
|
||||
│ • User files: │ │ fanout │ │ • MinIO (S3) │
|
||||
│ - Indicators │ │ • Work queue │ │ │
|
||||
│ - Strategies │ │ • Stateless │ │ │
|
||||
│ - Preferences │ │ │ │ │
|
||||
│ • Event Publisher│ │ │ │ │
|
||||
│ • Lifecycle Mgmt │ │ │ │ │
|
||||
└──────────────────┘ └──────┬───────┘ └──────────────────────┘
|
||||
│
|
||||
┌──────────────┴──────────────┐
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────────┐ ┌──────────────────────┐
|
||||
│ Ingestors │ │ Flink Cluster │
|
||||
│ • CCXT adapters │ │ • Deduplication │
|
||||
│ • Exchange APIs │ │ • OHLC aggregation │
|
||||
│ • Push to Kafka │ │ • CEP engine │
|
||||
└────────┬─────────┘ │ • Writes to Iceberg │
|
||||
│ │ • Market data PUB │
|
||||
│ └──────────┬───────────┘
|
||||
▼ │
|
||||
┌─────────────────────────────────────▼────────────┐
|
||||
│ Kafka │
|
||||
│ • Durable append log │
|
||||
│ • Topic-based streams │
|
||||
│ • Event sourcing │
|
||||
└──────────────────────┬───────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Iceberg Catalog │
|
||||
│ • Historical │
|
||||
│ OHLC storage │
|
||||
│ • Query API │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Gateway
|
||||
|
||||
**Location:** `gateway/`
|
||||
**Language:** TypeScript (Node.js)
|
||||
**Purpose:** Entry point for all user interactions
|
||||
|
||||
**Responsibilities:**
|
||||
- **Authentication:** JWT tokens, Telegram OAuth, multi-tier licensing
|
||||
- **Session Management:** WebSocket connections, Telegram webhooks, multi-channel support
|
||||
- **Container Orchestration:** Dynamic provisioning of user agent pods ([[gateway_container_creation]])
|
||||
- **Event Handling:**
|
||||
- Subscribe to user container events (XPUB/SUB for informational)
|
||||
- Route critical events (ROUTER/DEALER with ack) ([[user_container_events]])
|
||||
- **Agent Harness (LangChain/LangGraph):** ([[agent_harness]])
|
||||
- Stateless LLM orchestration
|
||||
- MCP client connector to user containers
|
||||
- RAG retrieval from Qdrant (global + user-specific knowledge)
|
||||
- Model routing based on license tier and complexity
|
||||
- Skills and subagents framework
|
||||
- Workflow state machines with validation loops
|
||||
|
||||
**Key Features:**
|
||||
- **Stateless design:** All conversation state lives in user containers or Qdrant
|
||||
- **Multi-channel support:** WebSocket, Telegram (future: mobile, Discord, Slack)
|
||||
- **Kubernetes-native:** Uses k8s API for container management
|
||||
- **Three-tier memory:**
|
||||
- Redis: Hot storage, active sessions, LangGraph checkpoints (1 hour TTL)
|
||||
- Qdrant: Vector search, RAG, global + user knowledge, GDPR-compliant
|
||||
- Iceberg: Cold storage, full history, analytics, time-travel queries
|
||||
|
||||
**Infrastructure:**
|
||||
- Deployed in `dexorder-system` namespace
|
||||
- RBAC: Can create but not delete user containers
|
||||
- Network policies: Access to k8s API, user containers, infrastructure
|
||||
|
||||
---
|
||||
|
||||
### 2. User Containers
|
||||
|
||||
**Location:** `client-py/`
|
||||
**Language:** Python
|
||||
**Purpose:** Per-user isolated workspace and data storage
|
||||
|
||||
**Architecture:**
|
||||
- One pod per user (auto-provisioned by gateway)
|
||||
- Persistent storage (PVC) for user data
|
||||
- Multi-container pod:
|
||||
- **Agent container:** MCP server + event publisher + user files
|
||||
- **Lifecycle sidecar:** Auto-shutdown and cleanup
|
||||
|
||||
**Components:**
|
||||
|
||||
#### MCP Server
|
||||
Exposes user-specific resources and tools via Model Context Protocol.
|
||||
|
||||
**Resources (Context for LLM):**
|
||||
Gateway fetches these before each LLM call:
|
||||
- `context://user-profile` - Trading preferences, style, risk tolerance
|
||||
- `context://conversation-summary` - Recent conversation with semantic context
|
||||
- `context://workspace-state` - Current chart, watchlist, positions, alerts
|
||||
- `context://system-prompt` - User's custom AI instructions
|
||||
|
||||
**Tools (Actions with side effects):**
|
||||
Gateway proxies these to user's MCP server:
|
||||
- `save_message(role, content)` - Save to conversation history
|
||||
- `search_conversation(query)` - Semantic search over past conversations
|
||||
- `list_strategies()`, `read_strategy(name)`, `write_strategy(name, code)`
|
||||
- `list_indicators()`, `read_indicator(name)`, `write_indicator(name, code)`
|
||||
- `run_backtest(strategy, params)` - Execute backtest
|
||||
- `get_watchlist()`, `execute_trade(params)`, `get_positions()`
|
||||
- `run_python(code)` - Execute Python with data science libraries
|
||||
|
||||
**User Files:**
|
||||
- `indicators/*.py` - Custom technical indicators
|
||||
- `strategies/*.py` - Trading strategies with entry/exit rules
|
||||
- Watchlists and preferences
|
||||
- Git-versioned in persistent volume
|
||||
|
||||
#### Event Publisher ([[user_container_events]])
|
||||
Publishes user events (order fills, alerts, workspace changes) via dual-channel ZMQ:
|
||||
- **XPUB:** Informational events (fire-and-forget to active sessions)
|
||||
- **DEALER:** Critical events (guaranteed delivery with ack)
|
||||
|
||||
#### Lifecycle Manager ([[container_lifecycle_management]])
|
||||
Tracks activity and triggers; auto-shuts down when idle:
|
||||
- Configurable idle timeouts by license tier
|
||||
- Exit code 42 signals intentional shutdown
|
||||
- Sidecar deletes deployment and optionally PVC
|
||||
|
||||
**Isolation:**
|
||||
- Network policies: Cannot access k8s API, other users, or system services
|
||||
- PodSecurity: Non-root, read-only rootfs, dropped capabilities
|
||||
- Resource limits enforced by license tier
|
||||
|
||||
---
|
||||
|
||||
### 3. Data Pipeline
|
||||
|
||||
#### Relay (ZMQ Router)
|
||||
|
||||
**Location:** `relay/`
|
||||
**Language:** Rust
|
||||
**Purpose:** Stateless message router for market data and requests
|
||||
|
||||
**Architecture:**
|
||||
- Well-known bind point (all components connect to it)
|
||||
- No request tracking or state
|
||||
- Topic-based routing
|
||||
|
||||
**Channels:**
|
||||
1. **Client Requests (ROUTER):** Port 5559 - Historical data requests
|
||||
2. **Ingestor Work Queue (PUB):** Port 5555 - Work distribution with exchange prefix
|
||||
3. **Market Data Fanout (XPUB/XSUB):** Port 5558 - Realtime data + notifications
|
||||
4. **Responses (SUB → PUB proxy):** Notifications from Flink to clients
|
||||
|
||||
See [[protocol]] for detailed ZMQ patterns and message formats.
|
||||
|
||||
---
|
||||
|
||||
#### Ingestors
|
||||
|
||||
**Location:** `ingestor/`
|
||||
**Language:** Python
|
||||
**Purpose:** Fetch market data from exchanges
|
||||
|
||||
**Features:**
|
||||
- CCXT-based exchange adapters
|
||||
- Subscribes to work queue via exchange prefix (e.g., `BINANCE:`)
|
||||
- Writes raw data to Kafka only (no direct client responses)
|
||||
- Supports realtime ticks and historical OHLC
|
||||
|
||||
**Data Flow:**
|
||||
```
|
||||
Exchange API → Ingestor → Kafka → Flink → Iceberg
|
||||
↓
|
||||
Notification → Relay → Clients
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Kafka
|
||||
|
||||
**Deployment:** KRaft mode (no Zookeeper)
|
||||
**Purpose:** Durable event log and stream processing backbone
|
||||
|
||||
**Topics:**
|
||||
- Raw market data streams (per exchange/symbol)
|
||||
- Processed OHLC data
|
||||
- Notification events
|
||||
- User events (orders, alerts)
|
||||
|
||||
**Retention:**
|
||||
- Configurable per topic (default: 7 days for raw data)
|
||||
- Longer retention for aggregated data
|
||||
|
||||
---
|
||||
|
||||
#### Flink
|
||||
|
||||
**Deployment:** JobManager + TaskManager(s)
|
||||
**Purpose:** Stream processing and aggregation
|
||||
|
||||
**Jobs:**
|
||||
1. **Deduplication:** Remove duplicate ticks from multiple ingestors
|
||||
2. **OHLC Aggregation:** Build candles from tick streams
|
||||
3. **CEP (Complex Event Processing):** Pattern detection and alerts
|
||||
4. **Iceberg Writer:** Batch write to long-term storage
|
||||
5. **Notification Publisher:** ZMQ PUB for async client notifications
|
||||
|
||||
**State:**
|
||||
- Checkpointing to MinIO (S3-compatible)
|
||||
- Exactly-once processing semantics
|
||||
|
||||
**Scaling:**
|
||||
- Multiple TaskManagers for parallelism
|
||||
- Headless service for ZMQ discovery (see [[protocol#TODO: Flink-to-Relay ZMQ Discovery]])
|
||||
|
||||
---
|
||||
|
||||
#### Apache Iceberg
|
||||
|
||||
**Deployment:** REST catalog with PostgreSQL backend
|
||||
**Purpose:** Historical data lake for OHLC and analytics
|
||||
|
||||
**Features:**
|
||||
- Schema evolution
|
||||
- Time travel queries
|
||||
- Partitioning by date/symbol
|
||||
- Efficient columnar storage (Parquet)
|
||||
|
||||
**Storage:** MinIO (S3-compatible object storage)
|
||||
|
||||
---
|
||||
|
||||
### 4. Infrastructure Services
|
||||
|
||||
#### DragonflyDB
|
||||
- Redis-compatible in-memory cache
|
||||
- Session state, rate limiting, hot data
|
||||
|
||||
#### Qdrant
|
||||
- Vector database for RAG
|
||||
- **Global knowledge** (user_id="0"): Platform capabilities, trading concepts, strategy patterns
|
||||
- **User knowledge** (user_id=specific): Personal conversations, preferences, strategies
|
||||
- GDPR-compliant (indexed by user_id for fast deletion)
|
||||
|
||||
#### PostgreSQL
|
||||
- Iceberg catalog metadata
|
||||
- User accounts and license info (gateway)
|
||||
- Per-user data lives in user containers
|
||||
|
||||
#### MinIO
|
||||
- S3-compatible object storage
|
||||
- Iceberg table data
|
||||
- Flink checkpoints
|
||||
- User file uploads
|
||||
|
||||
---
|
||||
|
||||
## Data Flow Patterns
|
||||
|
||||
### Historical Data Query (Async)
|
||||
|
||||
```
|
||||
1. Client → Gateway → User Container MCP: User requests data
|
||||
2. Gateway → Relay (REQ/REP): Submit historical request
|
||||
3. Relay → Ingestors (PUB/SUB): Broadcast work with exchange prefix
|
||||
4. Ingestor → Exchange API: Fetch data
|
||||
5. Ingestor → Kafka: Write OHLC batch with metadata
|
||||
6. Flink → Kafka: Read, process, dedupe
|
||||
7. Flink → Iceberg: Write to table
|
||||
8. Flink → Relay (PUB): Publish HistoryReadyNotification
|
||||
9. Relay → Client (SUB): Notification delivered
|
||||
10. Client → Iceberg: Query data directly
|
||||
```
|
||||
|
||||
**Key Design:**
|
||||
- Client subscribes to notification topic BEFORE submitting request (prevents race)
|
||||
- Notification topics are deterministic: `RESPONSE:{client_id}` or `HISTORY_READY:{request_id}`
|
||||
- No state in Relay (fully topic-based routing)
|
||||
|
||||
See [[protocol#Historical Data Query Flow]] for details.
|
||||
|
||||
---
|
||||
|
||||
### Realtime Market Data
|
||||
|
||||
```
|
||||
1. Ingestor → Kafka: Write realtime ticks
|
||||
2. Flink → Kafka: Read and aggregate OHLC
|
||||
3. Flink → Relay (PUB): Publish market data
|
||||
4. Relay → Clients (XPUB/SUB): Fanout to subscribers
|
||||
```
|
||||
|
||||
**Topic Format:** `{ticker}|{data_type}` (e.g., `BINANCE:BTC/USDT|tick`)
|
||||
|
||||
---
|
||||
|
||||
### User Events
|
||||
|
||||
User containers emit events (order fills, alerts) that must reach users reliably.
|
||||
|
||||
**Dual-Channel Design:**
|
||||
|
||||
1. **Informational Events (XPUB/SUB):**
|
||||
- Container tracks active subscriptions via XPUB
|
||||
- Publishes only if someone is listening
|
||||
- Zero latency, fire-and-forget
|
||||
|
||||
2. **Critical Events (DEALER/ROUTER):**
|
||||
- Container sends to gateway ROUTER with event ID
|
||||
- Gateway delivers via Telegram/email/push
|
||||
- Gateway sends EventAck back to container
|
||||
- Container retries on timeout
|
||||
- Persisted to disk on shutdown
|
||||
|
||||
See [[user_container_events]] for implementation.
|
||||
|
||||
---
|
||||
|
||||
## Container Lifecycle
|
||||
|
||||
### Creation ([[gateway_container_creation]])
|
||||
|
||||
```
|
||||
User authenticates → Gateway checks if deployment exists
|
||||
→ If missing, create from template (based on license tier)
|
||||
→ Wait for ready (2min timeout)
|
||||
→ Return MCP endpoint
|
||||
```
|
||||
|
||||
**Templates by Tier:**
|
||||
| Tier | Memory | CPU | Storage | Idle Timeout |
|
||||
|------|--------|-----|---------|--------------|
|
||||
| Free | 512Mi | 500m | 1Gi | 15min |
|
||||
| Pro | 2Gi | 2000m | 10Gi | 60min |
|
||||
| Enterprise | 4Gi | 4000m | 50Gi | Never |
|
||||
|
||||
---
|
||||
|
||||
### Lifecycle Management ([[container_lifecycle_management]])
|
||||
|
||||
**Idle Detection:**
|
||||
- Container is idle when: no active triggers + no recent MCP activity
|
||||
- Lifecycle manager tracks:
|
||||
- MCP tool/resource calls (reset idle timer)
|
||||
- Active triggers (data subscriptions, CEP patterns)
|
||||
|
||||
**Shutdown:**
|
||||
- On idle timeout: exit with code 42
|
||||
- Lifecycle sidecar detects exit code 42
|
||||
- Sidecar calls k8s API to delete deployment
|
||||
- Optionally deletes PVC (anonymous users only)
|
||||
|
||||
**Security:**
|
||||
- Sidecar has RBAC to delete its own deployment only
|
||||
- Cannot delete other deployments or access other namespaces
|
||||
- Gateway cannot delete deployments (separation of concerns)
|
||||
|
||||
---
|
||||
|
||||
## Security Architecture
|
||||
|
||||
### Network Isolation
|
||||
|
||||
**NetworkPolicies:**
|
||||
- User containers:
|
||||
- ✅ Connect to gateway (MCP)
|
||||
- ✅ Connect to relay (market data)
|
||||
- ✅ Outbound HTTPS (exchanges, LLM APIs)
|
||||
- ❌ No k8s API access
|
||||
- ❌ No system namespace access
|
||||
- ❌ No inter-user communication
|
||||
|
||||
- Gateway:
|
||||
- ✅ k8s API (create containers)
|
||||
- ✅ User containers (MCP client)
|
||||
- ✅ Infrastructure (Postgres, Redis)
|
||||
- ✅ Outbound (Anthropic API)
|
||||
|
||||
---
|
||||
|
||||
### RBAC
|
||||
|
||||
**Gateway ServiceAccount:**
|
||||
- Create deployments/services/PVCs in `dexorder-agents` namespace
|
||||
- Read pod status and logs
|
||||
- Cannot delete, exec, or access secrets
|
||||
|
||||
**Lifecycle Sidecar ServiceAccount:**
|
||||
- Delete deployments in `dexorder-agents` namespace
|
||||
- Delete PVCs (conditional on user type)
|
||||
- Cannot access other resources
|
||||
|
||||
---
|
||||
|
||||
### Admission Control
|
||||
|
||||
All pods in `dexorder-agents` namespace must:
|
||||
- Use approved images only (allowlist)
|
||||
- Run as non-root
|
||||
- Drop all capabilities
|
||||
- Use read-only root filesystem
|
||||
- Have resource limits
|
||||
|
||||
See `deploy/k8s/base/admission-policy.yaml`
|
||||
|
||||
---
|
||||
|
||||
## Agent Harness Flow
|
||||
|
||||
The gateway's agent harness (LangChain/LangGraph) orchestrates LLM interactions with full context.
|
||||
|
||||
```
|
||||
1. User sends message → Gateway (WebSocket/Telegram)
|
||||
↓
|
||||
2. Authenticator validates user and gets license info
|
||||
↓
|
||||
3. Container Manager ensures user's MCP container is running
|
||||
↓
|
||||
4. Agent Harness processes message:
|
||||
│
|
||||
├─→ a. MCPClientConnector fetches context resources from user's MCP:
|
||||
│ - context://user-profile
|
||||
│ - context://conversation-summary
|
||||
│ - context://workspace-state
|
||||
│ - context://system-prompt
|
||||
│
|
||||
├─→ b. RAGRetriever searches Qdrant for relevant memories:
|
||||
│ - Embeds user query
|
||||
│ - Searches: user_id IN (current_user, "0")
|
||||
│ - Returns user-specific + global platform knowledge
|
||||
│
|
||||
├─→ c. Build system prompt:
|
||||
│ - Base platform prompt
|
||||
│ - User profile context
|
||||
│ - Workspace state
|
||||
│ - Custom user instructions
|
||||
│ - Relevant RAG memories
|
||||
│
|
||||
├─→ d. ModelRouter selects LLM:
|
||||
│ - Based on license tier
|
||||
│ - Query complexity
|
||||
│ - Routing strategy (cost/speed/quality)
|
||||
│
|
||||
├─→ e. LLM invocation with tool support:
|
||||
│ - Send messages to LLM
|
||||
│ - If tool calls requested:
|
||||
│ • Platform tools → handled by gateway
|
||||
│ • User tools → proxied to MCP container
|
||||
│ - Loop until no more tool calls
|
||||
│
|
||||
├─→ f. Save conversation to MCP:
|
||||
│ - mcp.callTool('save_message', user_message)
|
||||
│ - mcp.callTool('save_message', assistant_message)
|
||||
│
|
||||
└─→ g. Return response to user via channel
|
||||
```
|
||||
|
||||
**Key Architecture:**
|
||||
- **Gateway is stateless:** No conversation history stored in gateway
|
||||
- **User context in MCP:** All user-specific data lives in user's container
|
||||
- **Global knowledge in Qdrant:** Platform documentation loaded from `gateway/knowledge/`
|
||||
- **RAG at gateway level:** Semantic search combines global + user knowledge
|
||||
- **Skills vs Subagents:**
|
||||
- Skills: Well-defined, single-purpose tasks
|
||||
- Subagents: Complex domain expertise with multi-file context
|
||||
- **Workflows:** LangGraph state machines for multi-step processes
|
||||
|
||||
See [[agent_harness]] for detailed implementation.
|
||||
|
||||
---
|
||||
|
||||
## Configuration Management
|
||||
|
||||
All services use dual YAML files:
|
||||
- `config.yaml` - Non-sensitive configuration (mounted from ConfigMap)
|
||||
- `secrets.yaml` - Credentials and tokens (mounted from Secret)
|
||||
|
||||
**Environment Variables:**
|
||||
- K8s downward API for pod metadata
|
||||
- Service discovery via DNS (e.g., `kafka:9092`)
|
||||
|
||||
---
|
||||
|
||||
## Deployment
|
||||
|
||||
### Development
|
||||
|
||||
```bash
|
||||
# Start local k8s
|
||||
minikube start
|
||||
|
||||
# Apply infrastructure
|
||||
kubectl apply -k deploy/k8s/dev
|
||||
|
||||
# Build and load images
|
||||
docker build -t dexorder/gateway:latest gateway/
|
||||
minikube image load dexorder/gateway:latest
|
||||
|
||||
# Port-forward for access
|
||||
kubectl port-forward -n dexorder-system svc/gateway 3000:3000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Production
|
||||
|
||||
```bash
|
||||
# Apply production configs
|
||||
kubectl apply -k deploy/k8s/prod
|
||||
|
||||
# Push images to registry
|
||||
docker push ghcr.io/dexorder/gateway:latest
|
||||
docker push ghcr.io/dexorder/agent:latest
|
||||
docker push ghcr.io/dexorder/lifecycle-sidecar:latest
|
||||
```
|
||||
|
||||
**Namespaces:**
|
||||
- `dexorder-system` - Platform services (gateway, infrastructure)
|
||||
- `dexorder-agents` - User containers (isolated)
|
||||
|
||||
---
|
||||
|
||||
## Observability
|
||||
|
||||
### Metrics (Prometheus)
|
||||
- Container creation/deletion rates
|
||||
- Idle shutdown counts
|
||||
- MCP call latency and errors
|
||||
- Event delivery rates and retries
|
||||
- Kafka lag and throughput
|
||||
- Flink checkpoint duration
|
||||
|
||||
### Logging
|
||||
- Structured JSON logs
|
||||
- User ID in all agent logs
|
||||
- Aggregated via Loki or CloudWatch
|
||||
|
||||
### Tracing
|
||||
- OpenTelemetry spans across gateway → MCP → LLM
|
||||
- User-scoped traces for debugging
|
||||
|
||||
---
|
||||
|
||||
## Scalability
|
||||
|
||||
### Horizontal Scaling
|
||||
|
||||
**Stateless Components:**
|
||||
- Gateway: Add replicas behind load balancer
|
||||
- Relay: Single instance (stateless router)
|
||||
- Ingestors: Scale by exchange workload
|
||||
|
||||
**Stateful Components:**
|
||||
- Flink: Scale TaskManagers
|
||||
- User containers: One per user (1000s of pods)
|
||||
|
||||
**Bottlenecks:**
|
||||
- Flink → Relay ZMQ: Requires discovery protocol (see [[protocol#TODO: Flink-to-Relay ZMQ Discovery]])
|
||||
- Kafka: Partition by symbol for parallelism
|
||||
- Iceberg: Partition by date/symbol
|
||||
|
||||
---
|
||||
|
||||
### Cost Optimization
|
||||
|
||||
**Tiered Resources:**
|
||||
- Free users: Aggressive idle shutdown (15min)
|
||||
- Pro users: Longer timeout (60min)
|
||||
- Enterprise: Always-on containers
|
||||
|
||||
**Storage:**
|
||||
- PVC deletion for anonymous users
|
||||
- Tiered storage classes (fast SSD → cheap HDD)
|
||||
|
||||
**LLM Costs:**
|
||||
- Rate limiting per license tier
|
||||
- Caching of MCP resources (1-5min TTL)
|
||||
- Conversation summarization to reduce context size
|
||||
|
||||
---
|
||||
|
||||
## Development Roadmap
|
||||
|
||||
See [[backend_redesign]] for detailed notes.
|
||||
|
||||
**Phase 1: Foundation (Complete)**
|
||||
- Gateway with k8s integration
|
||||
- User container provisioning
|
||||
- MCP protocol implementation
|
||||
- Basic market data pipeline
|
||||
|
||||
**Phase 2: Data Pipeline (In Progress)**
|
||||
- Kafka topic schemas
|
||||
- Flink jobs for aggregation
|
||||
- Iceberg integration
|
||||
- Historical backfill service
|
||||
|
||||
**Phase 3: Agent Features**
|
||||
- RAG integration (Qdrant)
|
||||
- Strategy backtesting
|
||||
- Risk management tools
|
||||
- Portfolio analytics
|
||||
|
||||
**Phase 4: Production Hardening**
|
||||
- Multi-region deployment
|
||||
- HA for infrastructure
|
||||
- Comprehensive monitoring
|
||||
- Performance optimization
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [[protocol]] - ZMQ message protocols and data flow
|
||||
- [[gateway_container_creation]] - Dynamic container provisioning
|
||||
- [[container_lifecycle_management]] - Idle shutdown and cleanup
|
||||
- [[user_container_events]] - Event system implementation
|
||||
- [[agent_harness]] - LLM orchestration flow
|
||||
- [[m_c_p_tools_architecture]] - User MCP tools specification
|
||||
- [[user_mcp_resources]] - Context resources and RAG
|
||||
- [[m_c_p_client_authentication_modes]] - MCP authentication patterns
|
||||
- [[backend_redesign]] - Design notes and TODO items
|
||||
Reference in New Issue
Block a user