feat: add @tag model override support and remove Qdrant dependencies

- Add model-tags parser for @Tag syntax in chat messages - Support Anthropic models (Sonnet, Haiku, Opus) via @tag - Remove Qdrant vector database from infrastructure and configs - Simplify license model config to use null fallbacks - Add greeting stream after model switch via @tag - Fix protobuf field names to camelCase for v7 compatibility - Add 429 rate limit retry logic with exponential backoff - Remove RAG references from agent harness documentation
2026-04-27 20:55:18 -04:00
parent 6f937f9e5e
commit d41fcd0499
50 changed files with 956 additions and 798 deletions
--- a/gateway/README.md
+++ b/gateway/README.md
@@ -58,7 +58,6 @@ Multi-channel gateway with agent harness for the Dexorder AI platform.
 - **Streaming responses**: Real-time chat with WebSocket and Telegram
 - **Complex workflows**: LangGraph for stateful trading analysis (backtest → risk → approval)
 - **Agent harness**: Stateless orchestrator (all context lives in user's MCP container)
- **MCP resource integration**: User's RAG, conversation history, and preferences

 ## Container Management

@@ -91,9 +90,7 @@ Containers self-manage their lifecycle using the lifecycle sidecar (see `../life
  - OpenAI GPT
  - Google Gemini
  - OpenRouter (one key for 300+ models)
- Ollama (for embeddings): https://ollama.com/download
 - Redis (for session/hot storage)
- Qdrant (for RAG vector search)
 - Kafka + Flink + Iceberg (for durable storage)

 ### Development
@@ -123,20 +120,7 @@ DEFAULT_MODEL_PROVIDER=anthropic
 DEFAULT_MODEL=claude-sonnet-4-6
 ```

-4. Start Ollama and pull embedding model:
-```bash
-# Install Ollama (one-time): https://ollama.com/download
-# Or with Docker: docker run -d -p 11434:11434 ollama/ollama
-
-# Pull the all-minilm embedding model (90MB, CPU-friendly)
-ollama pull all-minilm
-
-# Alternative models:
-# ollama pull nomic-embed-text  # 8K context length
-# ollama pull mxbai-embed-large  # Higher accuracy, slower
-```
-
-5. Run development server:
+4. Run development server:
 ```bash
 npm run dev
 ```
@@ -217,138 +201,6 @@ ws.send(JSON.stringify({
 **`GET /health`**
 - Returns server health status

-## Ollama Deployment Options
-
-The gateway requires Ollama for embedding generation in RAG queries. You have two deployment options:
-
-### Option 1: Ollama in Gateway Container (Recommended for simplicity)
-
-Install Ollama directly in the gateway container. This keeps all dependencies local and simplifies networking.
-
-**Dockerfile additions:**
-```dockerfile
-FROM node:22-slim
-
-# Install Ollama
-RUN curl -fsSL https://ollama.com/install.sh | sh
-
-# Pull embedding model at build time
-RUN ollama serve & \
-    sleep 5 && \
-    ollama pull all-minilm && \
-    pkill ollama
-
-# ... rest of your gateway Dockerfile
-```
-
-**Start script (entrypoint.sh):**
-```bash
-#!/bin/bash
-# Start Ollama in background
-ollama serve &
-
-# Start gateway
-node dist/main.js
-```
-
-**Pros:**
- Simple networking (localhost:11434)
- No extra K8s resources
- Self-contained deployment
-
-**Cons:**
- Larger container image (~200MB extra)
- CPU/memory shared with gateway process
-
-**Resource requirements:**
- Add +200MB memory
- Add +0.2 CPU cores for embedding inference
-
-### Option 2: Ollama as Separate Pod/Sidecar
-
-Deploy Ollama as a separate container in the same pod (sidecar) or as its own deployment.
-
-**K8s Deployment (sidecar pattern):**
-```yaml
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: gateway
-spec:
-  template:
-    spec:
-      containers:
-      - name: gateway
-        image: ghcr.io/dexorder/gateway:latest
-        env:
-        - name: OLLAMA_URL
-          value: http://localhost:11434
-
-      - name: ollama
-        image: ollama/ollama:latest
-        command: ["/bin/sh", "-c"]
-        args:
-          - |
-            ollama serve &
-            sleep 5
-            ollama pull all-minilm
-            wait
-        resources:
-          requests:
-            memory: "512Mi"
-            cpu: "500m"
-          limits:
-            memory: "1Gi"
-            cpu: "1000m"
-```
-
-**K8s Deployment (separate service):**
-```yaml
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: ollama
-spec:
-  replicas: 1
-  template:
-    spec:
-      containers:
-      - name: ollama
-        image: ollama/ollama:latest
-        # ... same as above
---
-apiVersion: v1
-kind: Service
-metadata:
-  name: ollama
-spec:
-  selector:
-    app: ollama
-  ports:
-  - port: 11434
-```
-
-Gateway `.env`:
-```bash
-OLLAMA_URL=http://ollama:11434
-```
-
-**Pros:**
- Isolated resource limits
- Can scale separately
- Easier to monitor/debug
-
-**Cons:**
- More K8s resources
- Network hop (minimal latency)
- More complex deployment
-
-### Recommendation
-
-For most deployments: **Use Option 1 (in-container)** for simplicity, unless you need to:
- Share Ollama across multiple services
- Scale embedding inference independently
- Run Ollama on GPU nodes (gateway on CPU nodes)

 ## TODO