ai/gateway/README.md

# Dexorder Gateway

Multi-channel gateway with agent harness for the Dexorder AI platform.

## Architecture

```
┌─────────────────────────────────────────────────────────┐
│                    Platform Gateway                      │
│                   (Node.js/Fastify)                      │
│                                                          │
│  ┌────────────────────────────────────────────────┐    │
│  │  Channels                                       │    │
│  │  - WebSocket (/ws/chat)                         │    │
│  │  - Telegram Webhook (/webhook/telegram)        │    │
│  └────────────────────────────────────────────────┘    │
│                         ↕                                │
│  ┌────────────────────────────────────────────────┐    │
│  │  Authenticator                                  │    │
│  │  - JWT verification (WebSocket)                 │    │
│  │  - Channel linking (Telegram)                   │    │
│  │  - User license lookup (PostgreSQL)             │    │
│  └────────────────────────────────────────────────┘    │
│                         ↕                                │
│  ┌────────────────────────────────────────────────┐    │
│  │  Agent Harness (per-session)                    │    │
│  │  - Claude API integration                       │    │
│  │  - MCP client connector                         │    │
│  │  - Conversation state                           │    │
│  └────────────────────────────────────────────────┘    │
│                         ↕                                │
│  ┌────────────────────────────────────────────────┐    │
│  │  MCP Client                                      │    │
│  │  - User container connection                    │    │
│  │  - Tool routing                                  │    │
│  └────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────┘
                          ↕
          ┌───────────────────────────────┐
          │  User MCP Server (Python)      │
          │  - Strategies, indicators       │
          │  - Memory, preferences          │
          │  - Backtest sandbox             │
          └───────────────────────────────────┘
```

## Features

- **Automatic container provisioning**: Creates user agent containers on-demand via Kubernetes
- **Multi-channel support**: WebSocket and Telegram webhooks
- **Per-channel authentication**: JWT for web, channel linking for chat apps
- **User license management**: Feature flags and resource limits from PostgreSQL
- **Container lifecycle management**: Auto-shutdown on idle (handled by container sidecar)
- **License-based resources**: Different memory/CPU/storage limits per tier
- **Multi-model LLM support**: Anthropic Claude, OpenAI GPT, Google Gemini, OpenRouter (300+ models)
- **Zero vendor lock-in**: Switch models with one line, powered by LangChain.js
- **Intelligent routing**: Auto-select models based on complexity, license tier, or user preference
- **Streaming responses**: Real-time chat with WebSocket and Telegram
- **Complex workflows**: LangGraph for stateful trading analysis (backtest → risk → approval)
- **Agent harness**: Stateless orchestrator (all context lives in user's MCP container)
- **MCP resource integration**: User's RAG, conversation history, and preferences

## Container Management

When a user authenticates, the gateway:

1. **Checks for existing container**: Queries Kubernetes for deployment
2. **Creates if missing**: Renders YAML template based on license tier
3. **Waits for ready**: Polls deployment status until healthy
4. **Returns MCP endpoint**: Computed from service name
5. **Connects to MCP server**: Proceeds with normal authentication flow

Container templates by license tier:

| Tier | Memory | CPU | Storage | Idle Timeout |
|------|--------|-----|---------|--------------|
| Free | 512Mi | 500m | 1Gi | 15min |
| Pro | 2Gi | 2000m | 10Gi | 60min |
| Enterprise | 4Gi | 4000m | 50Gi | Never |

Containers self-manage their lifecycle using the lifecycle sidecar (see `../lifecycle-sidecar/`)

## Setup

### Prerequisites

- Node.js >= 22.0.0
- PostgreSQL database
- At least one LLM provider API key:
  - Anthropic Claude
  - OpenAI GPT
  - Google Gemini
  - OpenRouter (one key for 300+ models)
- Ollama (for embeddings): https://ollama.com/download
- Redis (for session/hot storage)
- Qdrant (for RAG vector search)
- Kafka + Flink + Iceberg (for durable storage)

### Development

1. Install dependencies:
```bash
npm install
```

2. Copy environment template:
```bash
cp .env.example .env
```

3. Configure `.env` (see `.env.example`):
```bash
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/dexorder

# Configure at least one provider
ANTHROPIC_API_KEY=sk-ant-xxxxx
# OPENAI_API_KEY=sk-xxxxx
# GOOGLE_API_KEY=xxxxx
# OPENROUTER_API_KEY=sk-or-xxxxx

# Optional: Set default model
DEFAULT_MODEL_PROVIDER=anthropic
DEFAULT_MODEL=claude-3-5-sonnet-20241022
```

4. Start Ollama and pull embedding model:
```bash
# Install Ollama (one-time): https://ollama.com/download
# Or with Docker: docker run -d -p 11434:11434 ollama/ollama

# Pull the all-minilm embedding model (90MB, CPU-friendly)
ollama pull all-minilm

# Alternative models:
# ollama pull nomic-embed-text  # 8K context length
# ollama pull mxbai-embed-large  # Higher accuracy, slower
```

5. Run development server:
```bash
npm run dev
```

### Production Build

```bash
npm run build
npm start
```

### Docker

```bash
docker build -t dexorder/gateway:latest .
docker run -p 3000:3000 --env-file .env dexorder/gateway:latest
```

## Database Schema

Required PostgreSQL tables (will be documented separately):

### `user_licenses`
- `user_id` (text, primary key)
- `email` (text)
- `license_type` (text: 'free', 'pro', 'enterprise')
- `features` (jsonb)
- `resource_limits` (jsonb)
- `mcp_server_url` (text)
- `expires_at` (timestamp, nullable)
- `created_at` (timestamp)
- `updated_at` (timestamp)

### `user_channel_links`
- `id` (serial, primary key)
- `user_id` (text, foreign key)
- `channel_type` (text: 'telegram', 'slack', 'discord')
- `channel_user_id` (text)
- `created_at` (timestamp)

## API Endpoints

### WebSocket

**`GET /ws/chat`**
- WebSocket connection for web client
- Auth: Bearer token in headers
- Protocol: JSON messages

Example:
```javascript
const ws = new WebSocket('ws://localhost:3000/ws/chat', {
  headers: {
    'Authorization': 'Bearer your-jwt-token'
  }
});

ws.on('message', (data) => {
  const msg = JSON.parse(data);
  console.log(msg);
});

ws.send(JSON.stringify({
  type: 'message',
  content: 'Hello, AI!'
}));
```

### Telegram Webhook

**`POST /webhook/telegram`**
- Telegram bot webhook endpoint
- Auth: Telegram user linked to platform user
- Automatically processes incoming messages

### Health Check

**`GET /health`**
- Returns server health status

## Ollama Deployment Options

The gateway requires Ollama for embedding generation in RAG queries. You have two deployment options:

### Option 1: Ollama in Gateway Container (Recommended for simplicity)

Install Ollama directly in the gateway container. This keeps all dependencies local and simplifies networking.

**Dockerfile additions:**
```dockerfile
FROM node:22-slim

# Install Ollama
RUN curl -fsSL https://ollama.com/install.sh | sh

# Pull embedding model at build time
RUN ollama serve & \
    sleep 5 && \
    ollama pull all-minilm && \
    pkill ollama

# ... rest of your gateway Dockerfile
```

**Start script (entrypoint.sh):**
```bash
#!/bin/bash
# Start Ollama in background
ollama serve &

# Start gateway
node dist/main.js
```

**Pros:**
- Simple networking (localhost:11434)
- No extra K8s resources
- Self-contained deployment

**Cons:**
- Larger container image (~200MB extra)
- CPU/memory shared with gateway process

**Resource requirements:**
- Add +200MB memory
- Add +0.2 CPU cores for embedding inference

### Option 2: Ollama as Separate Pod/Sidecar

Deploy Ollama as a separate container in the same pod (sidecar) or as its own deployment.

**K8s Deployment (sidecar pattern):**
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gateway
spec:
  template:
    spec:
      containers:
      - name: gateway
        image: ghcr.io/dexorder/gateway:latest
        env:
        - name: OLLAMA_URL
          value: http://localhost:11434

      - name: ollama
        image: ollama/ollama:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            ollama serve &
            sleep 5
            ollama pull all-minilm
            wait
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
```

**K8s Deployment (separate service):**
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        # ... same as above
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
```

Gateway `.env`:
```bash
OLLAMA_URL=http://ollama:11434
```

**Pros:**
- Isolated resource limits
- Can scale separately
- Easier to monitor/debug

**Cons:**
- More K8s resources
- Network hop (minimal latency)
- More complex deployment

### Recommendation

For most deployments: **Use Option 1 (in-container)** for simplicity, unless you need to:
- Share Ollama across multiple services
- Scale embedding inference independently
- Run Ollama on GPU nodes (gateway on CPU nodes)

## TODO

- [ ] Implement JWT verification with JWKS
- [ ] Implement MCP HTTP/SSE transport
- [ ] Add rate limiting per user license
- [ ] Add message usage tracking
- [ ] Add streaming responses for WebSocket
- [ ] Add Slack and Discord channel handlers
- [ ] Add session cleanup/timeout logic