redesign fully scaffolded and web login works

2026-03-17 20:10:47 -04:00
parent b9cc397e05
commit f6bd22a8ef
143 changed files with 17317 additions and 693 deletions
--- a/gateway/README.md
+++ b/gateway/README.md
@@ -91,6 +91,10 @@ Containers self-manage their lifecycle using the lifecycle sidecar (see `../life
  - OpenAI GPT
  - Google Gemini
  - OpenRouter (one key for 300+ models)
+- Ollama (for embeddings): https://ollama.com/download
+- Redis (for session/hot storage)
+- Qdrant (for RAG vector search)
+- Kafka + Flink + Iceberg (for durable storage)

 ### Development

@@ -119,7 +123,20 @@ DEFAULT_MODEL_PROVIDER=anthropic
 DEFAULT_MODEL=claude-3-5-sonnet-20241022
 ```

-4. Run development server:
+4. Start Ollama and pull embedding model:
+```bash
+# Install Ollama (one-time): https://ollama.com/download
+# Or with Docker: docker run -d -p 11434:11434 ollama/ollama
+
+# Pull the all-minilm embedding model (90MB, CPU-friendly)
+ollama pull all-minilm
+
+# Alternative models:
+# ollama pull nomic-embed-text  # 8K context length
+# ollama pull mxbai-embed-large  # Higher accuracy, slower
+```
+
+5. Run development server:
 ```bash
 npm run dev
 ```
@@ -200,11 +217,143 @@ ws.send(JSON.stringify({
 **`GET /health`**
 - Returns server health status

+## Ollama Deployment Options
+
+The gateway requires Ollama for embedding generation in RAG queries. You have two deployment options:
+
+### Option 1: Ollama in Gateway Container (Recommended for simplicity)
+
+Install Ollama directly in the gateway container. This keeps all dependencies local and simplifies networking.
+
+**Dockerfile additions:**
+```dockerfile
+FROM node:22-slim
+
+# Install Ollama
+RUN curl -fsSL https://ollama.com/install.sh | sh
+
+# Pull embedding model at build time
+RUN ollama serve & \
+    sleep 5 && \
+    ollama pull all-minilm && \
+    pkill ollama
+
+# ... rest of your gateway Dockerfile
+```
+
+**Start script (entrypoint.sh):**
+```bash
+#!/bin/bash
+# Start Ollama in background
+ollama serve &
+
+# Start gateway
+node dist/main.js
+```
+
+**Pros:**
+- Simple networking (localhost:11434)
+- No extra K8s resources
+- Self-contained deployment
+
+**Cons:**
+- Larger container image (~200MB extra)
+- CPU/memory shared with gateway process
+
+**Resource requirements:**
+- Add +200MB memory
+- Add +0.2 CPU cores for embedding inference
+
+### Option 2: Ollama as Separate Pod/Sidecar
+
+Deploy Ollama as a separate container in the same pod (sidecar) or as its own deployment.
+
+**K8s Deployment (sidecar pattern):**
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: gateway
+spec:
+  template:
+    spec:
+      containers:
+      - name: gateway
+        image: ghcr.io/dexorder/gateway:latest
+        env:
+        - name: OLLAMA_URL
+          value: http://localhost:11434
+
+      - name: ollama
+        image: ollama/ollama:latest
+        command: ["/bin/sh", "-c"]
+        args:
+          - |
+            ollama serve &
+            sleep 5
+            ollama pull all-minilm
+            wait
+        resources:
+          requests:
+            memory: "512Mi"
+            cpu: "500m"
+          limits:
+            memory: "1Gi"
+            cpu: "1000m"
+```
+
+**K8s Deployment (separate service):**
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: ollama
+spec:
+  replicas: 1
+  template:
+    spec:
+      containers:
+      - name: ollama
+        image: ollama/ollama:latest
+        # ... same as above
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: ollama
+spec:
+  selector:
+    app: ollama
+  ports:
+  - port: 11434
+```
+
+Gateway `.env`:
+```bash
+OLLAMA_URL=http://ollama:11434
+```
+
+**Pros:**
+- Isolated resource limits
+- Can scale separately
+- Easier to monitor/debug
+
+**Cons:**
+- More K8s resources
+- Network hop (minimal latency)
+- More complex deployment
+
+### Recommendation
+
+For most deployments: **Use Option 1 (in-container)** for simplicity, unless you need to:
+- Share Ollama across multiple services
+- Scale embedding inference independently
+- Run Ollama on GPU nodes (gateway on CPU nodes)
+
 ## TODO

 - [ ] Implement JWT verification with JWKS
 - [ ] Implement MCP HTTP/SSE transport
- [ ] Add Redis for session persistence
 - [ ] Add rate limiting per user license
 - [ ] Add message usage tracking
 - [ ] Add streaming responses for WebSocket