redesign fully scaffolded and web login works
This commit is contained in:
@@ -91,6 +91,10 @@ Containers self-manage their lifecycle using the lifecycle sidecar (see `../life
|
||||
- OpenAI GPT
|
||||
- Google Gemini
|
||||
- OpenRouter (one key for 300+ models)
|
||||
- Ollama (for embeddings): https://ollama.com/download
|
||||
- Redis (for session/hot storage)
|
||||
- Qdrant (for RAG vector search)
|
||||
- Kafka + Flink + Iceberg (for durable storage)
|
||||
|
||||
### Development
|
||||
|
||||
@@ -119,7 +123,20 @@ DEFAULT_MODEL_PROVIDER=anthropic
|
||||
DEFAULT_MODEL=claude-3-5-sonnet-20241022
|
||||
```
|
||||
|
||||
4. Run development server:
|
||||
4. Start Ollama and pull embedding model:
|
||||
```bash
|
||||
# Install Ollama (one-time): https://ollama.com/download
|
||||
# Or with Docker: docker run -d -p 11434:11434 ollama/ollama
|
||||
|
||||
# Pull the all-minilm embedding model (90MB, CPU-friendly)
|
||||
ollama pull all-minilm
|
||||
|
||||
# Alternative models:
|
||||
# ollama pull nomic-embed-text # 8K context length
|
||||
# ollama pull mxbai-embed-large # Higher accuracy, slower
|
||||
```
|
||||
|
||||
5. Run development server:
|
||||
```bash
|
||||
npm run dev
|
||||
```
|
||||
@@ -200,11 +217,143 @@ ws.send(JSON.stringify({
|
||||
**`GET /health`**
|
||||
- Returns server health status
|
||||
|
||||
## Ollama Deployment Options
|
||||
|
||||
The gateway requires Ollama for embedding generation in RAG queries. You have two deployment options:
|
||||
|
||||
### Option 1: Ollama in Gateway Container (Recommended for simplicity)
|
||||
|
||||
Install Ollama directly in the gateway container. This keeps all dependencies local and simplifies networking.
|
||||
|
||||
**Dockerfile additions:**
|
||||
```dockerfile
|
||||
FROM node:22-slim
|
||||
|
||||
# Install Ollama
|
||||
RUN curl -fsSL https://ollama.com/install.sh | sh
|
||||
|
||||
# Pull embedding model at build time
|
||||
RUN ollama serve & \
|
||||
sleep 5 && \
|
||||
ollama pull all-minilm && \
|
||||
pkill ollama
|
||||
|
||||
# ... rest of your gateway Dockerfile
|
||||
```
|
||||
|
||||
**Start script (entrypoint.sh):**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Start Ollama in background
|
||||
ollama serve &
|
||||
|
||||
# Start gateway
|
||||
node dist/main.js
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Simple networking (localhost:11434)
|
||||
- No extra K8s resources
|
||||
- Self-contained deployment
|
||||
|
||||
**Cons:**
|
||||
- Larger container image (~200MB extra)
|
||||
- CPU/memory shared with gateway process
|
||||
|
||||
**Resource requirements:**
|
||||
- Add +200MB memory
|
||||
- Add +0.2 CPU cores for embedding inference
|
||||
|
||||
### Option 2: Ollama as Separate Pod/Sidecar
|
||||
|
||||
Deploy Ollama as a separate container in the same pod (sidecar) or as its own deployment.
|
||||
|
||||
**K8s Deployment (sidecar pattern):**
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: gateway
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: gateway
|
||||
image: ghcr.io/dexorder/gateway:latest
|
||||
env:
|
||||
- name: OLLAMA_URL
|
||||
value: http://localhost:11434
|
||||
|
||||
- name: ollama
|
||||
image: ollama/ollama:latest
|
||||
command: ["/bin/sh", "-c"]
|
||||
args:
|
||||
- |
|
||||
ollama serve &
|
||||
sleep 5
|
||||
ollama pull all-minilm
|
||||
wait
|
||||
resources:
|
||||
requests:
|
||||
memory: "512Mi"
|
||||
cpu: "500m"
|
||||
limits:
|
||||
memory: "1Gi"
|
||||
cpu: "1000m"
|
||||
```
|
||||
|
||||
**K8s Deployment (separate service):**
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: ollama
|
||||
spec:
|
||||
replicas: 1
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: ollama
|
||||
image: ollama/ollama:latest
|
||||
# ... same as above
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: ollama
|
||||
spec:
|
||||
selector:
|
||||
app: ollama
|
||||
ports:
|
||||
- port: 11434
|
||||
```
|
||||
|
||||
Gateway `.env`:
|
||||
```bash
|
||||
OLLAMA_URL=http://ollama:11434
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Isolated resource limits
|
||||
- Can scale separately
|
||||
- Easier to monitor/debug
|
||||
|
||||
**Cons:**
|
||||
- More K8s resources
|
||||
- Network hop (minimal latency)
|
||||
- More complex deployment
|
||||
|
||||
### Recommendation
|
||||
|
||||
For most deployments: **Use Option 1 (in-container)** for simplicity, unless you need to:
|
||||
- Share Ollama across multiple services
|
||||
- Scale embedding inference independently
|
||||
- Run Ollama on GPU nodes (gateway on CPU nodes)
|
||||
|
||||
## TODO
|
||||
|
||||
- [ ] Implement JWT verification with JWKS
|
||||
- [ ] Implement MCP HTTP/SSE transport
|
||||
- [ ] Add Redis for session persistence
|
||||
- [ ] Add rate limiting per user license
|
||||
- [ ] Add message usage tracking
|
||||
- [ ] Add streaming responses for WebSocket
|
||||
|
||||
Reference in New Issue
Block a user