redesign fully scaffolded and web login works

This commit is contained in:
2026-03-17 20:10:47 -04:00
parent b9cc397e05
commit f6bd22a8ef
143 changed files with 17317 additions and 693 deletions

View File

@@ -151,6 +151,310 @@ The two-frame envelope is the **logical protocol format**, but physical transmis
| 0x11 | SubmitResponse | Immediate ack with notification topic |
| 0x12 | HistoryReadyNotification | Notification that data is ready in Iceberg |
## User Container Event System
User containers emit events (order executions, alerts, workspace changes) that must be delivered to users via their active session or external channels (Telegram, email, push). This requires two ZMQ patterns with different delivery guarantees.
### Event Flow Overview
```
┌─────────────────────────────────────────────────────────────┐
│ User Container │
│ │
│ Strategy/Indicator Engine │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Event Publisher │ │
│ │ │ │
│ │ 1. Check delivery spec │ │
│ │ 2. If INFORMATIONAL or has_active_subscriber(): │ │
│ │ → XPUB (fast path) │ │
│ │ 3. Else (CRITICAL or no active session): │ │
│ │ → DEALER (guaranteed delivery) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │ │
│ XPUB socket DEALER socket │
│ (port 5570) (port 5571) │
└─────────┼───────────────────────────┼───────────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Gateway Pool │
│ │
│ ┌──────────────────┐ ┌──────────────────────────┐ │
│ │ SUB socket │ │ ROUTER socket │ │
│ │ (per-session) │ │ (shared, any gateway) │ │
│ │ │ │ │ │
│ │ Subscribe to │ │ Pull event, deliver, │ │
│ │ USER:{user_id} │ │ send EventAck back │ │
│ │ on connect │ │ │ │
│ └────────┬─────────┘ └─────────────┬────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Active WS/ │ │ Telegram API / Email / │ │
│ │ Telegram │ │ Push Notification │ │
│ └─────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
### 5. User Event Channel - Informational (Container → Gateway)
**Pattern**: XPUB/SUB with subscription tracking
- **Socket Type**: Container uses XPUB (bind), Gateway uses SUB (connect)
- **Endpoint**: `tcp://*:5570` (Container binds)
- **Message Types**: `UserEvent`
- **Topic Format**: `USER:{user_id}` (e.g., `USER:user-abc123`)
- **Behavior**:
- Gateway subscribes to `USER:{user_id}` when user's WebSocket/Telegram session connects
- Gateway unsubscribes when session disconnects
- Container uses XPUB with `ZMQ_XPUB_VERBOSE` to track active subscriptions
- Container checks subscription set before publishing
- If no subscriber, message is either dropped (INFORMATIONAL) or routed to critical channel
- Zero coordination, fire-and-forget for active sessions
### 6. User Event Channel - Critical (Container → Gateway)
**Pattern**: DEALER/ROUTER with acknowledgment
- **Socket Type**: Container uses DEALER (connect), Gateway uses ROUTER (bind)
- **Endpoint**: `tcp://gateway:5571` (Gateway binds, containers connect)
- **Message Types**: `UserEvent``EventAck`
- **Behavior**:
- Container sends `UserEvent` with `event_id` via DEALER
- DEALER round-robins to available gateway ROUTER sockets
- Gateway processes event (sends to Telegram, email, etc.)
- Gateway sends `EventAck` back to container
- Container tracks pending events with timeout (30s default)
- On timeout without ack: resend (DEALER routes to next gateway)
- On container shutdown: persist pending to disk, reload on startup
- Provides at-least-once delivery guarantee
### Subscription Tracking (Container Side)
Container uses XPUB to detect active sessions:
```python
# Container event publisher initialization
xpub_socket = ctx.socket(zmq.XPUB)
xpub_socket.setsockopt(zmq.XPUB_VERBOSE, 1) # Receive all sub/unsub
xpub_socket.bind("tcp://*:5570")
active_subscriptions: set[str] = set()
# In event loop, handle subscription messages
def process_subscriptions():
while xpub_socket.poll(0):
msg = xpub_socket.recv()
topic = msg[1:].decode() # Skip first byte (sub/unsub flag)
if msg[0] == 1: # Subscribe
active_subscriptions.add(topic)
elif msg[0] == 0: # Unsubscribe
active_subscriptions.discard(topic)
def has_active_subscriber(user_id: str) -> bool:
return f"USER:{user_id}" in active_subscriptions
```
### Event Routing Logic (Container Side)
```python
def publish_event(event: UserEvent):
topic = f"USER:{event.user_id}"
if event.delivery.priority == Priority.INFORMATIONAL:
# Fire and forget - drop if nobody's listening
if has_active_subscriber(event.user_id):
xpub_socket.send_multipart([topic.encode(), serialize(event)])
# else: silently drop
elif has_active_subscriber(event.user_id):
# Active session exists - use fast path
xpub_socket.send_multipart([topic.encode(), serialize(event)])
else:
# No active session - use guaranteed delivery
send_via_dealer(event)
def send_via_dealer(event: UserEvent):
pending_events[event.event_id] = PendingEvent(
event=event,
sent_at=time.time(),
retries=0
)
dealer_socket.send(serialize(event))
```
### Message Type IDs (User Events)
| Type ID | Message Type | Description |
|---------|-----------------|------------------------------------------------|
| 0x20 | UserEvent | Container → Gateway event |
| 0x21 | EventAck | Gateway → Container acknowledgment |
### UserEvent Message
```protobuf
message UserEvent {
string user_id = 1;
string event_id = 2; // UUID for dedup/ack
int64 timestamp = 3; // Unix millis
EventType event_type = 4;
bytes payload = 5; // JSON or nested protobuf
DeliverySpec delivery = 6;
}
enum EventType {
ORDER_PLACED = 0;
ORDER_FILLED = 1;
ORDER_CANCELLED = 2;
ALERT_TRIGGERED = 3;
POSITION_UPDATED = 4;
WORKSPACE_CHANGED = 5;
STRATEGY_LOG = 6;
}
message DeliverySpec {
Priority priority = 1;
repeated ChannelPreference channels = 2; // Ordered preference list
}
enum Priority {
INFORMATIONAL = 0; // Drop if no active session
NORMAL = 1; // Best effort, short queue
CRITICAL = 2; // Must deliver, retry, escalate
}
message ChannelPreference {
ChannelType channel = 1;
bool only_if_active = 2; // true = skip if not connected
}
enum ChannelType {
ACTIVE_SESSION = 0; // Whatever's currently connected
WEB = 1;
TELEGRAM = 2;
EMAIL = 3;
PUSH = 4; // Mobile push notification
}
```
### EventAck Message
```protobuf
message EventAck {
string event_id = 1;
AckStatus status = 2;
string error_message = 3; // If status is ERROR
}
enum AckStatus {
DELIVERED = 0; // Successfully sent to at least one channel
QUEUED = 1; // Accepted, will retry (e.g., Telegram rate limit)
ERROR = 2; // Permanent failure
}
```
### Delivery Examples
```python
# "Show on screen if they're watching, otherwise don't bother"
# → Uses XPUB path only, dropped if no subscriber
UserEvent(
delivery=DeliverySpec(
priority=Priority.INFORMATIONAL,
channels=[ChannelPreference(ChannelType.ACTIVE_SESSION, only_if_active=True)]
)
)
# "Active session preferred, fallback to Telegram"
# → Tries XPUB first (if subscribed), else DEALER for Telegram delivery
UserEvent(
delivery=DeliverySpec(
priority=Priority.NORMAL,
channels=[
ChannelPreference(ChannelType.ACTIVE_SESSION, only_if_active=True),
ChannelPreference(ChannelType.TELEGRAM, only_if_active=False),
]
)
)
# "Order executed - MUST get through"
# → Always uses DEALER path for guaranteed delivery
UserEvent(
delivery=DeliverySpec(
priority=Priority.CRITICAL,
channels=[
ChannelPreference(ChannelType.ACTIVE_SESSION, only_if_active=True),
ChannelPreference(ChannelType.TELEGRAM, only_if_active=False),
ChannelPreference(ChannelType.PUSH, only_if_active=False),
ChannelPreference(ChannelType.EMAIL, only_if_active=False),
]
)
)
```
### Gateway Event Processing
Gateway maintains:
1. **Session registry**: Maps user_id → active WebSocket/channel connections
2. **Channel credentials**: Telegram bot token, email service keys, push certificates
3. **SUB socket per user session**: Subscribes to `USER:{user_id}` on container's XPUB
4. **Shared ROUTER socket**: Receives critical events from any container
```typescript
// On user WebSocket connect
async onSessionConnect(userId: string, ws: WebSocket) {
// Subscribe to user's informational events
subSocket.subscribe(`USER:${userId}`);
sessions.set(userId, ws);
}
// On user WebSocket disconnect
async onSessionDisconnect(userId: string) {
subSocket.unsubscribe(`USER:${userId}`);
sessions.delete(userId);
}
// Handle informational events (from SUB socket)
subSocket.on('message', (topic, payload) => {
const event = deserialize(payload);
const ws = sessions.get(event.userId);
if (ws) {
ws.send(JSON.stringify({ type: 'event', ...event }));
}
});
// Handle critical events (from ROUTER socket)
routerSocket.on('message', (identity, payload) => {
const event = deserialize(payload);
deliverEvent(event).then(status => {
routerSocket.send([identity, serialize(EventAck(event.eventId, status))]);
});
});
async function deliverEvent(event: UserEvent): Promise<AckStatus> {
for (const pref of event.delivery.channels) {
if (pref.onlyIfActive && !sessions.has(event.userId)) continue;
switch (pref.channel) {
case ChannelType.ACTIVE_SESSION:
const ws = sessions.get(event.userId);
if (ws) { ws.send(...); return AckStatus.DELIVERED; }
break;
case ChannelType.TELEGRAM:
await telegramBot.sendMessage(event.userId, formatEvent(event));
return AckStatus.DELIVERED;
case ChannelType.EMAIL:
await emailService.send(event.userId, formatEvent(event));
return AckStatus.DELIVERED;
// ... etc
}
}
return AckStatus.ERROR;
}
```
## Error Handling
**Async Architecture Error Handling**:
@@ -162,7 +466,51 @@ The two-frame envelope is the **logical protocol format**, but physical transmis
- PUB/SUB has no delivery guarantees (Kafka provides durability)
- No response routing needed - all notifications via topic-based pub/sub
**User Event Error Handling**:
- Informational events: dropped silently if no active session (by design)
- Critical events: container retries on ack timeout (30s default)
- Gateway tracks event_id for deduplication (5 minute window)
- If all channels fail: return ERROR ack, container may escalate or log
- Container persists pending critical events to disk on shutdown
**Durability**:
- All data flows through Kafka for durability
- Flink checkpointing ensures exactly-once processing
- Client can retry request with new request_id if notification not received
- Critical user events use DEALER/ROUTER with ack for at-least-once delivery
## Scaling
### TODO: Flink-to-Relay ZMQ Discovery
Currently Relay connects to Flink via XSUB on a single endpoint. With multiple Flink instances behind a K8s service, we need many-to-many connectivity.
**Problem**: K8s service load balancing doesn't help ZMQ since connections are persistent. Relay needs to connect to ALL Flink instances to receive all published messages.
**Proposed Solution**: Use a K8s headless service for Flink workers:
```yaml
apiVersion: v1
kind: Service
metadata:
name: flink-workers
spec:
clusterIP: None
selector:
app: flink
```
Relay implementation:
1. On startup and periodically (every N seconds), resolve `flink-workers.namespace.svc.cluster.local`
2. DNS returns A records for all Flink pod IPs
3. Diff against current XSUB connections
4. Connect to new pods, disconnect from removed pods
**Alternative approaches considered**:
- XPUB/XSUB broker: Adds single point of failure and latency
- Service discovery (etcd/Redis): More complex, requires additional infrastructure
**Open questions**:
- Appropriate polling interval for DNS resolution (5-10 seconds?)
- Handling of brief disconnection during pod replacement
- Whether to use K8s Endpoints API watch instead of DNS polling for faster reaction