# ZeroMQ Protocol Architecture

Our data transfer protocol uses ZeroMQ with Protobufs. We send a small envelope with a protocol version byte as the first frame, then a type ID as the first byte of the second frame, followed by the protobuf payload also in the second frame.

OHLC periods are represented as seconds.

## Data Flow Overview

**Relay as Gateway**: The Relay is a well-known bind point that all components connect to. It routes messages between clients, ingestors, and Flink.

### Historical Data Query Flow (Async Event-Driven Architecture)
* Client generates request_id and/or client_id (both are client-generated)
* Client computes notification topic: `RESPONSE:{client_id}` or `HISTORY_READY:{request_id}`
* **Client subscribes to notification topic BEFORE sending request (prevents race condition)**
* Client sends SubmitHistoricalRequest to Relay (REQ/REP)
* Relay returns immediate SubmitResponse with request_id and notification_topic (for confirmation)
* Relay publishes DataRequest to ingestor work queue with exchange prefix (PUB/SUB)
* Ingestor receives request, fetches data from exchange
* Ingestor writes OHLC data to Kafka with __metadata in first record
* Flink reads from Kafka, processes data, writes to Iceberg
* Flink task manager sends HistoryReadyNotification via PUSH to job manager PULL (port 5561)
* Job manager `HistoryNotificationForwarder` republishes on MARKET_DATA_PUB (port 5558)
* Relay proxies notification via XSUB → XPUB to clients
* Client receives notification (already subscribed) and queries Iceberg for data

**Key Architectural Change**: Relay is completely stateless. No request/response correlation needed. All notification routing is topic-based (e.g., "RESPONSE:{client_id}").

**Race Condition Prevention**: Notification topics are deterministic based on client-generated values (request_id or client_id). Clients MUST subscribe to the notification topic BEFORE submitting the request to avoid missing notifications.

**Two Notification Patterns**:
1. **Per-client topic** (`RESPONSE:{client_id}`): Subscribe once during connection, reuse for all requests from this client. Recommended for most clients.
2. **Per-request topic** (`HISTORY_READY:{request_id}`): Subscribe immediately before each request. Use when you need per-request isolation or don't have a persistent client_id.

### Realtime Data Flow (Flink → Relay → Clients)
* Ingestors write realtime ticks to Kafka
* Flink reads from Kafka, processes OHLC aggregations, CEP triggers
* Flink publishes market data via ZMQ PUB (port 5558)
* Relay subscribes to Flink (XSUB) and fanouts to clients (XPUB)
* Clients subscribe to specific tickers

### Symbol Metadata Update Flow (Flink → Gateways)
* Ingestors write symbol metadata to Kafka
* Flink reads from Kafka, writes to Iceberg symbol_metadata table
* After committing to Iceberg, Flink publishes SymbolMetadataUpdated notification on MARKET_DATA_PUB
* Gateways subscribe to METADATA_UPDATE topic on startup
* Upon receiving notification, gateways reload symbol metadata from Iceberg
* This prevents race conditions where gateways start before symbol metadata is available

### Data Processing (Kafka → Flink → Iceberg)
* All market data flows through Kafka (durable event log)
* Flink processes streams for aggregations and CEP
* Flink writes historical data to Apache Iceberg tables
* Clients can query Iceberg for historical data (alternative to ingestor backfill)

**Key Design Principles**:
* Relay is the well-known bind point - all other components connect to it
* Relay is completely stateless - no request tracking, only topic-based routing
* Exchange prefix filtering allows ingestor specialization (e.g., only BINANCE ingestors)
* Historical data flows through Kafka (durable processing) only - no direct response
* Async event-driven notifications via pub/sub (Flink → Relay → Clients)
* Protobufs over ZMQ for all inter-service communication
* Kafka for durability and Flink stream processing
* Iceberg for long-term historical storage and client queries

## ZeroMQ Channels and Patterns

All sockets bind on **Relay** (well-known endpoint). Components connect to relay.

### 1. Client Request Channel (Clients → Relay)
**Pattern**: ROUTER (Relay binds, Clients use REQ)
- **Socket Type**: Relay uses ROUTER (bind), Clients use REQ (connect)
- **Endpoint**: `tcp://*:5559` (Relay binds)
- **Message Types**: `SubmitHistoricalRequest` → `SubmitResponse`
- **Behavior**:
  - Client generates request_id and/or client_id
  - Client computes notification topic deterministically
  - **Client subscribes to notification topic FIRST (prevents race)**
  - Client sends REQ for historical OHLC data
  - Relay validates request and returns immediate acknowledgment
  - Response includes notification_topic for client confirmation
  - Relay publishes DataRequest to ingestor work queue
  - No request tracking - relay is stateless

### 2. Ingestor Work Queue (Flink ↔ Ingestors)
**Pattern**: ROUTER/DEALER slot-based broker
- **Socket Type**: Flink `IngestorBroker` uses ROUTER (bind), Ingestors use DEALER (connect)
- **Endpoint**: `tcp://*:5567` (Flink binds)
- **Message Types**: `WorkerReady` (slot offer), `DataRequest` (work assignment), `WorkComplete`, `WorkHeartbeat`, `WorkReject`, `WorkStop`
- **Capacity model**:
  - Each `WorkerReady` (0x20) is ONE slot offer for one exchange and one job type (`SlotType`: `HISTORICAL=1`, `REALTIME=2`, `ANY=0`)
  - Ingestors send N `WorkerReady` messages at startup — one per available slot per exchange per type
  - Flink dispatches a job by matching the slot's exchange and SlotType to the request
  - The slot is consumed on dispatch; the ingestor re-offers it (new `WorkerReady`) when the job ends
  - Rate-limit backoff: if the exchange returns a 429, the ingestor delays the re-offer by the `Retry-After` duration from the response header
- **Historical job lifecycle**:
  - Flink dispatches `DataRequest` (HISTORICAL_OHLC) → ingestor fetches and writes to Kafka → sends `WorkComplete` (0x21) → sends new `WorkerReady` for that slot
- **Realtime job lifecycle**:
  - Flink dispatches `DataRequest` (REALTIME_TICKS) → ingestor polls exchange and writes ticks to Kafka → sends `WorkHeartbeat` (0x22) every 5 s → on `WorkStop` (0x25) from Flink: cancels and sends new `WorkerReady`
- **Slot configuration** (per ingestor, per exchange):
  ```yaml
  exchange_capacity:
    BINANCE:  { historical_slots: 3, realtime_slots: 5 }
    KRAKEN:   { historical_slots: 2, realtime_slots: 3 }
    COINBASE: { historical_slots: 2, realtime_slots: 4 }
  ```
- **Flink restart**: when Flink restarts its `freeSlots` deque is cleared; all in-flight jobs time out on the ingestor side, releasing their slots, which then re-offer via `WorkerReady`

### 3. Market Data Fanout (Relay ↔ Flink ↔ Clients)
**Pattern**: XPUB/XSUB proxy
- **Socket Type**:
  - Relay XPUB (bind) ← Clients SUB (connect) - Port 5558
  - Relay XSUB (connect) → Flink MARKET_DATA_PUB (bind) - Port 5558
- **Message Types**: `Tick`, `OHLC`, `HistoryReadyNotification`, `SymbolMetadataUpdated`
- **Topic Formats**:
  - Market data: `{ticker}|{data_type}` (e.g., `BTC/USDT.BINANCE|tick`)
  - Notifications: `RESPONSE:{client_id}` or `HISTORY_READY:{request_id}`
  - System notifications: `METADATA_UPDATE` (for symbol metadata updates)
- **Behavior**:
  - Clients subscribe to ticker topics and notification topics via Relay XPUB
  - Relay forwards subscriptions to Flink via XSUB
  - Flink publishes processed market data and notifications
  - Relay proxies data to subscribed clients (stateless forwarding)
  - Dynamic subscription management (no pre-registration)

**Internal Flink notification path (port 5561)**:
- Flink task managers send `HistoryReadyNotification` via PUSH to job manager PULL (port 5561)
- `HistoryNotificationForwarder` (job manager) receives and republishes on MARKET_DATA_PUB (port 5558)
- This decouples task manager instances from direct pub/sub and handles multi-task-manager setups

### 4. User Event Channels (User Containers → Gateway)
See [user-events.md](user-events.md) for the full spec including ZMQ patterns, protobuf schemas, and delivery semantics for ports 5570 and 5571.

## Message Envelope Format

The core protocol uses two ZeroMQ frames:
```
Frame 1: [1 byte: protocol version]
Frame 2: [1 byte: message type ID][N bytes: protobuf message]
```

This two-frame approach allows receivers to check the protocol version before parsing the message type and protobuf payload.

**Important**: Some ZeroMQ socket patterns (PUB/SUB, XPUB/XSUB) may prepend additional frames for routing purposes. For example:
- **PUB/SUB with topic filtering**: SUB sockets receive `[topic frame][version frame][message frame]`
- **ROUTER sockets**: Prepend identity frames before the message

Components must handle these additional frames appropriately:
- SUB sockets: Skip the first frame (topic), then parse the remaining frames as the standard 2-frame envelope
- ROUTER sockets: Extract identity frames, then parse the standard 2-frame envelope

The two-frame envelope is the **logical protocol format**, but physical transmission may include additional ZeroMQ transport frames.

## Message Type IDs

| Type ID | Message Type              | Description                                    |
|---------|---------------------------|------------------------------------------------|
| 0x01    | DataRequest               | Request for historical or realtime data        |
| 0x02    | DataResponse (deprecated) | Historical data response (no longer used)      |
| 0x03    | IngestorControl           | Control messages for ingestors                 |
| 0x04    | Tick                      | Individual trade tick data                     |
| 0x05    | OHLC                      | Single OHLC candle with volume                 |
| 0x06    | Market                    | Market metadata                                |
| 0x07    | OHLCRequest (deprecated)  | Client request (replaced by SubmitHistorical)  |
| 0x08    | Response (deprecated)     | Generic response (replaced by SubmitResponse)  |
| 0x09    | CEPTriggerRequest         | Register CEP trigger                           |
| 0x0A    | CEPTriggerAck             | CEP trigger acknowledgment                     |
| 0x0B    | CEPTriggerEvent           | CEP trigger fired callback                     |
| 0x0C    | OHLCBatch                 | Batch of OHLC rows with metadata (Kafka)       |
| 0x10    | SubmitHistoricalRequest   | Client request for historical data (async)     |
| 0x11    | SubmitResponse            | Immediate ack with notification topic          |
| 0x12    | HistoryReadyNotification  | Notification that data is ready in Iceberg     |
| 0x13    | SymbolMetadataUpdated     | Notification that symbol metadata refreshed    |
| 0x20    | UserEvent                 | Container → Gateway event (see user-events.md) |
| 0x21    | EventAck                  | Gateway → Container acknowledgment             |

## Error Handling

**Async Architecture Error Handling**:
- Failed historical requests: ingestor writes error marker to Kafka
- Flink reads error marker and publishes HistoryReadyNotification with ERROR status
- Client timeout: if no notification received within timeout, assume failure
- REQ/REP timeouts: 30 seconds default for client request submission
- PUB/SUB has no delivery guarantees (Kafka provides durability)
- No response routing needed - all notifications via topic-based pub/sub

**Durability**:
- All data flows through Kafka for durability
- Flink checkpointing ensures exactly-once processing
- Client can retry request with new request_id if notification not received