# Scaling Notes ## TODO: Flink-to-Relay ZMQ Discovery Currently Relay connects to Flink via XSUB on a single endpoint. With multiple Flink instances behind a K8s service, we need many-to-many connectivity. **Problem**: K8s service load balancing doesn't help ZMQ since connections are persistent. Relay needs to connect to ALL Flink instances to receive all published messages. **Proposed Solution**: Use a K8s headless service for Flink workers: ```yaml apiVersion: v1 kind: Service metadata: name: flink-workers spec: clusterIP: None selector: app: flink ``` Relay implementation: 1. On startup and periodically (every N seconds), resolve `flink-workers.namespace.svc.cluster.local` 2. DNS returns A records for all Flink pod IPs 3. Diff against current XSUB connections 4. Connect to new pods, disconnect from removed pods **Alternative approaches considered**: - XPUB/XSUB broker: Adds single point of failure and latency - Service discovery (etcd/Redis): More complex, requires additional infrastructure **Open questions**: - Appropriate polling interval for DNS resolution (5–10 seconds?) - Handling of brief disconnection during pod replacement - Whether to use K8s Endpoints API watch instead of DNS polling for faster reaction