prod alpha deploy

2026-04-10 16:09:39 -04:00
parent 7d231169d9
commit 6418729b16
17 changed files with 375 additions and 967 deletions
--- a/doc/scaling.md
+++ b/doc/scaling.md
@@ -0,0 +1,35 @@
+# Scaling Notes
+
+## TODO: Flink-to-Relay ZMQ Discovery
+
+Currently Relay connects to Flink via XSUB on a single endpoint. With multiple Flink instances behind a K8s service, we need many-to-many connectivity.
+
+**Problem**: K8s service load balancing doesn't help ZMQ since connections are persistent. Relay needs to connect to ALL Flink instances to receive all published messages.
+
+**Proposed Solution**: Use a K8s headless service for Flink workers:
+
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: flink-workers
+spec:
+  clusterIP: None
+  selector:
+    app: flink
+```
+
+Relay implementation:
+1. On startup and periodically (every N seconds), resolve `flink-workers.namespace.svc.cluster.local`
+2. DNS returns A records for all Flink pod IPs
+3. Diff against current XSUB connections
+4. Connect to new pods, disconnect from removed pods
+
+**Alternative approaches considered**:
+- XPUB/XSUB broker: Adds single point of failure and latency
+- Service discovery (etcd/Redis): More complex, requires additional infrastructure
+
+**Open questions**:
+- Appropriate polling interval for DNS resolution (5–10 seconds?)
+- Handling of brief disconnection during pod replacement
+- Whether to use K8s Endpoints API watch instead of DNS polling for faster reaction