Files
ai/doc/scaling.md
2026-04-10 16:09:39 -04:00

1.2 KiB
Raw Blame History

Scaling Notes

Currently Relay connects to Flink via XSUB on a single endpoint. With multiple Flink instances behind a K8s service, we need many-to-many connectivity.

Problem: K8s service load balancing doesn't help ZMQ since connections are persistent. Relay needs to connect to ALL Flink instances to receive all published messages.

Proposed Solution: Use a K8s headless service for Flink workers:

apiVersion: v1
kind: Service
metadata:
  name: flink-workers
spec:
  clusterIP: None
  selector:
    app: flink

Relay implementation:

  1. On startup and periodically (every N seconds), resolve flink-workers.namespace.svc.cluster.local
  2. DNS returns A records for all Flink pod IPs
  3. Diff against current XSUB connections
  4. Connect to new pods, disconnect from removed pods

Alternative approaches considered:

  • XPUB/XSUB broker: Adds single point of failure and latency
  • Service discovery (etcd/Redis): More complex, requires additional infrastructure

Open questions:

  • Appropriate polling interval for DNS resolution (510 seconds?)
  • Handling of brief disconnection during pod replacement
  • Whether to use K8s Endpoints API watch instead of DNS polling for faster reaction