prod alpha deploy
This commit is contained in:
35
doc/scaling.md
Normal file
35
doc/scaling.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Scaling Notes
|
||||
|
||||
## TODO: Flink-to-Relay ZMQ Discovery
|
||||
|
||||
Currently Relay connects to Flink via XSUB on a single endpoint. With multiple Flink instances behind a K8s service, we need many-to-many connectivity.
|
||||
|
||||
**Problem**: K8s service load balancing doesn't help ZMQ since connections are persistent. Relay needs to connect to ALL Flink instances to receive all published messages.
|
||||
|
||||
**Proposed Solution**: Use a K8s headless service for Flink workers:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: flink-workers
|
||||
spec:
|
||||
clusterIP: None
|
||||
selector:
|
||||
app: flink
|
||||
```
|
||||
|
||||
Relay implementation:
|
||||
1. On startup and periodically (every N seconds), resolve `flink-workers.namespace.svc.cluster.local`
|
||||
2. DNS returns A records for all Flink pod IPs
|
||||
3. Diff against current XSUB connections
|
||||
4. Connect to new pods, disconnect from removed pods
|
||||
|
||||
**Alternative approaches considered**:
|
||||
- XPUB/XSUB broker: Adds single point of failure and latency
|
||||
- Service discovery (etcd/Redis): More complex, requires additional infrastructure
|
||||
|
||||
**Open questions**:
|
||||
- Appropriate polling interval for DNS resolution (5–10 seconds?)
|
||||
- Handling of brief disconnection during pod replacement
|
||||
- Whether to use K8s Endpoints API watch instead of DNS polling for faster reaction
|
||||
Reference in New Issue
Block a user