redesign fully scaffolded and web login works

2026-03-17 20:10:47 -04:00
parent b9cc397e05
commit f6bd22a8ef
143 changed files with 17317 additions and 693 deletions
--- a/iceberg/gateway_schemas.md
+++ b/iceberg/gateway_schemas.md
@@ -0,0 +1,292 @@
+# Gateway Iceberg Schemas
+
+Gateway persistence layer tables for conversation history and LangGraph checkpoints.
+
+## Tables
+
+### gateway.conversations
+
+Stores all conversation messages from the agent harness across all channels (WebSocket, Telegram, etc.).
+
+**Schema**:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| id | string | Unique message ID: `{user_id}:{session_id}:{timestamp_ms}` |
+| user_id | string | User identifier (for GDPR compliance) |
+| session_id | string | Conversation session identifier |
+| role | string | Message role: 'user', 'assistant', 'system', 'tool' |
+| content | string | Message content (text, JSON for tool calls/results) |
+| metadata | string | JSON metadata (channel, model, tokens, attachments, etc.) |
+| timestamp | long | Message timestamp in microseconds (UTC) |
+
+**Natural Key**: `(id)` - unique per message, includes timestamp for ordering
+
+**Partitioning**: `(user_id, days(timestamp))`
+- Partition by user_id for efficient GDPR deletion
+- Partition by days for time-range queries
+- Hidden partitioning - not exposed in queries
+
+**Iceberg Version**: Format v1 (1.10.1)
+- Append-only writes (no updates/deletes via Flink)
+- Copy-on-write mode for query performance
+- Schema evolution supported
+
+**Storage Format**: Parquet with Snappy compression
+
+**Write Pattern**:
+- Gateway → Kafka topic `gateway_conversations` → Flink → Iceberg
+- Flink job buffers and writes micro-batches
+- Near real-time persistence (5-10 second latency)
+
+**Query Patterns**:
+```sql
+-- Get recent conversation history for a session
+SELECT role, content, timestamp
+FROM gateway.conversations
+WHERE user_id = 'user123'
+  AND session_id = 'session456'
+  AND timestamp > (UNIX_MICROS(CURRENT_TIMESTAMP()) - 86400000000)  -- Last 24h
+ORDER BY timestamp DESC
+LIMIT 50;
+
+-- Search user's conversations across all sessions
+SELECT session_id, role, content, timestamp
+FROM gateway.conversations
+WHERE user_id = 'user123'
+  AND timestamp BETWEEN 1735689600000000 AND 1736294399000000
+ORDER BY timestamp DESC;
+
+-- Count messages by channel
+SELECT
+  JSON_EXTRACT_SCALAR(metadata, '$.channel') as channel,
+  COUNT(*) as message_count
+FROM gateway.conversations
+WHERE user_id = 'user123'
+GROUP BY JSON_EXTRACT_SCALAR(metadata, '$.channel');
+```
+
+**GDPR Compliance**:
+```sql
+-- Delete all messages for a user (via catalog API, not Flink)
+DELETE FROM gateway.conversations WHERE user_id = 'user123';
+```
+
+---
+
+### gateway.checkpoints
+
+Stores LangGraph checkpoints for agent workflow state persistence.
+
+**Schema**:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| id | string | Checkpoint ID: `{user_id}:{session_id}:{checkpoint_ns}:{timestamp_ms}` |
+| user_id | string | User identifier (for GDPR compliance) |
+| session_id | string | Conversation session identifier |
+| checkpoint_ns | string | LangGraph checkpoint namespace (default: 'default') |
+| thread_id | string | LangGraph thread identifier (often same as session_id) |
+| parent_checkpoint_id | string | Parent checkpoint ID for replay/branching |
+| checkpoint_data | string | Serialized checkpoint state (JSON) |
+| metadata | string | JSON metadata (step_count, node_name, status, etc.) |
+| timestamp | long | Checkpoint timestamp in microseconds (UTC) |
+
+**Natural Key**: `(id)` - unique per checkpoint
+
+**Partitioning**: `(user_id, days(timestamp))`
+- Partition by user_id for GDPR compliance
+- Partition by days for efficient pruning of old checkpoints
+- Hidden partitioning
+
+**Iceberg Version**: Format v1 (1.10.1)
+- Append-only writes
+- Checkpoints are immutable once written
+- Copy-on-write mode
+
+**Storage Format**: Parquet with Snappy compression
+
+**Write Pattern**:
+- Gateway → Kafka topic `gateway_checkpoints` → Flink → Iceberg
+- Checkpoints written on each LangGraph step
+- Critical for workflow resumption after failures
+
+**Query Patterns**:
+```sql
+-- Get latest checkpoint for a session
+SELECT checkpoint_data, metadata, timestamp
+FROM gateway.checkpoints
+WHERE user_id = 'user123'
+  AND session_id = 'session456'
+  AND checkpoint_ns = 'default'
+ORDER BY timestamp DESC
+LIMIT 1;
+
+-- Get checkpoint history for debugging
+SELECT id, parent_checkpoint_id,
+       JSON_EXTRACT_SCALAR(metadata, '$.node_name') as node,
+       JSON_EXTRACT_SCALAR(metadata, '$.step_count') as step,
+       timestamp
+FROM gateway.checkpoints
+WHERE user_id = 'user123'
+  AND session_id = 'session456'
+ORDER BY timestamp;
+
+-- Find checkpoints for a specific workflow node
+SELECT *
+FROM gateway.checkpoints
+WHERE user_id = 'user123'
+  AND JSON_EXTRACT_SCALAR(metadata, '$.node_name') = 'human_approval'
+  AND JSON_EXTRACT_SCALAR(metadata, '$.status') = 'pending'
+ORDER BY timestamp DESC;
+```
+
+**GDPR Compliance**:
+```sql
+-- Delete all checkpoints for a user
+DELETE FROM gateway.checkpoints WHERE user_id = 'user123';
+```
+
+---
+
+## Kafka Topics
+
+### gateway_conversations
+- **Partitions**: 6 (partitioned by user_id hash)
+- **Replication Factor**: 3
+- **Retention**: 7 days (Kafka), unlimited (Iceberg)
+- **Schema**: Protobuf (see `protobuf/gateway_messages.proto`)
+
+### gateway_checkpoints
+- **Partitions**: 6 (partitioned by user_id hash)
+- **Replication Factor**: 3
+- **Retention**: 7 days (Kafka), unlimited (Iceberg)
+- **Schema**: Protobuf (see `protobuf/gateway_checkpoints.proto`)
+
+---
+
+## Flink Integration
+
+### SchemaInitializer.java
+
+Add to `flink/src/main/java/com/dexorder/flink/iceberg/SchemaInitializer.java`:
+
+```java
+// Initialize gateway.conversations table
+TableIdentifier conversationsTable = TableIdentifier.of("gateway", "conversations");
+Schema conversationsSchema = new Schema(
+    required(1, "id", Types.StringType.get()),
+    required(2, "user_id", Types.StringType.get()),
+    required(3, "session_id", Types.StringType.get()),
+    required(4, "role", Types.StringType.get()),
+    required(5, "content", Types.StringType.get()),
+    optional(6, "metadata", Types.StringType.get()),
+    required(7, "timestamp", Types.LongType.get())
+);
+
+PartitionSpec conversationsPartitionSpec = PartitionSpec.builderFor(conversationsSchema)
+    .identity("user_id")
+    .day("timestamp", "timestamp_day")
+    .build();
+
+catalog.createTable(conversationsTable, conversationsSchema, conversationsPartitionSpec);
+
+// Initialize gateway.checkpoints table
+TableIdentifier checkpointsTable = TableIdentifier.of("gateway", "checkpoints");
+Schema checkpointsSchema = new Schema(
+    required(1, "id", Types.StringType.get()),
+    required(2, "user_id", Types.StringType.get()),
+    required(3, "session_id", Types.StringType.get()),
+    required(4, "checkpoint_ns", Types.StringType.get()),
+    required(5, "thread_id", Types.StringType.get()),
+    optional(6, "parent_checkpoint_id", Types.StringType.get()),
+    required(7, "checkpoint_data", Types.StringType.get()),
+    optional(8, "metadata", Types.StringType.get()),
+    required(9, "timestamp", Types.LongType.get())
+);
+
+PartitionSpec checkpointsPartitionSpec = PartitionSpec.builderFor(checkpointsSchema)
+    .identity("user_id")
+    .day("timestamp", "timestamp_day")
+    .build();
+
+catalog.createTable(checkpointsTable, checkpointsSchema, checkpointsPartitionSpec);
+```
+
+### Flink Jobs
+
+**ConversationsSink.java** - Read from `gateway_conversations` Kafka topic and write to Iceberg:
+
+```java
+TableLoader tableLoader = TableLoader.fromCatalog(
+    CatalogLoader.rest("gateway", catalogUri),
+    TableIdentifier.of("gateway", "conversations")
+);
+
+DataStream<Row> messageRows = // ... from Kafka gateway_conversations
+
+FlinkSink.forRow(messageRows, conversationsSchema)
+    .tableLoader(tableLoader)
+    .append()  // Append-only
+    .build();
+```
+
+**CheckpointsSink.java** - Read from `gateway_checkpoints` Kafka topic and write to Iceberg:
+
+```java
+TableLoader tableLoader = TableLoader.fromCatalog(
+    CatalogLoader.rest("gateway", catalogUri),
+    TableIdentifier.of("gateway", "checkpoints")
+);
+
+DataStream<Row> checkpointRows = // ... from Kafka gateway_checkpoints
+
+FlinkSink.forRow(checkpointRows, checkpointsSchema)
+    .tableLoader(tableLoader)
+    .append()  // Append-only
+    .build();
+```
+
+---
+
+## Access Patterns
+
+### Gateway (Write)
+- Buffers messages/checkpoints in Redis (hot layer, 1-hour TTL)
+- Async sends to Kafka topics (fire-and-forget)
+- Flink streams Kafka → Iceberg (durable storage)
+
+### Gateway (Read)
+- Recent data: Read from Redis (fast)
+- Historical data: Query Iceberg via REST catalog (slower, cold storage)
+- Use iceberg-js for JavaScript/Node.js queries
+
+### Analytics/Debugging
+- Query Iceberg tables directly via REST API
+- Use Spark or Trino for complex analytics
+- GDPR deletions via catalog API
+
+---
+
+## Catalog Configuration
+
+Same as trading namespace (see main README):
+
+```yaml
+catalog:
+  type: rest
+  uri: http://iceberg-catalog:8181
+  warehouse: s3://gateway-warehouse/
+  s3:
+    endpoint: http://minio:9000
+    access-key-id: ${S3_ACCESS_KEY}
+    secret-access-key: ${S3_SECRET_KEY}
+```
+
+---
+
+## References
+
+- [LangGraph Checkpointer](https://langchain-ai.github.io/langgraph/concepts/persistence/)
+- [Apache Iceberg Partitioning](https://iceberg.apache.org/docs/latest/partitioning/)
+- [Flink Iceberg Connector](https://iceberg.apache.org/docs/latest/flink/)