redesign fully scaffolded and web login works
This commit is contained in:
292
iceberg/gateway_schemas.md
Normal file
292
iceberg/gateway_schemas.md
Normal file
@@ -0,0 +1,292 @@
|
||||
# Gateway Iceberg Schemas
|
||||
|
||||
Gateway persistence layer tables for conversation history and LangGraph checkpoints.
|
||||
|
||||
## Tables
|
||||
|
||||
### gateway.conversations
|
||||
|
||||
Stores all conversation messages from the agent harness across all channels (WebSocket, Telegram, etc.).
|
||||
|
||||
**Schema**:
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| id | string | Unique message ID: `{user_id}:{session_id}:{timestamp_ms}` |
|
||||
| user_id | string | User identifier (for GDPR compliance) |
|
||||
| session_id | string | Conversation session identifier |
|
||||
| role | string | Message role: 'user', 'assistant', 'system', 'tool' |
|
||||
| content | string | Message content (text, JSON for tool calls/results) |
|
||||
| metadata | string | JSON metadata (channel, model, tokens, attachments, etc.) |
|
||||
| timestamp | long | Message timestamp in microseconds (UTC) |
|
||||
|
||||
**Natural Key**: `(id)` - unique per message, includes timestamp for ordering
|
||||
|
||||
**Partitioning**: `(user_id, days(timestamp))`
|
||||
- Partition by user_id for efficient GDPR deletion
|
||||
- Partition by days for time-range queries
|
||||
- Hidden partitioning - not exposed in queries
|
||||
|
||||
**Iceberg Version**: Format v1 (1.10.1)
|
||||
- Append-only writes (no updates/deletes via Flink)
|
||||
- Copy-on-write mode for query performance
|
||||
- Schema evolution supported
|
||||
|
||||
**Storage Format**: Parquet with Snappy compression
|
||||
|
||||
**Write Pattern**:
|
||||
- Gateway → Kafka topic `gateway_conversations` → Flink → Iceberg
|
||||
- Flink job buffers and writes micro-batches
|
||||
- Near real-time persistence (5-10 second latency)
|
||||
|
||||
**Query Patterns**:
|
||||
```sql
|
||||
-- Get recent conversation history for a session
|
||||
SELECT role, content, timestamp
|
||||
FROM gateway.conversations
|
||||
WHERE user_id = 'user123'
|
||||
AND session_id = 'session456'
|
||||
AND timestamp > (UNIX_MICROS(CURRENT_TIMESTAMP()) - 86400000000) -- Last 24h
|
||||
ORDER BY timestamp DESC
|
||||
LIMIT 50;
|
||||
|
||||
-- Search user's conversations across all sessions
|
||||
SELECT session_id, role, content, timestamp
|
||||
FROM gateway.conversations
|
||||
WHERE user_id = 'user123'
|
||||
AND timestamp BETWEEN 1735689600000000 AND 1736294399000000
|
||||
ORDER BY timestamp DESC;
|
||||
|
||||
-- Count messages by channel
|
||||
SELECT
|
||||
JSON_EXTRACT_SCALAR(metadata, '$.channel') as channel,
|
||||
COUNT(*) as message_count
|
||||
FROM gateway.conversations
|
||||
WHERE user_id = 'user123'
|
||||
GROUP BY JSON_EXTRACT_SCALAR(metadata, '$.channel');
|
||||
```
|
||||
|
||||
**GDPR Compliance**:
|
||||
```sql
|
||||
-- Delete all messages for a user (via catalog API, not Flink)
|
||||
DELETE FROM gateway.conversations WHERE user_id = 'user123';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### gateway.checkpoints
|
||||
|
||||
Stores LangGraph checkpoints for agent workflow state persistence.
|
||||
|
||||
**Schema**:
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| id | string | Checkpoint ID: `{user_id}:{session_id}:{checkpoint_ns}:{timestamp_ms}` |
|
||||
| user_id | string | User identifier (for GDPR compliance) |
|
||||
| session_id | string | Conversation session identifier |
|
||||
| checkpoint_ns | string | LangGraph checkpoint namespace (default: 'default') |
|
||||
| thread_id | string | LangGraph thread identifier (often same as session_id) |
|
||||
| parent_checkpoint_id | string | Parent checkpoint ID for replay/branching |
|
||||
| checkpoint_data | string | Serialized checkpoint state (JSON) |
|
||||
| metadata | string | JSON metadata (step_count, node_name, status, etc.) |
|
||||
| timestamp | long | Checkpoint timestamp in microseconds (UTC) |
|
||||
|
||||
**Natural Key**: `(id)` - unique per checkpoint
|
||||
|
||||
**Partitioning**: `(user_id, days(timestamp))`
|
||||
- Partition by user_id for GDPR compliance
|
||||
- Partition by days for efficient pruning of old checkpoints
|
||||
- Hidden partitioning
|
||||
|
||||
**Iceberg Version**: Format v1 (1.10.1)
|
||||
- Append-only writes
|
||||
- Checkpoints are immutable once written
|
||||
- Copy-on-write mode
|
||||
|
||||
**Storage Format**: Parquet with Snappy compression
|
||||
|
||||
**Write Pattern**:
|
||||
- Gateway → Kafka topic `gateway_checkpoints` → Flink → Iceberg
|
||||
- Checkpoints written on each LangGraph step
|
||||
- Critical for workflow resumption after failures
|
||||
|
||||
**Query Patterns**:
|
||||
```sql
|
||||
-- Get latest checkpoint for a session
|
||||
SELECT checkpoint_data, metadata, timestamp
|
||||
FROM gateway.checkpoints
|
||||
WHERE user_id = 'user123'
|
||||
AND session_id = 'session456'
|
||||
AND checkpoint_ns = 'default'
|
||||
ORDER BY timestamp DESC
|
||||
LIMIT 1;
|
||||
|
||||
-- Get checkpoint history for debugging
|
||||
SELECT id, parent_checkpoint_id,
|
||||
JSON_EXTRACT_SCALAR(metadata, '$.node_name') as node,
|
||||
JSON_EXTRACT_SCALAR(metadata, '$.step_count') as step,
|
||||
timestamp
|
||||
FROM gateway.checkpoints
|
||||
WHERE user_id = 'user123'
|
||||
AND session_id = 'session456'
|
||||
ORDER BY timestamp;
|
||||
|
||||
-- Find checkpoints for a specific workflow node
|
||||
SELECT *
|
||||
FROM gateway.checkpoints
|
||||
WHERE user_id = 'user123'
|
||||
AND JSON_EXTRACT_SCALAR(metadata, '$.node_name') = 'human_approval'
|
||||
AND JSON_EXTRACT_SCALAR(metadata, '$.status') = 'pending'
|
||||
ORDER BY timestamp DESC;
|
||||
```
|
||||
|
||||
**GDPR Compliance**:
|
||||
```sql
|
||||
-- Delete all checkpoints for a user
|
||||
DELETE FROM gateway.checkpoints WHERE user_id = 'user123';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Kafka Topics
|
||||
|
||||
### gateway_conversations
|
||||
- **Partitions**: 6 (partitioned by user_id hash)
|
||||
- **Replication Factor**: 3
|
||||
- **Retention**: 7 days (Kafka), unlimited (Iceberg)
|
||||
- **Schema**: Protobuf (see `protobuf/gateway_messages.proto`)
|
||||
|
||||
### gateway_checkpoints
|
||||
- **Partitions**: 6 (partitioned by user_id hash)
|
||||
- **Replication Factor**: 3
|
||||
- **Retention**: 7 days (Kafka), unlimited (Iceberg)
|
||||
- **Schema**: Protobuf (see `protobuf/gateway_checkpoints.proto`)
|
||||
|
||||
---
|
||||
|
||||
## Flink Integration
|
||||
|
||||
### SchemaInitializer.java
|
||||
|
||||
Add to `flink/src/main/java/com/dexorder/flink/iceberg/SchemaInitializer.java`:
|
||||
|
||||
```java
|
||||
// Initialize gateway.conversations table
|
||||
TableIdentifier conversationsTable = TableIdentifier.of("gateway", "conversations");
|
||||
Schema conversationsSchema = new Schema(
|
||||
required(1, "id", Types.StringType.get()),
|
||||
required(2, "user_id", Types.StringType.get()),
|
||||
required(3, "session_id", Types.StringType.get()),
|
||||
required(4, "role", Types.StringType.get()),
|
||||
required(5, "content", Types.StringType.get()),
|
||||
optional(6, "metadata", Types.StringType.get()),
|
||||
required(7, "timestamp", Types.LongType.get())
|
||||
);
|
||||
|
||||
PartitionSpec conversationsPartitionSpec = PartitionSpec.builderFor(conversationsSchema)
|
||||
.identity("user_id")
|
||||
.day("timestamp", "timestamp_day")
|
||||
.build();
|
||||
|
||||
catalog.createTable(conversationsTable, conversationsSchema, conversationsPartitionSpec);
|
||||
|
||||
// Initialize gateway.checkpoints table
|
||||
TableIdentifier checkpointsTable = TableIdentifier.of("gateway", "checkpoints");
|
||||
Schema checkpointsSchema = new Schema(
|
||||
required(1, "id", Types.StringType.get()),
|
||||
required(2, "user_id", Types.StringType.get()),
|
||||
required(3, "session_id", Types.StringType.get()),
|
||||
required(4, "checkpoint_ns", Types.StringType.get()),
|
||||
required(5, "thread_id", Types.StringType.get()),
|
||||
optional(6, "parent_checkpoint_id", Types.StringType.get()),
|
||||
required(7, "checkpoint_data", Types.StringType.get()),
|
||||
optional(8, "metadata", Types.StringType.get()),
|
||||
required(9, "timestamp", Types.LongType.get())
|
||||
);
|
||||
|
||||
PartitionSpec checkpointsPartitionSpec = PartitionSpec.builderFor(checkpointsSchema)
|
||||
.identity("user_id")
|
||||
.day("timestamp", "timestamp_day")
|
||||
.build();
|
||||
|
||||
catalog.createTable(checkpointsTable, checkpointsSchema, checkpointsPartitionSpec);
|
||||
```
|
||||
|
||||
### Flink Jobs
|
||||
|
||||
**ConversationsSink.java** - Read from `gateway_conversations` Kafka topic and write to Iceberg:
|
||||
|
||||
```java
|
||||
TableLoader tableLoader = TableLoader.fromCatalog(
|
||||
CatalogLoader.rest("gateway", catalogUri),
|
||||
TableIdentifier.of("gateway", "conversations")
|
||||
);
|
||||
|
||||
DataStream<Row> messageRows = // ... from Kafka gateway_conversations
|
||||
|
||||
FlinkSink.forRow(messageRows, conversationsSchema)
|
||||
.tableLoader(tableLoader)
|
||||
.append() // Append-only
|
||||
.build();
|
||||
```
|
||||
|
||||
**CheckpointsSink.java** - Read from `gateway_checkpoints` Kafka topic and write to Iceberg:
|
||||
|
||||
```java
|
||||
TableLoader tableLoader = TableLoader.fromCatalog(
|
||||
CatalogLoader.rest("gateway", catalogUri),
|
||||
TableIdentifier.of("gateway", "checkpoints")
|
||||
);
|
||||
|
||||
DataStream<Row> checkpointRows = // ... from Kafka gateway_checkpoints
|
||||
|
||||
FlinkSink.forRow(checkpointRows, checkpointsSchema)
|
||||
.tableLoader(tableLoader)
|
||||
.append() // Append-only
|
||||
.build();
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Access Patterns
|
||||
|
||||
### Gateway (Write)
|
||||
- Buffers messages/checkpoints in Redis (hot layer, 1-hour TTL)
|
||||
- Async sends to Kafka topics (fire-and-forget)
|
||||
- Flink streams Kafka → Iceberg (durable storage)
|
||||
|
||||
### Gateway (Read)
|
||||
- Recent data: Read from Redis (fast)
|
||||
- Historical data: Query Iceberg via REST catalog (slower, cold storage)
|
||||
- Use iceberg-js for JavaScript/Node.js queries
|
||||
|
||||
### Analytics/Debugging
|
||||
- Query Iceberg tables directly via REST API
|
||||
- Use Spark or Trino for complex analytics
|
||||
- GDPR deletions via catalog API
|
||||
|
||||
---
|
||||
|
||||
## Catalog Configuration
|
||||
|
||||
Same as trading namespace (see main README):
|
||||
|
||||
```yaml
|
||||
catalog:
|
||||
type: rest
|
||||
uri: http://iceberg-catalog:8181
|
||||
warehouse: s3://gateway-warehouse/
|
||||
s3:
|
||||
endpoint: http://minio:9000
|
||||
access-key-id: ${S3_ACCESS_KEY}
|
||||
secret-access-key: ${S3_SECRET_KEY}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [LangGraph Checkpointer](https://langchain-ai.github.io/langgraph/concepts/persistence/)
|
||||
- [Apache Iceberg Partitioning](https://iceberg.apache.org/docs/latest/partitioning/)
|
||||
- [Flink Iceberg Connector](https://iceberg.apache.org/docs/latest/flink/)
|
||||
Reference in New Issue
Block a user