# Gateway Iceberg Schemas Gateway persistence layer tables for conversation history and LangGraph checkpoints. ## Tables ### gateway.conversations Stores all conversation messages from the agent harness across all channels (WebSocket, Telegram, etc.). **Schema**: | Column | Type | Description | |--------|------|-------------| | id | string | Unique message ID: `{user_id}:{session_id}:{timestamp_ms}` | | user_id | string | User identifier (for GDPR compliance) | | session_id | string | Conversation session identifier | | role | string | Message role: 'user', 'assistant', 'system', 'tool' | | content | string | Message content (text, JSON for tool calls/results) | | metadata | string | JSON metadata (channel, model, tokens, attachments, etc.) | | timestamp | long | Message timestamp in microseconds (UTC) | **Natural Key**: `(id)` - unique per message, includes timestamp for ordering **Partitioning**: `(user_id, days(timestamp))` - Partition by user_id for efficient GDPR deletion - Partition by days for time-range queries - Hidden partitioning - not exposed in queries **Iceberg Version**: Format v1 (1.10.1) - Append-only writes (no updates/deletes via Flink) - Copy-on-write mode for query performance - Schema evolution supported **Storage Format**: Parquet with Snappy compression **Write Pattern**: - Gateway → Kafka topic `gateway_conversations` → Flink → Iceberg - Flink job buffers and writes micro-batches - Near real-time persistence (5-10 second latency) **Query Patterns**: ```sql -- Get recent conversation history for a session SELECT role, content, timestamp FROM gateway.conversations WHERE user_id = 'user123' AND session_id = 'session456' AND timestamp > (UNIX_MICROS(CURRENT_TIMESTAMP()) - 86400000000) -- Last 24h ORDER BY timestamp DESC LIMIT 50; -- Search user's conversations across all sessions SELECT session_id, role, content, timestamp FROM gateway.conversations WHERE user_id = 'user123' AND timestamp BETWEEN 1735689600000000 AND 1736294399000000 ORDER BY timestamp DESC; -- Count messages by channel SELECT JSON_EXTRACT_SCALAR(metadata, '$.channel') as channel, COUNT(*) as message_count FROM gateway.conversations WHERE user_id = 'user123' GROUP BY JSON_EXTRACT_SCALAR(metadata, '$.channel'); ``` **GDPR Compliance**: ```sql -- Delete all messages for a user (via catalog API, not Flink) DELETE FROM gateway.conversations WHERE user_id = 'user123'; ``` --- ### gateway.checkpoints Stores LangGraph checkpoints for agent workflow state persistence. **Schema**: | Column | Type | Description | |--------|------|-------------| | id | string | Checkpoint ID: `{user_id}:{session_id}:{checkpoint_ns}:{timestamp_ms}` | | user_id | string | User identifier (for GDPR compliance) | | session_id | string | Conversation session identifier | | checkpoint_ns | string | LangGraph checkpoint namespace (default: 'default') | | thread_id | string | LangGraph thread identifier (often same as session_id) | | parent_checkpoint_id | string | Parent checkpoint ID for replay/branching | | checkpoint_data | string | Serialized checkpoint state (JSON) | | metadata | string | JSON metadata (step_count, node_name, status, etc.) | | timestamp | long | Checkpoint timestamp in microseconds (UTC) | **Natural Key**: `(id)` - unique per checkpoint **Partitioning**: `(user_id, days(timestamp))` - Partition by user_id for GDPR compliance - Partition by days for efficient pruning of old checkpoints - Hidden partitioning **Iceberg Version**: Format v1 (1.10.1) - Append-only writes - Checkpoints are immutable once written - Copy-on-write mode **Storage Format**: Parquet with Snappy compression **Write Pattern**: - Gateway → Kafka topic `gateway_checkpoints` → Flink → Iceberg - Checkpoints written on each LangGraph step - Critical for workflow resumption after failures **Query Patterns**: ```sql -- Get latest checkpoint for a session SELECT checkpoint_data, metadata, timestamp FROM gateway.checkpoints WHERE user_id = 'user123' AND session_id = 'session456' AND checkpoint_ns = 'default' ORDER BY timestamp DESC LIMIT 1; -- Get checkpoint history for debugging SELECT id, parent_checkpoint_id, JSON_EXTRACT_SCALAR(metadata, '$.node_name') as node, JSON_EXTRACT_SCALAR(metadata, '$.step_count') as step, timestamp FROM gateway.checkpoints WHERE user_id = 'user123' AND session_id = 'session456' ORDER BY timestamp; -- Find checkpoints for a specific workflow node SELECT * FROM gateway.checkpoints WHERE user_id = 'user123' AND JSON_EXTRACT_SCALAR(metadata, '$.node_name') = 'human_approval' AND JSON_EXTRACT_SCALAR(metadata, '$.status') = 'pending' ORDER BY timestamp DESC; ``` **GDPR Compliance**: ```sql -- Delete all checkpoints for a user DELETE FROM gateway.checkpoints WHERE user_id = 'user123'; ``` --- ## Kafka Topics ### gateway_conversations - **Partitions**: 6 (partitioned by user_id hash) - **Replication Factor**: 3 - **Retention**: 7 days (Kafka), unlimited (Iceberg) - **Schema**: Protobuf (see `protobuf/gateway_messages.proto`) ### gateway_checkpoints - **Partitions**: 6 (partitioned by user_id hash) - **Replication Factor**: 3 - **Retention**: 7 days (Kafka), unlimited (Iceberg) - **Schema**: Protobuf (see `protobuf/gateway_checkpoints.proto`) --- ## Flink Integration ### SchemaInitializer.java Add to `flink/src/main/java/com/dexorder/flink/iceberg/SchemaInitializer.java`: ```java // Initialize gateway.conversations table TableIdentifier conversationsTable = TableIdentifier.of("gateway", "conversations"); Schema conversationsSchema = new Schema( required(1, "id", Types.StringType.get()), required(2, "user_id", Types.StringType.get()), required(3, "session_id", Types.StringType.get()), required(4, "role", Types.StringType.get()), required(5, "content", Types.StringType.get()), optional(6, "metadata", Types.StringType.get()), required(7, "timestamp", Types.LongType.get()) ); PartitionSpec conversationsPartitionSpec = PartitionSpec.builderFor(conversationsSchema) .identity("user_id") .day("timestamp", "timestamp_day") .build(); catalog.createTable(conversationsTable, conversationsSchema, conversationsPartitionSpec); // Initialize gateway.checkpoints table TableIdentifier checkpointsTable = TableIdentifier.of("gateway", "checkpoints"); Schema checkpointsSchema = new Schema( required(1, "id", Types.StringType.get()), required(2, "user_id", Types.StringType.get()), required(3, "session_id", Types.StringType.get()), required(4, "checkpoint_ns", Types.StringType.get()), required(5, "thread_id", Types.StringType.get()), optional(6, "parent_checkpoint_id", Types.StringType.get()), required(7, "checkpoint_data", Types.StringType.get()), optional(8, "metadata", Types.StringType.get()), required(9, "timestamp", Types.LongType.get()) ); PartitionSpec checkpointsPartitionSpec = PartitionSpec.builderFor(checkpointsSchema) .identity("user_id") .day("timestamp", "timestamp_day") .build(); catalog.createTable(checkpointsTable, checkpointsSchema, checkpointsPartitionSpec); ``` ### Flink Jobs **ConversationsSink.java** - Read from `gateway_conversations` Kafka topic and write to Iceberg: ```java TableLoader tableLoader = TableLoader.fromCatalog( CatalogLoader.rest("gateway", catalogUri), TableIdentifier.of("gateway", "conversations") ); DataStream messageRows = // ... from Kafka gateway_conversations FlinkSink.forRow(messageRows, conversationsSchema) .tableLoader(tableLoader) .append() // Append-only .build(); ``` **CheckpointsSink.java** - Read from `gateway_checkpoints` Kafka topic and write to Iceberg: ```java TableLoader tableLoader = TableLoader.fromCatalog( CatalogLoader.rest("gateway", catalogUri), TableIdentifier.of("gateway", "checkpoints") ); DataStream checkpointRows = // ... from Kafka gateway_checkpoints FlinkSink.forRow(checkpointRows, checkpointsSchema) .tableLoader(tableLoader) .append() // Append-only .build(); ``` --- ## Access Patterns ### Gateway (Write) - Buffers messages/checkpoints in Redis (hot layer, 1-hour TTL) - Async sends to Kafka topics (fire-and-forget) - Flink streams Kafka → Iceberg (durable storage) ### Gateway (Read) - Recent data: Read from Redis (fast) - Historical data: Query Iceberg via REST catalog (slower, cold storage) - Use iceberg-js for JavaScript/Node.js queries ### Analytics/Debugging - Query Iceberg tables directly via REST API - Use Spark or Trino for complex analytics - GDPR deletions via catalog API --- ## Catalog Configuration Same as trading namespace (see main README): ```yaml catalog: type: rest uri: http://iceberg-catalog:8181 warehouse: s3://gateway-warehouse/ s3: endpoint: http://minio:9000 access-key-id: ${S3_ACCESS_KEY} secret-access-key: ${S3_SECRET_KEY} ``` --- ## References - [LangGraph Checkpointer](https://langchain-ai.github.io/langgraph/concepts/persistence/) - [Apache Iceberg Partitioning](https://iceberg.apache.org/docs/latest/partitioning/) - [Flink Iceberg Connector](https://iceberg.apache.org/docs/latest/flink/)