# Iceberg Schema Definitions We use Apache Iceberg for historical data storage. The metadata server is a PostgreSQL database. This directory stores schema files and database setup. ## Tables ### trading.ohlc Historical OHLC (Open, High, Low, Close, Volume) candle data for all periods in a single table. **Schema**: `ohlc_schema.sql` **Natural Key**: `(ticker, period_seconds, timestamp)` - uniqueness enforced by application **Partitioning**: `(ticker, days(timestamp))` - Partition by ticker to isolate different markets - Partition by days for efficient time-range queries - Hidden partitioning - not exposed in queries **Iceberg Version**: Format v1 (1.10.1) - Uses equality delete files for deduplication - Flink upsert mode generates equality deletes - Last-write-wins semantics for duplicates - Copy-on-write mode for better query performance **Deduplication**: - Flink Iceberg sink with upsert mode - Equality delete files on (ticker, period_seconds, timestamp) - PyIceberg automatically filters deleted rows during queries **Storage Format**: Parquet with Snappy compression **Supported Periods**: Any period in seconds (60, 300, 900, 3600, 14400, 86400, 604800, etc.) **Usage**: ```sql -- Query 1-hour candles for specific ticker and time range SELECT * FROM trading.ohlc WHERE ticker = 'BINANCE:BTC/USDT' AND period_seconds = 3600 AND timestamp BETWEEN 1735689600000000 AND 1736294399000000 ORDER BY timestamp; -- Query most recent 1-minute candles SELECT * FROM trading.ohlc WHERE ticker = 'BINANCE:BTC/USDT' AND period_seconds = 60 AND timestamp > (UNIX_MICROS(CURRENT_TIMESTAMP()) - 3600000000) ORDER BY timestamp DESC LIMIT 60; -- Query all periods for a ticker SELECT period_seconds, COUNT(*) as candle_count FROM trading.ohlc WHERE ticker = 'BINANCE:BTC/USDT' GROUP BY period_seconds; ``` ## Access Patterns ### Flink (Write) - Reads OHLCBatch from Kafka - Writes rows to Iceberg table - Uses Iceberg Flink connector - Upsert mode to handle duplicates ### Client-Py (Read) - Queries historical data after receiving HistoryReadyNotification - Uses PyIceberg or Iceberg REST API - Read-only access via Iceberg catalog ### Web UI (Read) - Queries for chart display - Time-series queries with partition pruning - Read-only access ## Catalog Configuration The Iceberg catalog is accessed via REST API: ```yaml catalog: type: rest uri: http://iceberg-catalog:8181 warehouse: s3://trading-warehouse/ s3: endpoint: http://minio:9000 access-key-id: ${S3_ACCESS_KEY} secret-access-key: ${S3_SECRET_KEY} ``` ## Table Naming Convention `{namespace}.ohlc` where: - `namespace`: Trading namespace (default: "trading") - All OHLC data is stored in a single table - Partitioned by ticker and date for efficient queries ## Integration Examples ### Flink Write ```java TableLoader tableLoader = TableLoader.fromCatalog( CatalogLoader.rest("trading", catalogUri), TableIdentifier.of("trading", "ohlc") ); DataStream ohlcRows = // ... from OHLCBatch FlinkSink.forRow(ohlcRows, schema) .tableLoader(tableLoader) .upsert(true) .build(); ``` ### Python Read ```python from pyiceberg.catalog import load_catalog catalog = load_catalog("trading", uri="http://iceberg-catalog:8181") table = catalog.load_table("trading.ohlc") # Query with filters df = table.scan( row_filter=( (col("ticker") == "BINANCE:BTC/USDT") & (col("period_seconds") == 3600) & (col("timestamp") >= 1735689600000000) ) ).to_pandas() ``` ## References - [Apache Iceberg Documentation](https://iceberg.apache.org/) - [Flink Iceberg Connector](https://iceberg.apache.org/docs/latest/flink/) - [PyIceberg](https://py.iceberg.apache.org/)