# Iceberg Schema Definitions

We use Apache Iceberg for historical data storage. The metadata server is a PostgreSQL database.

This directory stores schema files and database setup.

## Tables

### trading.ohlc
Historical OHLC (Open, High, Low, Close, Volume) candle data for all periods in a single table.

**Schema**: `ohlc_schema.sql`

**Natural Key**: `(ticker, period_seconds, timestamp)` - uniqueness enforced by application

**Partitioning**: `(ticker, days(timestamp))`
- Partition by ticker to isolate different markets
- Partition by days for efficient time-range queries
- Hidden partitioning - not exposed in queries

**Iceberg Version**: Format v1 (1.10.1)
- Uses equality delete files for deduplication
- Flink upsert mode generates equality deletes
- Last-write-wins semantics for duplicates
- Copy-on-write mode for better query performance

**Deduplication**:
- Flink Iceberg sink with upsert mode
- Equality delete files on (ticker, period_seconds, timestamp)
- PyIceberg automatically filters deleted rows during queries

**Storage Format**: Parquet with Snappy compression

**Supported Periods**: Any period in seconds (60, 300, 900, 3600, 14400, 86400, 604800, etc.)

**Usage**:
```sql
-- Query 1-hour candles for specific ticker and time range
SELECT * FROM trading.ohlc
WHERE ticker = 'BINANCE:BTC/USDT'
  AND period_seconds = 3600
  AND timestamp BETWEEN 1735689600000000 AND 1736294399000000
ORDER BY timestamp;

-- Query most recent 1-minute candles
SELECT * FROM trading.ohlc
WHERE ticker = 'BINANCE:BTC/USDT'
  AND period_seconds = 60
  AND timestamp > (UNIX_MICROS(CURRENT_TIMESTAMP()) - 3600000000)
ORDER BY timestamp DESC
LIMIT 60;

-- Query all periods for a ticker
SELECT period_seconds, COUNT(*) as candle_count
FROM trading.ohlc
WHERE ticker = 'BINANCE:BTC/USDT'
GROUP BY period_seconds;
```

## Access Patterns

### Flink (Write)
- Reads OHLCBatch from Kafka
- Writes rows to Iceberg table
- Uses Iceberg Flink connector
- Upsert mode to handle duplicates

### Client-Py (Read)
- Queries historical data after receiving HistoryReadyNotification
- Uses PyIceberg or Iceberg REST API
- Read-only access via Iceberg catalog

### Web UI (Read)
- Queries for chart display
- Time-series queries with partition pruning
- Read-only access

## Catalog Configuration

The Iceberg catalog is accessed via REST API:

```yaml
catalog:
  type: rest
  uri: http://iceberg-catalog:8181
  warehouse: s3://trading-warehouse/
  s3:
    endpoint: http://minio:9000
    access-key-id: ${S3_ACCESS_KEY}
    secret-access-key: ${S3_SECRET_KEY}
```

## Table Naming Convention

`{namespace}.ohlc` where:
- `namespace`: Trading namespace (default: "trading")
- All OHLC data is stored in a single table
- Partitioned by ticker and date for efficient queries

## Integration Examples

### Flink Write
```java
TableLoader tableLoader = TableLoader.fromCatalog(
    CatalogLoader.rest("trading", catalogUri),
    TableIdentifier.of("trading", "ohlc")
);

DataStream<Row> ohlcRows = // ... from OHLCBatch

FlinkSink.forRow(ohlcRows, schema)
    .tableLoader(tableLoader)
    .upsert(true)
    .build();
```

### Python Read
```python
from pyiceberg.catalog import load_catalog

catalog = load_catalog("trading", uri="http://iceberg-catalog:8181")
table = catalog.load_table("trading.ohlc")

# Query with filters
df = table.scan(
    row_filter=(
        (col("ticker") == "BINANCE:BTC/USDT") &
        (col("period_seconds") == 3600) &
        (col("timestamp") >= 1735689600000000)
    )
).to_pandas()
```

## References

- [Apache Iceberg Documentation](https://iceberg.apache.org/)
- [Flink Iceberg Connector](https://iceberg.apache.org/docs/latest/flink/)
- [PyIceberg](https://py.iceberg.apache.org/)