backend redesign

This commit is contained in:
2026-03-11 18:47:11 -04:00
parent 8ff277c8c6
commit e99ef5d2dd
210 changed files with 12147 additions and 155 deletions

138
iceberg/README.md Normal file
View File

@@ -0,0 +1,138 @@
# Iceberg Schema Definitions
We use Apache Iceberg for historical data storage. The metadata server is a PostgreSQL database.
This directory stores schema files and database setup.
## Tables
### trading.ohlc
Historical OHLC (Open, High, Low, Close, Volume) candle data for all periods in a single table.
**Schema**: `ohlc_schema.sql`
**Natural Key**: `(ticker, period_seconds, timestamp)` - uniqueness enforced by application
**Partitioning**: `(ticker, days(timestamp))`
- Partition by ticker to isolate different markets
- Partition by days for efficient time-range queries
- Hidden partitioning - not exposed in queries
**Iceberg Version**: Format v1 (1.10.1)
- Uses equality delete files for deduplication
- Flink upsert mode generates equality deletes
- Last-write-wins semantics for duplicates
- Copy-on-write mode for better query performance
**Deduplication**:
- Flink Iceberg sink with upsert mode
- Equality delete files on (ticker, period_seconds, timestamp)
- PyIceberg automatically filters deleted rows during queries
**Storage Format**: Parquet with Snappy compression
**Supported Periods**: Any period in seconds (60, 300, 900, 3600, 14400, 86400, 604800, etc.)
**Usage**:
```sql
-- Query 1-hour candles for specific ticker and time range
SELECT * FROM trading.ohlc
WHERE ticker = 'BINANCE:BTC/USDT'
AND period_seconds = 3600
AND timestamp BETWEEN 1735689600000000 AND 1736294399000000
ORDER BY timestamp;
-- Query most recent 1-minute candles
SELECT * FROM trading.ohlc
WHERE ticker = 'BINANCE:BTC/USDT'
AND period_seconds = 60
AND timestamp > (UNIX_MICROS(CURRENT_TIMESTAMP()) - 3600000000)
ORDER BY timestamp DESC
LIMIT 60;
-- Query all periods for a ticker
SELECT period_seconds, COUNT(*) as candle_count
FROM trading.ohlc
WHERE ticker = 'BINANCE:BTC/USDT'
GROUP BY period_seconds;
```
## Access Patterns
### Flink (Write)
- Reads OHLCBatch from Kafka
- Writes rows to Iceberg table
- Uses Iceberg Flink connector
- Upsert mode to handle duplicates
### Client-Py (Read)
- Queries historical data after receiving HistoryReadyNotification
- Uses PyIceberg or Iceberg REST API
- Read-only access via Iceberg catalog
### Web UI (Read)
- Queries for chart display
- Time-series queries with partition pruning
- Read-only access
## Catalog Configuration
The Iceberg catalog is accessed via REST API:
```yaml
catalog:
type: rest
uri: http://iceberg-catalog:8181
warehouse: s3://trading-warehouse/
s3:
endpoint: http://minio:9000
access-key-id: ${S3_ACCESS_KEY}
secret-access-key: ${S3_SECRET_KEY}
```
## Table Naming Convention
`{namespace}.ohlc` where:
- `namespace`: Trading namespace (default: "trading")
- All OHLC data is stored in a single table
- Partitioned by ticker and date for efficient queries
## Integration Examples
### Flink Write
```java
TableLoader tableLoader = TableLoader.fromCatalog(
CatalogLoader.rest("trading", catalogUri),
TableIdentifier.of("trading", "ohlc")
);
DataStream<Row> ohlcRows = // ... from OHLCBatch
FlinkSink.forRow(ohlcRows, schema)
.tableLoader(tableLoader)
.upsert(true)
.build();
```
### Python Read
```python
from pyiceberg.catalog import load_catalog
catalog = load_catalog("trading", uri="http://iceberg-catalog:8181")
table = catalog.load_table("trading.ohlc")
# Query with filters
df = table.scan(
row_filter=(
(col("ticker") == "BINANCE:BTC/USDT") &
(col("period_seconds") == 3600) &
(col("timestamp") >= 1735689600000000)
)
).to_pandas()
```
## References
- [Apache Iceberg Documentation](https://iceberg.apache.org/)
- [Flink Iceberg Connector](https://iceberg.apache.org/docs/latest/flink/)
- [PyIceberg](https://py.iceberg.apache.org/)