- Flink update_bars debouncing - update_bars subscription idempotency bugfix - Price decimal correction bugfix of previous commit - Add GLM-5.1 model tag alongside renamed GLM-5 - Use short Anthropic model IDs (sonnet/haiku/opus) instead of full version strings - Allow @tags anywhere in message content, not just at start - Return hasOtherContent flag instead of trimmed rest string - Only trigger greeting stream when tag has no other content - Update workspace knowledge base references to platform/workspace and platform/shapes - Hierarchical knowledge base catalog - 151 Trading Strategies knowledge base articles - Shapes knowledge base article - MutateShapes tool instead of workspace patch
86 lines
4.3 KiB
Markdown
86 lines
4.3 KiB
Markdown
---
|
|
description: "Predicts a stock's future T-day cumulative return using the K-nearest-neighbor algorithm on normalized price and volume features, then trades based on the predicted return signal."
|
|
tags: [stocks, machine-learning, knn, prediction]
|
|
---
|
|
|
|
# Machine Learning — Single-Stock KNN
|
|
|
|
**Section**: 3.17 | **Asset Class**: Stocks | **Type**: Machine Learning / Prediction
|
|
|
|
## Overview
|
|
This single-stock strategy uses the k-nearest-neighbor (KNN) algorithm to predict future cumulative stock returns based on a set of predictor (feature) variables derived from the stock's own price and volume history. For each stock, the model is trained independently using only that stock's data (no cross-sectional information). The predicted return is then used to generate long/short signals.
|
|
|
|
## Construction / Signal
|
|
**Target variable** — cumulative return over the next T trading days:
|
|
```
|
|
Y(t) = P(t-T) / P(t) - 1 (332)
|
|
```
|
|
(t ascending corresponds to going back in time; t=0 is today)
|
|
|
|
**Predictor variables** (moving averages of volume and price over varying windows T_1, T_2, T_3, ...):
|
|
```
|
|
X_1(t) = (1/T_1) * sum_{s=1}^{T_1} V(t+s) (333) [volume MA]
|
|
X_2(t) = (1/T_2) * sum_{s=1}^{T_2} P(t+s) (334) [price MA 1]
|
|
X_3(t) = (1/T_3) * sum_{s=1}^{T_3} P(t+s) (335) [price MA 2]
|
|
... (336)
|
|
```
|
|
|
|
Predictor variables are normalized to [0, 1] using the training period's min/max:
|
|
```
|
|
X_tilde_a(t) = (X_a(t) - X_a^-) / (X_a^+ - X_a^-) (337)
|
|
```
|
|
where `X_a^+` and `X_a^-` are the max and min of `X_a(t)` over the training period.
|
|
|
|
**KNN prediction** — for a given t, find the k nearest neighbors of `X_tilde_a(t)` among training points `t' = t+1, t+2, ..., t+T_*` using Euclidean distance:
|
|
```
|
|
[D(t, t')]^2 = sum_{a=1}^{m} (X_tilde_a(t) - X_tilde_a(t'))^2 (338)
|
|
```
|
|
|
|
**Predicted return** (simple average):
|
|
```
|
|
Y(t) = (1/k) * sum_{alpha=1}^{k} Y(t'_alpha(t)) (339)
|
|
```
|
|
|
|
Alternatively, fit a linear model with weights w_alpha and intercept v:
|
|
```
|
|
Y(t) = sum_{alpha=1}^{k} Y(t'_alpha(t)) w_alpha + v (340)
|
|
```
|
|
trained by regressing Y(t) on the k neighbor returns over M values of t.
|
|
|
|
**Trading signal** (z_1, z_2 are trader-defined thresholds):
|
|
```
|
|
Signal = { Establish long position if Y > z_1
|
|
{ Liquidate long position if Y <= z_2
|
|
{ Establish short position if Y < -z_1
|
|
{ Liquidate short position if Y >= -z_2 (341)
|
|
```
|
|
|
|
## Entry / Exit Rules
|
|
- **Long entry**: Predicted cumulative return `Y = Y(0) > z_1`
|
|
- **Long exit**: Predicted return `Y <= z_2` (where z_2 <= z_1)
|
|
- **Short entry**: Predicted return `Y < -z_1`
|
|
- **Short exit**: Predicted return `Y >= -z_2`
|
|
- All thresholds must be backtested out-of-sample.
|
|
|
|
## Key Parameters
|
|
- **Number of neighbors k**: Typically `k = floor(sqrt(T_*))` or `k = ceiling(sqrt(T_*))` (T_* = training sample size)
|
|
- **Training sample size T_***: Number of historical time points used for training
|
|
- **Prediction horizon T**: Number of trading days for the target return
|
|
- **Feature set m**: Number and type of predictor variables (volume MAs, price MAs)
|
|
- **Thresholds z_1, z_2**: Entry and exit thresholds for signals (backtested)
|
|
- **Train/validation split**: E.g., 60% training, 40% cross-validation
|
|
- **Distance metric**: Euclidean (Eq. 338) or Manhattan distance
|
|
|
|
## Variations
|
|
- **Weighted KNN**: Use distance-based weights for the k neighbors instead of uniform averaging (Eq. 340)
|
|
- **Cross-sectional extension**: Compute expected returns Y_i for N stocks and use as inputs to cross-sectional mean-reversion or other multi-stock strategies
|
|
- **Alternative features**: Fundamental data, earnings surprises, sentiment indicators in addition to price/volume
|
|
|
|
## Notes
|
|
- This is a single-stock strategy: each stock's model is trained on that stock's own price/volume data only.
|
|
- The strategy must be backtested strictly out-of-sample; data leakage is a critical risk.
|
|
- Simple uniform KNN (Eq. 339) has no parameters to train; the linear model (Eq. 340) requires cross-validation and is prone to out-of-sample instability.
|
|
- k can be optimized via backtesting; common heuristic: `k = floor(sqrt(T_*))`.
|
|
- Typical holding period: T trading days (matching the prediction horizon).
|
|
- Training/cross-validation split: e.g., 60%/40%.
|