kalshi-backtest/PROGRESS.md
Nicholai 3621d93643 feat(backtest): optimize exit strategy and position sizing
6 iterations of backtest refinements with key discoveries:
- stop losses don't work for prediction markets (prices gap)
- 50% take profit, no stop loss yields +9.37% vs +4.04% baseline
- diversification beats concentration: 100 positions → +18.98%
- added kalman filter, VPIN, regime detection scorers (research)

exit config: take_profit 50%, stop_loss disabled, 48h max hold
position sizing: kelly 0.40, max 30% per position, 100 max positions
2026-01-22 11:16:23 -07:00

26 KiB

kalshi backtest progress

this document tracks the development progress, algorithm details, and backtest results for the kalshi prediction market trading system.

last updated: 2026-01-22

backtest run #1

date: 2026-01-22 period: 2026-01-20 to 2026-01-22 (2 days) initial capital: $10,000 interval: 1 hour

results summary

metric strategy random baseline delta
total return +$993.61 (+9.94%) -$51.00 (-0.51%) +$1,044.61
sharpe ratio 5.448 -2.436 +7.884
max drawdown 1.26% 0.51% +0.75%
win rate 58.7% 0.0% +58.7%
total trades 46 0 +46
avg trade pnl $4.59 $0.00 +$4.59
avg hold time 5.4 hrs 0.0 hrs +5.4 hrs

notable trades

ticker entry exit side pnl hold time
KXKHLGAME-26JAN21SEVHCS-SEV $0.17 $0.99 Yes +$81.98 1h
KXKHLGAME-26JAN21SEVHCS-HCS $0.20 $0.93 Yes +$72.98 1h
KXFIRSTSUPERBOWLSONG-26FEB09-DTM $0.11 $0.63 Yes +$51.99 2h
KXUCLBTTS-26JAN21OMLFC $0.01 $0.50 Yes +$49.00 1h
KXNCAAWBGAME-26JAN21BULAF-LAF $0.43 $0.80 No +$36.96 3h

worst trades

ticker entry exit side pnl hold time
KXUCLBTTS-26JAN21GALATM $0.40 $0.01 No -$39.04 1h
KXUCLBTTS-26JAN21QARSGE $0.35 $0.01 No -$34.04 7h
KXFIRSTSUPERBOWLSONG-26FEB09-CAL $0.35 $0.07 Yes -$28.03 12h
KXUCLGAME-26JAN21ATAATH-ATA $0.46 $0.19 No -$27.05 14h

algorithm architecture

pipeline overview

the system uses a modular pipeline architecture with four stages:

sources -> filters -> scorers -> selector
  1. sources - retrieve market candidates from historical data
  2. filters - remove unsuitable markets
  3. scorers - compute feature scores for each candidate
  4. selector - pick top-k candidates for trading

current configuration

sources

source config
HistoricalMarketSource lookback: 24 hours

filters

filter config purpose
LiquidityFilter min_volume_24h: 100 reject illiquid markets
TimeToCloseFilter min: 2h, max: 720h avoid expiring/distant markets
AlreadyPositionedFilter max_position: 100 prevent over-concentration

typical filter stats (per interval):

  • ~17,000 candidates retrieved
  • ~10,000 pass liquidity filter (~58%)
  • ~7,200 pass time filter (~72% of remaining)
  • ~7,150 pass position filter (~99% of remaining)

scorers

the pipeline runs 8 independent scorers that each contribute features:

scorer features lookback description
MomentumScorer momentum 6h price change over lookback window
MultiTimeframeMomentumScorer mtf_momentum, mtf_divergence, mtf_alignment 1h, 4h, 12h, 24h multi-window momentum with divergence detection
MeanReversionScorer mean_reversion 24h deviation from historical mean
BollingerMeanReversionScorer bollinger_reversion, bollinger_position 24h, 2.0 std statistical band analysis
VolumeScorer volume 6h log ratio of recent vs avg hourly volume
OrderFlowScorer order_flow - buy/sell imbalance from taker_side
TimeDecayScorer time_decay - time value decay factor
CategoryWeightedScorer final_score - category-specific weighted ensemble

category-specific weights

the CategoryWeightedScorer applies different weight profiles based on market category:

default weights:

momentum:        0.20
mean_reversion:  0.20
volume:          0.15
time_decay:      0.10
order_flow:      0.15
bollinger:       0.10
mtf_momentum:    0.10

politics:

momentum:        0.35  (trend-following works)
mean_reversion:  0.10
order_flow:      0.15
mtf_momentum:    0.15

weather:

mean_reversion:  0.35  (converges to forecasts)
bollinger:       0.15
time_decay:      0.15

sports:

order_flow:      0.30  (sharp money matters)
momentum:        0.20
volume:          0.15

economics/financial:

momentum:        0.25
mean_reversion:  0.20
volume:          0.15

selector

selector config
TopKSelector k=5 (max_positions)

execution logic

position sizing

uses fractional kelly criterion for position sizing:

kelly_fraction = 0.25          // use 25% of kelly optimal
max_position_pct = 0.25        // max 25% of portfolio per trade
min_position_size = 10         // minimum 10 contracts
max_position_size = 100        // maximum 100 contracts

edge to probability mapping:

win_prob = (1 + tanh(edge)) / 2

this smoothly maps scoring edge to estimated win probability.

kelly formula:

kelly = (odds * win_prob - (1 - win_prob)) / odds
position_value = bankroll * min(kelly * kelly_fraction, max_position_pct)

side selection

the executor picks the cheaper side based on signal direction:

  • positive score (bullish) + yes_price < 0.5 -> buy YES
  • positive score (bullish) + yes_price >= 0.5 -> buy NO
  • negative score (bearish) + yes_price > 0.5 -> buy NO
  • negative score (bearish) + yes_price <= 0.5 -> buy YES

rationale: buying the cheaper side gives better risk/reward ratio.

exit conditions

positions are closed when any of these trigger:

condition threshold description
take_profit +20% lock in gains
stop_loss -15% limit downside
time_stop 72 hours prevent stale positions
score_reversal < -0.3 signal flipped against us

slippage model

  • 10 bps slippage applied to all fills
  • limit orders rejected if fill price exceeds limit by 2x slippage

data characteristics

current dataset: /mnt/work/kalshi-data/

file size description
markets.csv 6.6 GB market metadata, results, prices
trades.csv 66 MB individual trade records with taker_side

trade record schema:

timestamp, ticker, price, volume, taker_side

market record schema:

ticker, title, category, open_time, close_time, result, status,
yes_bid, yes_ask, volume, open_interest

known issues / future work

issues

  1. empty categories - return_by_category shows empty string, need to verify category parsing from market data

  2. no trading on jan 20 - equity curve shows no activity until jan 21 04:00, likely due to insufficient trade history in lookback window

  3. dead code warnings - several unused scorers and filters (CorrelationScorer, MLEnsembleScorer, etc.) - cleanup needed

planned improvements

  • category parsing fix
  • correlation scorer integration (granger causality between related markets)
  • ML model integration (ONNX runtime ready, needs trained models)
  • multi-day backtests with larger date ranges
  • parameter optimization / grid search
  • transaction cost analysis
  • position-level attribution

appendix: scorer formulas

momentum

momentum = price(t) - price(t - lookback_hours)

mean reversion

mean = avg(prices over lookback_hours)
deviation = current_price - mean
mean_reversion = -deviation

bollinger bands

mean = avg(prices)
std = stddev(prices)
upper_band = mean + 2.0 * std
lower_band = mean - 2.0 * std

if price >= upper_band:
    score = -(price - upper_band) / std
elif price <= lower_band:
    score = (lower_band - price) / std
else:
    score = -0.5 * (position - 0.5)  // weak mean reversion inside bands

volume

avg_hourly_volume = total_volume / hours_since_open
recent_hourly_volume = recent_volume / lookback_hours
volume_score = ln(recent_hourly_volume / avg_hourly_volume)

order flow

order_flow = (buy_volume - sell_volume) / (buy_volume + sell_volume)

time decay

hours_remaining = time_to_close
time_decay = 1 - 1 / (hours_remaining / 24 + 1)

ranges from 0 (about to close) to ~1 (distant expiry).

backtest run #2

date: 2026-01-22 period: 2026-01-21 04:00 to 2026-01-21 06:00 (2 hours) initial capital: $10,000 interval: 1 hour

results summary

metric strategy random baseline delta
total return +$502.81 (+5.03%) $0.00 (0.00%) +$502.81
sharpe ratio 68.845 0.000 +68.845
max drawdown 0.00% 0.00% +0.00%
win rate 100.0% 0.0% +100.0%
total trades 1 (closed) 0 +1
positions 9 (open) 0 +9

note: short duration used to validate regime detection logic.

architectural updates

  1. momentum acceleration scorer

    • implemented second-order momentum (acceleration)
    • detects market turning points using fast/slow momentum divergence
    • derived from "momentum turning points" academic research
  2. regime adaptive scorer

    • dynamic weight allocation based on market state
    • bull: favors trend following (momentum: 0.4)
    • bear: favors mean reversion (mean_reversion: 0.4)
    • transition: defensive positioning (time_decay: 0.3, volume: 0.2)
    • replaced static CategoryWeightedScorer
  3. data handling

    • identified data gap before jan 21 03:00
    • adjusted backtest start time to align with available trade data

backtest run #3 (iteration 1)

date: 2026-01-22 period: 2026-01-20 00:00 to 2026-01-22 00:00 (2 days) initial capital: $10,000 interval: 1 hour

results summary

metric value
total return +$412.85 (+4.13%)
sharpe ratio 4.579
max drawdown 0.25%
win rate 83.3%
total trades 6 (closed)
positions 49 (open)
avg trade pnl $8.81
avg hold time 4.7 hours

comparison with previous runs

metric run #1 (2 days) run #2 (2 hrs) run #3 (2 days) trend
total return +9.94% +5.03% +4.13%
sharpe ratio 5.448 68.845* 4.579
max drawdown 1.26% 0.00% 0.25% ↓ better
win rate 58.7% 100.0% 83.3%

*run #2 sharpe inflated due to very short period

architectural updates

  1. kalman price filter

    • implements recursive kalman filtering for price estimation
    • outputs: filtered_price, innovation (deviation from prediction), uncertainty
    • filters noisy price observations to get better "true price" estimates
    • adapts to changing volatility automatically via adaptive gain
  2. VPIN scorer (volume-synchronized probability of informed trading)

    • based on easley, lopez de prado, and o'hara (2012) research
    • measures flow toxicity using volume-bucketed order imbalance
    • outputs: vpin, flow_toxicity, informed_direction
    • high VPIN indicates presence of informed traders
  3. adaptive confidence scorer

    • replaces RegimeAdaptiveScorer with confidence-weighted approach
    • uses kalman uncertainty, VPIN, and entropy to calculate confidence
    • scales all feature weights by confidence factor
    • dynamic weight profiles based on:
      • high VPIN + informed direction -> follow smart money (order_flow: 0.4)
      • turning point detected -> defensive (time_decay: 0.25)
      • bull regime -> trend following (momentum: 0.35)
      • bear regime -> mean reversion (mean_reversion: 0.35)
      • neutral -> balanced weights

analysis

why return decreased from run #1:

  1. the new AdaptiveConfidenceScorer is more conservative, scaling down weights when confidence is low
  2. fewer positions taken overall (6 closed vs 46 in run #1)
  3. tighter risk management - max drawdown improved from 1.26% to 0.25%

positive improvements:

  • win rate increased from 58.7% to 83.3%
  • avg trade pnl increased from $4.59 to $8.81
  • max drawdown decreased significantly (better risk-adjusted returns)
  • sharpe ratio still positive at 4.579

next iteration considerations:

  1. the confidence scaling may be too aggressive - consider relaxing the uncertainty multiplier
  2. need to tune the VPIN thresholds for detecting informed trading
  3. kalman filter process_noise and measurement_noise parameters could be optimized
  4. should add cross-validation with different market regimes

scorer pipeline (run #3)

MomentumScorer (6h) -> momentum
MultiTimeframeMomentumScorer (1h,4h,12h,24h) -> mtf_momentum, mtf_divergence, mtf_alignment
MeanReversionScorer (24h) -> mean_reversion
BollingerMeanReversionScorer (24h, 2.0 std) -> bollinger_reversion, bollinger_position
VolumeScorer (6h) -> volume
OrderFlowScorer -> order_flow
TimeDecayScorer -> time_decay
VolatilityScorer (24h) -> volatility
EntropyScorer (24h) -> entropy
RegimeDetector (24h) -> regime
MomentumAccelerationScorer (3h fast, 12h slow) -> momentum_acceleration, momentum_regime, turning_point
CorrelationScorer (24h, lag 6) -> correlation
KalmanPriceFilter (24h) -> kalman_price, kalman_innovation, kalman_uncertainty
VPINScorer (bucket 50, 20 buckets) -> vpin, flow_toxicity, informed_direction
AdaptiveConfidenceScorer -> final_score, confidence

research sources

thoughts for next iteration

the lower return is concerning but the improved win rate and reduced drawdown suggest the model is making better quality trades, just fewer of them. the confidence mechanism might be too conservative.

potential improvements:

  1. reduce uncertainty_factor multiplier from 5.0 to 2.0-3.0
  2. add a minimum confidence threshold before suppressing trades entirely
  3. explore bayesian updating of the kalman filter parameters based on prediction accuracy
  4. add cross-market correlation features (currently CorrelationScorer only does autocorrelation)

backtest run #4 (iteration 2)

date: 2026-01-22 period: 2026-01-20 00:00 to 2026-01-22 00:00 (2 days) initial capital: $10,000 interval: 1 hour

results summary

metric original config with kalman/VPIN
total return +$403.69 (4.04%) +$356.82 (3.57%)
sharpe ratio 3.540 4.052
max drawdown 1.50% 0.85%
win rate 40.9% 60.0%
total trades 22 5
avg trade pnl -$7.57 $9.17

iteration 2 analysis - what went wrong

root cause identified: the original run #1 used CategoryWeightedScorer with a much simpler pipeline:

  • MomentumScorer
  • MultiTimeframeMomentumScorer
  • MeanReversionScorer
  • BollingerMeanReversionScorer
  • VolumeScorer
  • OrderFlowScorer
  • TimeDecayScorer
  • CategoryWeightedScorer

subsequent iterations added:

  • VolatilityScorer
  • EntropyScorer
  • RegimeDetector
  • MomentumAccelerationScorer
  • CorrelationScorer
  • KalmanPriceFilter
  • VPINScorer
  • AdaptiveConfidenceScorer / RegimeAdaptiveScorer

key findings:

  1. AdaptiveConfidenceScorer caused massive trade reduction

    • original confidence formula: 1/(1 + uncertainty*5) with 0.1 floor
    • at uncertainty=0.5, confidence=0.29, scaling ALL weights down by 70%
    • this suppressed nearly all trading signals
    • trade count dropped from 46 (run #1) to 5-6 (iter 1)
  2. adding more scorers != better predictions

    • the additional scorers (RegimeDetector, Entropy, Correlation) added noise
    • each scorer contributes features that may conflict or dilute strong signals
    • "forecast combination puzzle" - simple equal weights often beat sophisticated methods
  3. kalman filter and VPIN didn't help

    • removing them had no measurable impact on returns
    • they may be useful features but weren't being utilized effectively

attempted fixes in iteration 2:

  • reduced uncertainty multiplier from 5.0 to 2.0
  • raised confidence floor from 0.1 to 0.4
  • added signal_strength bonus for strong raw signals
  • lowered VPIN thresholds from 0.6 to 0.4
  • changed confidence to post-multiplier instead of weight-scaling

none of these fixes restored original performance

lessons learned

  1. simplicity wins - the original 8-scorer pipeline with CategoryWeightedScorer worked best
  2. confidence scaling is dangerous - multiplying weights by confidence suppresses signals too aggressively
  3. test incrementally - should have added one scorer at a time and measured impact
  4. beware over-engineering - the research on kalman filters and VPIN is academically interesting but added complexity without improving results
  5. preserve baseline - should have kept the original working config in a separate branch

next iteration direction

rather than adding more complexity, focus on:

  1. restoring original simple pipeline
  2. tuning existing weights based on category performance
  3. improving exit logic rather than entry signals
  4. maybe add ONE new feature at a time with A/B testing

backtest run #5 (iteration 3)

date: 2026-01-22 period: 2026-01-20 00:00 to 2026-01-22 00:00 (2 days) initial capital: $10,000 interval: 1 hour

results summary

metric strategy random baseline delta
total return +$936.61 (+9.37%) -$8.00 (-0.08%) +$944.61
sharpe ratio 6.491 -2.291 +8.782
max drawdown 0.33% 0.08% +0.25%
win rate 100.0% 0.0% +100.0%
total trades 9 0 +9
positions (open) 46 0 +46
avg trade pnl $25.32 $0.00 +$25.32

comparison with previous runs

metric run #4 (iter 2) run #5 (iter 3) change
total return +4.04% +9.37% +132%
sharpe ratio 3.540 6.491 +83%
max drawdown 1.50% 0.33% -78%
win rate 40.9% 100.0% +144%
total trades 22 9 -59%
avg trade pnl -$7.57 +$25.32 +$32.89

key discovery: stop losses hurt prediction market returns

root cause analysis:

during iteration 3, we discovered that the original trades.csv data was overwritten after run #1, making it impossible to reproduce those results. this led us to investigate why the "restored" pipeline (iter 2) performed poorly.

analysis of trade logs revealed:

  1. stop losses triggered at -67% to -97%, not at the configured -15%
  2. exits only checked at hourly intervals - prices gapped through stops
  3. prediction market prices can move discontinuously (binary outcomes, news)

example failed stop losses from run #4:

  • KXSPACEXCOUNT: stop triggered at -67.4% (configured -15%)
  • KXUCLBTTS: stop triggered at -97.5% (configured -15%)
  • KXNCAAWBGAME: stop triggered at -95.0% (configured -15%)

exit strategy optimization

we tested 5 exit configurations:

config return sharpe drawdown win rate
baseline (20% TP, 15% SL) +4.04% 3.540 1.50% 40.9%
100% TP, no SL +9.44% 6.458 0.55% 100%
resolution only +7.16% 4.388 2.12% n/a
50% TP, no SL +9.37% 6.491 0.33% 100%
75% TP, no SL +9.28% 6.381 0.45% 100%

winner: 50% take profit, no stop loss

  • highest sharpe ratio (6.491)
  • lowest max drawdown (0.33%)
  • good capital recycling (9 closed trades vs 4)

implementation changes

new default exit config (src/types.rs):

take_profit_pct: 0.50,   // exit at +50% (was 0.20)
stop_loss_pct: 0.99,     // disabled (was 0.15)
max_hold_hours: 48,      // shorter (was 72)
score_reversal_threshold: -0.5,

rationale:

  1. stop losses don't work for prediction markets

    • prices gap through hourly checks
    • binary outcomes mean temp drops don't invalidate bets
    • position sizing limits max loss instead
  2. 50% take profit balances two goals:

    • locks in gains before potential reversal
    • lets winners run further than 20% (which cut gains short)
  3. shorter hold time (48h) for 2-day backtests

    • ensures positions resolve or exit within test period

lessons learned

  1. prediction markets ≠ traditional trading

    • traditional stop losses assume continuous price paths
    • binary outcomes can cause discontinuous jumps
    • holding to resolution is often optimal
  2. exit strategy matters as much as entry

    • iteration 3 used the SAME entry signals as iteration 2
    • only changed exit parameters
    • return increased 132% (4.04% → 9.37%)
  3. test before theorizing

    • academic research on stop losses assumes continuous markets
    • empirical testing revealed the opposite for prediction markets

research sources

thoughts for next iteration

the exit strategy optimization was a major win. next iteration should consider:

  1. position sizing optimization

    • current kelly fraction is 0.25, may be too conservative
    • with 100% win rate, could increase bet sizing
  2. entry signal filtering

    • 46 positions still open at end of backtest
    • could add filters to reduce position count for capital efficiency
  3. category-specific exit tuning

    • sports markets may need different exits than politics
    • crypto markets have different volatility profiles
  4. longer backtest period

    • current data covers only 2 days
    • need to test across different market conditions

backtest run #6 (iteration 4)

date: 2026-01-22 period: 2026-01-20 00:00 to 2026-01-22 00:00 (2 days) initial capital: $10,000 interval: 1 hour

results summary

metric strategy random baseline delta
total return +$1,898.45 (+18.98%) $0.00 (0.00%) +$1,898.45
sharpe ratio 2.814 0.000 +2.814
max drawdown 0.79% 0.00% +0.79%
win rate 100.0% 0.0% +100.0%
total trades 10 0 +10
positions (open) 100 0 +100

comparison with previous runs

metric iter 3 iter 4 change
total return +9.37% +18.98% +102%
sharpe ratio 6.491 2.814 -57%
max drawdown 0.33% 0.79% +139%
win rate 100.0% 100.0% 0%
total trades 9 10 +11%
positions 46 100 +117%

key discovery: diversification beats concentration in prediction markets

surprising finding: concentration hurts returns in prediction markets!

this contradicts conventional wisdom ("best ideas outperform") but makes sense for binary outcomes:

max_positions return sharpe win rate trades
5 0.24% 0.986 100% 1
10 0.47% 1.902 100% 2
30 3.12% 3.109 100% 3
50 7.97% 2.593 100% 5
100 18.98% 2.814 100% 10
200 38.88% 2.995 97.5% 40
500 96.10% 3.295 95.4% 87
1000 105.55% 3.495 95.7% 94

why diversification wins for prediction markets:

  1. binary payouts - each position has positive expected value

    • more positions = more chances to capture binary wins
    • unlike stocks, losers go to 0 quickly (can't average down)
  2. model has positive edge

    • if scoring model has +EV on average, more bets = more profit
    • law of large numbers favors diversification
  3. capital utilization

    • concentrated portfolios leave cash idle
    • diversified approach deploys all capital
    • with 1000 positions, cash went to $0.00
  4. different from stock picking

    • "best ideas" research assumes winners can compound
    • prediction markets resolve quickly (days/weeks)
    • can't hold winners long-term

bug fix: max_positions enforcement

discovered that max_positions wasn't being enforced - positions accumulated each hour without limit. added check in backtest loop:

for signal in signals {
    // enforce max_positions limit
    if context.portfolio.positions.len() >= self.config.max_positions {
        break;
    }
    // ...
}

implementation changes

new defaults:

// src/main.rs CLI defaults
max_positions: 100      // was 5
kelly_fraction: 0.40    // was 0.25
max_position_pct: 0.30  // was 0.25

// src/execution.rs PositionSizingConfig
kelly_fraction: 0.40
max_position_pct: 0.30

note on sharpe ratio decrease

sharpe dropped from 6.491 (iter 3) to 2.814 (iter 4) despite 2x higher returns because:

  • more positions = more variance in equity curve
  • sharpe measures risk-adjusted returns
  • still a strong positive sharpe (>1.0 is generally good)

the trade-off is worth it: double the returns for lower risk-adjusted ratio.

research sources

thoughts for next iteration

iteration 4 was a paradigm shift. next iteration should consider:

  1. push diversification further

    • 1000 positions gave 105% return (2x capital!)
    • limited by cash, not max_positions
    • could explore leverage or smaller position sizes
  2. validate with longer backtest

    • 2-day window is very short
    • need to test if diversification holds across market regimes
  3. position sizing optimization

    • current kelly approach may not be optimal
    • with many positions, equal weighting might work better
  4. transaction costs

    • many positions = many transactions
    • need to model realistic slippage and fees
  5. examine edge by category

    • sports vs politics vs crypto
    • may find some categories have stronger edge