Loading...
Loading...
Backtest trading strategies on historical data and interpret performance metrics. Provides run_backtest (crypto strategies) and run_prediction_market_backtest (Polymarket strategies). Fast execution (20-60s), minimal cost ($0.001). Returns Sharpe ratio, max drawdown, win rate, profit factor, and trade statistics. Use this skill after building or improving strategies to validate performance before deploying. NEVER deploy without thorough backtesting (6+ months recommended).
npx skill4agent add robonet-tech/skills test-trading-strategiesUse MCPSearch to select: mcp__workbench__run_backtest
Use MCPSearch to select: mcp__workbench__get_latest_backtest_resultsrun_backtest(
strategy_name="MyStrategy",
start_date="2024-01-01",
end_date="2024-12-31",
symbol="BTC-USDT",
timeframe="1h"
)strategy_namestart_dateend_datesymboltimeframeconfigfeeslippageleveragestrategy_namestart_dateend_datecondition_idassetintervalinitial_balancetimeframestrategy_namelimitinclude_equity_curveequity_curve_max_pointsFormula: (Mean Return - Risk-Free Rate) / Standard Deviation of Returns
Interpretation:
>2.0 → Excellent (very rare for algo strategies)
1.0-2.0 → Good (achievable with solid strategy)
0.5-1.0 → Acceptable (worth testing further)
<0.5 → Poor (likely not profitable after costs)
Why it matters:
- Accounts for volatility (high return with high volatility = lower Sharpe)
- Industry standard for comparing strategies
- More useful than total return aloneExample: Strategy grows from $10k → $15k → $12k
Drawdown: ($15k - $12k) / $15k = 20%
Interpretation:
<10% → Conservative (lower returns, safer)
10-20% → Moderate (balanced risk/reward)
20-40% → Aggressive (higher returns, higher risk)
>40% → Very risky (difficult to recover from)
Why it matters:
- Measures worst-case scenario
- Predicts emotional difficulty of holding strategy
- 50% drawdown requires 100% return to recoverFormula: (Winning Trades / Total Trades) × 100%
Interpretation:
45-65% → Realistic for most strategies
>70% → Suspicious (possible overfitting or unrealistic fills)
<40% → Needs improvement (unless very high profit factor)
Why it matters:
- High win rate doesn't guarantee profitability
- Can have 40% win rate but profitable (if winners > losers)
- Very high win rate (>75%) often indicates overfitting
Common misconception: Higher is always better
Reality: 40% win rate with 3:1 reward:risk is better than 60% win rate with 1:1Formula: Sum of All Winning Trades / Sum of All Losing Trades
Interpretation:
>2.0 → Excellent
1.5-2.0 → Good
1.2-1.5 → Acceptable
<1.2 → Marginal (risky to deploy)
<1.0 → Unprofitable (losses exceed profits)
Why it matters:
- Simple profitability measure
- <1.5 means small edge, vulnerable to slippage/fees
- Combines win rate and win size into single metric
Example:
10 trades: 6 winners ($100 each), 4 losers ($50 each)
Gross profit: $600, Gross loss: $200
Profit factor: $600 / $200 = 3.0 (excellent)Total Return: 50% over 6 months
Annual Return: ~100% (extrapolated to 12 months)
Why both matter:
- Total return: Actual profit over test period
- Annual return: Standardized for comparison across time periods
- Longer test periods more reliable (6-12 months minimum)Quick test: 1-3 months
- Limited validation
- Use for initial screening only
- High risk of luck/overfitting
Standard test: 6-12 months (RECOMMENDED MINIMUM)
- Captures multiple market regimes
- Sufficient trades for statistical significance
- Industry standard for strategy validation
Robust test: 12-24 months
- Ideal for high-confidence validation
- Includes bull, bear, and ranging markets
- Best for strategies before live deployment1. Train period: 2024-01-01 to 2024-08-31
run_backtest(..., start_date="2024-01-01", end_date="2024-08-31")
→ Sharpe: 1.5
2. Validation period: 2024-09-01 to 2024-12-31
run_backtest(..., start_date="2024-09-01", end_date="2024-12-31")
→ Sharpe: 1.3
3. Compare:
Performance similar → Robust strategy ✓
Performance degraded significantly → Overfit to train period ✗Test strategy across different market conditions:
1. Trending up (bull market): 2023-10 to 2024-03
→ Sharpe: 1.8
2. Trending down (bear market): 2024-04 to 2024-07
→ Sharpe: 0.9
3. Ranging (sideways): 2024-08 to 2024-12
→ Sharpe: 1.1
Analysis:
- Works well in all regimes ✓
- Or works in specific regime (trend-following good in trends)
- Fails in all regimes → Fundamentally broken ✗Win rate: 82%
Problem: Markets are noisy; >70% suggests strategy memorized past data
Solution: Test on out-of-sample data; expect performance degradationTotal trades: 8 over 6 months
Problem: Not enough data for statistical significance; could be luck
Solution: Test longer period or adjust strategy to generate more tradesTrain period (Jan-Aug): Sharpe 2.5
Test period (Sep-Dec): Sharpe 0.3
Problem: Overfitted to training data
Solution: Simplify strategy, reduce parameters, test on more data6-month test: 50% return
- Month 1-5: -5% return
- Month 6: 55% return (one lucky trade)
Problem: Performance driven by single event, not consistent edge
Solution: Analyze equity curve; look for consistent growth, not spikesStrategy uses 12 indicators, 20+ parameters
Sharpe ratio: 1.3 (only modest improvement)
Problem: Complex strategies should dramatically outperform simple ones
Solution: Simplify; complexity without performance = overfittingBacktest: Win rate 75%, Sharpe 2.3
Live: Win rate 45%, Sharpe 0.6
Problem: Backtest didn't account for slippage, fees, execution delays
Solution: Use realistic fees (0.05-0.1%), slippage (0.05-0.1%), and test on higher timeframesconfig = {
"fee": 0.0005, # 0.05% per trade (Hyperliquid taker fee)
"slippage": 0.0005, # 0.05% slippage (liquid markets)
"leverage": 1 # Start with 1x (no leverage)
}
run_backtest(
...,
config=config
)Without fees/slippage:
- Backtest: 50% return, Sharpe 2.0
- Reality: Fees eat 5-10% of profit → 40% return, Sharpe 1.5
With realistic fees/slippage:
- Backtest: 40% return, Sharpe 1.5
- Reality: Matches expectation → 38-42% return# Test without leverage first
run_backtest(..., config={"leverage": 1})
→ Sharpe: 1.5, Drawdown: 12%
# Then test with leverage (if deploying with leverage)
run_backtest(..., config={"leverage": 2})
→ Sharpe: 1.4, Drawdown: 24% (doubled)
Risk assessment:
- Leverage amplifies returns AND drawdowns
- 2x leverage doesn't mean 2x Sharpe (risk increases faster)
- Start deployment at 1x, increase cautiously1. Backtest all versions on SAME date range:
run_backtest(strategy_name="Strategy_v1", start_date="2024-01-01", end_date="2024-12-31", ...)
run_backtest(strategy_name="Strategy_v2", start_date="2024-01-01", end_date="2024-12-31", ...)
run_backtest(strategy_name="Strategy_v3", start_date="2024-01-01", end_date="2024-12-31", ...)
2. Compare all metrics (not just one):
| Version | Sharpe | Drawdown | Win Rate | Profit Factor |
|---------|--------|----------|----------|---------------|
| v1 | 1.2 | 15% | 50% | 1.6 |
| v2 | 1.5 | 12% | 52% | 1.8 |
| v3 | 1.8 | 25% | 48% | 2.2 |
3. Analyze trade-offs:
v1: Baseline (acceptable)
v2: Better across all metrics ✓ (clear winner)
v3: Higher Sharpe but excessive drawdown ✗ (too risky)
4. Decision:
Deploy v2 (balanced improvement without excessive risk)1. Before running backtest:
get_latest_backtest_results(strategy_name="MyStrategy")
2. Review results:
- If recent backtest exists with same parameters → Use cached result
- If parameters differ (date range, symbol, timeframe) → Run new backtest
3. Saves time and clutter:
- Backtests are fast (20-40s) but avoiding duplicates is cleaner
- Easier to find specific backtest results later1. Check data availability (use browse-robonet-data):
get_data_availability(symbols=["BTC-USDT"], only_with_data=true)
→ Verify 6+ months of history available
2. Run initial backtest (6 months):
run_backtest(
strategy_name="NewStrategy",
start_date="2024-06-01",
end_date="2024-12-31",
symbol="BTC-USDT",
timeframe="1h",
config={"fee": 0.0005, "slippage": 0.0005, "leverage": 1}
)
3. Evaluate results:
Sharpe: 1.3 ✓ (good)
Drawdown: 14% ✓ (moderate)
Win rate: 51% ✓ (realistic)
Profit factor: 1.7 ✓ (profitable)
Total trades: 87 ✓ (sufficient)
4. Decision:
→ Strong initial results
→ Proceed to multi-period validation (Workflow 2)1. Test Period 1 (Train):
run_backtest(..., start_date="2024-01-01", end_date="2024-06-30")
→ Sharpe: 1.5, Drawdown: 12%
2. Test Period 2 (Validation):
run_backtest(..., start_date="2024-07-01", end_date="2024-12-31")
→ Sharpe: 1.3, Drawdown: 15%
3. Compare:
Period 2 slightly worse but consistent ✓
Sharpe drop: 13% (acceptable variation)
Drawdown increase: 3% (acceptable)
4. Test Period 3 (Recent):
run_backtest(..., start_date="2024-10-01", end_date="2024-12-31")
→ Sharpe: 1.4, Drawdown: 11%
5. Analysis:
Consistent performance across all periods ✓
No significant degradation ✓
Strategy is robust ✓
6. Decision:
→ Ready for deployment consideration
→ Review pre-deployment checklist1. Baseline (before improvement):
run_backtest(
strategy_name="Strategy_original",
start_date="2024-01-01",
end_date="2024-12-31",
...
)
→ Sharpe: 1.0, Drawdown: 18%, Win rate: 48%
2. Improve strategy (use improve-trading-strategies):
refine_strategy(strategy_name="Strategy_original", changes="Add trailing stop", mode="new")
3. Test improvement:
run_backtest(
strategy_name="Strategy_original_refined",
start_date="2024-01-01", # SAME date range!
end_date="2024-12-31",
...
)
→ Sharpe: 1.3, Drawdown: 14%, Win rate: 52%
4. Compare (apples-to-apples on same data):
Sharpe: +0.3 (+30%) ✓
Drawdown: -4% (-22%) ✓
Win rate: +4% (+8%) ✓
→ Clear improvement across all metrics
5. Validate on different period (avoid overfitting to test data):
run_backtest(
strategy_name="Strategy_original_refined",
start_date="2023-07-01", # Different period
end_date="2023-12-31",
...
)
→ Sharpe: 1.2 (still better than original's 1.0)
→ Improvement is real, not overfitted
6. Decision:
→ Keep improved version
→ Consider further optimization or deployment1. Baseline (default parameters):
Strategy uses RSI(14) threshold of 30
run_backtest(...) → Sharpe: 1.3
2. Test parameter variations:
Create variants: RSI threshold 25, 30, 35
run_backtest(strategy_name="Strategy_RSI25", ...) → Sharpe: 1.1
run_backtest(strategy_name="Strategy_RSI30", ...) → Sharpe: 1.3
run_backtest(strategy_name="Strategy_RSI35", ...) → Sharpe: 1.2
3. Analysis:
Performance varies only slightly (1.1 to 1.3)
→ Strategy is robust (not overly sensitive to exact parameters) ✓
vs. High sensitivity:
RSI25: Sharpe 2.5
RSI30: Sharpe 1.3
RSI35: Sharpe 0.4
→ Overfitted to specific parameter value ✗
4. Decision:
Robust strategy (small variation) → Safe to deploy
Sensitive strategy (large variation) → Likely overfit, risky to deploy1. Check data availability first (use browse-robonet-data):
get_data_availability(symbols=["YOUR-SYMBOL"], only_with_data=true)
2. Adjust date range:
- BTC-USDT, ETH-USDT: Available from 2020-present
- Altcoins: Typically 6-24 months
- Use date range within available data
3. Try different symbol:
- BTC-USDT and ETH-USDT have longest history
- Start testing on these, then expand to altcoins1. Entry conditions too restrictive:
- Review strategy code (use browse-robonet-data: get_strategy_code)
- Conditions may never be met simultaneously
- Example: "RSI < 20 AND price > 200 EMA" (RSI rarely gets to 20)
2. Test on longer period:
- 6 months may not have ideal conditions
- Try 12-24 months
3. Adjust thresholds (use improve-trading-strategies):
- Loosen entry conditions slightly
- Example: Change "RSI < 25" to "RSI < 30"1. Long date range + high-frequency timeframe:
- 2+ years on 1m timeframe = slow
- Solution: Test shorter range or use 5m/15m timeframe
2. Complex strategy with many indicators:
- Some indicators are computationally expensive
- Solution: Simplify strategy if possible
3. Normal for prediction markets:
- run_prediction_market_backtest can take 30-60s
- This is expected1. Likely overfitted to historical data
2. Test on out-of-sample period (different dates)
3. Check for look-ahead bias (using future data)
4. Verify realistic fees and slippage configured
5. If too-good-to-be-true persists, be very skeptical
6. Start with tiny deployment size to validate in live marketimprove-trading-strategiesdeploy-live-tradingbrowse-robonet-data