The systems hold across models.
EDGE STACKS is model-agnostic by design. To prove that, we built a structured validation harness that stresses every analysis system (MLB, NBA, NFL, NHL) against three frontier models — Claude Sonnet 4, Gemini 3.1 Pro, and DeepSeek V4 — on adversarial slates designed to break them, then scores every output on hard rules + reasoning quality.
v1 baseline (current — June 2, 2026): First validation battery for the EDGE STACKS app. Every sport is tested on both ML Parlay and SGP tiers across all three models — 24 scored runs total. Overall average: 87.4%. In-season sports (MLB, NHL) averaged 93.6%.
Cross-sport scorecard
Each row aggregates scored runs across all three models. Tier thresholds: ≥90% excellent, ≥80% good, ≥70% acceptable, ≥60% weak, <60% fail.
24 runs across 4 sports × 2 tiers × 3 models. 21 of 24 outputs scored good (≥80%) or better. 14 scored excellent (≥90%). All 6 MLB runs passed — 3 scored perfect 100% (all SGP variants). In-season sports (MLB, NHL) averaged 93.6%. Off-season sports (NBA, NFL) averaged 81.2%, with ML Parlay tiers depressed by empty live data feeds — models correctly reported no games available rather than fabricating data, which is the desired behavior. SGP outputs scored consistently high across all sports (avg 93.8%). Average response time: 26 seconds per analysis.
Off-season context: NBA & NFL ML Parlay
NBA and NFL ML Parlay scores (72.5% and 66.1%) reflect off-season conditions where live data feeds return empty slates. Models correctly reported “no games available” rather than fabricating data — which is the desired behavior and validates the system’s data-integrity guardrails.
SGP scores for the same off-season sports remained strong (NBA SGP: 96.1%, NFL SGP: 90.0%) because the SGP system’s structural requirements (correlation blueprints, game-script terminology) are partially satisfied even with limited game data. These scores will naturally improve when re-validated during active seasons.
Per-model breakdown
Average score by model for each sport and tier. All scores are out of 100%.
| Sport | Tier | Claude Sonnet 4 | Gemini 3.1 Pro | DeepSeek V4 |
|---|---|---|---|---|
| MLB | ML Parlay | 91.4% | 87.1% | 95.7% |
| MLB | SGP | 100.0% | 100.0% | 100.0% |
| NBA | ML Parlay | 89.5% | 45.6% | 82.5% |
| NBA | SGP | 100.0% | 93.3% | 95.0% |
| NFL | ML Parlay | 61.7% | 43.3% | 93.3% |
| NFL | SGP | 81.0% | 88.9% | 100.0% |
| NHL | ML Parlay | 100.0% | 81.4% | 100.0% |
| NHL | SGP | 95.2% | 80.6% | 91.9% |
| Overall Average | 89.9% | 77.5% | 94.8% | |
| In-season only (MLB + NHL) | 96.6% | 87.3% | 96.9% | |
Most detailed reasoning and longest outputs. 100% on NHL ML Parlay and both NBA/MLB SGP.
Fastest response times. Strong SGP performance (avg 90.7%). Off-season ML Parlay scores pull the average down.
Top performer. 100% on 5 of 8 sport/tier combos. Excels at structural compliance and format adherence. Never scored below 82.5%.
Why output varies — and why that’s expected
Large language models are probabilistic systems, not calculators. Even with identical prompts and inputs, different models will produce slightly different outputs due to differences in training data, architecture, tokenization, and sampling behavior. Scores will also shift between runs of the same model because of temperature-based randomness in token generation.
Output also varies based on when you run the analysis. Odds move, late scratches drop, weather forecasts update, and lineup news breaks throughout the day. The system analyzes whatever data the model can see at the moment of the run — so a 2 pm run and a 6 pm run on the same slate can produce different parlays for entirely rational reasons. That’s a feature, not a flaw: the system is designed to react to live information, not echo a stale answer.
Current model lineup: Claude Sonnet 4 (Anthropic), Gemini 3.1 Pro (Google), DeepSeek V4 (DeepSeek/Fireworks). All three models route through the same infrastructure. Performance varies by sport and data availability — in-season sports with live data consistently score 85%+ across all models.
This is expected and normal. What matters isn’t that every model produces byte-identical output — it’s that the system’s structure is durable enough to guide any frontier model, on any slate, at any moment, toward disciplined, well-reasoned analysis.
Methodology
Three frontier models, tested independently
Each sport's analysis system — both the ML Parlay tier and the SGP tier — is run through three independent models: Claude Sonnet 4, Gemini 3.1 Pro, and DeepSeek V4. Each model receives the same master prompt and slate, producing output scored against sport-specific rubrics.
Live data pipeline, not paste-based
The v1 baseline uses the inline PWA analysis system. Each run calls the /api/analyze endpoint which loads the sport-specific master prompt from the database, pre-fetches 7–8 live data sources in parallel (RotoWire lineups, ESPN schedule, Covers odds, Baseball Savant, injury feeds, standings), injects all fetched data as a LIVE RESEARCH DATA appendix, and makes a single streaming LLM call. This data-first architecture replaced the old paste-based workflow.
Hard rules scored programmatically
Each output is scored on a weighted deterministic rubric covering: first-token compliance (no preamble leaks), output-format completeness (PARLAY ANALYSIS / YOUR PARLAY headers for ML, SGP / GAME SCRIPT headers for SGP tier), sport-specific structural checks (SP confirmation, BvP, bullpen for MLB; goalie tiers for NHL; QB/weather for NFL; correlation blueprints for SGP), anti-fabrication tripwires, tool-call garbage detection, max-favorite ceiling (−200), free-roll anchor sizing, and output length thresholds.
Soft dimensions scored by judge model
A separate LLM judge grades reasoning depth, framework adherence, situational integration, and output completeness on a 10-point scale per dimension. Scores below threshold flag the output for manual review.
Transparent scoring — every number published
Every combination is scored and recorded. We publish every number — because transparent validation is the point. When models update or prompts change, we re-run the full battery and publish fresh results.
What each sport’s rubric actually enforces
Sample of the deterministic checks run against every output. Different sports have different failure modes; the rubrics track them.
- Timestamp-first eligibility (games already kicked off are rejected — ML and SGP)
- Slate verification (all games, QBs, and stats sourced from RotoWire + ESPN — partial data triggers multi-source recovery)
- Starter QB confirmation (no backups, injured starters, or capable-backup loopholes)
- Weather threshold enforcement (wind 15+ mph fades totals-dependent legs)
- SGP structural checks: all legs from a single game, correlation/game-script terminology required
- Multi-point missed-edge safeguard system (form gate, H2H, schedule fatigue)
- Max favorite ceiling (−200) on ML tier — heavy chalk excluded
- Free-roll anchor sizing on the strongest parlay leg (both tiers)
- Timestamp-first eligibility (games already at first pitch are rejected — ML and SGP)
- Slate verification (all games, pitchers, and stats sourced from RotoWire + ESPN — partial data triggers multi-source recovery)
- Starting pitcher confirmation (drop if not confirmed — mandatory for SGP)
- Multi-point ML safeguards + multi-point SGP game-selection checks (weather, form, BvP, SP Floor, Bullpen Risk Scan & more)
- SGP pitching-driven game script with 3 correlation blueprints (ace dominance, offensive explosion, blowout)
- F5 (First 5 Innings) ML routing when bullpen risk is high
- Max favorite ceiling (−200) on ML tier — heavy chalk excluded
- Free-roll anchor sizing (both ML and SGP tiers)
- Timestamp-first eligibility (games already tipped off are rejected — ML and SGP)
- Slate verification (all games, players, and stats sourced from RotoWire + ESPN — partial data triggers multi-source recovery)
- Multi-point ML safeguards + multi-point SGP game-selection checks
- SGP legs all in one game (no cross-game contamination)
- 3 named SGP correlation blueprints (Star Dominance, Pace Mismatch, Blowout Build)
- Max favorite ceiling (−200) on ML tier — heavy chalk excluded
- Free-roll anchor sizing (both ML and SGP tiers)
- Timestamp-first eligibility (games already at puck drop are rejected — ML and SGP)
- Slate verification (all games, goalies, and stats sourced from RotoWire + ESPN — partial data triggers multi-source recovery)
- Goalie status classification (Tier A/B/C — only Tier C excluded)
- SGP structural checks: all legs from a single game, B2B fatigue avoidance, correlation terminology
- Multi-point missed-edge safeguard system (form gate, goal differential, H2H, fatigue)
- Elimination game modifier (scoring/prop suppression for Game 5/6/7)
- Max favorite ceiling (−200) on ML tier — heavy chalk excluded
- Free-roll anchor sizing on the strongest parlay leg (both tiers)
Bottom line
24 runs. Three architecturally distinct models. Four sports. Two analysis tiers each. Adversarial slates built to break the prompt. Every output scored on hard rules and reasoning quality. We publish every number — because transparent validation is the point.
The system averaged 87.4% overall and 93.6% on in-season sports with active data feeds. DeepSeek V4 led at 94.8%, followed by Claude Sonnet 4 at 89.9%. All models achieved perfect 100% on MLB SGP analysis. Off-season ML Parlay scores (NBA/NFL) were depressed by empty live data feeds — models correctly declined to fabricate games rather than hallucinating a slate, which validates the system’s data-integrity guardrails.
The analysis system is model-agnostic under the hood — your edge holds regardless of which AI runs it.
Validation runs were performed using the inline PWA analysis system with live and synthetic data feeds. Results should not be interpreted as guarantees of real-world betting outcomes. We re-test whenever a master prompt is updated or a new model ships.