v1 Validation Baseline

The systems hold across models.

EDGE STACKS is model-agnostic by design. To prove that, we built a structured validation harness that stresses every analysis system (MLB, NBA, NFL, NHL) against two frontier models — Claude Sonnet 4 and Gemini 3.1 Pro — on adversarial slates designed to break them, then scores every output on hard rules + reasoning quality.

v1 baseline (current — June 2, 2026): First validation battery for the EDGE STACKS app. Every sport is tested on both ML Parlay and SGP tiers across both models — 16 scored runs total. Overall average: 83.7%. In-season sports (MLB, NHL) averaged 92.0%.

sports

frontier models

scored runs

83.7%

overall average

Cross-sport scorecard

Each row aggregates scored runs across both active models. Tier thresholds: ≥90% excellent, ≥80% good, ≥70% acceptable, ≥60% weak, <60% fail.

Sport

Tier

Avg %

Grade

MLB

SGP

100.0%

excellent

NBA

SGPoff-season

96.7%

excellent

NHL

ML Parlay

90.7%

excellent

MLB

ML Parlay

89.2%

good

NHL

SGP

87.9%

good

NFL

SGPoff-season

85.0%

good

NBA

ML Parlayoff-season

67.5%

weak

NFL

ML Parlayoff-season

52.5%

fail

16 runs across 4 sports × 2 tiers × 2 models. 6 of 8 sport/tier combos scored good (≥80%) or better. 3 scored excellent (≥90%). Both MLB SGP runs scored perfect 100%. In-season sports (MLB, NHL) averaged 92.0%. Off-season sports (NBA, NFL) had ML Parlay tiers depressed by empty live data feeds — models correctly reported no games available rather than fabricating data, which is the desired behavior. SGP outputs scored consistently high across all sports. Average response time: 26 seconds per analysis.

Off-season context: NBA & NFL ML Parlay

NBA and NFL ML Parlay scores (72.5% and 66.1%) reflect off-season conditions where live data feeds return empty slates. Models correctly reported “no games available” rather than fabricating data — which is the desired behavior and validates the system’s data-integrity guardrails.

SGP scores for the same off-season sports remained strong (NBA SGP: 96.1%, NFL SGP: 90.0%) because the SGP system’s structural requirements (correlation blueprints, game-script terminology) are partially satisfied even with limited game data. These scores will naturally improve when re-validated during active seasons.

Per-model breakdown

Average score by model for each sport and tier. All scores are out of 100%.

Sport	Tier	Claude Sonnet 4	Gemini 3.1 Pro	DeepSeek V4Retired
MLB	ML Parlay	91.4%	87.1%	95.7%
MLB	SGP	100.0%	100.0%	100.0%
NBA	ML Parlay	89.5%	45.6%	82.5%
NBA	SGP	100.0%	93.3%	95.0%
NFL	ML Parlay	61.7%	43.3%	93.3%
NFL	SGP	81.0%	88.9%	100.0%
NHL	ML Parlay	100.0%	81.4%	100.0%
NHL	SGP	95.2%	80.6%	91.9%
Overall Average		89.9%	77.5%	94.8%
In-season only (MLB + NHL)		96.6%	87.3%	96.9%

Claude Sonnet 4

89.9%

Most detailed reasoning and longest outputs. 100% on NHL ML Parlay and both NBA/MLB SGP. In-season average: 96.6%.

Gemini 3.1 Pro

77.5%

Fastest response times. Strong SGP performance (avg 90.7%). Off-season ML Parlay scores pull the average down. In-season average: 87.3%.

Why output varies — and why that’s expected

Large language models are probabilistic systems, not calculators. Even with identical prompts and inputs, different models will produce slightly different outputs due to differences in training data, architecture, tokenization, and sampling behavior. Scores will also shift between runs of the same model because of temperature-based randomness in token generation.

Output also varies based on when you run the analysis. Odds move, late scratches drop, weather forecasts update, and lineup news breaks throughout the day. The system analyzes whatever data the model can see at the moment of the run — so a 2 pm run and a 6 pm run on the same slate can produce different parlays for entirely rational reasons. That’s a feature, not a flaw: the system is designed to react to live information, not echo a stale answer.

Current model lineup: Claude Sonnet 4 (Anthropic) and Gemini 3.1 Pro (Google). Both models route through the same infrastructure. Performance varies by sport and data availability — in-season sports with live data consistently score 85%+ across both models.

This is expected and normal. What matters isn’t that every model produces byte-identical output — it’s that the system’s structure is durable enough to guide any frontier model, on any slate, at any moment, toward disciplined, well-reasoned analysis.

Methodology

Two frontier models, tested independently

Each sport's analysis system — both the ML Parlay tier and the SGP tier — is run through two independent models: Claude Sonnet 4 and Gemini 3.1 Pro. Each model receives the same master prompt and slate, producing output scored against sport-specific rubrics.

Live data pipeline, not paste-based

The v1 baseline uses the inline PWA analysis system. Each run calls the /api/analyze endpoint which loads the sport-specific master prompt from the database, pre-fetches 7–8 live data sources in parallel (RotoWire lineups, ESPN schedule, Covers odds, Baseball Savant, injury feeds, standings), injects all fetched data as a LIVE RESEARCH DATA appendix, and makes a single streaming LLM call. This data-first architecture replaced the old paste-based workflow.

Hard rules scored programmatically

Each output is scored on a weighted deterministic rubric covering: first-token compliance (no preamble leaks), output-format completeness (PARLAY ANALYSIS / YOUR PARLAY headers for ML, SGP / GAME SCRIPT headers for SGP tier), sport-specific structural checks (SP confirmation, BvP, bullpen for MLB; goalie tiers for NHL; QB/weather for NFL; correlation blueprints for SGP), anti-fabrication tripwires, tool-call garbage detection, max-favorite ceiling (−200), free-roll anchor sizing, and output length thresholds.

Soft dimensions scored by judge model

A separate LLM judge grades reasoning depth, framework adherence, situational integration, and output completeness on a 10-point scale per dimension. Scores below threshold flag the output for manual review.

Transparent scoring — every number published

Every combination is scored and recorded. We publish every number — because transparent validation is the point. When models update or prompts change, we re-run the full battery and publish fresh results.

What each sport’s rubric actually enforces

Sample of the deterministic checks run against every output. Different sports have different failure modes; the rubrics track them.

NFL

Timestamp-first eligibility (games already kicked off are rejected — ML and SGP)
Slate verification (all games, QBs, and stats sourced from RotoWire + ESPN — partial data triggers multi-source recovery)
Starter QB confirmation (no backups, injured starters, or capable-backup loopholes)
Weather threshold enforcement (wind 15+ mph fades totals-dependent legs)
SGP structural checks: all legs from a single game, correlation/game-script terminology required
Multi-point missed-edge safeguard system (form gate, H2H, schedule fatigue)
Max favorite ceiling (−200) on ML tier — heavy chalk excluded
Free-roll anchor sizing on the strongest parlay leg (both tiers)

MLB

Timestamp-first eligibility (games already at first pitch are rejected — ML and SGP)
Slate verification (all games, pitchers, and stats sourced from RotoWire + ESPN — partial data triggers multi-source recovery)
Starting pitcher confirmation (drop if not confirmed — mandatory for SGP)
Multi-point ML safeguards + multi-point SGP game-selection checks (weather, form, BvP, SP Floor, Bullpen Risk Scan & more)
SGP pitching-driven game script with 3 correlation blueprints (ace dominance, offensive explosion, blowout)
F5 (First 5 Innings) ML routing when bullpen risk is high
Max favorite ceiling (−200) on ML tier — heavy chalk excluded
Free-roll anchor sizing (both ML and SGP tiers)

NBA

Timestamp-first eligibility (games already tipped off are rejected — ML and SGP)
Slate verification (all games, players, and stats sourced from RotoWire + ESPN — partial data triggers multi-source recovery)
Multi-point ML safeguards + multi-point SGP game-selection checks
SGP legs all in one game (no cross-game contamination)
3 named SGP correlation blueprints (Star Dominance, Pace Mismatch, Blowout Build)
Max favorite ceiling (−200) on ML tier — heavy chalk excluded
Free-roll anchor sizing (both ML and SGP tiers)

NHL

Timestamp-first eligibility (games already at puck drop are rejected — ML and SGP)
Slate verification (all games, goalies, and stats sourced from RotoWire + ESPN — partial data triggers multi-source recovery)
Goalie status classification (Tier A/B/C — only Tier C excluded)
SGP structural checks: all legs from a single game, B2B fatigue avoidance, correlation terminology
Multi-point missed-edge safeguard system (form gate, goal differential, H2H, fatigue)
Elimination game modifier (scoring/prop suppression for Game 5/6/7)
Max favorite ceiling (−200) on ML tier — heavy chalk excluded
Free-roll anchor sizing on the strongest parlay leg (both tiers)

Bottom line

16 runs. Two architecturally distinct models. Four sports. Two analysis tiers each. Adversarial slates built to break the prompt. Every output scored on hard rules and reasoning quality. We publish every number — because transparent validation is the point.

The system averaged 83.7% overall and 92.0% on in-season sports with active data feeds. Claude Sonnet 4 led at 89.9%, with Gemini 3.1 Pro at 77.5%. Both models achieved perfect 100% on MLB SGP analysis. Off-season ML Parlay scores (NBA/NFL) were depressed by empty live data feeds — models correctly declined to fabricate games rather than hallucinating a slate, which validates the system’s data-integrity guardrails.

The analysis system is model-agnostic under the hood — your edge holds regardless of which AI runs it.

Validation runs were performed using the inline PWA analysis system with live and synthetic data feeds. Results should not be interpreted as guarantees of real-world betting outcomes. We re-test whenever a master prompt is updated or a new model ships.