Storage Experiment: Making NFL Simulation Data Fast¶
A plain-English walkthrough of what we're building and why.
The Problem¶
We built a football simulation engine that can play an NFL season thousands of times. Each simulation world is a full alternative reality — different plays, different winners, different MVP races. We want to answer questions like:
- How likely is a given quarterback to hit 4,000 passing yards this season?
- What's Kansas City's chance of beating Philadelphia by more than 7 points?
- Who's most likely to score the first touchdown in the playoffs?
The way we answer these questions at scale is to simulate many thousands of alternative seasons and look at the distribution of outcomes. If 6,800 out of 10,000 simulated seasons have Mahomes clearing 4,000 yards, that's a 68% probability.
That simple idea runs into a wall: data volume.
How Much Data?¶
For one simulated season:
| Thing | Count |
|---|---|
| Games in the regular season | 288 |
| Plays per game (roughly) | ~130 |
| Plays in the whole season | ~37,000 |
| Statistics we track per game | ~833 |
Now multiply by the number of alternative seasons we want to simulate:
| Scale | Plays across all worlds | Game-level statistics across all worlds |
|---|---|---|
| 10,000 worlds | 370 million | ~8 million |
| 100,000 worlds | 3.7 billion | ~83 million |
At 100,000 worlds we're producing billions of records. That volume doesn't fit in memory; it doesn't fit on a single laptop's disk in any convenient way; and it needs to be queryable quickly so that downstream probability engines can get answers in seconds, not hours.
That's the problem the storage experiment is trying to solve.
The Shape of the Data¶
Two different shapes of data come out of the simulation:
Play-by-play. Every single play from every simulated game. Kickoffs, third-and-shorts, the whole script. This is useful for deep questions — "which plays went for 50+ yards," or "in how many worlds did the kicker miss a game- winning field goal?" A typical game has ~130 plays. Across 288 games × 100,000 worlds, that's almost 4 billion rows.
Outcome Context. A summarised, tabular view. Instead of every play, it's
"here are the totals per player and per team per game — passing yards, rushing
attempts, points scored, and so on." For each statistic — say,
PASSING_YARDS_GAME_KC_QB_1 (Kansas City's starter's passing yards in this
game) — we store an array of 100,000 numbers, one per simulated world. This is
the shape our probability engines actually want to consume.
Both shapes go into the same database: ClickHouse, a columnar analytical database designed for exactly this kind of workload.
The Two Pipelines¶
So we've got two pipelines going into ClickHouse:
┌─────────────────────┐ ┌──────────────────────┐
│ Simulation engine │──┬──→ │ Play-by-play stream │ ─→ ClickHouse
└─────────────────────┘ │ └──────────────────────┘
│ ┌──────────────────────┐
└──→ │ Outcome accumulator │ ─→ ClickHouse
└──────────────────────┘
Play-by-play goes in raw. Every play becomes a row. Fast path, no aggregation.
Outcome Context takes a little more work. For each game, we walk through every play and accumulate — we keep running totals ("this QB has 287 passing yards so far, this RB has 3 TDs, this team is up by 14"). At the end of the game we hand off a bundle of ~600 statistics to ClickHouse. Then we move on to the next game.
Where We Are Right Now¶
We've completed the Outcome accumulator pipeline and verified it runs end-to- end on real simulated data. We've also completed the season-level accumulator (win/loss records, milestones like "reached 4,000 passing yards for the season", playoff seedings).
What we haven't done yet: the actual ClickHouse write experiments. The storage experiment harness exists, the ClickHouse connection works, but the measurements — "how long does it take to write a season at 100,000 worlds" — are still ahead of us. That's the next phase.
What a Real Simulation Looks Like¶
Here's what the pipeline produces for one simulated game (Dallas vs Denver, Week 1, final score DAL 38, DEN 33):
POINTS_GAME_DAL = 38
POINTS_GAME_DEN = 33
PASSING_YARDS_GAME_DEN_QB_1 = 382
RUSHING_YARDS_GAME_DAL_RB_1 = 161
RECEIVING_YARDS_GAME_DEN_WR_3 = 193
Each of those values, at 100,000-world scale, becomes an array of 100,000 numbers — the full distribution of that statistic across every alternative reality we simulated. That's exactly the shape a probability engine needs to compute "how often does Denver's WR3 go over 150 receiving yards" — count the entries in the array that exceed 150, divide by 100,000, done.
Across a whole season simulation (one world), the accumulator emits:
- 288 games with around 600 statistics each — roughly 175,000 total rows of game-level data
- 1,669 season-level statistics — team standings, playoff seedings, player
season totals, milestone flags like
REACHED_4000_PASSING_KC_QB_1
The Catalogue¶
The team has defined a catalogue of statistics that the system is supposed to track per game. This is the source of truth for "what should exist." It covers:
- Team scoring — points by quarter, half, game; winner flags; point differential; largest lead
- Team offense — passing yards, rushing yards, total yards, first downs, penalties
- Team defense — sacks, interceptions, turnovers forced, fumbles recovered
- Team game flow — number of drives, punts, time of possession
- QB stats — passing yards, TDs, interceptions, completion %, passer rating, yards per attempt, longest completion
- RB stats — rushing yards, attempts, TDs, yards per carry, longest rush, plus receiving (RBs catch passes too)
- WR / TE stats — receiving yards, receptions, targets, TDs, longest reception
- Kicker stats — field goals made, attempts, longest, extra points, kicking points
- TD ordinal stats — who scored first, second, third; anytime-TD indicators
- Fantasy stats — PPR, half-PPR, standard scoring
For the Dallas vs Denver matchup above, the catalogue declares 833 distinct statistics that should exist.
What the Diagnostic Shows¶
The accumulator runs and produces a concrete set of rows for the game. How does that compare to the catalogue? That's what the diagnostic test measures.
Here are the results from a sample Week 1 game (DAL vs DEN):
| Metric | Count |
|---|---|
| Catalogue-declared outcomes | 833 |
| Accumulator-produced outcomes | 610 |
| Produced and declared (matching) | 593 |
| Produced but not declared (unexpected) | 17 |
| Declared but not produced (the gap) | 240 |
So the accumulator produces 610 statistics for the game. The catalogue says there could have been up to 833. Where's the other 240?
The diagnostic splits that gap into three categories:
| Classification | Count | What it means |
|---|---|---|
| sparse-player-inactive | 96 | Player had no activity in this game at all |
| sparse-zero-stat | 144 | Player played, but this specific stat was legitimately zero |
| missing-implementation | 0 | Catalogue declares it, accumulator doesn't produce it |
Sparse-player-inactive (the bigger bucket)¶
Not every player touches the ball in every game. A team's third-string tight
end might never see the field. A backup quarterback on the sideline isn't
throwing any passes. The catalogue faithfully declares entries for every
listed player (PASSING_YARDS_GAME_DAL_QB_2, RECEIVING_YARDS_GAME_DAL_TE_2,
and so on), but the accumulator only emits rows when something actually
happens — there's no passing-yards row for a QB who didn't attempt a pass.
This is expected sparse data, not a bug.
Sparse-zero-stat (the other bucket)¶
A QB who throws for 382 yards but doesn't throw an interception will not
produce an INTERCEPTIONS_GAME_QB row. He played, the stat is declared, and
it genuinely zero-ed out. Same story for a team's TEAM_DEF_INTERCEPTIONS_
GAME_DAL when the defence didn't pick off a pass.
Zero values are dropped on purpose — in the Outcome Context schema, only rows with meaningful data get stored. The probability engine reads the absence of a row as "this stat is zero in this world." Storing 100,000 zeros for every player × every stat that zeroed-out would be enormously wasteful.
Missing-implementation (the zero bucket, for now)¶
These are outcomes the catalogue declares but the accumulator doesn't produce at all. In the current run this count is zero — the catalogue is honest about what the accumulator supports. But there are deliberate gaps the catalogue also leaves out (for now):
- Per-defender statistics (individual sacks, tackles, pass deflections by player)
- Drive-level metrics (time of possession, longest drive in yards)
- Some detailed team totals (first downs, penalty counts, penalty yards have placeholders but are reported as zero — the plumbing isn't connected yet)
The catalogue and the accumulator are kept in sync so there are no surprising gaps. If we want to add defender-level stats later, we'll extend both sides together.
Coverage by section¶
The diagnostic also reports coverage section by section:
| Section | Declared | Produced | Coverage |
|---|---|---|---|
| §1 Team Scoring | 14 | 13 | 93% |
| §2 Team Game Outcomes | 28 | 28 | 100% |
| §3 Team Scoring Derived | 10 | 10 | 100% |
| §4 Team Passing | 16 | 15 | 94% |
| §5 Team Rushing | 10 | 10 | 100% |
| §6 Team Total Offense | 12 | 12 | 100% |
| §7 Team Defence / ST | 14 | 6 | 43% |
| §8 Team Game Flow | 5 | 4 | 80% |
| §9 QB Passing | 44 | 39 | 89% |
| §11 RB Rushing | 56 | 43 | 77% |
| §13 WR Receiving | 144 | 102 | 71% |
| §17 TD Ordinal | 180 | 153 | 85% |
| §18 Fantasy | 100 | 85 | 85% |
Percentages below 100% are almost entirely sparsity — the stat has a row only for players who actually did the thing. Team-level sections (§1-§6) hit near-100% because there are exactly 2 teams in every game and they both do things.
Why This Matters¶
The accumulator is the first half of the pipeline; ClickHouse is the second. Together they turn "we simulated 100,000 alternative Week 1s" into "tell me the probability that DAL scores first and beats DEN by 3+."
Concretely, the pipeline unlocks:
- Fast probability queries. Instead of re-simulating to answer a question, we pre-simulate once and query the stored arrays. Turn a probability calculation from minutes into milliseconds.
- Confident pricing. Sports-betting prices depend on how confident we are in the probability. More simulated worlds = tighter confidence intervals = better prices. Storage that can handle 100,000 worlds lets us stop compromising.
- Repeatable experiments. The same stored dataset can be queried many times with different prediction models. We don't pay the simulation cost twice.
- Backwards-compatible analytics. Stats we didn't think to track today can be derived tomorrow from the stored play-by-play — we keep the raw material.
The Storage Experiment (What's Next)¶
The infrastructure to write to ClickHouse exists but the actual benchmarks are still to come. The experiment will answer, concretely:
- Can ClickHouse write a full NFL season at 10,000 worlds? How long does it take? How much disk does it use?
- Can it write a full season at 100,000 worlds? At this scale, the per-game data no longer fits comfortably in memory, so we're testing a streaming approach where we flush partial results and merge them inside ClickHouse.
- How fast can it serve reads? When a probability engine wants one game's worth of data — all 600 statistics × 100,000 values — how long does the query take? Target: under 5 seconds for the 100K case.
- How well does it compress? Float64 arrays should compress 5-6x on simulation data. We'll validate that and try different codec options (LZ4, ZSTD, Delta + ZSTD).
- Does it hold up under concurrent reads? Ten probability engines hitting the same game at once — does latency stay flat, or does it fall over?
Results will go into a ClickHouse Cloud instance for reference and comparison. At the end of the experiment we'll have the data to decide: is ClickHouse the right store for this workload, or do we need something else?
What's Complete vs What's Planned¶
| Component | Status |
|---|---|
| Simulation engine (plays, games, seasons) | Done |
| Game-level Outcome accumulator | Done |
| Season-level accumulator (standings, milestones) | Done |
| Outcome catalogue (list of declared statistics) | Done |
| Diagnostic test (coverage + sparsity classification) | Done |
| ClickHouse schemas (game, season, play-by-play, staging) | Designed |
| ClickHouse .NET client integration | Prototype ready |
| Write experiments at 10K scale | Planned |
| Write experiments at 100K scale | Planned |
| Read latency experiments | Planned |
| Compression and codec comparisons | Planned |
| Streaming-write + merge path | Planned |
The accumulator pipeline is the foundation. Everything from here is measuring how well ClickHouse holds up at scale.
For Reference¶
The full diagnostic report is written to
first-game-diagnostic.md every time the test runs. It lists every outcome
ID that was declared but not produced, grouped by reason. If you want to see
the shape of real data end-to-end, that file is a good place to start.