Storage Experiment: Making NFL Simulation Data Fast¶

A plain-English walkthrough of what we're building and why.

The Problem¶

We built a football simulation engine that can play an NFL season thousands of times. Each simulation world is a full alternative reality — different plays, different winners, different MVP races. We want to answer questions like:

How likely is a given quarterback to hit 4,000 passing yards this season?
What's Kansas City's chance of beating Philadelphia by more than 7 points?
Who's most likely to score the first touchdown in the playoffs?

The way we answer these questions at scale is to simulate many thousands of alternative seasons and look at the distribution of outcomes. If 6,800 out of 10,000 simulated seasons have Mahomes clearing 4,000 yards, that's a 68% probability.

That simple idea runs into a wall: data volume.

How Much Data?¶

For one simulated season:

Thing	Count
Games in the regular season	288
Plays per game (roughly)	~130
Plays in the whole season	~37,000
Statistics we track per game	~833

Now multiply by the number of alternative seasons we want to simulate:

Scale	Plays across all worlds	Game-level statistics across all worlds
10,000 worlds	370 million	~8 million
100,000 worlds	3.7 billion	~83 million

At 100,000 worlds we're producing billions of records. That volume doesn't fit in memory; it doesn't fit on a single laptop's disk in any convenient way; and it needs to be queryable quickly so that downstream probability engines can get answers in seconds, not hours.

That's the problem the storage experiment is trying to solve.

The Shape of the Data¶

Two different shapes of data come out of the simulation:

Play-by-play. Every single play from every simulated game. Kickoffs, third-and-shorts, the whole script. This is useful for deep questions — "which plays went for 50+ yards," or "in how many worlds did the kicker miss a game- winning field goal?" A typical game has ~130 plays. Across 288 games × 100,000 worlds, that's almost 4 billion rows.

Outcome Context. A summarised, tabular view. Instead of every play, it's "here are the totals per player and per team per game — passing yards, rushing attempts, points scored, and so on." For each statistic — say, PASSING_YARDS_GAME_KC_QB_1 (Kansas City's starter's passing yards in this game) — we store an array of 100,000 numbers, one per simulated world. This is the shape our probability engines actually want to consume.

Both shapes go into the same database: ClickHouse, a columnar analytical database designed for exactly this kind of workload.

The Two Pipelines¶

So we've got two pipelines going into ClickHouse:

┌─────────────────────┐       ┌──────────────────────┐
│  Simulation engine  │──┬──→ │  Play-by-play stream │ ─→ ClickHouse
└─────────────────────┘  │    └──────────────────────┘
                          │    ┌──────────────────────┐
                          └──→ │  Outcome accumulator │ ─→ ClickHouse
                               └──────────────────────┘

Play-by-play goes in raw. Every play becomes a row. Fast path, no aggregation.

Outcome Context takes a little more work. For each game, we walk through every play and accumulate — we keep running totals ("this QB has 287 passing yards so far, this RB has 3 TDs, this team is up by 14"). At the end of the game we hand off a bundle of ~600 statistics to ClickHouse. Then we move on to the next game.

Where We Are Right Now¶

We've completed the Outcome accumulator pipeline and verified it runs end-to- end on real simulated data. We've also completed the season-level accumulator (win/loss records, milestones like "reached 4,000 passing yards for the season", playoff seedings).

What we haven't done yet: the actual ClickHouse write experiments. The storage experiment harness exists, the ClickHouse connection works, but the measurements — "how long does it take to write a season at 100,000 worlds" — are still ahead of us. That's the next phase.

What a Real Simulation Looks Like¶

Here's what the pipeline produces for one simulated game (Dallas vs Denver, Week 1, final score DAL 38, DEN 33):

POINTS_GAME_DAL                          = 38
POINTS_GAME_DEN                          = 33
PASSING_YARDS_GAME_DEN_QB_1              = 382
RUSHING_YARDS_GAME_DAL_RB_1              = 161
RECEIVING_YARDS_GAME_DEN_WR_3            = 193

Each of those values, at 100,000-world scale, becomes an array of 100,000 numbers — the full distribution of that statistic across every alternative reality we simulated. That's exactly the shape a probability engine needs to compute "how often does Denver's WR3 go over 150 receiving yards" — count the entries in the array that exceed 150, divide by 100,000, done.

Across a whole season simulation (one world), the accumulator emits:

288 games with around 600 statistics each — roughly 175,000 total rows of game-level data
1,669 season-level statistics — team standings, playoff seedings, player season totals, milestone flags like REACHED_4000_PASSING_KC_QB_1

The Catalogue¶

The team has defined a catalogue of statistics that the system is supposed to track per game. This is the source of truth for "what should exist." It covers:

Team scoring — points by quarter, half, game; winner flags; point differential; largest lead
Team offense — passing yards, rushing yards, total yards, first downs, penalties
Team defense — sacks, interceptions, turnovers forced, fumbles recovered
Team game flow — number of drives, punts, time of possession
QB stats — passing yards, TDs, interceptions, completion %, passer rating, yards per attempt, longest completion
RB stats — rushing yards, attempts, TDs, yards per carry, longest rush, plus receiving (RBs catch passes too)
WR / TE stats — receiving yards, receptions, targets, TDs, longest reception
Kicker stats — field goals made, attempts, longest, extra points, kicking points
TD ordinal stats — who scored first, second, third; anytime-TD indicators
Fantasy stats — PPR, half-PPR, standard scoring

For the Dallas vs Denver matchup above, the catalogue declares 833 distinct statistics that should exist.

What the Diagnostic Shows¶

The accumulator runs and produces a concrete set of rows for the game. How does that compare to the catalogue? That's what the diagnostic test measures.

Here are the results from a sample Week 1 game (DAL vs DEN):

Metric	Count
Catalogue-declared outcomes	833
Accumulator-produced outcomes	610
Produced and declared (matching)	593
Produced but not declared (unexpected)	17
Declared but not produced (the gap)	240

So the accumulator produces 610 statistics for the game. The catalogue says there could have been up to 833. Where's the other 240?

The diagnostic splits that gap into three categories:

Classification	Count	What it means
sparse-player-inactive	96	Player had no activity in this game at all
sparse-zero-stat	144	Player played, but this specific stat was legitimately zero
missing-implementation	0	Catalogue declares it, accumulator doesn't produce it

Sparse-player-inactive (the bigger bucket)¶

Not every player touches the ball in every game. A team's third-string tight end might never see the field. A backup quarterback on the sideline isn't throwing any passes. The catalogue faithfully declares entries for every listed player (PASSING_YARDS_GAME_DAL_QB_2, RECEIVING_YARDS_GAME_DAL_TE_2, and so on), but the accumulator only emits rows when something actually happens — there's no passing-yards row for a QB who didn't attempt a pass. This is expected sparse data, not a bug.

Sparse-zero-stat (the other bucket)¶

A QB who throws for 382 yards but doesn't throw an interception will not produce an INTERCEPTIONS_GAME_QB row. He played, the stat is declared, and it genuinely zero-ed out. Same story for a team's TEAM_DEF_INTERCEPTIONS_ GAME_DAL when the defence didn't pick off a pass.

Zero values are dropped on purpose — in the Outcome Context schema, only rows with meaningful data get stored. The probability engine reads the absence of a row as "this stat is zero in this world." Storing 100,000 zeros for every player × every stat that zeroed-out would be enormously wasteful.

Missing-implementation (the zero bucket, for now)¶

These are outcomes the catalogue declares but the accumulator doesn't produce at all. In the current run this count is zero — the catalogue is honest about what the accumulator supports. But there are deliberate gaps the catalogue also leaves out (for now):

Per-defender statistics (individual sacks, tackles, pass deflections by player)
Drive-level metrics (time of possession, longest drive in yards)
Some detailed team totals (first downs, penalty counts, penalty yards have placeholders but are reported as zero — the plumbing isn't connected yet)

The catalogue and the accumulator are kept in sync so there are no surprising gaps. If we want to add defender-level stats later, we'll extend both sides together.

Coverage by section¶

The diagnostic also reports coverage section by section:

Section	Declared	Produced	Coverage
§1 Team Scoring	14	13	93%
§2 Team Game Outcomes	28	28	100%
§3 Team Scoring Derived	10	10	100%
§4 Team Passing	16	15	94%
§5 Team Rushing	10	10	100%
§6 Team Total Offense	12	12	100%
§7 Team Defence / ST	14	6	43%
§8 Team Game Flow	5	4	80%
§9 QB Passing	44	39	89%
§11 RB Rushing	56	43	77%
§13 WR Receiving	144	102	71%
§17 TD Ordinal	180	153	85%
§18 Fantasy	100	85	85%

Percentages below 100% are almost entirely sparsity — the stat has a row only for players who actually did the thing. Team-level sections (§1-§6) hit near-100% because there are exactly 2 teams in every game and they both do things.

Why This Matters¶

The accumulator is the first half of the pipeline; ClickHouse is the second. Together they turn "we simulated 100,000 alternative Week 1s" into "tell me the probability that DAL scores first and beats DEN by 3+."

Concretely, the pipeline unlocks:

Fast probability queries. Instead of re-simulating to answer a question, we pre-simulate once and query the stored arrays. Turn a probability calculation from minutes into milliseconds.
Confident pricing. Sports-betting prices depend on how confident we are in the probability. More simulated worlds = tighter confidence intervals = better prices. Storage that can handle 100,000 worlds lets us stop compromising.
Repeatable experiments. The same stored dataset can be queried many times with different prediction models. We don't pay the simulation cost twice.
Backwards-compatible analytics. Stats we didn't think to track today can be derived tomorrow from the stored play-by-play — we keep the raw material.

The Storage Experiment (What's Next)¶

The infrastructure to write to ClickHouse exists but the actual benchmarks are still to come. The experiment will answer, concretely:

Can ClickHouse write a full NFL season at 10,000 worlds? How long does it take? How much disk does it use?
Can it write a full season at 100,000 worlds? At this scale, the per-game data no longer fits comfortably in memory, so we're testing a streaming approach where we flush partial results and merge them inside ClickHouse.
How fast can it serve reads? When a probability engine wants one game's worth of data — all 600 statistics × 100,000 values — how long does the query take? Target: under 5 seconds for the 100K case.
How well does it compress? Float64 arrays should compress 5-6x on simulation data. We'll validate that and try different codec options (LZ4, ZSTD, Delta + ZSTD).
Does it hold up under concurrent reads? Ten probability engines hitting the same game at once — does latency stay flat, or does it fall over?

Results will go into a ClickHouse Cloud instance for reference and comparison. At the end of the experiment we'll have the data to decide: is ClickHouse the right store for this workload, or do we need something else?

What's Complete vs What's Planned¶

Component	Status
Simulation engine (plays, games, seasons)	Done
Game-level Outcome accumulator	Done
Season-level accumulator (standings, milestones)	Done
Outcome catalogue (list of declared statistics)	Done
Diagnostic test (coverage + sparsity classification)	Done
ClickHouse schemas (game, season, play-by-play, staging)	Designed
ClickHouse .NET client integration	Prototype ready
Write experiments at 10K scale	Planned
Write experiments at 100K scale	Planned
Read latency experiments	Planned
Compression and codec comparisons	Planned
Streaming-write + merge path	Planned

The accumulator pipeline is the foundation. Everything from here is measuring how well ClickHouse holds up at scale.

For Reference¶

The full diagnostic report is written to first-game-diagnostic.md every time the test runs. It lists every outcome ID that was declared but not produced, grouped by reason. If you want to see the shape of real data end-to-end, that file is a good place to start.