The V2 Game Engine

How one NFL game becomes a pure state machine, and how a run drives that machine across tens of thousands of parallel worlds — fast, deterministic, and without a single lock on the hot path.

Zero heap allocation per game No locks on the simulation path Deterministic regardless of core count Shared state, exclusive writes

0  ·  The big picture

The engine simulates a single NFL game as a pure state machine. A run drives that same machine across tens of thousands of parallel "worlds" — typically 10,000 to 100,000. Three ideas make that cheap: the engine owns and reuses all its working memory, parallel workers never share mutable state, and every result is written into shared, index-addressed storage.

HOT PATH — zero allocation, no locks Invariants read-only roster · matchup schedule · seed Workers disjoint worlds 0 .. k each lane = 1 worker Engine pure FSM over caller-owned state context: pooled, reused Shared store one slot per world runs
Invariants flow in; each worker runs the engine over its own caller-owned state, handing it the same read-only invariant object (shared, never copied); results land in a shared store addressed by world index.

Open any concept below to drill in.

See it run: the whole process, two ways at once

A fast-food kitchen and the real engine, stepped together — two cooks (workers) cook burgers (games) and box one meal per customer (the store), all from a single Back / Next.

This is the entire process in one place, shown two ways that step together. Picture a franchise where an order lands for four cheeseburgers (four worlds). With two cooks on shift, the work splits two and two — Cook A takes burgers 0–1, Cook B takes 2–3, and no burger is ever cooked by both (workers own disjoint world ranges). Each cook keeps one station and wipes it clean between burgers instead of grabbing new equipment — that is a worker reusing its pooled engine context. Both work from the same order ticket (the read-only invariant), but no two burgers come out identical: one is a touch under, one over, one just right — that variation is the per-world randomness. Each finished meal is boxed for its own customer and never shared — that is each world's result written to its own slot in the outcome store. The two cooks work their own tickets in parallel without reaching into each other's pans (no locks), and one can finish before the other.

Step through with Next: the kitchen (analogy) and the engine (real) advance in lock-step.

step 1 / 1
press Next to start taking orders
Share, lock nothing · invariant read by 0 now · 0 writers, 0 locks Own slot · 0/4 results written · 0 collisions Pooling · 2 contexts, reused for 0 games · 0 new allocations
Kitchen — the analogy
Order ticket — read-only
4 × Double Cheeseburger
2 cooks → 2 each · seed 42
shared by both cooks · never edited
Cook B — one station, wiped & reused (never a new one)
idlestn #Bburger: –
station clean, ready
Pickup counter — one boxed meal per customer, never shared
burger 0
waiting
burger 1
waiting
burger 2
waiting
burger 3
waiting
Engine — what actually happens
Invariant — read-only
matchup: HOME vs AWAY
4 worlds → 2 workers · seed 42
shared by all workers · never written
Worker B — one context, reset & reused (0 alloc)
idlectx #Bworld:
HOME
0
:
AWAY
0
idle
Outcome store — one slot per world (Worker A → 0–1, Worker B → 2–3)
worldhome_ptsaway_pts
0
1
2
3
One Back / Next drives both rows. Two cooks reuse their stations across burgers; each meal is boxed for its own customer — mirroring two workers reusing one context each and writing each world to its own store slot. Watch one cook finish before the other.
Key idea
The same station (context) is wiped clean and reused for every burger — never a new one built — which is memory pooling and zero per-game allocation (concepts 6, 2); the stn #A / ctx #A badge keeps its identity across both games to prove it. Doneness varies per burger — the per-world randomness (concept 3). Every meal is boxed for its own customer, never shared — each world owns its store slot (concept 5). Two cooks never share a pan, read the one ticket without ever locking it, and one finishes first — workers over disjoint ranges, no locks (concept 4). The chips under the caption keep these countable: readers climb but writers and locks stay 0, results fill their own slots with 0 collisions, and the context count never grows.
1

Invariants in, engine owns its state

Read-only facts cross the boundary one way; the engine drives a caller-supplied state object and returns nothing.

Two kinds of data cross the boundary into the engine, and the distinction is the foundation everything else builds on:

  • Invariants (read-only). The roster, the matchup, the schedule, the random seed. The engine reads these as immutable facts — it never writes them. Many worlds can read the same invariants at once because reads never conflict.
  • Working state (engine-owned). Everything that changes during a game — score, clock, possession, down and distance, per-play scratch, per-player stats — lives in a context object the caller creates and hands in. The engine mutates it in place.

The engine exposes one entry point: it takes the read-only invariants and the context it should drive, and returns nothing (the exact signature is incidental — any shape carrying those inputs works). It is a pure function over caller-owned state: same inputs produce the same in-place mutations, with no hidden globals or static buffers. Before each game it resets the context in place — wiping the durable state, re-stamping the random source, clearing the per-play scratch — rather than constructing a fresh one.

That single decision — the caller owns the state and the engine only drives it — is what unlocks everything downstream: the state can be pooled and reused, workers can each own one privately, and the engine itself stays a stateless function you can reason about in isolation.

Invariants roster · matchup schedule · seed read only ⟶ Engine one entry point returns nothing mutates the context in place Context (caller-owned) score · clock · possession scratch · player stats reset in place & replay — next game
The engine reads invariants, drives a caller-owned context, and is reset and replayed for the next game — never reconstructed.
Key idea
A stateless engine over externally-owned state is testable, poolable, and trivially parallel. The state is a parameter, not a property.
2

Zero heap allocation per game

The hot path allocates nothing on the managed heap — and a sentinel test fails the build if it ever does.

Running a matchup across tens of thousands of worlds means the per-game cost is multiplied tens of thousands of times over. The biggest lever is the garbage collector: if each game allocated even a little, the GC would dominate the run. So the design target is blunt — zero bytes allocated per game. The strategies that get there:

  • Reuse, don't allocate. One context per worker is reset and replayed for every game it runs (see concept 1). The expensive objects exist once and live for the whole run.
  • Mutate in place. State objects are cleared and refilled each game, never reconstructed.
  • Stack, not heap. Short-lived per-decision buffers — like the probability vector a predictor fills before sampling — live on the stack, so they never touch the GC.
  • Value types for records. Per-play log rows are value types copied into a pre-sized buffer rather than freshly allocated objects.
  • No language features that allocate invisibly on the hot path — in C#, that rules out LINQ, closures, and params arrays. Each allocates quietly behind the scenes, so they're kept off the per-play and per-game path entirely.
  • A recording hook that's free. Games optionally record play-by-play through a simple virtual hook; benchmarking showed it costs the same as the more complex generic alternative, so the simpler, still-zero-allocation form was chosen.

This is enforced, not hoped for. A sentinel test warms the engine up (so the runtime has fully optimised every hot method — on .NET, JIT-compiled to its top tier), then runs hundreds of games while measuring exactly how many bytes were allocated. The budget is a hard zero.

warm-up (JIT to top tier) measured window — hundreds of games allocation budget 0 B / game any stray allocation, boxing, or closure breaks the build
After warm-up, the measured window asserts a hard zero-bytes-per-game budget.
Enforced by
An allocation sentinel test with a hard 0 bytes/game budget. The invariant can't silently rot — a regression fails CI immediately.
3

Determinism, independent of the hardware

Each world's randomness is derived from stable identifiers, so the same world produces the same game on any machine.

Randomness and parallelism are usually in tension: if worlds pulled from a shared random source, the result of any given world would depend on the order workers happened to run — which depends on how many CPUs the host has. That makes runs irreproducible.

The engine sidesteps this entirely. Each world's random stream is seeded from stable identifiers — the run's seed combined with the matchup and the world index — rather than from a shared counter or the wall clock. Because the seed is computed purely from which matchup and which world, world 7 of matchup 3 always plays out the same game, no matter how many cores ran the season or in what order.

Reproducibility is therefore a property of the identifiers, not of the schedule. You can run the same season on a laptop and a 64-core server and get byte-identical results.

run seed + matchup id · world id derive seed pure function per-world stream this world only laptop (4 cores) server (64 cores) identical result
The seed is a pure function of identifiers, so the same world yields the same game on any machine, in any order.
Key idea
Tie randomness to identity, not to execution order. Reproducibility then survives any amount of parallelism.
4

Share everything, lock nothing

Workers own disjoint slices of the work, so no two threads ever touch the same mutable object — and locks become unnecessary.

A run spans tens of thousands of worlds. They're divided into contiguous ranges, one per worker, and each worker is single-threaded over its own range. The invariant that makes this safe is simple and absolute: no two threads ever modify the same object.

Each worker carries its own private working set — its own engine context, its own pool of player handles, its own memory pools. The only things shared across workers are either read-only (the invariants) or written at slots that belong exclusively to one worker (the next concept). There is no overlap by construction.

The payoff: there are no locks, no atomic operations, and no concurrent collections anywhere on the simulation path. They aren't optimised away — they were never needed. Eliminating the need for synchronisation is both faster and far simpler to reason about than synchronising correctly.

worlds 0 .. N, partitioned into disjoint ranges Worker A Worker B Worker C Worker D own contextown poolsown players own contextown poolsown players own contextown poolsown players own contextown poolsown players
Disjoint world ranges, single-threaded workers, private state each — nothing mutable is ever shared.
Holds because
The partition has no overlap by construction, so the absence of synchronisation on the hot path — no lock, atomics, or concurrent collections — is a design guarantee, not a tuning choice.
5

Shared-but-indexed objects

Collecting results needs no locks, because writes land in disjoint slots. Two layouts achieve that — a world-indexed shared store, or per-worker private stores stitched back together — and the world count decides which.

Every world produces a result, and they all have to be collected somewhere. The obvious wrong answer is one result object per world — tens of thousands of tiny objects to allocate and reassemble. There are two good answers, and both keep writes to disjoint slots, so both are lock-free.

Strategy A — one shared store, laid out as columns. Each column is a single array with one slot per world, and a world writes only its own slot: world i touches element i and nothing else. The store is shared, but because it's addressed by world index, two workers recording two different worlds write two different elements and never collide — no atomics, no locking, and nothing to stitch back together afterwards. Picture it as one 2D table: a row per world, a column per outcome family, each worker filling a contiguous block of rows (see below).

Strategy B — a private store per worker (what we do today). Strategy A has one catch at our scale: every column must be a single array spanning every world. At 10,000–100,000 worlds per matchup, across dozens of outcome columns, that's a large block of memory that has to stay fully resident until the whole matchup finishes. So instead each worker owns a private store sized to just its slice of worlds. It fills the local slots of its own buffer and can flush and recycle that buffer the moment a game is done — keeping every allocation small and the footprint flat.

The trade-off is reassembly. With a per-worker store, one game's results are split across however many workers ran it, so the full per-game picture has to be stitched back together in world order downstream — currently by writing each worker's slice to a staging table and merging the slices per game in the analytics store (ClickHouse). Strategy A needed none of that. So the choice is a function of world count: at today's tens of thousands, per-worker partitioning wins; if the world count were small, the single shared store is simpler and needs no stitching — so it stays a documented strategy to revisit, not a discarded one. (At one worker, the two collapse: that single private store spans every world, which is Strategy A.)

Each worker fills its own rows top to bottom at its own pace — watch them finish at different times.
worker world home_pts away_pts total_yds pass_yds turnovers
Worker Aworlds 0–2 024173512681
131284023052
213202881903
Worker Bworlds 3–5 327243662440
420233122212
534314213301
Worker Cworlds 6–8 617142761982
728213582511
810132441603
Worker Dworlds 9–11 923203332401
1038354553660
1121243012052
Rows = worlds, columns = outcome families. Each worker fills only its own contiguous block of rows (A → 0–2, B → 3–5, C → 6–8, D → 9–11). Whether that's one shared table with each worker writing disjoint rows (Strategy A) or each worker's private slice concatenated back in world order (Strategy B), no two workers ever write the same row — so no locks either way.
Key idea
"Shared" and "needs a lock" are not the same thing. Partition who writes where — disjoint slots of one buffer, or a private buffer per worker — and concurrent writers never contend. The only question left is whether you reassemble the pieces afterwards.
6

Memory pooling

Backing arrays are rented from per-worker pools and returned, so a run reuses a handful of buffers instead of allocating tens of thousands.

Concept 2 keeps the per-game path at zero allocation, but a season still needs sizeable backing arrays for those outcome columns. Allocating them fresh, per matchup, would reintroduce exactly the GC pressure we removed. So those arrays come from pools:

  • Typed pools. Separate pools hand out the integer, boolean, and floating-point arrays the columns need.
  • Per-worker and lock-free. Each worker gets its own pool. Because the worker is single-threaded, that pool can be unsynchronised — no locking — and exact-size, handing back an array of precisely the requested length with no power-of-two rounding waste.
  • A simple lifecycle. Rent an array and clear it, fill it during the matchup, then return it to the pool when the matchup is done. The next matchup rents the same memory back.
  • Players too. Per-player handles are pooled the same way — minted once per worker and reused across every game it simulates, with their stats reset in place rather than reallocated.

Pooling converts "tens of thousands of allocations" into "rent the same handful of buffers over and over." It's the season-scale counterpart to the per-game zero-allocation rule.

per-worker pool exact-size · lock-free rent & clear fill (matchup) return to pool next matchup reused
Rent → clear → fill → return, then reuse: the same buffers serve matchup after matchup.
Key idea
Single-threaded ownership (concept 4) is what lets the pools be lock-free and exact-size — the parallelism model and the memory model reinforce each other.
7

Why it stays true: auditable invariants

The performance properties survive ongoing change because each one reduces to a cheap review rule.

A fast design is only valuable if it stays fast as the code evolves. What keeps these properties intact is that each one collapses into a rule a reviewer (or a test) can check at a glance:

  • Any allocation on the per-game path is a red flag — in C#, a stray new X(). The zero-allocation sentinel catches it automatically; a reviewer catches the intent.
  • Any lock or atomic on the simulation path means the ownership model was violated. If you reach for synchronisation, two threads are touching the same object — back up and re-partition instead.
  • Randomness comes from derived seeds, never a shared source. A shared counter would couple a world's result to the schedule and break reproducibility.

These are review rules, not just good intentions — which is exactly why the engine's speed and determinism don't quietly erode over time.

Review red flags
an allocation on the hot path (in C#, new) · synchronisation in the simulation core (a lock or atomic) · randomness from a shared counter. Spotting any one means an invariant is about to break.