Play Decision Model Integration — Performance Comparison (LBS-1657)¶

This document records the measured cost of the Play Decision neural network at each step of its integration into the engine, against the pseudo predictor it replaced. The model itself — weights, outputs, sampled play calls — is identical at every neural-network step; the steps differ only in how the same network is executed. All numbers were measured back-to-back on the same host on 2026-06-04.

The intent is to make the cost of each decision visible at the scales the platform actually runs: one call, one game, one fixture priced across 10,000 and 100,000 parallel worlds, and one season.

Background¶

The pseudo predictor (LBS-1654) is a hand-written rule/probability table: a few branches and multiplies per call. It was the play-call predictor at the branch point.
The Play Decision network is the quants' PyTorch MLP (~17K parameters: two embedding lookups, concat, four Gemm/ReLU layers), trained and exported to ONNX independently of the engine. LBS-1657 integrated it as the first real model behind the Predictors seam, replacing the pseudo predictor outright. Behavioural parity with the PyTorch reference is pinned by a 1024-case fixture (top-1 100%, logits < 1e-4, probs < 1e-5) at every integration step, and the LBS-1654 stat sanity harness passes with the model live.
The engine simulates one game in tens of microseconds and runs thousands of worlds per fixture, so per-call predictor cost multiplies directly into wall time. The engine's hot path allocates zero managed bytes per game (LBS-1698), a property each integration step had to preserve.

Integration steps compared¶

Step	What it is
Pseudo (LBS-1654)	Rule-based predictor at the branch point. No neural network.
ONNX Runtime	First integration: ORT `InferenceSession`, pre-bound per-thread IOBinding, `IntraOpNumThreads=1`.
ONNX Runtime, tuned flags	Same, plus `EnableMemoryPattern=false`, `EnableCpuMemArena=false`.
Kernel, row-major	Hand-rolled forward pass over the same `.onnx` weights: one SIMD dot product per output row.
Kernel, SAXPY + zero-skip	Column-major accumulation; inputs that are exactly zero are skipped (32 of 33 one-hot inputs, ~half of post-ReLU activations).
Kernel, register-tiled (ships)	SAXPY + zero-skip with four-vector register tiles, so accumulators stay in registers across the input loop.

Every neural-network step executes the same embedded play_decision.onnx bytes and produces the same play-call distribution; the parity fixture is the contract throughout. The two ONNX Runtime steps and the three kernel steps are transport changes only.

Measurement setup¶

Apple M5 Max (18 physical / 18 logical cores, 128 GB), macOS 26.4.1, .NET 10.0.7 (Arm64 RyuJIT AdvSIMD), BenchmarkDotNet v0.14.0.
Engine-scale cases run with the OutcomeSink bound to a pooled GameOutcomeStore — the per-game flush hot path — so recording cost is included, matching production shape.
Per-call, single-game and single-world-season cases are single-threaded. The multi-world cases run with degree of parallelism 18 (Environment.ProcessorCount, all cores): one pooled engine context per worker, worlds statically chunked across workers, each world folding into its own store slot — the same per-world fold pattern SeasonRunner uses in production.
Benchmarks: PlayDecisionPredictorBenchmark (per call), EngineGameBenchmark (1 game single-threaded; 1 game x 10,000 / 100,000 worlds across all cores), EngineSeasonBenchmark.SimulateSeasonOutcomeRecording (288-game season).
States reproduced from git: pseudo at c4514af7, ONNX Runtime at babdc16a; the flag-tuned ORT and the two intermediate kernel shapes were reconstructed as one-method variants of the shipped PlayDecisionKernel for this measurement.
Call counts measured with an instrumented 288-game run: the engine makes 142.8 play-call predictor calls per game with the model (41,137 per 288-game season; 144.9 and 41,723 with the pseudo predictor), at ~158 recorded plays per game.
BenchmarkDotNet standard errors are below 2% of the mean for every number shown. Per-call numbers drift run-to-run by roughly 5-10% (the shipped kernel has measured 554-596 ns across days); each table below is one same-day matrix.

Results¶

1. Play decision cost: per call, per game, per season¶

Per call is measured (BenchmarkDotNet, full predictor path: features, forward pass, softmax, remap, sample). Per game and per season are derived from the measured call counts (142.8 calls/game for the model states, 144.9 for pseudo; 288-game season).

Step	Per call	Per game	Per season	vs pseudo
Pseudo (LBS-1654)	10.0 ns	1.5 us	0.42 ms	1x
ONNX Runtime	3,583 ns	512 us	147.4 ms	358x
ONNX Runtime, tuned flags	3,610 ns	516 us	148.5 ms	361x
Kernel, row-major	2,075 ns	296 us	85.4 ms	207x
Kernel, SAXPY + zero-skip	837 ns	119 us	34.4 ms	84x
Kernel, register-tiled (ships)	596 ns	85 us	24.5 ms	60x

Every step allocates zero managed bytes per call.

The isolated per-call benchmark slightly understates in-engine cost (in the engine, other work competes for cache between calls): the engine-scale deltas below run ~10% above these derived figures. The deltas are the ground truth.

2. Game, 1 world (OutcomeSink)¶

Step	One game	vs pseudo	vs ONNX Runtime
Pseudo (LBS-1654)	15.0 us	1x	—
ONNX Runtime	535.4 us	35.6x	1x
ONNX Runtime, tuned flags	534.4 us	35.5x	1.00x
Kernel, row-major	307.0 us	20.4x	1.7x faster
Kernel, SAXPY + zero-skip	141.2 us	9.4x	3.8x faster
Kernel, register-tiled (ships)	110.0 us	7.3x	4.9x faster

Allocated: 0 B at every step.

The engine without a play-call model costs ~13.6 us per game (pseudo baseline minus its predictor cost). With the shipped kernel, the play-call model accounts for ~96 us of the 110 us game — about 88% of single-game wall time. Under ONNX Runtime it was ~522 us of 535 us, about 97%.

3. Game, 10,000 worlds (OutcomeSink, 18 cores)¶

The production shape for pricing one fixture: the same matchup simulated into 10,000 world slots of the outcome store. The "one thread" column runs the worlds sequentially; the "18 cores" column runs the same work with degree of parallelism 18. The vs columns compare the 18-core numbers.

Step	One thread	18 cores	Parallel scaling	vs pseudo	vs ONNX Runtime
Pseudo (LBS-1654)	0.152 s	17.9 ms	8.5x	1x	—
ONNX Runtime	5.318 s	3.343 s	1.6x	187x	1x
ONNX Runtime, tuned flags	5.342 s	1.186 s	4.5x	66x	2.8x faster
Kernel, row-major	3.192 s	0.308 s	10.4x	17.2x	10.9x faster
Kernel, SAXPY + zero-skip	1.425 s	0.136 s	10.5x	7.6x	24.6x faster
Kernel, register-tiled (ships)	1.105 s	0.102 s	10.8x	5.7x	32.7x faster

The single-threaded runs scale linearly from one game (each lands within a few percent of 10,000x its single-game time) and allocate 0 B, as expected from a zero-allocation loop. The parallel runs allocate ~6-9 KB per pricing run — Parallel.For orchestration only; the per-world loop itself still allocates nothing.

Parallelism is where the transports diverge hardest. The kernel steps scale 10.4-10.8x on 18 cores (the kernel is a pure function over immutable weights — workers share nothing). ONNX Runtime as first integrated scales 1.6x: its CPU memory arena serialises the workers. Disabling the arena — the tuned-flags step, which measured as noise on one thread — buys 2.8x at 18 cores, and still scales no better than 4.5x. The pseudo predictor's 8.5x reflects orchestration overhead on very short games rather than contention.

4. Game, 100,000 worlds (OutcomeSink, 18 cores)¶

The same fixture at 10x the world count, all cores. Times are ~10x the 10,000-world column across the board — linear in worlds at a fixed core count.

Step	100,000 worlds	vs pseudo	vs ONNX Runtime
Pseudo (LBS-1654)	0.162 s	1x	—
ONNX Runtime	33.98 s	210x	1x
ONNX Runtime, tuned flags	11.56 s	71x	2.9x faster
Kernel, row-major	3.28 s	20.3x	10.4x faster
Kernel, SAXPY + zero-skip	1.38 s	8.6x	24.6x faster
Kernel, register-tiled (ships)	1.03 s	6.4x	32.9x faster

5. Season, 1 world (OutcomeSink, 288 games)¶

Step	One season	vs pseudo	vs ONNX Runtime
Pseudo (LBS-1654)	4.29 ms	1x	—
ONNX Runtime	155.4 ms	36.2x	1x
ONNX Runtime, tuned flags	153.8 ms	35.9x	1.01x faster
Kernel, row-major	85.5 ms	19.9x	1.8x faster
Kernel, SAXPY + zero-skip	40.8 ms	9.5x	3.8x faster
Kernel, register-tiled (ships)	31.9 ms	7.4x	4.9x faster

Allocated: 0 B at every step.

Observations¶

The first integration multiplied engine cost by ~36x on one thread — and ~190x at production parallelism. The network's arithmetic is ~34K floating-point operations per call — under a microsecond of math — but ONNX Runtime's fixed per-call harness (native interop, graph-executor walk, kernel dispatch) put the realized cost at ~3.6 us per call, and its CPU memory arena serialised the engine's 18 workers on top (1.6x parallel scaling, against the kernel's 10.8x).
The configuration-level remedy was worth ~1% on one thread and 2.8x at 18 cores. Tuning ORT session flags — the cheap, documented knob — measured as noise in every single-threaded case; disabling the memory arena only shows up under parallelism, and even then the path scales no better than 4.5x on 18 cores. Recovering the rest meant replacing the transport: a custom ONNX parser plus a hand-rolled, SIMD, sparsity-aware forward pass, built and validated against the parity fixture across three iterations.
That engineering recovered 4.9x single-threaded and 32.7x at production parallelism, without touching the model. Same .onnx bytes, same outputs, same sampled plays at every step — the difference between 3.3 s and 0.10 s per 10,000-world pricing run was entirely in how the model was executed, not in what the model computes.
The remaining ~6-7x over the pseudo predictor is the model itself. After the transport work, play-call inference still accounts for ~88% of engine wall time per game. This is the genuine price of the network's capacity at the engine's call rate (~143 calls per game), and it now sets the floor on simulation throughput. The known next lever — batching inference across worlds — changes both the export shape and the engine's world loop.
The zero-allocation invariant survived every step, but not by default: the ORT path needed per-thread pre-bound IOBindings to get there, and the kernel achieves it by construction (stackalloc scratch, no per-call state).

Notes for future model handoffs¶

Facts from this integration that were knowable before it started, recorded here as checklist material for the next model:

The engine's per-call budget was ~10 ns; the model arrived at ~3,600 ns as first integrated. Neither figure was on the table when the model was designed, trained, or exported. A stated per-call budget (or even the engine's calls-per-game and worlds-per-fixture numbers) would have made the cost conversation possible at design time rather than after integration.
Most of the integration cost was recoverable, but only by specialised engineering work — profiling, a transport rewrite, and three measured kernel iterations. The standard runtime's tuning surface offered ~1%.
The irreducible model cost (~60x the incumbent per call) only becomes visible at engine scale. Per-call microseconds read as negligible in isolation; at 143 calls per game and 10,000 worlds per fixture they are seconds per pricing run. Model capacity and simulation throughput trade off against each other and are cheapest to weigh jointly, before architecture and export shape are fixed.
Runtime constraints shape what is integrable: zero managed allocation per game, thread-safe single-sample inference at batch=1 that actually scales across the engine's parallel worlds (a runtime whose allocator serialises 18 workers fails this even when each call is fast — and that failure is invisible in single-threaded measurement), and (for the eventual batching lever) an export whose input shape the engine can actually feed. These constraints existed before the model did; they cost nothing to state up front.
Artifact provenance needed reconstruction after the fact — the normalization stats were recovered from the quants' repo history rather than handed over alongside the exported model (see the integration contract for the open confirmation). A model handoff that includes weights, normalization, a parity fixture, and training-encoding notes removes a class of integration risk outright.

Reproduction¶

# Per call (run in the tree under measurement)
dotnet run -c Release --project src/Models/AmericanFootball/LBS.Model.AmericanFootball.Benchmarks \
  -- --filter '*PlayDecisionPredictorBenchmark*'

# Game x 1 world (single-threaded) and game x 10,000 / 100,000 worlds (all cores)
dotnet run -c Release --project src/Models/AmericanFootball/LBS.Model.AmericanFootball.Benchmarks \
  -- --filter '*EngineGameBenchmark*'

# Season x 1 world (OutcomeSink)
dotnet run -c Release --project src/Models/AmericanFootball/LBS.Model.AmericanFootball.Benchmarks \
  -- --filter '*SimulateSeasonOutcomeRecording*'

States: pseudo at commit c4514af7, ONNX Runtime at babdc16a, shipped kernel on the LBS-1657 branch. The flag-tuned ORT step is babdc16a plus EnableMemoryPattern=false / EnableCpuMemArena=false on the SessionOptions; the row-major and untiled-SAXPY steps are single-method Gemv variants of the shipped PlayDecisionKernel. Call counts came from a temporary counter on PlayDecisionPredictor.Sample over an instrumented 288-game run. The slowest ORT multi-world runs used --warmupCount 2 --iterationCount 6; everything else ran BenchmarkDotNet defaults (standard errors stayed under ~2-6% of the mean throughout). If BenchmarkDotNet reports duplicate project names, run it from inside the Benchmarks project directory (it scans the working directory's subtree for the project file, and worktrees under the repo root confuse it).