OutcomeContext — Per-Phase Breakdown¶

Companion to the executive roadmap. The roadmap names nine phases on a single page; this doc breaks each phase into scope, definition of done, dependencies, rough sizing, and the Linear tickets that already cover slices of it.

Sizing is rough: S ≈ a few days, M ≈ 1–2 weeks, L ≈ 3–6 weeks, XL ≈ a quarter or more. These are engineering shapes, not commitments — every phase still needs its own planning pass before it's picked up.

Phase 1 — Bitvector evaluation engine¶

Business outcome: boolean expressions (the most common query shape) evaluate in single-digit ms instead of tens-to-hundreds of ms.

Scope (in): - Inner ring under PostfixEvaluator that packs 64 world results per ulong for boolean leaves and walks AND / OR / NOT via bitwise ops. - Per-leaf threshold pre-computation: a leaf like PASSING_YARDS >= 250 is reduced to a 100K-bit vector once, then composed with other vectors in O(N/64) instead of O(N). - Fast popcount for matchingWorlds. - Fallback to the existing array-walk evaluator for numeric / mixed expressions (delta distributions, mean / median outputs). - Benchmark harness in LBS.OutcomeContext.Query.Tests (or sibling benchmark project) that asserts the new path is no slower than the array path on small inputs and meaningfully faster on the typical boolean shape.

Out of scope: - Persisting pre-computed bitvectors to storage — that's a phase 8 optimisation, not a phase 1 deliverable. - Cross-context bitvector composition (already covered by the existing per-world index alignment rule).

Definition of done: - Unit tests cover all 15 binary + 3 unary operators against the array-path evaluator (parity required). - Benchmark shows ≥5× improvement on the canonical "Same Game compound AND" shape at 100K worlds. - The evaluator is selected automatically based on whether the top-level expression is boolean.

Dependencies: none. Can start now.

Sizing: M (1–2 weeks).

Tickets: LBS-1361.

Phase 2 — Historical data ingest¶

Business outcome: real past outcomes live in the same shape as simulated outcomes, so any "how accurate were we?" claim has data to compare against.

Scope (in): - Decide the historical scope: how many seasons back, which sport(s). This is the single biggest sizing input for ClickHouse storage cost, so it's the first decision the phase has to settle. - ESPN (or alternative) historical schedule + roster + per-game box score ingest into a new historical_game_outcome_context table or partition. - Decide on a single canonical OC table per scope-type with a kind column (SIMULATED vs HISTORICAL) or parallel tables. Drives every query that wants to overlay prediction onto actual. - Cohort identity story: are 2023 Mahomes and 2025 Mahomes the same participantId? Need a stable identity model before mass ingest.

Out of scope: - Live in-game feed ingest (that's phase 4). - The actual prediction-vs-actual UX (that's the phase 3 prototype). - Multi-sport ingest beyond the first chosen scope (covered by phase 5 generalisation).

Definition of done: - One full historical season ingested end-to-end and queryable via gameContext / seasonContext using the same outcome IDs the simulator produces. - Storage sizing measurement: bytes-per-season-historical recorded so multi-season + multi-sport extrapolation is grounded. - Identity story documented (decision recorded in docs/adr/).

Dependencies: - Product / engineering decision on historical scope (this is a scope decision, not an engineering one — see "Critical dependencies" in the roadmap). - Reliable historical data source (ESPN, alternative, or paid feed).

Sizing: L (3–6 weeks), driven mostly by scope decision and data-provider integration.

Tickets: none yet — first to raise when this phase starts.

Phase 3 — Prediction-vs-actual prototype¶

Business outcome: settles, before significant build, how customers will see "this is what we predicted, here's what actually happened" without prescribing a UX everyone agrees to in advance.

Scope (in): - A throwaway-grade end-to-end slice: pick one season already ingested in phase 2, one canonical question (e.g. "P(Mahomes ≥ 30 passing TDs for the season)"), surface both prediction distribution and the actual value via the same GraphQL surface. - Compare-mode contract: does evaluate return an actual-bucket alongside the prediction distribution? Does the API surface a separate historicalContext query? This is what the prototype is for — pick one shape, ship it, see if it sticks. - Stakeholder review with whoever owns the product surface (likely Daisy + product) — output is a "this is the shape we'll build for real in phase 4+" decision.

Out of scope: - Production-quality code. The prototype is allowed to be ugly. - Calibration metrics themselves (those are phase 7).

Definition of done: - One sample question demoable end-to-end against a real historical season. - Written decision on the API contract for prediction-vs-actual. - Lessons captured in docs/ so the prototype isn't lost when the branch is closed.

Dependencies: phase 2 (at least one historical season ingested).

Sizing: S (a few days for the slice + a stakeholder session).

Tickets: none yet.

Phase 4 — Live data ingest¶

Business outcome: the simulator runs against current rosters, injury status, and (eventually) in-game state — not the snapshot the prototype was built against.

Scope (in): - Live roster sync (daily or on-change cadence). Source TBD — depends on commercial provider chosen. - Injury / inactive status feed at fixture-eligibility granularity. - Scheduled-vs-completed game state — knowing a game is live, final, or yet-to-play affects what gameContext / seasonContext return for that fixture. - Decide cadence: is the simulator re-run nightly? On every roster change? On every game close? Affects ClickHouse data freshness story and is the input to the data-version semantics (LBS-1364).

Out of scope: - In-play / play-by-play live feed (separate, much heavier effort — not required for season-level prediction). - The actual provider commercial / contractual side (engineering unblocks once provider is chosen).

Definition of done: - Live feed wired to a re-simulation trigger; one full end-to-end run has executed against fresh-as-of-today inputs. - Stale-data semantics defined: how does a query know whether the simulation it's reading is current? - Feed reliability runbook (what happens when the feed lags or fails).

Dependencies: - Commercial decision on data provider (parallel track). - Phase 5 (real model) ideally landed first — otherwise live feeds fire a placeholder simulator, which adds work for no signal.

Sizing: L (3–6 weeks), provider-integration-heavy.

Tickets: none yet.

Phase 5 — Real model integration¶

Business outcome: the placeholder pseudo-simulator is replaced with the data-science team's predictive model. From here on, "accuracy" is a question the system can answer for real.

Scope (in): - DS-team-provided model (form TBD — Python service, ONNX binary, REST endpoint, in-process .NET binding) replaces LBS.Model.AmericanFootball.Simulation.GameEngine / SeasonEngine / PlayoffEngine. - Contract negotiation with DS: what does the model produce per game? Per drive? Per play? Does it produce world-indexed samples directly, or a parametric distribution we sample from? - Sport-agnostic generalisation (catalyst for catching architecture gaps): wire a second sport — likely NRL or NBA — through the same contracts so any single-sport assumptions surface and get fixed. - Determinism / RNG seed strategy: reproducible runs given the same inputs are a precondition for phase 7's accuracy claims.

Out of scope: - Building the model itself — that's the DS team. - Multi-model ensemble / A-B serving (that's a phase 7+ concern).

Definition of done: - Real model in production-grade integration, replacing the placeholder. - One full simulation run executed against the real model at 100K worlds and confirmed structurally correct (same OC shape, same query surface). - Second sport wired through to at least the storage layer (full query path optional). - Determinism contract documented and tested.

Dependencies: - DS team has a production-grade model. This is the single biggest external risk in the whole plan.

Sizing: XL. Has the widest scope and the most external dependency. Sport-agnostic generalisation alone is L.

Tickets: none yet (gated on DS readiness).

Phase 6 — Progressive season simulation¶

Business outcome: mid-season prediction — answer "what's the chance of X" after week 8 conditional on weeks 1–7's actuals. The core product use case.

Scope (in): - Re-simulation primitive that takes a "world state up to week N" snapshot and projects from there. Distinct from the current "simulate the whole season from scratch" runner. - Storage shape for partial-season state: which weeks are actual, which weeks are simulated, where the boundary sits. This needs to compose with phase 4's live feed. - Query surface: does seasonContext flag which fixtures are resolved vs simulated? Does the customer see a single probability or a "P(X | through week N)" breakdown? - Re-run cadence: weekly? On every game close?

Out of scope: - Intra-game progressive simulation (way out of scope — that's a different product entirely).

Definition of done: - Mid-season run executes against a known partial-season input and produces probabilities that change appropriately week-on-week. - Sample query exists for the canonical mid-season question.

Dependencies: phase 4 (live data), phase 5 (real model).

Sizing: L (3–6 weeks), more if the model integration adds constraints we haven't predicted.

Tickets: none yet.

Phase 7 — Foundation + accuracy validation¶

Business outcome: observability is in place; predictions are measured against reality; calibration story is documented and defensible.

Scope (in): - Observability foundation (LBS-1363): structured logging, metrics, distributed tracing, operator dashboard, run-failure alert. - Data-version / epoch semantics (LBS-1364): retires the read-time SELECT FINAL workaround, enables atomic cutover between simulation runs, supports historical-version queries. - Outcome template registry completion (LBS-1365): closes the three known coverage gaps in AmericanFootballOutcomeTemplates and re-enables the skipped TemplatesCoverRealSimulationOutput test so future drift causes CI failure. - Calibration metrics: define what "accurate" means in numbers (Brier score, log-loss, reliability diagram bucketing). Apply against the historical seasons ingested in phase 2. - Year-over-year stability claims: can we say "this model's errors are consistent across years"? Needs at least 2 historical seasons.

Out of scope: - Performance optimisation (phase 8) — accuracy first, speed second. - Cost-per-query observability (phase 7/8 overlap — listed under the roadmap's known gaps).

Definition of done: - Observability deployed; one operator can answer "is the system healthy?" from a dashboard. - A calibration report exists for at least one historical season. - Version / epoch cutover demonstrated under simulated read load.

Dependencies: - Phase 2 (historical data) for calibration to mean anything. - Phase 5 (real model) for accuracy to mean anything.

Sizing: L (3–6 weeks across the three sub-streams).

Tickets: LBS-1363, LBS-1364, LBS-1365.

Phase 8 — Performance¶

Business outcome: the ≤200ms latency contract is met at production load, not just for one warm query.

Scope (in): - Expression-result cache (LBS-1362): FusionCache in-process L1 keyed on expressionHash; optional Redis L2 added when multi-replica QueryApi deploys demand it. - In-region latency benchmark (LBS-1359 / LBS-1360): the canonical Same Game and Same Comp queries measured from an in-region client, asserted in CI. - Pre-computed bitvectors for hot outcomes — write-time pre-aggregation of common boolean shapes. Decided by what the observability data shows is actually hot. - Query-cost / fair-use design (roadmap known gap): bounded query shapes — basket-size limits, expression-depth limits, evaluator budget — so a single abusive query can't pin the cluster. - Multi-region read replicas if cross-region client volume warrants (decided by phase 7 observability, not assumed up-front).

Out of scope: - Schema-level optimisations (already done in the prototype phase). - Anything that needs a new model — that's phase 5.

Definition of done: - ≤200ms p95 for canonical Same Game + Same Comp at production concurrency. Measured and asserted in CI. - Cache hit rate ≥ X% on typical traffic (X set by phase 7 metrics). - Query-cost rules deployed and a load test demonstrates they prevent worst-case shapes.

Dependencies: phase 1 (bitvector), phase 7 (observability for "what's hot to pre-compute"), phase 1364 (epoch semantics — cache invalidation needs a version contract).

Sizing: L (3–6 weeks).

Tickets: LBS-1359, LBS-1360, LBS-1362.

Phase 9 — Production hardening¶

Business outcome: the system runs unattended; production posture is documented; consumers integrate against a stable contract.

Scope (in): - Authentication / authorisation: API-key or JWT on the public ingress; scope-based authz (customer X can read certain leagues only). - Rate limiting at the ingress, separate from the phase 8 query-cost rules (this is about volume per caller, not cost per query). - Infrastructure-as-code: Container Apps + ACR + cluster config defined in Bicep / Terraform, not click-ops. - Backup / disaster recovery for ClickHouse Cloud: procurement + config (called out in the roadmap's known gaps). - Runbook: how to operate, when to page, how to recover. - GraphQL schema-versioning policy: deprecation discipline, breaking-change process. Documented before external consumers integrate. - Customer / consumer integration shape decision (roadmap known gap): direct GraphQL, wrapper API, SDK, federated — pick one before hardening locks the public surface.

Out of scope: - New features. Hardening is closing loops, not opening new ones.

Definition of done: - Auth enforced; an anonymous request returns 401. - One staged failure (kill a replica, lose a feed, drop the cluster) is recovered without losing data. - Infra defined as code in the repo; one teammate can deploy a fresh environment from a clean checkout. - Consumer integration pattern documented; one customer-side integration (real or sample) exists.

Dependencies: phase 8 (cache + perf), phase 7 (observability), phase 5 (real model — hardening a placeholder is throwaway work).

Sizing: L (3–6 weeks across sub-streams).

Tickets: none yet.

What can start this week¶

The roadmap allows phases 1 and 2 to run in parallel. Concretely, this week's planning conversation should pick:

Phase 1, LBS-1361 — engineering-only, no external dependency, immediate start. Sized M; one engineer can finish in 1–2 weeks.
Phase 2 scope-decision call — not engineering work yet; needs the "how many seasons back, which sport" decision from product + engineering before any code is written. Worth a 30-minute meeting this week.

Everything past those two is dependency-gated and shouldn't be picked up speculatively.