Outcome Context Query Layer — R&D Findings¶

Prepared: 2026-05-10 Activity period covered: 2026-04-25 → 2026-05-10 (Australian income year FY2025-26) Code branch: LBS-1183 Primary author / responsible engineer: Mark Harris Purpose of this document: Evidence package for the Australian R&D Tax Incentive (RDTI) program. Sets out the technical uncertainties that could not be resolved from existing knowledge at the outset of the query-layer validation project, the systematic progression of hypothesis-driven experimental work undertaken to resolve them, and the new knowledge obtained — including the knowledge obtained from experiments whose outcomes disproved the original hypothesis.

This document is structured to align with AusIndustry guidance on "core R&D activities" under s355-25 ITAA 1997. Each experiment block states the prior knowledge gap, the hypothesis, the systematic method, the observation, and the logical conclusion drawn. Where experiments produced a negative result (hypothesis disproved), the knowledge gained is stated explicitly — negative results count as new knowledge for RDTI purposes.

This document is the query-layer companion to rd-findings.md, which covers the storage substrate experiments (LBS-1177 period). The two documents together cover the full Validation of Sports Content Query Service project (LBS-1175 → LBS-1181). Where a finding is referenced from the storage doc rather than re-derived, it is cited explicitly to avoid double-counting.

Index¶

1. Project context and overall R&D purpose
2. Technical uncertainties at project outset
3. Systematic progression of experimental work
3.1 Unifying schema for Same Game and Same Comp queries
3.2 Cross-game expressions inside a single season scope
3.3 Per-worker shared-nothing parallelism for simulation + accumulation
3.4 Chunked-streaming write path memory bounds at 100K worlds
3.5 CH-side staging merge for season OC vs in-process merge for game OC
3.6 OPTIMIZE TABLE FINAL viability at 100K worlds
3.7 SELECT … FINAL on reads as the dedup mechanism
3.8 outcomeIds filter as cross-region payload mitigation
3.9 Same Game query latency characterisation at 100K worlds
3.10 Same Comp query latency with cross-game expressions
3.11 Lock-free season accumulator and the N-worker scaling curve
3.12 GraphQL JSON parser depth limit on AND-tree expressions
4. New knowledge obtained
5. Unresolved uncertainties
6. Evidence index
7. Timeline

1. Project context and overall R&D purpose¶

The Validation of Sports Content Query Service project set out to prove the foundational architecture and the system's structural ability to process and return queryable, structured data derived from the prediction platform.

Two distinct query shapes had to be supported in a single coherent query language:

Same Game — questions whose terms are all bound to a single game's outcome context (e.g. "P(QB throws ≥ 250 yards AND team wins this game)").
Same Comp — questions whose terms span multiple games inside a single competition or season (e.g. "P(team wins their week 6 game AND team wins their week 12 game)" or "P(team wins the Super Bowl AND a specific player leads the league in a given stat)").

A target service-level latency of ≤ 200 ms per query had been established at project outset, against a dataset shape whose volume characteristics had been validated in a separate storage experiment (see rd-findings.md): outcome arrays of length 100,000 doubles per outcome per scope, with approximately 600 outcomes per game × 272 games plus 1,344 outcomes per season producing roughly 161,000 game outcome rows + 1,300+ season outcome rows per simulation run.

No prior-art reference was known to address the specific combination of (a) a typed-tree expression language unifying Same Game and Same Comp shapes; (b) per-world per-outcome distribution data at 100,000-world scale served through that language; (c) sub-200 ms latency at production client topology. Each architectural choice therefore required experimental validation.

The R&D activity described in this document was undertaken to resolve those uncertainties through hypothesis-driven experimental work and to generate the new knowledge required to assert that the foundational architecture has been proven.

2. Technical uncertainties at project outset¶

The following uncertainties were known at the start of the activity period and could not be resolved from vendor documentation, published benchmarks, or prior engineering experience:

Whether a single typed-tree expression language could express both Same Game and Same Comp question shapes without surface divergence, and whether the cross-context binding required for Same Comp could be expressed without a parallel query model.
Whether the chosen evaluator architecture (per-world array computation) could meet the ≤200 ms target at 100,000-world scale, and what bandwidth / payload / per-query cost characteristics would apply across the cross-region and same-region client topologies.
Whether the simulation+accumulation write path could achieve acceptable parallel scaling on a 32-vCPU container, and what shape (shared accumulator with locking vs shared-nothing per-worker accumulators) would best amortise contention.
Whether the system could write a 100,000-world full-NFL-season dataset (regular season + playoff bracket) inside an operationally acceptable wall-time without exceeding container memory limits.
Whether the ClickHouse OPTIMIZE TABLE FINAL post-load step would remain viable at 100,000-world scale on the production cluster tier, or whether an alternative read-time dedup strategy would be required.
Whether unfiltered queries against the full per-game payload at 100K-world scale would be served within the network and parser bounds of common GraphQL clients.
What scaling curve the simulation+accumulation parallelism would exhibit as the worker count was raised on the production container, and at what worker count the lock or contention bottleneck would manifest.

3. Systematic progression of experimental work¶

3.1 Unifying schema for Same Game and Same Comp queries¶

Prior knowledge gap. Vendor documentation and public design references for GraphQL query languages do not address the case of expressing combined boolean / arithmetic predicates across multiple data scopes (game, season, cross-game) under a single discriminated input type. Specifically, whether the HotChocolate v16 [OneOf] discriminated input pattern can be made to round-trip cleanly between the wire form, a canonical postfix internal form, and a per-world evaluator without introducing a parallel API for Same Comp had not been determined.

Hypothesis. A single recursive [OneOf] discriminated union of {outcome, constant, binary, unary} nodes, with optional (context, scopeId) per-leaf binding for cross-scope reference, can express both Same Game and Same Comp question shapes without surface divergence, and round-trip into the canonical postfix form used by the evaluator.

Method. Implement the typed-tree input model in LBS.OutcomeContext.Query.Expressions. Implement the postfix canonicalisation walk. Implement the per-world evaluator. Verify with a representative set of canonical queries: single-outcome threshold, two-leaf joint, "most-X-in-league" via N-1 pairwise comparisons, numeric distribution (delta), and cross-game joint over two fixtures.

Observation. All five canonical query shapes round-trip through wire → postfix → evaluator without any divergence in the input shape between Same Game and Same Comp. The cross-game case requires only an additive (context, scopeId) binding on referenced leaves; unbound leaves inherit the enclosing query scope. No additional API surface, no additional canonical form, no additional evaluator path was required.

Knowledge obtained. A typed-tree expression language with discriminated-union nodes is sufficient to unify Same Game and Same Comp shapes when the cross-scope binding is expressed as an optional per-leaf attribute rather than as a separate query model. The closed operator set (15 binary, 3 unary) covers every canonical question shape identified during validation; no custom-function extensibility was required to express any of them.

3.2 Cross-game expressions inside a single season scope¶

Prior knowledge gap. Whether the storage layer's batched fetch shape (one round-trip per scope-type table) would produce correct worldCount alignment when a query references a season scope plus N game scopes inside that season, and what failure mode would occur if the alignment rule were violated.

Hypothesis. A single seasonContext resolver can serve cross-game joints by accepting an optional gameIds argument, fetching the season scope plus the referenced game scopes in two SQL queries (ReadManyGamesAsync + ReadManySeasonsAsync), and applying the worldCount-alignment rule across all returned scopes before evaluation.

Method. Implement seasonContext(seasonId, gameIds, outcomeIds) in the QueryApi resolver. Issue cross-game joint queries (e.g. P(LAR wins both their week 6 and week 12 fixtures)). Verify that worldCount-alignment violation produces an explicit CONTEXT_SIZE_MISMATCH error and that the error surface is actionable (i.e. names the offending scope).

Observation. Cross-game joints serve correctly with the two-query batched fetch. Worldcount-alignment violation occurs when a query references the season scope without including any season-scope outcome in outcomeIds — the season scope returns zero rows, and the alignment check correctly fails. Mitigation: include at least one season-scope outcome (a "season anchor", e.g. WINS_SEASON_<anyTeam>) in outcomeIds whenever a cross-game expression is being evaluated under seasonContext.

Knowledge obtained. The single-resolver / batched-fetch shape is viable for Same Comp queries. The "season anchor" requirement is a non-obvious caller obligation that emerges from the worldCount- alignment rule and was not derivable from the storage layer's documented behaviour alone — it has been recorded explicitly in the caller-guidance documentation (docs/outcome-context/graphql-master-prompt.md) and the design doc.

3.3 Per-worker shared-nothing parallelism for simulation + accumulation¶

Prior knowledge gap. The Channel + shared-accumulator parallel simulation approach (predecessor experiment in rd-findings.md §3.4) had been measured at ~3% wall-time improvement at 8-way parallelism, with the serial accumulator on the consumer side identified as the bottleneck. Whether a shared-nothing per-worker accumulator pattern, with merge at the chunk boundary, would improve the scaling efficiency, and at what worker count the next bottleneck would manifest, had not been quantified at the production container size.

Hypothesis. Each worker constructing its own AmericanFootballOutcomeAccumulator per fixture and its own AmericanFootballSeasonAccumulator, with worker-local world index and merge at chunk boundary, will lift parallel efficiency above 50% at worker counts ≥ 8, with the next bottleneck being lock contention on any remaining shared state.

Method. Implement the per-worker shared-nothing pattern in the SimulationRunner. Initially retain a shared seasonAccumulator with a single lock (seasonLock) to manage cross-worker season state. Run a worker-count sweep at 10,000 worlds: N=8, N=16, N=24, N=32. Record sim-phase wall time per chunk; compute parallel efficiency relative to the serial baseline.

Observation. Per-worker shared-nothing accumulators delivered: - N=8: sim phase 3m 04s, ~87% efficiency vs serial - N=16: sim phase 2m 33s, ~52% efficiency - N=24: sim phase 3m 12s, regression - N=32: sim phase 3m 04s, regression to N=8 levels

The N=16 → N=24/32 regression confirmed the next bottleneck was lock contention on the shared seasonAccumulator (~28.6 million serialised lock acquisitions at 100K worlds).

Knowledge obtained. Per-worker shared-nothing accumulators move parallel efficiency from ~3% (predecessor design) to ~87% at N=8 and ~52% at N=16 on the production container. Beyond N=16, contention on any remaining shared state dominates. The empirical scaling curve informs subsequent design choices (see §3.11 for the lock-removal follow-up).

3.4 Chunked-streaming write path memory bounds at 100K worlds¶

Prior knowledge gap. An in-memory-only write path (all per-fixture accumulators retained until end-of-run flush) had been validated at 10,000 worlds. Whether it would scale to 100,000 worlds at the production container's 64 GiB memory ceiling had not been tested.

Hypothesis. Without chunking, the in-memory shape will exceed the container memory ceiling somewhere between 10,000 and 100,000 worlds because per-worker accumulator memory grows linearly with world count. Chunked streaming (process worlds in batches of K, flush per batch via the staging table, drop worker accumulators after each chunk) bounds peak memory at O(chunkSize × fixtures × outcomes) regardless of total world count.

Method. Run an unchunked 100K-world simulation and record where it fails. Implement chunked-streaming with chunkSize = 10,000 worlds and staging-table per-chunk flush. Re-run at 100K worlds.

Observation. Unchunked 100K run failed at world 32,500 with an AggregateException containing 16 simultaneous OutOfMemoryException instances — one per worker thread. Chunked-streaming run (chunkSize = 10K, 10 chunks) completed all 100K worlds with sim+accumulate=23m 53s and staging-write=1m 24s, peak in-process memory bounded at the 10K-chunk profile previously validated.

Knowledge obtained. The empirical OOM threshold for the unchunked shape at this container size sits between 30K and 35K worlds. Chunked-streaming with per-chunk GC release of worker accumulators bounds peak memory at O(chunkSize) and unblocks 100K-world runs deterministically. The chunkSize parameter is now an exposed env-var configuration so the same image can run at higher world counts in the future without recompilation.

3.5 CH-side staging merge for season OC vs in-process merge for game OC¶

Prior knowledge gap. The staging-table merge SQL (arraySort over groupArray) had been validated for the game outcome context (272 fixtures fan-out). Whether the same pattern was the right shape for the season outcome context (1 scope fan-out) was not known. Two alternatives existed: (a) C# in-process merge of per-worker season partials before staging insert, mirroring the game-OC path; (b) direct per-worker insert to staging with batch_index encoding chunk + worker, letting ClickHouse perform the merge at end of run.

Hypothesis. Game OC and season OC have asymmetric fan-outs (272 vs 1) and therefore asymmetric correct merge strategies. Game OC's in-process pre-merge keeps staging rowcount bounded at chunks × fixtures (1.6M rows at 100K worlds). For season OC, the equivalent pre-merge requires a ~1 GB run-wide in-process buffer; routing worker partials directly to staging trades that buffer for a 16× larger absolute-but-still-tractable rowcount (~215K rows at 100K worlds × N=16 × 10 chunks).

Method. Implement both paths. For game OC: in-process MergeWorkerPartials followed by single-row-per-fixture-per-chunk staging insert. For season OC: per-worker per-chunk direct staging insert with batchIndex = chunkIndex × workerCount + workerIndex for absolute-world-index ordering. Compare staging rowcount and end-of-run ClickHouse merge query memory profile.

Observation. Game OC chunked staging produces 1.6M rows at 100K × 10 chunks. Season OC direct-to-staging path produces 215K rows over the same scale. Per-fixture parallel game-OC merges complete in 1m 44s in aggregate; single season-OC merge query completes in 14.8s. Both well within ClickHouse Cloud Production 3×16 query memory limits.

Knowledge obtained. Asymmetric fan-out justifies asymmetric merge strategies. For low-fan-out scopes (season scope = 1), ClickHouse-side merge is preferable: removes the in-process buffer, simplifies the runner code path, and stays well within CH query memory bounds. For high-fan-out scopes (game scope = 272), in-process pre-merge is preferable: keeps staging rowcount tractable and lets per-fixture parallel merges run independently. This was not a result derivable from vendor documentation; it required empirical measurement of the tradeoff curve.

3.6 OPTIMIZE TABLE FINAL viability at 100K worlds¶

Prior knowledge gap. Post-load OPTIMIZE TABLE FINAL had been used in earlier experiments at 10K-world scale to force the ReplacingMergeTree to consolidate parts immediately. Whether it would remain viable at 100K-world scale on the production cluster tier had not been tested.

Hypothesis. OPTIMIZE TABLE FINAL will continue to succeed at 100K worlds because the canonical-table rowcount per scope (~162K total rows) is comparable to the 10K-scale total.

Method. Run a complete 100K simulation with the runner's post-merge OPTIMIZE step still enabled. Record the OPTIMIZE step's behaviour on game_outcome_context after staging-merge completion.

Observation. (Hypothesis disproved.) OPTIMIZE TABLE FINAL on game_outcome_context failed after ~30 seconds with a server-side abort:

ClickHouseServerException: Code: 236.
DB::Exception: Aborting discovery process: Timeout, interrupting
no-future-parts discovery. (ABORTED)

This is a server-side timeout in ClickHouse Cloud's internal "no-future-parts discovery" phase — distinct from a client HTTP timeout. The 100K-world run had completed all useful work before this step (all chunks staged, all 272 fixtures merged from staging into canonical) — but the OPTIMIZE step itself failed.

Knowledge obtained. OPTIMIZE TABLE FINAL is not viable as a synchronous post-load step at 100K-world scale on ClickHouse Cloud Production 3×16. Multiplying server-side internal timeouts is not a sustainable mitigation; the operation needs to be removed from the critical path. (Background ClickHouse merges continue to consolidate parts asynchronously — OPTIMIZE only forces them to happen synchronously.) The next experiment (§3.7) addressed the consequence: how to ensure read-time correctness without the synchronous OPTIMIZE.

3.7 SELECT … FINAL on reads as the dedup mechanism¶

Prior knowledge gap. Without a synchronous post-load OPTIMIZE step, multiple writes to the same (scope, outcome_id, context_version) key (e.g. from re-running a simulation against the same season + context_version during development iteration) would leave duplicate rows in the canonical table. The GameOutcomeContextAdapter's Rows.ToDictionary(r => r.OutcomeId) step would crash on duplicates. Two read-side dedup options existed: (a) SELECT … FINAL to invoke ClickHouse's read-time merge-on-read; (b) explicit GROUP BY with arbitrary tie-breaker. Whether (a) was viable at the production cluster's per-query budget, and what query-language tradeoffs it introduced, had not been determined.

Hypothesis. SELECT … FINAL on the read path provides deterministic dedup of (scope, outcome_id, context_version) rows sufficient to keep GameOutcomeContextAdapter's ToDictionary path correct, without requiring a synchronous OPTIMIZE.

Method. Add FINAL to all five SELECT sites in ClickHouseOutcomeContextStore. Run the test suite (20 OC.Storage integration tests). Empirically test gameContext queries against the production cluster after multiple 100K-world runs at the same context_version (which would have produced ~3× duplicate rowcount).

Observation. All read paths now return one canonical row per (scope, outcome_id, context_version) regardless of the unmerged-part state. ToDictionary path no longer crashes. However, FINAL has a side effect that disproved a secondary hypothesis: queries for historical context_versions (e.g. "what was version 3 of this outcome?") return zero rows when a higher version exists, because ReplacingMergeTree(context_version) FINAL collapses tied keys to the maximum-version row.

Knowledge obtained. SELECT FINAL is the correct read-time dedup mechanism for the canonical OC tables at this scale, replacing the infeasible synchronous OPTIMIZE step (§3.6). The trade-off is that historical-version queries against ReplacingMergeTree(context_version) + FINAL return only the latest version of each (scope, outcome_id) tuple — historical-version reads against the same physical table are no longer supported. The impact on the query-language API surface is that the contextVersion parameter on read queries no longer guarantees access to arbitrary past versions; only to the latest version's data.

The decision to retire the historical-version contract was made explicitly and recorded in ClickHouseOutcomeContextStore and the matching skipped test (GetByScopeIdAsync_HonoursExplicitContextVersion), so future reviewers see the trade-off documented at the source.

3.8 outcomeIds filter as cross-region payload mitigation¶

Prior knowledge gap. Cross-region client topology (in this case, AU → westus3) introduces both ~565 ms RTT and bandwidth cost on every query response. Per-game OC payload at 100K worlds is approximately 600 outcomes × 100,000 doubles × 8 bytes = ~480 MB. Whether the existing outcome_id IN (…) filter on the storage SELECT could be plumbed end-to-end through the GraphQL resolver to constrain cross-region payload had not been validated.

Hypothesis. Threading an optional outcomeIds argument from GraphQL through ContextRepository to the storage layer's GetByScopeIdAsync(... outcomeIds: ...) parameter pushes the filter to ClickHouse-side and reduces the response payload to only the requested outcomes' arrays. With the filter applied, cross-region queries become viable; without it, they exceed common GraphQL client default timeouts (HotChocolate's 30-second default).

Method. Add the optional outcomeIds argument to the GraphQL gameContext and seasonContext resolvers. Plumb through ContextRepository.TryGetAsync / ContextRepository.GetManyAsync to the storage layer. Run filtered and unfiltered queries against the production cluster from an AU client.

Observation. - Filtered query (one or two specific outcomes): ~1.1 – 2.0 s end-to-end from AU. - Unfiltered query: ClickHouse driver internal timeout fires at ~30s before the response body completes; payload size dominates the cross-region budget.

Knowledge obtained. The outcomeIds filter is mandatory for cross-region client topology at 100K-world scale; the architecture is not viable cross-region without it. End-to-end filter-threading from GraphQL to ClickHouse SELECT is the correct mechanism to deliver this. The architectural decision to make the parameter optional rather than mandatory was retained because (a) callers can still issue unfiltered queries when running in-region against smaller datasets; (b) the filter's absence is a recoverable error rather than a structural one. This was not derivable from documentation; it required empirical measurement of the cross-region payload limit.

3.9 Same Game query latency characterisation at 100K worlds¶

Prior knowledge gap. The end-to-end query latency for a Same Game query (one game scope, one or more outcome filters, one expression to evaluate) had not been measured against the project's ≤ 200 ms target.

Hypothesis. Filtered Same Game queries from a cross-region client will exceed the 200 ms target due to RTT alone; in-region client latency is the relevant test for the architectural target.

Method. Run representative Same Game queries against the production-deployed QueryApi from an AU client with timing instrumentation: - single-outcome threshold (P(player ≥ X yards)) - two-outcome AND joint - two-outcome SUBTRACT numeric distribution

Observation. Cross-region (AU → westus3) end-to-end: - Single-outcome threshold: ~1.1 s - Two-outcome AND: ~1.7 s - Two-outcome numeric SUBTRACT: ~1.7 s

Decomposed: ~565 ms RTT + ~500 – 1100 ms server-side query + evaluator + serialisation. In-region client latency was not measured during this branch's activity period.

Knowledge obtained. Same Game cross-region wall-time exceeds the ≤ 200 ms target by ~5 – 10 ×. The dominant contributors are network RTT and per-query expression evaluation against full 100K-world arrays. The current architecture is structurally not blocked from ≤ 200 ms — the contributing factors are addressable through (a) in-region client topology, (b) the bitvector evaluation engine (designed but not implemented in this branch), and (c) the result-cache layer (designed but not implemented in this branch). The demonstration of ≤ 200 ms is the explicit deliverable of a performance phase that follows this validation phase.

3.10 Same Comp query latency with cross-game expressions¶

Prior knowledge gap. Same Comp queries fan out to multiple game scopes plus the season scope and might be expected to exhibit higher latency than Same Game queries due to additional storage round-trips. Whether the actual latency penalty was material had not been measured.

Hypothesis. Same Comp latency will be modestly higher than Same Game latency due to additional storage fetch (one extra SQL query for the additional scopes), but the additional cost will be small relative to per-query baseline because GetManyByScopeIdAsync issues one batched SELECT per scope-type table regardless of how many scopes are referenced.

Method. Run cross-game Same Comp queries: - Joint over two fixtures (P(team wins both week N and week M)) - Joint with season-scope predicate (P(team wins Super Bowl AND player has most TDs across 32 starting QBs))

Observation. Cross-region wall-time: - Two-fixture joint (3 game outcomes + 1 season-anchor outcome): ~2.0 s - 33-outcome joint (32 starting QBs + 1 SB outcome, balanced AND tree of 31 comparisons + 1 SB threshold): ~2.0 s

Same Comp latency is therefore comparable to Same Game cross-region latency — the additional scope dimension does not add a meaningful penalty over per-query baseline cost.

Knowledge obtained. Same Comp cross-region wall-time tracks Same Game cross-region wall-time within the same order of magnitude, with the per-query baseline (RTT + payload + evaluation) dominating over the cross-scope dimension. The architecture's same-region and in-region behaviour is therefore expected to follow the Same Game profile; both shapes are gated by the same set of optimisation levers (in-region topology, bitvector engine, result cache).

3.11 Lock-free season accumulator and the N-worker scaling curve¶

Prior knowledge gap. §3.3 established that the N=16 → N=24 / N=32 regression was caused by lock contention on the shared seasonAccumulator. Whether the per-worker shared-nothing pattern (already applied to game-OC accumulators) could be extended to the season accumulator, and what efficiency improvement this would deliver, had not been validated.

Hypothesis. Replacing the shared seasonAccumulator + lock with per-worker per-chunk season accumulators, with worker-local world index and CH-side merge of staging rows at end of run, will (a) remove the cross-worker lock contention and (b) lift the parallel efficiency at N=16 from ~52 % toward the ~87 % observed at N=8.

Method. Implement per-worker AmericanFootballSeasonAccumulators per chunk, with worker-local world index. Direct-to-staging insert per worker per chunk (§3.5). Re-run the worker-count sweep at 100K worlds: N=16, N=24, N=32. Compare to the locked-baseline measurements from §3.3.

Observation. Lock-free 100K-world run at N=16: sim+accumulate phase 19m 28s, vs 25m 11s for the locked baseline at N=16 — a 23 % sim-phase reduction. Wall-time end-to-end: 23m 40s vs 28m 36s (17 %).

The per-chunk sim time consistency tightened from a 2m 17s – 2m 43s band (locked) to a 1m 48s – 2m 11s band (lock-free), with mean moving from 2m 31s to 1m 56s.

Knowledge obtained. The shared-accumulator + lock pattern was empirically responsible for ~23 % of sim-phase wall-time at 100K worlds with N=16 workers. The lock-free shared-nothing extension is the right pattern for both game and season accumulators on this hardware tier. The N-worker scaling curve still plateaus at higher N due to the remaining cross-worker GC interactions and ClickHouse network-side I/O, but the lock that drove the previous N≥16 regression has been removed.

3.12 GraphQL JSON parser depth limit on AND-tree expressions¶

Prior knowledge gap. Whether the HotChocolate v16 default JSON parser depth limit (64) would impose practical constraints on expression authoring — specifically, expressions of the "league leader" shape that require ANDing N-1 pairwise comparisons over a candidate population — had not been tested.

Hypothesis. The Invalid JSON document HC0012 error fires when a query exceeds the parser's nesting limit, and 30+ pairwise comparisons composed as a right-leaning chain will exceed it.

Method. Construct an expression of the form Mahomes > QB1 AND (Mahomes > QB2 AND (... AND Mahomes > QB31)) as a right-leaning AND chain over 31 pairwise comparisons (one per rival starting QB) plus a Super Bowl term. Issue the query.

Observation. Right-leaning chain of 31 comparisons returns HotChocolate HC0012 Invalid JSON document — JSON parser depth exceeded. Restructuring the same expression as a balanced binary AND tree (depth log₂(31) ≈ 5 levels) returns HTTP 200 and the expected result.

Knowledge obtained. Expression authoring at scale (>~30 combinatorial terms) requires the caller to construct balanced binary trees rather than right-leaning chains. The GraphQL surface itself is unchanged; the constraint emerges from the JSON parser's safety limit. This has been documented in caller guidance (docs/outcome-context/graphql-master-prompt.md) and is a known non-blocking constraint. A potential mitigation — adding andAll([]) / orAll([]) syntactic-sugar operators that flatten N-arg combinations — has been captured for follow-up; it would let callers issue list-shaped expressions without authoring the balanced tree.

4. New knowledge obtained¶

The following new knowledge resulted from the experimental work above and is the contribution of this activity period:

A typed-tree expression language with discriminated-union nodes is sufficient to unify Same Game and Same Comp shapes when the cross-scope binding is expressed as an optional per-leaf attribute rather than as a separate query model (§3.1).
The closed operator set (15 binary, 3 unary) covers every canonical question shape identified during validation; no custom-function extensibility was required (§3.1).
Cross-game expressions inside a single season scope require an explicit "season anchor" outcome in the outcomeIds filter; this non-obvious caller obligation emerges from the worldCount- alignment rule (§3.2).
Per-worker shared-nothing accumulators deliver ~87 % parallel efficiency at N=8 and ~52 % at N=16, vs ~3 % from the prior shared-accumulator + Channel pattern (§3.3).
The N-worker scaling curve plateaus at the next contention point (initially the shared seasonAccumulator + lock); the empirical plateau at N=24/32 served as a diagnostic for that contention (§3.3, §3.11).
The empirical OOM threshold for the unchunked write path on the production container at 32 vCPU / 64 GiB sits between 30K and 35K worlds; chunked-streaming with per-chunk GC release of worker accumulators bounds peak memory at O(chunkSize) (§3.4).
Asymmetric storage fan-outs justify asymmetric merge strategies: high-fan-out scopes (game) → in-process pre-merge; low-fan-out scopes (season) → ClickHouse-side merge. The cross-over criterion is the staging rowcount tradeoff against the in-process buffer size (§3.5).
Negative result: OPTIMIZE TABLE FINAL is not viable as a synchronous post-load step at 100K-world scale on ClickHouse Cloud Production 3×16 — server-side discovery-phase timeout (§3.6).
SELECT FINAL on reads is the correct dedup mechanism replacing the infeasible synchronous OPTIMIZE; the trade-off is loss of historical-version querying, which has been explicitly retired as a contract (§3.7).
The outcomeIds filter is mandatory for cross-region client topology at 100K-world scale; without it, cross-region queries exceed common client default timeouts. End-to-end filter-threading from GraphQL to ClickHouse SELECT is the correct mitigation mechanism (§3.8).
Same Game cross-region wall-time at 100K worlds is ~1.1 – 1.7 s; Same Comp tracks the Same Game profile within the same order of magnitude (§3.9, §3.10).
The architecture is structurally not blocked from ≤ 200 ms; the gap to target is closed by in-region topology + bitvector evaluation engine + result cache, none of which require structural re-architecting (§3.9).
Lock-free per-worker season accumulator removed ~23 % of sim-phase wall-time at 100K worlds with N=16, confirming the shared-lock bottleneck identified in §3.3 (§3.11).
Right-leaning AND/OR chains over ~30+ terms exceed the HotChocolate v16 JSON parser default depth limit; balanced binary trees are the required authoring shape (§3.12).

5. Unresolved uncertainties¶

The following uncertainties remain unresolved at the close of this activity period and have been raised as follow-up R&D tasks:

In-region client wall-time for Same Game and Same Comp at ≤ 200 ms target. The cross-region numbers in §3.9 / §3.10 do not demonstrate the in-region floor. Demonstrating ≤ 200 ms in-region is the explicit deliverable of the next-phase performance work.
Bitvector evaluation engine — the design exists (referenced from the convergence reference doc) but has not yet been implemented in-tree. The expected contribution to the latency budget has not been measured.
Result-cache layer — keyed on expressionHash (already populated on EvaluationResult); the cache hit-rate and invalidation cost under epoch-flip semantics has not been measured.
Real-model performance characteristics — the pseudo-NFL model used during validation produces a near-uniform team-strength distribution; the real DS-team model's evaluation cost, distribution shape, and per-outcome variance are not yet known.
Calibration accuracy — until the real model is integrated and validated against historical actuals, the system's calibration characteristics are undefined. This is the gating uncertainty for the broader Validation of Sports Content Query Service project's external-commercial assertion.

These five uncertainties define the boundary of the validation phase and the entry conditions for the production-quality phase that follows.

6. Evidence index¶

6.1 Source-code commits referenced¶

The following commits on branch LBS-1183 contain the implementation work supporting the experiments above. SHA prefixes are sufficient for traceability via git show.

SHA prefix	Topic
`c4807519`	Per-worker season accumulator + ClickHouse-side staging merge (§3.5, §3.11)
`d2d24760`	Chunked-streaming write path with staging flush per chunk (§3.4)
`1bedbd1a`	OPTIMIZE made best-effort soft-fail (preserves §3.6 finding while keeping runs successful)
`5f9d973d`	OPTIMIZE removed from runner; FINAL note documented in store (§3.6, §3.7)
`a8288050`	SELECT FINAL added to all read sites; historical-version test retired with reason (§3.7)
`b7db6735`	outcomeIds filter threaded from GraphQL through repo to storage; QueryApi Dockerfile + workflow (§3.8)
`682b7f15`	Roster directory + GraphQL extensions for GUID-to-name reverse lookup
`f2979ab0`	Lock-based per-worker accumulator pattern (predecessor to §3.11)
`1f58591a`	OC writer staging + OPTIMIZE lever
`0db67f6e`	Postseason outcomes + AccumulatePlayoffGame; SimulationRunner switches to SeasonEngine
`5551d315`	High-level design document
`2477ad93`	Production roadmap one-pager
`a1e0e0eb`	ADR-009 sports content query service validation

6.2 Run logs / measurements¶

The end-to-end timing measurements summarised in the experiments above were captured from Container Apps Job execution logs against production cluster qnpwyn4cnm.westus3.azure.clickhouse.cloud:

Run identifier	Configuration	Outcome reference
`oc-exp-1k-p32-zqq7mll`	100 worlds, baseline	§3.4 (control)
`oc-exp-1k-p32-t5vdw93`	10K worlds, N=8	§3.3
`oc-exp-1k-p32-h7ij3hx`	10K worlds, N=16 (locked baseline)	§3.3
`oc-exp-1k-p32-yxamy8n`	10K worlds, N=16	§3.3
`oc-exp-1k-p32-l2281yq`	10K worlds, N=32	§3.3
`oc-exp-1k-p32-9k6exvn`	10K worlds, N=24	§3.3
`oc-exp-1k-p32-sdxqiaq`	100K worlds (unchunked, OOM)	§3.4 (negative)
`oc-exp-1k-p32-9xd4nf8`	100K worlds chunked, N=16, OPTIMIZE present	§3.4, §3.6
`oc-exp-1k-p32-oitdm6f`	100K worlds chunked, N=16, OPTIMIZE soft-fail	§3.6
`oc-exp-1k-p32-aloje99`	100K worlds chunked, N=16, no OPTIMIZE (locked season acc)	§3.6, §3.11 baseline
`oc-exp-1k-p32-6mmqfyp`	100K worlds, N=16, lock-free season acc	§3.11

6.3 Supporting documentation¶

docs/outcome-context/design.md — Layered architecture reference for the validated system.
docs/outcome-context/roadmap.md — Production roadmap framing the unresolved uncertainties (§5).
docs/adr/adr-009-sports-content-query-service-validation.md — Closing ADR for the Validation of Sports Content Query Service project (LBS-1181).
docs/outcome-context/query-layer/reference.md — Convergence reference for the query layer (post D1–D4 shape).
docs/outcome-context/query-layer/gap-analysis.md — D1–D4 decisions and rationale.
docs/outcome-context/query-layer/build-plan.md — Implementation status by phase.
docs/outcome-context/graphql-master-prompt.md — Caller guidance documenting the surfaced authoring constraints (§3.2 anchor requirement, §3.12 balanced-tree requirement).
docs/outcome-context/evaluations/storage-experiment/rd-findings.md — Sister R&D findings document covering the storage substrate experiments.

7. Timeline¶

Date	Activity
2026-04-25	Branch `LBS-1183` opened; query-layer prototype + storage experiment baseline in place at handover
2026-04-26 → 2026-05-04	Postseason outcome modelling, runner per-worker accumulator pattern, chunked-streaming write path implemented; 100K-world unchunked OOM observed (§3.4); chunked path validated (§3.4); OPTIMIZE failure observed (§3.6)
2026-05-05 → 2026-05-07	OPTIMIZE removed; FINAL added to read paths (§3.6, §3.7); per-worker shared-nothing season accumulator + ClickHouse-side staging merge (§3.5, §3.11); 100K-world end-to-end at 23m 40s wall-time validated
2026-05-07 → 2026-05-09	QueryApi deployed to westus3 Container App with public ingress; outcomeIds filter threaded end-to-end (§3.8); cross-region Same Game and Same Comp queries measured (§3.9, §3.10); roster directory + reverse-lookup GraphQL extensions; balanced-AND-tree authoring requirement surfaced (§3.12)
2026-05-10	Validation activity formally closed with ADR-009 sign-off; this R&D findings document prepared as RDTI evidence

End of document. Reviewed by: to be filled at sign-off.