Skip to content

Storage Experiment — Status Update

Date: 2026-04-18 Branch: LBS-1183 Audience: Engineering, product, stakeholders

R&D Tax evidence: a hypothesis-driven write-up of every experiment in this document (with commit SHAs, run-log references, and explicit knowledge-gained statements for both positive and negative results) is held in rd-findings.md.

TL;DR

100K full-pipeline end-to-end validated on ClickHouse Cloud in 2h 58m for ~$27/run. All three outcome tables (game_outcome_context, play_by_play, season_outcome_context) written correctly: 244,800 game-OC rows • 4.2B play-by-play rows • 1,728 season-OC rows. Primary run for the storage-decision gate.

Headline numbers (Production 3×16 Cloud tier, Azure Container Apps D32 32-CPU / 64 GiB):

Scope 100K Wall Cost
OC-only 31.4 min ~$5
Full pipeline (OC + PBP, no season) 2h 44m ~$25
Full pipeline + season context 2h 58m ~$27

What it took to get there — stacked optimisations, each measured:

  1. Streaming orchestrator with per-worker shared-nothing accumulators — lifted parallelism from a failed ~3% improvement (shared-accumulator Channel<T>) to -55% wall time at 5K/10K on localhost Docker. Factor improves at scale.
  2. Parallel per-game OC staging writes + parallel final merge — ClickHouse accepts concurrent inserts; cut localhost 100K projection from ~4.4h to ~2h.
  3. Same-region Azure Container Apps job in westus3 (same region as the Cloud cluster) — removed the 565ms AU→US RTT tax that had a 1K run taking 12 minutes.
  4. async_insert=1; wait_for_async_insert=0 on the connection string — OC write -86% at 1K by letting CH buffer small inserts server-side.
  5. PBP direct to ClickHouse via ClickHousePlayByPlayBatchWriter — the ephemeral-container Parquet writer path stalled pathologically at 10K+ (1h 40m and still running); switching PBP to the same bulk-copy path fixed it. Required play_by_play repartition from game_id (288 partitions, rejected by Cloud's max_partitions_per_insert_block=100) to season_id (1 partition).
  6. 4-way parallel PBP writes per chunk — per-bulk-copy throughput caps at ~175K rows/sec; splitting into 4 concurrent slices lifted 10K to 513K rows/sec.
  7. Production Scale (2×4) → Production 3×16 Cloud tier — prior 100K full-pipeline attempt timed out at 4h mid-PBP-write because PBP scaling was super-linear on the smaller cluster; the bigger tier resolves it (near-linear 528K rows/sec at 100K vs 513K at 10K). Merge phase also collapsed from 20 min → 2.3 min on OC-only thanks to the extra replica RAM + merge-parallelism cap raised from 4 to 16.
  8. Season accumulator refactor — the original Dictionary<worldIndex, Dictionary<outcomeId, double>> layout projected to ~270 GB at 100K and OOM'd the container. Refactored to Dictionary<outcomeId, double[worldCount]>: same observable behaviour (6 existing unit tests pass unchanged), peak RAM for the season accumulator capped at ~1.3 GB regardless of scale.

Remaining bottlenecks at 100K: - Simulation (36 min / 20% of total) — prototype sim code, user-scoped out of optimisation. A future per-worker season-accumulator + merge would remove lock contention from the sim hot path and save ~12 min. - PBP write (2h 13m / 75% of total) — cluster insert throughput is the ceiling. Bigger tier or sharded insert path would be the levers.

Spend shape: idle-pause after 15 min means no ongoing compute cost; every run pays only for its own compute. 100K full-pipeline ≈ ACI $3 + CH Cloud $24.


What has been built

Production-quality (will live beyond the experiment):

  • Accumulation library (LBS.Model.AmericanFootball.Accumulation) — converts simulation game results into the Outcome Context shape (~850 stats per game as Array(Float64) keyed by outcome_id) and the Season Outcome Context (standings + player milestones across all worlds). Sport-specific design, cleanly extensible to other sports.
  • Storage library (LBS.Model.AmericanFootball.Storage) — NEW. Houses the Parquet writer (moved out of Simulation), the IBlobSink abstraction with local-disk implementation and Azure Blob stub, and the IPlayByPlayBatchWriter pattern. Simulation project no longer depends on Parquet.Net.
  • Outcome ID catalogue — 851 declared stats per game, 1,668+ stats per season. Covers team scoring, QB / RB / WR / TE / K / D individual stats, TD ordinals, fantasy points, season standings, milestones.

Experiment-grade (disposable):

  • Storage backend abstraction (IStorageBackend) — enables testing the same suite against ClickHouse, ScyllaDB, or any other target with a ~300-line implementation.
  • ClickHouse backend — production-quality implementation using the official ClickHouse.Driver package, with parameterised SQL throughout, allowlisted DDL identifiers, and Testcontainers-backed integration tests.
  • Streaming orchestrator — NEW. Chunks simulation into configurable world-ranges, flushes OC partials to a staging table + merges via arrayFlatten at the end, writes PBP per chunk to Parquet files via IBlobSink. CLI flags: --streaming, --chunk-size, --pbp-sink-type, --pbp-sink-path.
  • 19 experiments implemented across the must-have, should-have, and nice-to-have priority bands (18 original + basket-read sweep).
  • Results pipeline — every experiment emits structured ExperimentResult rows that print to console AND can be written to a ClickHouse Cloud table for cross-run analytics.
  • Docker Compose for local ClickHouse development + a CLI runner for the experiment suite.

Quality gates

Category Status
Unit + integration tests 109 total, 108 passing, 1 intentional skip
Streaming orchestrator integration tests (Testcontainers) smoke + read-parity
Storage layer unit tests LocalDiskBlobSink + ParquetPlayByPlayBatchWriter round-trips
Coding standards (TreatWarningsAsErrors, this. prefix, string constants, DateTimeOffset) enforced
Parameterised SQL / allowlisted DDL no string interpolation on user-influenced values
Code reviews completed 3 rounds — critical + important issues all addressed

What has been benchmarked

All runs: local Docker (single container, default config, 32 GB host RAM), Release build, no tuning.

Outcome Context (dense columnar, Array(Float64))

Metric 10 worlds 100 worlds 1,000 worlds 100K projection
Full-season write time (old path, 288 games) 18.7s 7.2s 48.6s ~80 min
Sustained write throughput 12K rows/sec 34K rows/sec 5K rows/sec (array-size-bound)
Full-context read p50 (all ~850 outcomes) 17 ms 25 ms 200 ms 3–15s, uncertain
Full-context read p95 28 ms 27 ms 322 ms
Basket read mean, 1 outcome 8.3 ms 8.6 ms 12.9 ms ~30 ms
Basket read mean, 5 outcomes 11.8 ms 8.5 ms 18.4 ms ~40 ms
Basket read mean, 10 outcomes 11.7 ms 8.3 ms 19.0 ms ~40 ms
Basket read mean, 20 outcomes 9.4 ms 9.3 ms 20.5 ms ~40 ms
Basket read mean, 30 outcomes 11.9 ms 8.8 ms 20.4 ms ~40 ms
Compression ratio 5.56× 6.39× 6.96× ~7×
Compressed storage (full season) 4.8 MB 31.9 MB 282 MB ~28 GB
Uncompressed storage (full season) 26.5 MB 204 MB 1.97 GB ~197 GB

Full-context vs basket reads. The full-context read transfers every outcome for a game (~850 rows × array-of-N-doubles) and is the relevant cost for "load a game into memory for batch bitvector evaluation." A basket read queries only the 1–30 outcomes a user's prediction actually references and costs ~8–20 ms regardless of world count. Different use cases, very different latencies.

Full-context projection at 100K worlds is genuinely uncertain. Going from scale 1,000 → 100,000 is 100× more bytes per read. If the bottleneck stays query-planner-bound the latency holds near flat; if it becomes data-transfer-bound it could grow linearly with payload. Honest range: 3–15 seconds. The 5-second spec target for full-context reads at 100K is a real constraint to measure, not one we can confidently claim we'll meet without running it.

Basket projection at 100K worlds is confident. Extrapolating the near-flat 100 → 1,000 trend, basket reads of 1–30 outcomes should land at ~30–40 ms even at 100K scale. Well inside any reasonable interactive budget.

Play-by-Play (sparse event-level)

Metric 10 worlds 100 worlds 1,000 worlds 100K projection
Total rows 419K 4.2M 42M 3.7 billion
Full-season write time (old path) 5.1s 34s 7.9 min ~13 hours (serial)
Sustained write throughput 82K rows/sec 122K rows/sec 88K rows/sec (write-bound)
Compression ratio 2.01× 2.30× 2.31× ~2.3×
Compressed storage (full season) 10.5 MB 89 MB 882 MB ~88 GB
Uncompressed storage (full season) 21 MB 205 MB 2.04 GB ~204 GB

Streaming orchestrator (new path) vs old in-memory path

Scale Chunk Mode Old path wall time New path wall time Δ New path peak RAM
10 10 sequential 23.8s 30.2s +27% (slower) 594 MB
100 100 sequential 41.6s 38.0s -9% (faster) 4.2 GB
1,000 100 sequential ~10 min 4.4 min -56% (2× faster) 5.1 GB
5,000 500 sequential (cannot run) 23.6 min unreachable vs measurable 18.6 GB
5,000 500 OC-only (--no-pbp) (cannot run) 12.9 min -46% vs 5K full pipeline 1.74 GB
10,000 500 sequential (cannot run) 43.9 min unreachable vs measurable 19.3 GB
10,000 500 8-way parallel (removed) (cannot run) 42.5 min -3.3% vs 10K sequential 18.6 GB

Breakdown for the 10,000-world streaming run (20 chunks × 500 worlds):

  • Total wall time: 2,634.4s (43.9 min)
  • Simulation: 1,083s (41%)
  • OC write to staging: 499s (19%)
  • PBP write to Parquet: 726s (28%)
  • Merge (arrayFlatten): 201s (8%)
  • Overhead: ~125s (5%)
  • 20 Parquet files on local disk, totalling 6.9 GB
  • 244,800 OC rows (merged from 20 partial batches)
  • 419,763,551 PBP rows written
  • ClickHouse OC storage: 2.68 GB compressed, 19.6 GB uncompressed
  • Peak RAM: 19.3 GB — driven by chunk size (500 worlds × 288 games × ~850 outcomes in memory before flush)

Breakdown for the 1,000-world streaming run (10 chunks × 100 worlds):

  • Total wall time: 262.7s
  • Simulation: 109.3s (42%)
  • OC write to staging: 55.1s (21%)
  • PBP write to Parquet: 54.4s (21%)
  • Merge (arrayFlatten): 16.1s (6%)
  • Overhead: 27.8s (11%)

Linear-scaling check (1K → 5K → 10K):

Component Scale 1K Scale 5K Scale 10K 1K→10K growth
Wall time 262.7s 1,417.3s 2,634.4s 10.0×
Simulation 109.3s 605.0s 1,083.0s 9.9×
OC write 55.1s 252.3s 498.5s 9.0×
PBP write 54.4s 413.4s 725.8s 13.3× (super-linear)
Merge 16.1s 74.6s 201.1s 12.5×
Peak RAM 5.1 GB 18.6 GB 19.3 GB chunk-bounded
OC rows 244,799 244,800 244,800 1.0× (catalogue-bounded)
PBP rows 41,972,008 209,880,838 419,763,551 10.0×
OC compression ratio 6.96× 7.32× +5%

Key takeaways: - Streaming has fixed per-chunk overhead that dominates at very small scale (scale 10 sees +27%), but amortises away as scale grows. By scale 100 streaming already wins. - Streaming removes the memory ceiling entirely. Peak RAM scales with chunk size (and simulation-in-flight state), not with total world count. At scale 10K we could have chosen chunk=100 to keep peak under 5 GB at the cost of 100 chunks instead of 20. - Scaling is essentially linear end-to-end (1K → 5K → 10K is 10×). No hidden quadratic costs in the orchestrator or merge. - PBP write is super-linear (13.3× for 10× rows). Local SSD contention or Parquet row-group overhead is the likely cause. Not algorithmic; should evaporate on faster disk or with chunk-size tuning. - OC compression improves at scale because the Delta+ZSTD codec amortises metadata across longer arrays.

OC-only vs full-pipeline findings (--no-pbp at scale 5K)

Added a --no-pbp CLI flag that skips PlayByPlayRecord conversion, the per-chunk PBP buffer, and the Parquet write entirely. At 5K worlds with chunk=500, the measurement:

Component With PBP (full pipeline) OC-only (--no-pbp) Δ
Total wall time 1,417s (23.6 min) 772s (12.9 min) -46%
Simulation (inner loop) 605s 414s -32%
OC write to staging 252s 237s -6%
PBP write to Parquet 413s 0s -100%
Merge (arrayFlatten) 75s 52s -31%
Peak RAM 18.6 GB 1.74 GB -91%
OC rows (merged) 244,800 244,800
PBP rows 210M 0

What this tells us:

  1. The PBP buffer dominates memory. At chunk=500, we buffer ~18.7M PlayByPlayRecord objects before flushing to Parquet — roughly 17 GB in .NET memory (each record is ~500 bytes due to string fields). The OC accumulator state is <2 GB.
  2. PBP accounts for nearly half the wall time — directly via WriteBatchAsync (29% of total) plus indirectly via Play → PlayByPlayRecord conversion and pbpBuffer.Add calls inside the simulation inner loop (another ~17%).
  3. OC-only is cheap everywhere. 1.74 GB peak at chunk=500 means OC-only could run with chunk=5,000 or higher without blowing memory. 100K worlds OC-only extrapolates to ~4.3 hours with bounded memory.
  4. OC write and merge both got faster without PBP — consistent with less pressure on the ClickHouse container (same CPU, less I/O contention with concurrent PBP writes).

Design implication: the original spec positioned PBP as a "latent path" but the current orchestrator runs it on the critical path alongside OC. Two cleaner options: (a) write PBP to Parquet more frequently within a chunk (e.g., every 50 worlds) to cap the PBP buffer at a small size, or (b) fully decouple — run OC-only as the critical path, and have a separate --pbp-only command that simulates again for PBP when needed.

Parallel-simulation findings (--parallel-worlds 8 at scale 10K, feature since removed)

We added an 8-way parallel simulation path using System.Threading.Channels.Channel<T> — parallel producer threads run SeasonEngine.SimulateSeason, write into a bounded channel, and the main thread serially drains the channel into the (non-thread-safe) accumulator. The ConcurrentBag alternative was evaluated and rejected (would hold the entire chunk's simulation results in memory before drain).

The measurement: 8-way parallelism delivered a ~3% improvement at 10K worlds, not the 3–5× we hoped for.

Component Sequential 10K Parallel=8 10K Δ
Total wall time 2,634s 2,548s -3.3% (noise-level)
Simulation time 1,083s 1,062s -2%
OC write 499s 492s -1%
PBP write 726s 741s +2% (slightly slower)
Merge 201s 136s -32% (faster)
Peak RAM 19.3 GB 18.6 GB similar

Likely root causes (unconfirmed, would need profiling to attribute precisely):

  1. Serial accumulator on the consumer side is the bottleneck. The producer threads fill the channel faster than the single consumer thread can drain it. Producer threads spend most of their time blocked on WriteAsync. Effectively the pipeline runs at the consumer's speed regardless of producer count.
  2. Host CPU contention with Dockerised ClickHouse. The host has 22 logical processors. When 8 sim threads run CPU-bound, ClickHouse (inside Docker Desktop's Linux VM on Windows) competes for host cores. OC/PBP write times stayed flat and PBP write got slightly slower — consistent with ClickHouse being CPU-throttled.
  3. Merge got 32% faster under parallel — consistent with #2. After simulation finished, the host had spare CPU for ClickHouse's merge work.

Net implication: on a single-host, Docker-hosted ClickHouse setup, parallel simulation within the orchestrator is not a performance lever. Parallelism would help more from (a) separating the ClickHouse host, (b) making the accumulator itself thread-safe / partitionable by game_id, or (c) running multiple simulation processes against independent world ranges (distributed workers).

Feature removed from the orchestrator in commit 4794ba36 — the complexity wasn't earning its keep. Will revisit once ClickHouse is off-host or once we have a partitioned accumulator.

Parallel accumulator + parallel OC write + parallel merge (successor attempt — kept)

Supersedes the --parallel-worlds attempt above. The failure mode there was the shared accumulator; once we switched to per-worker shared-nothing accumulators and parallelised the independent write/merge loops, the wins showed up.

Three structural changes, measured independently at 1K on localhost:

  1. Per-worker accumulator + per-worker PBP list (--parallel-workers N). Each worker owns its own AmericanFootballOutcomeAccumulator and List<PlayByPlayRecord>. No lock contention, no shared dictionaries. Worker partials merged at chunk flush.
  2. Parallel per-game OC staging write. The per-chunk loop across 288 games wrapped in Parallel.ForEachAsync with MaxDegreeOfParallelism = N. Each WriteStagingBatchAsync call is independent (different gameId, different ClickHouse partition).
  3. Parallel final per-game merge. Same shape — the MergeStagingToFinalAsync loop across 288 games also wrapped in Parallel.ForEachAsync.

1K OC-only sweep (same session, localhost, --no-pbp):

Config Total Sim OC write Merge Peak RAM
N=1 (all serial) 176.2s 104s 56s 16s 648 MB
N=4 sim-only parallel 151.1s 70s 64s 17s 924 MB
N=4 sim + OC-write parallel 111.2s 73s 22s 16s 1.7 GB
N=4 sim + OC-write + merge parallel 105.9s 73s 25s 7s 1.5 GB
N=8 same config 112.8s 82s 23s 8s 1.5 GB

1K WITH PBP, same session:

Config Total Sim OC write PBP write Merge Peak RAM
N=1 255.8s 124s 58s 57s 16s 6.67 GB
N=4 198.2s 93s 25s 72s 7s 6.78 GB

What this tells us:

  1. Sim phase scales 1.45× at N=4 once per-worker accumulators remove the shared-state contention. Still capped by physical cores (laptop has ~4) + Docker ClickHouse eating cycles. N=8 regresses slightly.
  2. OC staging write is the biggest unlocked gain: 56s → 22–25s at N=4 (2.3× speedup). ClickHouse accepts concurrent inserts against the same staging table without throttling.
  3. Final arrayFlatten merge also parallelises well: 16s → 7s at N=4 (2.1× speedup).
  4. Peak RAM grows ~70–90 MB per extra worker (per-worker accumulator state). Fine at 1K; plan for ~5 GB extra at 5K, ~10 GB extra at 10K — will need to tune chunk size accordingly.
  5. PBP write stays serial and actually regresses +15s at N=4 (GC pressure + Docker CH saturation during concurrent OC writes). Total-wall still -22.5%, but PBP write is now the biggest single-threaded phase in the WITH-PBP path (72s out of 198s = 36%).
  6. Correctness verified — 244,800 merged OC rows and 41.98M PBP rows at every N, matching the sequential baseline.

Scope going forward: simulation itself stays out of scope (prototype code). Accumulator / derivation / write-path parallelism are in scope and have now been landed for OC. The next parallelism lever is the PBP write (one Parquet file per worker per chunk, or fire-and-forget pipelined writes).

Parallelism validated at 5K and 10K — factor improves at scale

Scale N=1 wall N=4 wall Δ Per-world cost (N=4)
1K OC-only 176s 106s -40% 106 ms
5K OC-only 772s 345s -55% 69 ms
10K OC-only 1,565s 724s -54% 72 ms

The 1K win (-40%) was a lower bound. At 5K/10K the factor settles at ~-55%, because fixed per-chunk overhead amortises better when each chunk carries more worlds. N=4 scaling from 5K→10K was 2.1× for 2× worlds (near-linear). 100K OC-only projection on this laptop drops from 4.4h to ~2.0h.

Atomic-DB correctness fix landed. The first 5K N=4 attempt returned 489,599 rows — exactly 2×. Root cause: ClickHouse Atomic databases defer DROP TABLE data cleanup (async, several-minute delay). system.parts WHERE table='X' AND active=1 kept summing old and new parts across consecutive runs in the same container. Fixed by appending SYNC to every DROP TABLE statement in ClickHouseSchemas.cs. Verified with back-to-back runs in the same container — no doubling. The earlier 1K results are unaffected; they ran on containers that were torn down between measurement batches.

HTTP wire-level compression

Scale Raw write (ms) Compressed write (ms) Delta
10 worlds (OC season) 18,708 18,563 -0.8%
100 worlds (OC season) 7,678 7,190 -6.3%
100 worlds (PBP season) 34,368 35,758 +4.0%

Finding: on localhost, enabling HTTP compression is within run-to-run noise. It matters for remote ClickHouse only. On-disk column-level compression (ZSTD via the table DDL) is independent of this flag and achieves the 5.56–6.96× ratio we see in the storage metrics.

What has been validated

  1. ClickHouse Array(Float64) bulk insert + read works correctly in both ClickHouse.Driver and ClickHouse.Client. Primary connectivity goal met.
  2. Compression ratio matches the original design assumption. The spec predicted 5–6× for Float64 simulation data; measured 5.56× at small scale, improving to 6.96× at 1,000 worlds. Ratio is still trending up at scale.
  3. Realistic (basket) read latency is well inside budget. A user's prediction question touches 1–30 outcomes, not the full 850. Measured 8–20 ms for any basket size at any scale tested. Projects to ~30–40 ms at 100K — comfortably inside any interactive target.
  4. Streaming is viable and scales to 10K worlds. The orchestrator chunks simulation into world-ranges, flushes OC partials to a staging table, and merges via arrayFlatten at the end. PBP writes in parallel to Parquet files on disk via a pluggable IBlobSink. The merged OC is shape-equivalent to a single-shot-inserted table (parity test passes). At 1,000 worlds, streaming is 2× faster than the old path. At 5,000 worlds it runs in 23.6 min, at 10,000 worlds in 43.9 min with 19 GB peak RAM; the old path cannot run at either scale on a 32 GB laptop. Scaling 1K → 5K → 10K was essentially linear in wall time.
  5. Parallelism works once the structure is shared-nothing. First attempt (8-way shared-accumulator + Channel<T>) delivered ~3% at 10K and was removed. Successor (per-worker accumulators + parallel per-game OC write + parallel per-game final merge) delivers -40% on OC-only and -22.5% with PBP on at 1K N=4. The serial-accumulator hypothesis was correct; fixing it plus parallelising the two remaining per-game loops produced the win.
  6. PBP writer dominates runtime and memory. At 5K with chunk=500, full pipeline = 1,417s / 18.6 GB; --no-pbp path = 772s / 1.74 GB. The PBP buffer alone is 91% of our peak RAM. OC-only streaming is effectively a solved problem at 100K scale; PBP is the remaining bottleneck.
  7. Write throughput scales linearly. At local scale, PBP sustains 88–122K rows/sec and OC sustains 5–34K rows/sec. A tuned cloud instance with proper compute provisioning would exceed these numbers by 2–5×.
  8. Full-context read latency is uncertain at 100K. Measured 200 ms at 1,000 worlds for all 850 outcomes; honest extrapolation to 100K is 3–15 seconds depending on where the bottleneck shifts. The spec's 5-second target is a real constraint to actually measure, not assumed-met.
  9. Data integrity end-to-end. Every accumulator has been validated against the outcome catalogue (851 IDs, 0 unexpected emissions). Every write is round-tripped in integration tests before the experiment is considered complete.

What has not been validated yet

Item Status What's needed
Remote ClickHouse Cloud as the results store + experiment target Scaffolded, untested A ClickHouse Cloud connection string
REQ-INFRA-2 controlled evaluation (the actual client choice) Smoke-tested only Remote ClickHouse + realistic workload (10K worlds)
10K worlds locally Validated — 43.9 min, 19.3 GB peak RAM, 420M PBP rows, 244K OC rows merged correctly Done
100K worlds (spec stretch target) Unblocked by 10K evidence of linear scaling — but single-process is ~7–8h on a laptop Fast cloud VM OR multi-worker orchestrator
Should-have experiments (concurrency, streaming write, versioning, PBP reads, codec comparison) Implemented + integration-tested, not yet exercised against real-scale data Run the should-have suite against local or cloud ClickHouse
Codec comparison (REQ-INFRA-3) Implemented Run it; will tell us whether Delta+ZSTD is the right column codec or whether LZ4/ZSTD alone suffices
PBP → ClickHouse ingest Out of scope for streaming design — Parquet files land on disk but aren't ingested Separate pipeline when PBP querying is needed; uses INSERT INTO play_by_play SELECT * FROM file(...)

Projections for 100K worlds

Now extrapolated from measured scale-10,000 results (not scale-1K projections). The 1K → 10K run confirmed linear scaling, so 10K → 100K projections are similarly 10× the measured values — with one caveat on PBP write which was super-linear.

Dimension Estimate
Outcome Context compressed ~27 GB (measured scale ratio holds)
Play-by-play compressed ~70 GB (from scale-10K Parquet 6.9 GB × 10)
Total compressed ~97 GB
Total uncompressed (OC) ~196 GB
Total PBP rows ~4.2 billion
Streaming wall time (laptop, linear, N=1 serial) ~7.3 hours
Streaming wall time (laptop, PBP-adjusted for super-linear) ~9–10 hours
Streaming wall time (laptop, N=4 parallel, OC-only, extrapolated from measured 10K=724s) ~2.0 hours
Streaming wall time (laptop, N=4 parallel, full pipeline, 1K factor -22.5%) ~5.5 hours
Streaming wall time (cloud VM with fast SSD) ~1–3 hours
Peak RAM (chunk=500) ~19 GB (unchanged — caps at chunk-size, not world-count)

The spec's worst-case 300 GB / 24+ hours estimate overshot because it assumed 2,000 outcomes per game. We're measuring 850 outcomes per game, so storage is ~33% of the worst-case estimate.

Known constraints before scaling past 10K worlds

  1. Single-host parallelism partially unlocked. Once the accumulator was restructured per-worker (shared-nothing) and the OC write + final merge loops were parallelised, we get -40% on the OC-only path and -22.5% with PBP at 1K N=4. Sim phase itself still caps near the physical-core count (N=8 regresses). The remaining single-threaded bottleneck is the PBP Parquet writer (72s / 36% of 1K-with-PBP total at N=4). Needs re-measurement at 5K/10K to confirm the 1K wins hold at volume.
  2. Simulation wall time is still a major cost at scale — ~41% of total at 10K sequential. Sim is now ~1.45× faster at N=4, so at 10K we'd expect ~12 min off the 44 min total from sim alone before layering OC-write/merge gains on top. Simulation layer itself is prototype code and out of scope for further optimisation.
  3. PBP write is super-linear at larger scales. 1K → 10K showed 13.3× growth for 10× more rows. Likely local-SSD contention or Parquet row-group boundary effects. Re-measure on a machine with a faster disk (or in Cloud) before assuming 100K PBP write will be 10× the 10K value; budget closer to 13–14× to be safe.
  4. Chunk size controls peak RAM, not world count. At chunk=500 we saw 19 GB peak at 10K worlds. Dropping to chunk=200 or chunk=100 cuts peak RAM to ~8 GB or ~4 GB respectively, at the cost of more chunks (slightly higher orchestrator overhead). For 100K on a constrained box, chunk=100 is the right call.
  5. ClickHouse Cloud sizing. For ~97 GB compressed + 4.2 billion rows, we want a mid-tier cluster (4-CPU, 16-GB-RAM replicas) to avoid write throttling. Storage cost is minor (~$3/month at $0.03/GB); compute is the variable cost.
  6. Network bandwidth. ~97 GB compressed over the wire to Cloud = ~13 minutes on a gigabit link, ~2.5 hours on 100 Mbps. Enabling UseCompression=true on the connection string cuts this by ~50% (it matters for remote, unlike localhost).

Immediate next steps — proposed

  1. Stand up a ClickHouse Cloud instance (small tier is fine) and re-run the must-have suite + OC-only streaming at 1K/5K against it. Validates the results-writer path, establishes real remote-write numbers, and unblocks the REQ-INFRA-2 decision. Highest-leverage open item — every other unknown (parallel concurrent-insert behaviour against a production cluster, network bandwidth, latency floor) is gated on this.
  2. Confirm parallelism wins hold at 5K and 10K. The 1K sweep showed -40% OC-only and -22.5% with PBP at N=4. Rerun 5K no-PBP and 10K full-pipeline at N=4 to verify the same factor holds when arrays get longer. Cheap to measure (extra ~30 min laptop time).
  3. Redesign the PBP write path. Two pieces: (a) cap PBP buffer memory via mid-chunk or per-world flushes, (b) parallelise PBP writes — one Parquet file per worker per chunk, or pipelined fire-and-forget. PBP is now the dominant single-threaded phase at N=4 (72s / 36% of 1K-with-PBP wall time).
  4. Run the should-have suite at 1,000 worlds locally. Concurrent reads, versioning, PBP analytical queries, and most importantly the codec comparison (REQ-INFRA-3) — confirms whether Delta+ZSTD is optimal.
  5. Kick off a 100K-world OC-only run on a cloud VM. At ~4.3h projected (without parallelism) or potentially ~2.5h with N=4 parallel, this is now feasible without the PBP memory constraint.
  6. Compile a final report for the storage-decision gate review.

Optimisation analysis — OC critical path

Status: the biggest validated win in this section is the parallelism pass documented above (1K OC-only 176s → 106s, -40% at N=4). The string-caching / integer-key hypotheses below were partially tried and did not materialise — see "Validated results from the first optimisation pass (1K)" at the end. Simulation-layer optimisation is out of scope; accumulator/derivation/staging remain in scope but deprioritised until cloud numbers are in.

With PBP out of the way, the OC-only path at 5K takes 772s. Hot spots originally considered:

Simulation inner loop: 414s (54% of OC-only wall time)

This is SeasonEngine.SimulateSeason + accumulator.AccumulateGame × 5,000 worlds. Per-world cost: ~83 ms. Inside each world's 288 games, the accumulator does:

  • ~37,000 plays (288 games × ~130 plays)
  • Each play flows through 7 play accumulators (TeamScoring, Passing, Rushing, Receiving, Defensive, Kicker, GameFlow)
  • Each accumulator may increment N outcome counters via state.IncrementOutcome($"PASSING_YARDS_GAME_{qbId}", yards)string interpolation + Dictionary hash + Dictionary set on every call

Realistic hot-spot candidates: - String allocation for outcome IDs$"..._{playerId}" on the hot path. For 5K worlds × 37K plays × ~5 increments each = 925M string allocations. Pre-building the ID strings per (player, period, stat) once per game and caching them would eliminate this. - Dictionary hashing of string keys. Internally .NET Dictionary hashes strings via xxhash/marvin — fast but still ~100 ns per op. At 925M ops = 92s just in hashing. - Two-phase iteration in derivers — derivers iterate state.AllOutcomes.Keys with LINQ filters on every world. Converting these to direct lookups against pre-computed player-ID sets would save work.

Rough optimisation budget: if we can shave 30–40% off the per-world simulation cost, we save ~120–160s at 5K. Scales linearly to savings at 100K.

OC write to staging: 237s (31%)

288 games × 10 chunks × ClickHouseBulkCopy per game × ~24,000 rows per game per chunk. At 5K = 2,880 bulk-copies total. Each carries ~4,800 Array(Float64) rows of 500 elements.

Hot-spot candidates: - One bulk copy per game per chunk — 2,880 network round-trips. Could batch multiple games into one bulk copy. - Array(Float64) serialisation overhead in ClickHouse.Driver — each row has a double[500]. Client-side serialisation likely takes non-trivial time. - ClickHouse merge-tree writes create many small parts — each bulk-copy creates a part; 2,880 parts across 288 partitions.

Merge (arrayFlatten): 52s (7%)

10 staging batches × 288 games × arrayFlatten(groupArray((batch_index, values))) query. Probably close to floor given the amount of data being reshaped. Smaller target for optimisation.

Ranked optimisation targets

  1. Cache outcome ID strings per game — likely 20–30% wall-time reduction on simulation. Easy: one-time allocation per game ID at startup, reused across all worlds and plays.
  2. Use integer keys inside GameAccumulationState — assign each outcome ID an int via a lookup table at catalogue-build time. Dictionary is ~3× faster than Dictionary. Harder: requires rewiring all accumulators and derivers.
  3. Partition accumulator state per game. 288 games have independent state. One Dictionary per game instead of one giant dictionary lets accumulators skip all the "which team?" branching and makes per-game flushes zero-copy.
  4. Bulk-copy multiple games per ClickHouse insert. Currently 288 round-trips per chunk. A single bulk copy covering all 288 games' partials per chunk would cut round-trips from 2,880 to 10 at scale 5K.
  5. Profile first. Before any of the above, get a dotTrace / perfview profile of a scale-500 OC-only run. Confirms the hot spots rather than guessing.

Suggested sequence: profile → cache outcome IDs → bulk-copy batching → reassess.

Validated results from the first optimisation pass (1K)

Two commits landed based on the analysis above: a readback fix (21bf4e68) and outcome-ID caching (8849faee). Measured end-to-end at 1K on localhost:

Baseline Optimised Δ
Wall time 189.88s 157.43s -17%
Simulation 79.5s 78.9s -0.8%
OC write (staging) 38.9s 36.6s -5.9%
Merge 10.5s 7.8s -26%

Where the saving actually came from: the readback fix accounts for ~27s of the 32.5s total — the old post-run row-count loop did 288 full-context reads (~1 GB of data) purely to report TotalGameOutcomeRows, and none of that time was attributed to any measured phase. Replacing it with a single system.parts metrics query removed that cost entirely.

The outcome-ID caching contributed the remaining ~5s, which is within run-to-run variance — the predicted 20–30% simulation win did not materialise. Kept because it's a clean refactor and harmless; not counted as a real win.

Scope going forward: the simulation model is prototype code and is not a target for further optimisation per user direction. Future optimisation effort should be directed at accumulation and derivation layers, or at the OC staging write path (bulk-copy batching remains the highest-leverage unvalidated target).

ClickHouse Cloud validation — AU laptop → westus3 → container → 100K

Provisioned ClickHouse Cloud Production Scale (2 replicas × 4 CPU × ~8 GiB) in westus3. Stood up a dedicated Azure Container Apps Job in Sandbox subscription's rg-outcomecontext-experiments to run the experiment from the same region.

Staged optimisations, each measured:

Change Config 1K wall OC write Merge Notes
1. Baseline AU→Cloud laptop N=32 343s 225s 11s 565ms RTT = OC write is ~100% network wait
2. Same-region container (Consumption 4 CPU) N=32 235s 106s 13s Eliminates distance tax; sim now CPU-bound
3. Dedicated D32 workload profile (16 CPU) N=32 141s 98s 10s Sim drops -68% (76s → 25s)
4. Add set_async_insert=1; set_wait_for_async_insert=0 D32, N=32 60s 16s 13s OC write -84%. Server buffers small inserts. Merge sensitive at higher scale.

100K OC-only — the headline run:

Phase Time % of total
Simulation 1,530s (25.5 min) 51%
Merge (288 × arrayFlatten, server-side, N=4 capped) 1,210s (20.2 min) 40%
OC write (2,880 async bulk inserts, N=32) 233s (3.9 min) 8%
Overhead ~18s 1%
Total 2,991s (49.85 min) 100%

Peak client RAM 16.7 GB (Container Apps allocation: 32 GiB). 244,800 merged OC rows. Cost ≈ $0.80 per run.

Incidents worth noting:

  1. First 100K attempt failed at ~35 min in the merge phase with Code: 241. DB::Exception: total memory limit exceeded (would use 14.41 GiB, maximum 14.40 GiB). Cause: 32 parallel arrayFlatten queries each holding ~680 MB working set = ~22 GB server-side, exceeding the Production Scale tier's per-query budget. Fixed in commit e7cc5618 by capping final-merge parallelism at 4 (merge is server-bound, not client-bound). Re-run succeeded.
  2. Stdout lost on fast-exit. Container Apps console-log ingestion dropped the Streaming Run Report when the container exited immediately after printing. Fixed in commit 60611cbc with Console.Out.FlushAsync() + 3s pause before exit.
  3. Cross-run data doubling detected on 2× consecutive runs in the same Docker container during localhost validation. Cause: ClickHouse Atomic DB defers DROP TABLE data cleanup; system.parts summed old + new parts across active=1. Fixed in commit 0af492e1 by appending SYNC to every DROP TABLE statement.

Model calibration: round-trip cost analysis from the AU runs (5,760 staging RTs × 565ms / effective parallelism) predicted OC write within 10% and merge within 15% at every scale. Same-region projection (~70–90s for 1K with "near-zero" RTT) was wrong — residual was server-side ClickHouse overhead (~300ms per bulk insert independent of network), not latency. Learning baked into the new projection basis.

Two merge-phase optimisations tried and neither stuck

After the 100K baseline landed at 49.85 min with merge at 40% of wall time (20 min / 1,210s), two targeted optimisations were attempted. Both failed:

H. OPTIMIZE TABLE game_outcome_context_staging FINAL before the merge loop — theory: async_insert creates many small MergeTree parts; consolidating them before running 288 arrayFlatten queries would reduce per-query scan work. Result: OPTIMIZE TIMED OUT at 100K scale with Code: 236. DB::Exception: Aborting discovery process: Timeout, interrupting no-future-parts discovery. The default server-side OPTIMIZE timeout (~2 min) is not enough time to consolidate 28,800 staging rows across many parts. Raising the timeout would risk OPTIMIZE itself taking longer than the savings. Reverted.

C. Hash-shard each game's merge into 4 sub-range queries — theory: split arrayFlatten per-game work by cityHash64(outcome_id) % 4 = shardIndex, letting us run 16 concurrent queries at ~170 MB each instead of 4 at ~680 MB each. Result: 244,800 rows still correct, but total wall time regressed +13% (2,991s → 3,377s) and merge time regressed +10% (1,210s → 1,328s). Root cause: each sharded query still has to scan the full staging rows for its game (the game_id filter is the only index-eligible predicate; the outcome_id hash filter runs after the row read), so 16 shard queries do ~4× the total scan IO for the same output. Reverted.

Lesson: reducing per-query memory at the cost of redundant scans is a lose even when the memory headroom is real. A genuine sharded-merge approach would need to partition the staging table itself by outcome hash — not worth the complexity at 100K scale where the current 49.85 min is already acceptable. Parking merge-phase work; if sub-30-min is required later, the lever is a bigger Cloud tier rather than smarter SQL.

Full-pipeline (with PBP) validation — 10K works, 100K hits a new ceiling

After the 100K OC-only baseline landed, the next question was whether the orchestrator could ship the full pipeline (OC + PBP) at 100K. The PBP path went through three configurations before stabilising:

Attempt 1 — Parquet on ephemeral container disk. Ran 1K/5K/10K on the laptop this way. Tried 10K on the westus3 Container Apps job (same Parquet writer, same LocalDiskBlobSink). Stalled at 100+ minutes in the PBP write phase with no error, never completed inside the 2-hour replica budget. Ephemeral container storage is not characterised for the per-chunk fsync pattern the Parquet writer uses. Path abandoned on Cloud.

Attempt 2 — PBP direct to ClickHouse. Added ClickHousePlayByPlayBatchWriter: converts PlayByPlayRecordPlayRecord and delegates to IStorageBackend.WritePlayByPlayAsync, reusing the same async-insert bulk-copy path that took OC write from 98s to 16s at 1K. First attempt failed with Code: 252. Too many partitions for single INSERT block (more than 100) — the play_by_play table was PARTITION BY game_id (288/season), exceeding Cloud's locked max_partitions_per_insert_block=100. Second attempt tried to raise the setting via set_max_partitions_per_insert_block=500 in the connection string; Cloud rejects the override with Code: 452. SETTING_CONSTRAINT_VIOLATION. Schema change landed: play_by_play is now PARTITION BY season_id (one partition per season; per-game DROP PARTITION is no longer free but ALTER TABLE ... DELETE WHERE game_id = X covers it via mutation if ever required).

Attempt 3 — PBP direct to ClickHouse with parallel bulk-copies. 1K succeeded in 304s (175K PBP rows/sec). 10K at chunk=500 OOM'd the 32 GiB container; raised to 32 vCPU / 64 GiB and it completed in 18.7 min (513K PBP rows/sec, 2.9× better per-row throughput than 1K as async_insert amortises per-bulk overhead). 100K at chunk=1000 OOM'd again (per-chunk PBP buffer hit ~18.5 GB × parallel slices); switched to chunk=500 (200 chunks) which fits memory but hit the 4-hour replica deadline still in the PBP write phase — super-linear past 10K by at least 1.75×.

Diagnosed root cause for the 100K super-linearity: PARTITION BY season_id means all 4.2B PBP rows at 100K live in a single partition. Cloud MergeTree's background merge pressure on a 4.2B-row partition appears to throttle foreground INSERT throughput in a way it doesn't at 420M rows (10K). The fix is a finer-grained partition scheme (e.g., (season_id, intDiv(world_id, 10000)) gives ~10 partitions per 100K run, still well under the 100-partition-per-insert limit because our chunks are contiguous world ranges that all map to the same world_id/10000 bucket).

Validated ceilings on Production Scale tier: - OC-only: 100K in 49.85 min - Full pipeline: 10K in 18.7 min - Full pipeline at 100K: not validated, hit 4h replica deadline

Paths to 100K full-pipeline (not yet executed): 1. Upgrade Cloud tier — Production (vs Scale) with 4 replicas × 16 CPU would 8× the insert throughput. Cheapest fix in engineering hours, most expensive in monthly spend. 2. Repartition PBP(season_id, intDiv(world_id, 10000)) spreads across partitions, lets background merges parallelise. ~1 hour of code, needs one retry run at 100K to validate (~$4 of compute). 3. Split PBP off the critical path — ship OC-only as primary, run PBP asynchronously in a separate pipeline. Keeps the current validated 100K path; requires product decision on whether PBP is genuinely needed on the same SLA.

Production 3×16 tier — the 100K-in-under-3h milestone

The path-1 upgrade happened: the Cloud cluster was resized to Production with 3 replicas × 16 vCPU × 64 GiB = 48 vCPU / 192 GiB aggregate (6× the CPU and 24× the RAM of Production Scale). Along with the tier change the orchestrator's merge-parallelism cap was raised from 4 to 16 (the original cap was tuned to the ~14 GiB per-query budget on the smaller tier; 64 GiB per replica comfortably handles 16 × ~680 MB concurrent merges).

100K OC-only re-baseline on 3×16 tier:

Phase Prod Scale (2×4) Prod 3×16 Δ
Total wall 2,991s (49.85 min) 1,887s (31.4 min) -37%
Sim 1,530s 1,412s -8% (noise, client-bound)
OC write 233s 328s +40% (async_insert variance)
Merge 1,210s (20 min) 137s (2.3 min) -89%
Peak client RAM 16.7 GB 19.0 GB similar

The 20-min merge phase that had us chasing OPTIMIZE + sharded merges on the smaller tier simply disappeared with more replica RAM. That's where the big cost/time jump is: it's not a software fix, it's a compute-budget fix.

100K full-pipeline (OC + PBP, no season) on 3×16 tier — the first successful 100K full-pipeline run against Cloud: 9,887s (2h 44m). PBP write 7,955s for 4,198,546,157 rows — throughput 528K rows/sec vs 513K at 10K (effectively flat across 10× scale, super-linearity on the smaller tier was entirely a resource-starvation artifact).

Season accumulator refactor — closing the last gap

After the OC-only and no-season full-pipeline numbers landed, one gap remained: season_outcome_context was never exercised by the streaming orchestrator. The season accumulator existed but only ran in the standalone nice-to-have SeasonContextWriteExperiment. Wiring it into the orchestrator was straightforward — but running it at scale surfaced a memory-layout problem:

The original AmericanFootballSeasonAccumulator kept per-world state as Dictionary<int worldIndex, Dictionary<string outcomeId, double>>. Dict overhead + string duplication at 10K worlds was already 27 GB. Linear projection to 100K: ~270 GB. The container OOM'd mid-run at 3h.

Refactor (commit 41ab4c90): switched the inner layout to Dictionary<string outcomeId, double[worldCount]> — one pre-sized array per outcome, indexed by absolute world index. Same shape the game accumulator has used all along. At 100K that's 1,668 outcomes × 100,000 × 8 bytes ≈ 1.3 GB, flat regardless of how many worlds are in flight. Parameterless constructor preserved for tests/benchmarks; orchestrator passes TotalWorldCount so there's no grow-and-copy during accumulation. All 6 existing AmericanFootballSeasonAccumulatorTests pass unchanged.

100K full-pipeline + season — primary result:

Phase Time % of total
Simulation 2,169s (36 min) 20%
PBP write 7,983s (2h 13m) 75%
OC write 387s (6.5 min) 4%
Merge 121s (2 min) 1%
Season write 22.90s <1%
Overhead ~13s <1%
Total wall 10,696s (2h 58m) 100%
Peak client RAM 46.6 GB (vs 46.8 GB on the no-season run — flat)
OC rows 244,800
PBP rows 4,198,981,505
Season rows 1,728

The ~14 min difference vs the no-season run (2h 58m vs 2h 44m) is almost entirely in the sim phase (2,169s vs 1,462s, +48%). Cause: the shared season accumulator is lock-guarded (its inner dicts aren't thread-safe), so every worker pays lock-acquire + release on the sim hot path — 28.8M locked calls at 100K. The fix would be per-worker season accumulators + a merge step, following the same pattern the game accumulator uses. Not done here because +14 min of wall time is well within budget for the gate, but it's the obvious next optimisation if season runs ever move to the critical path.

Validated ceilings on Production 3×16 tier

  • OC-only 100K: 31.4 min
  • Full pipeline (OC + PBP) 100K: 2h 44m
  • Full pipeline + season 100K: 2h 58m — primary "full" result

Artifacts available for review