Storage Experiment — Status Update¶
Date: 2026-04-18 Branch: LBS-1183 Audience: Engineering, product, stakeholders
R&D Tax evidence: a hypothesis-driven write-up of every experiment in this document (with commit SHAs, run-log references, and explicit knowledge-gained statements for both positive and negative results) is held in
rd-findings.md.
TL;DR¶
100K full-pipeline end-to-end validated on ClickHouse Cloud in 2h 58m for ~$27/run. All three outcome tables (game_outcome_context, play_by_play, season_outcome_context) written correctly: 244,800 game-OC rows • 4.2B play-by-play rows • 1,728 season-OC rows. Primary run for the storage-decision gate.
Headline numbers (Production 3×16 Cloud tier, Azure Container Apps D32 32-CPU / 64 GiB):
| Scope | 100K Wall | Cost |
|---|---|---|
| OC-only | 31.4 min | ~$5 |
| Full pipeline (OC + PBP, no season) | 2h 44m | ~$25 |
| Full pipeline + season context | 2h 58m | ~$27 |
What it took to get there — stacked optimisations, each measured:
- Streaming orchestrator with per-worker shared-nothing accumulators — lifted parallelism from a failed ~3% improvement (shared-accumulator
Channel<T>) to -55% wall time at 5K/10K on localhost Docker. Factor improves at scale. - Parallel per-game OC staging writes + parallel final merge — ClickHouse accepts concurrent inserts; cut localhost 100K projection from ~4.4h to ~2h.
- Same-region Azure Container Apps job in
westus3(same region as the Cloud cluster) — removed the 565ms AU→US RTT tax that had a 1K run taking 12 minutes. async_insert=1; wait_for_async_insert=0on the connection string — OC write -86% at 1K by letting CH buffer small inserts server-side.- PBP direct to ClickHouse via
ClickHousePlayByPlayBatchWriter— the ephemeral-container Parquet writer path stalled pathologically at 10K+ (1h 40m and still running); switching PBP to the same bulk-copy path fixed it. Requiredplay_by_playrepartition fromgame_id(288 partitions, rejected by Cloud'smax_partitions_per_insert_block=100) toseason_id(1 partition). - 4-way parallel PBP writes per chunk — per-bulk-copy throughput caps at ~175K rows/sec; splitting into 4 concurrent slices lifted 10K to 513K rows/sec.
- Production Scale (2×4) → Production 3×16 Cloud tier — prior 100K full-pipeline attempt timed out at 4h mid-PBP-write because PBP scaling was super-linear on the smaller cluster; the bigger tier resolves it (near-linear 528K rows/sec at 100K vs 513K at 10K). Merge phase also collapsed from 20 min → 2.3 min on OC-only thanks to the extra replica RAM + merge-parallelism cap raised from 4 to 16.
- Season accumulator refactor — the original
Dictionary<worldIndex, Dictionary<outcomeId, double>>layout projected to ~270 GB at 100K and OOM'd the container. Refactored toDictionary<outcomeId, double[worldCount]>: same observable behaviour (6 existing unit tests pass unchanged), peak RAM for the season accumulator capped at ~1.3 GB regardless of scale.
Remaining bottlenecks at 100K: - Simulation (36 min / 20% of total) — prototype sim code, user-scoped out of optimisation. A future per-worker season-accumulator + merge would remove lock contention from the sim hot path and save ~12 min. - PBP write (2h 13m / 75% of total) — cluster insert throughput is the ceiling. Bigger tier or sharded insert path would be the levers.
Spend shape: idle-pause after 15 min means no ongoing compute cost; every run pays only for its own compute. 100K full-pipeline ≈ ACI $3 + CH Cloud $24.
What has been built¶
Production-quality (will live beyond the experiment):
- Accumulation library (
LBS.Model.AmericanFootball.Accumulation) — converts simulation game results into the Outcome Context shape (~850 stats per game asArray(Float64)keyed by outcome_id) and the Season Outcome Context (standings + player milestones across all worlds). Sport-specific design, cleanly extensible to other sports. - Storage library (
LBS.Model.AmericanFootball.Storage) — NEW. Houses the Parquet writer (moved out of Simulation), theIBlobSinkabstraction with local-disk implementation and Azure Blob stub, and theIPlayByPlayBatchWriterpattern. Simulation project no longer depends onParquet.Net. - Outcome ID catalogue — 851 declared stats per game, 1,668+ stats per season. Covers team scoring, QB / RB / WR / TE / K / D individual stats, TD ordinals, fantasy points, season standings, milestones.
Experiment-grade (disposable):
- Storage backend abstraction (
IStorageBackend) — enables testing the same suite against ClickHouse, ScyllaDB, or any other target with a ~300-line implementation. - ClickHouse backend — production-quality implementation using the official
ClickHouse.Driverpackage, with parameterised SQL throughout, allowlisted DDL identifiers, and Testcontainers-backed integration tests. - Streaming orchestrator — NEW. Chunks simulation into configurable world-ranges, flushes OC partials to a staging table + merges via
arrayFlattenat the end, writes PBP per chunk to Parquet files viaIBlobSink. CLI flags:--streaming,--chunk-size,--pbp-sink-type,--pbp-sink-path. - 19 experiments implemented across the must-have, should-have, and nice-to-have priority bands (18 original + basket-read sweep).
- Results pipeline — every experiment emits structured
ExperimentResultrows that print to console AND can be written to a ClickHouse Cloud table for cross-run analytics. - Docker Compose for local ClickHouse development + a CLI runner for the experiment suite.
Quality gates¶
| Category | Status |
|---|---|
| Unit + integration tests | 109 total, 108 passing, 1 intentional skip |
| Streaming orchestrator integration tests (Testcontainers) | smoke + read-parity |
| Storage layer unit tests | LocalDiskBlobSink + ParquetPlayByPlayBatchWriter round-trips |
Coding standards (TreatWarningsAsErrors, this. prefix, string constants, DateTimeOffset) |
enforced |
| Parameterised SQL / allowlisted DDL | no string interpolation on user-influenced values |
| Code reviews completed | 3 rounds — critical + important issues all addressed |
What has been benchmarked¶
All runs: local Docker (single container, default config, 32 GB host RAM), Release build, no tuning.
Outcome Context (dense columnar, Array(Float64))¶
| Metric | 10 worlds | 100 worlds | 1,000 worlds | 100K projection |
|---|---|---|---|---|
| Full-season write time (old path, 288 games) | 18.7s | 7.2s | 48.6s | ~80 min |
| Sustained write throughput | 12K rows/sec | 34K rows/sec | 5K rows/sec | (array-size-bound) |
| Full-context read p50 (all ~850 outcomes) | 17 ms | 25 ms | 200 ms | 3–15s, uncertain |
| Full-context read p95 | 28 ms | 27 ms | 322 ms | — |
| Basket read mean, 1 outcome | 8.3 ms | 8.6 ms | 12.9 ms | ~30 ms |
| Basket read mean, 5 outcomes | 11.8 ms | 8.5 ms | 18.4 ms | ~40 ms |
| Basket read mean, 10 outcomes | 11.7 ms | 8.3 ms | 19.0 ms | ~40 ms |
| Basket read mean, 20 outcomes | 9.4 ms | 9.3 ms | 20.5 ms | ~40 ms |
| Basket read mean, 30 outcomes | 11.9 ms | 8.8 ms | 20.4 ms | ~40 ms |
| Compression ratio | 5.56× | 6.39× | 6.96× | ~7× |
| Compressed storage (full season) | 4.8 MB | 31.9 MB | 282 MB | ~28 GB |
| Uncompressed storage (full season) | 26.5 MB | 204 MB | 1.97 GB | ~197 GB |
Full-context vs basket reads. The full-context read transfers every outcome for a game (~850 rows × array-of-N-doubles) and is the relevant cost for "load a game into memory for batch bitvector evaluation." A basket read queries only the 1–30 outcomes a user's prediction actually references and costs ~8–20 ms regardless of world count. Different use cases, very different latencies.
Full-context projection at 100K worlds is genuinely uncertain. Going from scale 1,000 → 100,000 is 100× more bytes per read. If the bottleneck stays query-planner-bound the latency holds near flat; if it becomes data-transfer-bound it could grow linearly with payload. Honest range: 3–15 seconds. The 5-second spec target for full-context reads at 100K is a real constraint to measure, not one we can confidently claim we'll meet without running it.
Basket projection at 100K worlds is confident. Extrapolating the near-flat 100 → 1,000 trend, basket reads of 1–30 outcomes should land at ~30–40 ms even at 100K scale. Well inside any reasonable interactive budget.
Play-by-Play (sparse event-level)¶
| Metric | 10 worlds | 100 worlds | 1,000 worlds | 100K projection |
|---|---|---|---|---|
| Total rows | 419K | 4.2M | 42M | 3.7 billion |
| Full-season write time (old path) | 5.1s | 34s | 7.9 min | ~13 hours (serial) |
| Sustained write throughput | 82K rows/sec | 122K rows/sec | 88K rows/sec | (write-bound) |
| Compression ratio | 2.01× | 2.30× | 2.31× | ~2.3× |
| Compressed storage (full season) | 10.5 MB | 89 MB | 882 MB | ~88 GB |
| Uncompressed storage (full season) | 21 MB | 205 MB | 2.04 GB | ~204 GB |
Streaming orchestrator (new path) vs old in-memory path¶
| Scale | Chunk | Mode | Old path wall time | New path wall time | Δ | New path peak RAM |
|---|---|---|---|---|---|---|
| 10 | 10 | sequential | 23.8s | 30.2s | +27% (slower) | 594 MB |
| 100 | 100 | sequential | 41.6s | 38.0s | -9% (faster) | 4.2 GB |
| 1,000 | 100 | sequential | ~10 min | 4.4 min | -56% (2× faster) | 5.1 GB |
| 5,000 | 500 | sequential | (cannot run) | 23.6 min | unreachable vs measurable | 18.6 GB |
| 5,000 | 500 | OC-only (--no-pbp) |
(cannot run) | 12.9 min | -46% vs 5K full pipeline | 1.74 GB |
| 10,000 | 500 | sequential | (cannot run) | 43.9 min | unreachable vs measurable | 19.3 GB |
| 10,000 | 500 | 8-way parallel (removed) | (cannot run) | 42.5 min | -3.3% vs 10K sequential | 18.6 GB |
Breakdown for the 10,000-world streaming run (20 chunks × 500 worlds):
- Total wall time: 2,634.4s (43.9 min)
- Simulation: 1,083s (41%)
- OC write to staging: 499s (19%)
- PBP write to Parquet: 726s (28%)
- Merge (arrayFlatten): 201s (8%)
- Overhead: ~125s (5%)
- 20 Parquet files on local disk, totalling 6.9 GB
- 244,800 OC rows (merged from 20 partial batches)
- 419,763,551 PBP rows written
- ClickHouse OC storage: 2.68 GB compressed, 19.6 GB uncompressed
- Peak RAM: 19.3 GB — driven by chunk size (500 worlds × 288 games × ~850 outcomes in memory before flush)
Breakdown for the 1,000-world streaming run (10 chunks × 100 worlds):
- Total wall time: 262.7s
- Simulation: 109.3s (42%)
- OC write to staging: 55.1s (21%)
- PBP write to Parquet: 54.4s (21%)
- Merge (arrayFlatten): 16.1s (6%)
- Overhead: 27.8s (11%)
Linear-scaling check (1K → 5K → 10K):
| Component | Scale 1K | Scale 5K | Scale 10K | 1K→10K growth |
|---|---|---|---|---|
| Wall time | 262.7s | 1,417.3s | 2,634.4s | 10.0× |
| Simulation | 109.3s | 605.0s | 1,083.0s | 9.9× |
| OC write | 55.1s | 252.3s | 498.5s | 9.0× |
| PBP write | 54.4s | 413.4s | 725.8s | 13.3× (super-linear) |
| Merge | 16.1s | 74.6s | 201.1s | 12.5× |
| Peak RAM | 5.1 GB | 18.6 GB | 19.3 GB | chunk-bounded |
| OC rows | 244,799 | 244,800 | 244,800 | 1.0× (catalogue-bounded) |
| PBP rows | 41,972,008 | 209,880,838 | 419,763,551 | 10.0× |
| OC compression ratio | 6.96× | — | 7.32× | +5% |
Key takeaways: - Streaming has fixed per-chunk overhead that dominates at very small scale (scale 10 sees +27%), but amortises away as scale grows. By scale 100 streaming already wins. - Streaming removes the memory ceiling entirely. Peak RAM scales with chunk size (and simulation-in-flight state), not with total world count. At scale 10K we could have chosen chunk=100 to keep peak under 5 GB at the cost of 100 chunks instead of 20. - Scaling is essentially linear end-to-end (1K → 5K → 10K is 10×). No hidden quadratic costs in the orchestrator or merge. - PBP write is super-linear (13.3× for 10× rows). Local SSD contention or Parquet row-group overhead is the likely cause. Not algorithmic; should evaporate on faster disk or with chunk-size tuning. - OC compression improves at scale because the Delta+ZSTD codec amortises metadata across longer arrays.
OC-only vs full-pipeline findings (--no-pbp at scale 5K)¶
Added a --no-pbp CLI flag that skips PlayByPlayRecord conversion, the per-chunk PBP buffer, and the Parquet write entirely. At 5K worlds with chunk=500, the measurement:
| Component | With PBP (full pipeline) | OC-only (--no-pbp) |
Δ |
|---|---|---|---|
| Total wall time | 1,417s (23.6 min) | 772s (12.9 min) | -46% |
| Simulation (inner loop) | 605s | 414s | -32% |
| OC write to staging | 252s | 237s | -6% |
| PBP write to Parquet | 413s | 0s | -100% |
| Merge (arrayFlatten) | 75s | 52s | -31% |
| Peak RAM | 18.6 GB | 1.74 GB | -91% |
| OC rows (merged) | 244,800 | 244,800 | — |
| PBP rows | 210M | 0 | — |
What this tells us:
- The PBP buffer dominates memory. At chunk=500, we buffer ~18.7M
PlayByPlayRecordobjects before flushing to Parquet — roughly 17 GB in .NET memory (each record is ~500 bytes due to string fields). The OC accumulator state is <2 GB. - PBP accounts for nearly half the wall time — directly via
WriteBatchAsync(29% of total) plus indirectly viaPlay → PlayByPlayRecordconversion andpbpBuffer.Addcalls inside the simulation inner loop (another ~17%). - OC-only is cheap everywhere. 1.74 GB peak at chunk=500 means OC-only could run with chunk=5,000 or higher without blowing memory. 100K worlds OC-only extrapolates to ~4.3 hours with bounded memory.
- OC write and merge both got faster without PBP — consistent with less pressure on the ClickHouse container (same CPU, less I/O contention with concurrent PBP writes).
Design implication: the original spec positioned PBP as a "latent path" but the current orchestrator runs it on the critical path alongside OC. Two cleaner options: (a) write PBP to Parquet more frequently within a chunk (e.g., every 50 worlds) to cap the PBP buffer at a small size, or (b) fully decouple — run OC-only as the critical path, and have a separate --pbp-only command that simulates again for PBP when needed.
Parallel-simulation findings (--parallel-worlds 8 at scale 10K, feature since removed)¶
We added an 8-way parallel simulation path using System.Threading.Channels.Channel<T> — parallel producer threads run SeasonEngine.SimulateSeason, write into a bounded channel, and the main thread serially drains the channel into the (non-thread-safe) accumulator. The ConcurrentBag alternative was evaluated and rejected (would hold the entire chunk's simulation results in memory before drain).
The measurement: 8-way parallelism delivered a ~3% improvement at 10K worlds, not the 3–5× we hoped for.
| Component | Sequential 10K | Parallel=8 10K | Δ |
|---|---|---|---|
| Total wall time | 2,634s | 2,548s | -3.3% (noise-level) |
| Simulation time | 1,083s | 1,062s | -2% |
| OC write | 499s | 492s | -1% |
| PBP write | 726s | 741s | +2% (slightly slower) |
| Merge | 201s | 136s | -32% (faster) |
| Peak RAM | 19.3 GB | 18.6 GB | similar |
Likely root causes (unconfirmed, would need profiling to attribute precisely):
- Serial accumulator on the consumer side is the bottleneck. The producer threads fill the channel faster than the single consumer thread can drain it. Producer threads spend most of their time blocked on
WriteAsync. Effectively the pipeline runs at the consumer's speed regardless of producer count. - Host CPU contention with Dockerised ClickHouse. The host has 22 logical processors. When 8 sim threads run CPU-bound, ClickHouse (inside Docker Desktop's Linux VM on Windows) competes for host cores. OC/PBP write times stayed flat and PBP write got slightly slower — consistent with ClickHouse being CPU-throttled.
- Merge got 32% faster under parallel — consistent with #2. After simulation finished, the host had spare CPU for ClickHouse's merge work.
Net implication: on a single-host, Docker-hosted ClickHouse setup, parallel simulation within the orchestrator is not a performance lever. Parallelism would help more from (a) separating the ClickHouse host, (b) making the accumulator itself thread-safe / partitionable by game_id, or (c) running multiple simulation processes against independent world ranges (distributed workers).
Feature removed from the orchestrator in commit 4794ba36 — the complexity wasn't earning its keep. Will revisit once ClickHouse is off-host or once we have a partitioned accumulator.
Parallel accumulator + parallel OC write + parallel merge (successor attempt — kept)¶
Supersedes the
--parallel-worldsattempt above. The failure mode there was the shared accumulator; once we switched to per-worker shared-nothing accumulators and parallelised the independent write/merge loops, the wins showed up.
Three structural changes, measured independently at 1K on localhost:
- Per-worker accumulator + per-worker PBP list (
--parallel-workers N). Each worker owns its ownAmericanFootballOutcomeAccumulatorandList<PlayByPlayRecord>. No lock contention, no shared dictionaries. Worker partials merged at chunk flush. - Parallel per-game OC staging write. The per-chunk loop across 288 games wrapped in
Parallel.ForEachAsyncwith MaxDegreeOfParallelism = N. EachWriteStagingBatchAsynccall is independent (different gameId, different ClickHouse partition). - Parallel final per-game merge. Same shape — the
MergeStagingToFinalAsyncloop across 288 games also wrapped inParallel.ForEachAsync.
1K OC-only sweep (same session, localhost, --no-pbp):
| Config | Total | Sim | OC write | Merge | Peak RAM |
|---|---|---|---|---|---|
| N=1 (all serial) | 176.2s | 104s | 56s | 16s | 648 MB |
| N=4 sim-only parallel | 151.1s | 70s | 64s | 17s | 924 MB |
| N=4 sim + OC-write parallel | 111.2s | 73s | 22s | 16s | 1.7 GB |
| N=4 sim + OC-write + merge parallel | 105.9s | 73s | 25s | 7s | 1.5 GB |
| N=8 same config | 112.8s | 82s | 23s | 8s | 1.5 GB |
1K WITH PBP, same session:
| Config | Total | Sim | OC write | PBP write | Merge | Peak RAM |
|---|---|---|---|---|---|---|
| N=1 | 255.8s | 124s | 58s | 57s | 16s | 6.67 GB |
| N=4 | 198.2s | 93s | 25s | 72s | 7s | 6.78 GB |
What this tells us:
- Sim phase scales 1.45× at N=4 once per-worker accumulators remove the shared-state contention. Still capped by physical cores (laptop has ~4) + Docker ClickHouse eating cycles. N=8 regresses slightly.
- OC staging write is the biggest unlocked gain: 56s → 22–25s at N=4 (2.3× speedup). ClickHouse accepts concurrent inserts against the same staging table without throttling.
- Final arrayFlatten merge also parallelises well: 16s → 7s at N=4 (2.1× speedup).
- Peak RAM grows ~70–90 MB per extra worker (per-worker accumulator state). Fine at 1K; plan for ~5 GB extra at 5K, ~10 GB extra at 10K — will need to tune chunk size accordingly.
- PBP write stays serial and actually regresses +15s at N=4 (GC pressure + Docker CH saturation during concurrent OC writes). Total-wall still -22.5%, but PBP write is now the biggest single-threaded phase in the WITH-PBP path (72s out of 198s = 36%).
- Correctness verified — 244,800 merged OC rows and 41.98M PBP rows at every N, matching the sequential baseline.
Scope going forward: simulation itself stays out of scope (prototype code). Accumulator / derivation / write-path parallelism are in scope and have now been landed for OC. The next parallelism lever is the PBP write (one Parquet file per worker per chunk, or fire-and-forget pipelined writes).
Parallelism validated at 5K and 10K — factor improves at scale¶
| Scale | N=1 wall | N=4 wall | Δ | Per-world cost (N=4) |
|---|---|---|---|---|
| 1K OC-only | 176s | 106s | -40% | 106 ms |
| 5K OC-only | 772s | 345s | -55% | 69 ms |
| 10K OC-only | 1,565s | 724s | -54% | 72 ms |
The 1K win (-40%) was a lower bound. At 5K/10K the factor settles at ~-55%, because fixed per-chunk overhead amortises better when each chunk carries more worlds. N=4 scaling from 5K→10K was 2.1× for 2× worlds (near-linear). 100K OC-only projection on this laptop drops from 4.4h to ~2.0h.
Atomic-DB correctness fix landed. The first 5K N=4 attempt returned 489,599 rows — exactly 2×. Root cause: ClickHouse Atomic databases defer DROP TABLE data cleanup (async, several-minute delay). system.parts WHERE table='X' AND active=1 kept summing old and new parts across consecutive runs in the same container. Fixed by appending SYNC to every DROP TABLE statement in ClickHouseSchemas.cs. Verified with back-to-back runs in the same container — no doubling. The earlier 1K results are unaffected; they ran on containers that were torn down between measurement batches.
HTTP wire-level compression¶
| Scale | Raw write (ms) | Compressed write (ms) | Delta |
|---|---|---|---|
| 10 worlds (OC season) | 18,708 | 18,563 | -0.8% |
| 100 worlds (OC season) | 7,678 | 7,190 | -6.3% |
| 100 worlds (PBP season) | 34,368 | 35,758 | +4.0% |
Finding: on localhost, enabling HTTP compression is within run-to-run noise. It matters for remote ClickHouse only. On-disk column-level compression (ZSTD via the table DDL) is independent of this flag and achieves the 5.56–6.96× ratio we see in the storage metrics.
What has been validated¶
- ClickHouse
Array(Float64)bulk insert + read works correctly in bothClickHouse.DriverandClickHouse.Client. Primary connectivity goal met. - Compression ratio matches the original design assumption. The spec predicted 5–6× for Float64 simulation data; measured 5.56× at small scale, improving to 6.96× at 1,000 worlds. Ratio is still trending up at scale.
- Realistic (basket) read latency is well inside budget. A user's prediction question touches 1–30 outcomes, not the full 850. Measured 8–20 ms for any basket size at any scale tested. Projects to ~30–40 ms at 100K — comfortably inside any interactive target.
- Streaming is viable and scales to 10K worlds. The orchestrator chunks simulation into world-ranges, flushes OC partials to a staging table, and merges via
arrayFlattenat the end. PBP writes in parallel to Parquet files on disk via a pluggableIBlobSink. The merged OC is shape-equivalent to a single-shot-inserted table (parity test passes). At 1,000 worlds, streaming is 2× faster than the old path. At 5,000 worlds it runs in 23.6 min, at 10,000 worlds in 43.9 min with 19 GB peak RAM; the old path cannot run at either scale on a 32 GB laptop. Scaling 1K → 5K → 10K was essentially linear in wall time. - Parallelism works once the structure is shared-nothing. First attempt (8-way shared-accumulator +
Channel<T>) delivered ~3% at 10K and was removed. Successor (per-worker accumulators + parallel per-game OC write + parallel per-game final merge) delivers -40% on OC-only and -22.5% with PBP on at 1K N=4. The serial-accumulator hypothesis was correct; fixing it plus parallelising the two remaining per-game loops produced the win. - PBP writer dominates runtime and memory. At 5K with chunk=500, full pipeline = 1,417s / 18.6 GB;
--no-pbppath = 772s / 1.74 GB. The PBP buffer alone is 91% of our peak RAM. OC-only streaming is effectively a solved problem at 100K scale; PBP is the remaining bottleneck. - Write throughput scales linearly. At local scale, PBP sustains 88–122K rows/sec and OC sustains 5–34K rows/sec. A tuned cloud instance with proper compute provisioning would exceed these numbers by 2–5×.
- Full-context read latency is uncertain at 100K. Measured 200 ms at 1,000 worlds for all 850 outcomes; honest extrapolation to 100K is 3–15 seconds depending on where the bottleneck shifts. The spec's 5-second target is a real constraint to actually measure, not assumed-met.
- Data integrity end-to-end. Every accumulator has been validated against the outcome catalogue (851 IDs, 0 unexpected emissions). Every write is round-tripped in integration tests before the experiment is considered complete.
What has not been validated yet¶
| Item | Status | What's needed |
|---|---|---|
| Remote ClickHouse Cloud as the results store + experiment target | Scaffolded, untested | A ClickHouse Cloud connection string |
| REQ-INFRA-2 controlled evaluation (the actual client choice) | Smoke-tested only | Remote ClickHouse + realistic workload (10K worlds) |
| 10K worlds locally | Validated — 43.9 min, 19.3 GB peak RAM, 420M PBP rows, 244K OC rows merged correctly | Done |
| 100K worlds (spec stretch target) | Unblocked by 10K evidence of linear scaling — but single-process is ~7–8h on a laptop | Fast cloud VM OR multi-worker orchestrator |
| Should-have experiments (concurrency, streaming write, versioning, PBP reads, codec comparison) | Implemented + integration-tested, not yet exercised against real-scale data | Run the should-have suite against local or cloud ClickHouse |
| Codec comparison (REQ-INFRA-3) | Implemented | Run it; will tell us whether Delta+ZSTD is the right column codec or whether LZ4/ZSTD alone suffices |
| PBP → ClickHouse ingest | Out of scope for streaming design — Parquet files land on disk but aren't ingested | Separate pipeline when PBP querying is needed; uses INSERT INTO play_by_play SELECT * FROM file(...) |
Projections for 100K worlds¶
Now extrapolated from measured scale-10,000 results (not scale-1K projections). The 1K → 10K run confirmed linear scaling, so 10K → 100K projections are similarly 10× the measured values — with one caveat on PBP write which was super-linear.
| Dimension | Estimate |
|---|---|
| Outcome Context compressed | ~27 GB (measured scale ratio holds) |
| Play-by-play compressed | ~70 GB (from scale-10K Parquet 6.9 GB × 10) |
| Total compressed | ~97 GB |
| Total uncompressed (OC) | ~196 GB |
| Total PBP rows | ~4.2 billion |
| Streaming wall time (laptop, linear, N=1 serial) | ~7.3 hours |
| Streaming wall time (laptop, PBP-adjusted for super-linear) | ~9–10 hours |
| Streaming wall time (laptop, N=4 parallel, OC-only, extrapolated from measured 10K=724s) | ~2.0 hours |
| Streaming wall time (laptop, N=4 parallel, full pipeline, 1K factor -22.5%) | ~5.5 hours |
| Streaming wall time (cloud VM with fast SSD) | ~1–3 hours |
| Peak RAM (chunk=500) | ~19 GB (unchanged — caps at chunk-size, not world-count) |
The spec's worst-case 300 GB / 24+ hours estimate overshot because it assumed 2,000 outcomes per game. We're measuring 850 outcomes per game, so storage is ~33% of the worst-case estimate.
Known constraints before scaling past 10K worlds¶
- Single-host parallelism partially unlocked. Once the accumulator was restructured per-worker (shared-nothing) and the OC write + final merge loops were parallelised, we get -40% on the OC-only path and -22.5% with PBP at 1K N=4. Sim phase itself still caps near the physical-core count (N=8 regresses). The remaining single-threaded bottleneck is the PBP Parquet writer (72s / 36% of 1K-with-PBP total at N=4). Needs re-measurement at 5K/10K to confirm the 1K wins hold at volume.
- Simulation wall time is still a major cost at scale — ~41% of total at 10K sequential. Sim is now ~1.45× faster at N=4, so at 10K we'd expect ~12 min off the 44 min total from sim alone before layering OC-write/merge gains on top. Simulation layer itself is prototype code and out of scope for further optimisation.
- PBP write is super-linear at larger scales. 1K → 10K showed 13.3× growth for 10× more rows. Likely local-SSD contention or Parquet row-group boundary effects. Re-measure on a machine with a faster disk (or in Cloud) before assuming 100K PBP write will be 10× the 10K value; budget closer to 13–14× to be safe.
- Chunk size controls peak RAM, not world count. At chunk=500 we saw 19 GB peak at 10K worlds. Dropping to chunk=200 or chunk=100 cuts peak RAM to ~8 GB or ~4 GB respectively, at the cost of more chunks (slightly higher orchestrator overhead). For 100K on a constrained box, chunk=100 is the right call.
- ClickHouse Cloud sizing. For ~97 GB compressed + 4.2 billion rows, we want a mid-tier cluster (4-CPU, 16-GB-RAM replicas) to avoid write throttling. Storage cost is minor (~$3/month at $0.03/GB); compute is the variable cost.
- Network bandwidth. ~97 GB compressed over the wire to Cloud = ~13 minutes on a gigabit link, ~2.5 hours on 100 Mbps. Enabling
UseCompression=trueon the connection string cuts this by ~50% (it matters for remote, unlike localhost).
Immediate next steps — proposed¶
- Stand up a ClickHouse Cloud instance (small tier is fine) and re-run the must-have suite + OC-only streaming at 1K/5K against it. Validates the results-writer path, establishes real remote-write numbers, and unblocks the REQ-INFRA-2 decision. Highest-leverage open item — every other unknown (parallel concurrent-insert behaviour against a production cluster, network bandwidth, latency floor) is gated on this.
- Confirm parallelism wins hold at 5K and 10K. The 1K sweep showed -40% OC-only and -22.5% with PBP at N=4. Rerun 5K no-PBP and 10K full-pipeline at N=4 to verify the same factor holds when arrays get longer. Cheap to measure (extra ~30 min laptop time).
- Redesign the PBP write path. Two pieces: (a) cap PBP buffer memory via mid-chunk or per-world flushes, (b) parallelise PBP writes — one Parquet file per worker per chunk, or pipelined fire-and-forget. PBP is now the dominant single-threaded phase at N=4 (72s / 36% of 1K-with-PBP wall time).
- Run the should-have suite at 1,000 worlds locally. Concurrent reads, versioning, PBP analytical queries, and most importantly the codec comparison (REQ-INFRA-3) — confirms whether
Delta+ZSTDis optimal. - Kick off a 100K-world OC-only run on a cloud VM. At ~4.3h projected (without parallelism) or potentially ~2.5h with N=4 parallel, this is now feasible without the PBP memory constraint.
- Compile a final report for the storage-decision gate review.
Optimisation analysis — OC critical path¶
Status: the biggest validated win in this section is the parallelism pass documented above (1K OC-only 176s → 106s, -40% at N=4). The string-caching / integer-key hypotheses below were partially tried and did not materialise — see "Validated results from the first optimisation pass (1K)" at the end. Simulation-layer optimisation is out of scope; accumulator/derivation/staging remain in scope but deprioritised until cloud numbers are in.
With PBP out of the way, the OC-only path at 5K takes 772s. Hot spots originally considered:
Simulation inner loop: 414s (54% of OC-only wall time)¶
This is SeasonEngine.SimulateSeason + accumulator.AccumulateGame × 5,000 worlds. Per-world cost: ~83 ms. Inside each world's 288 games, the accumulator does:
- ~37,000 plays (288 games × ~130 plays)
- Each play flows through 7 play accumulators (TeamScoring, Passing, Rushing, Receiving, Defensive, Kicker, GameFlow)
- Each accumulator may increment N outcome counters via
state.IncrementOutcome($"PASSING_YARDS_GAME_{qbId}", yards)— string interpolation + Dictionary hash + Dictionary set on every call
Realistic hot-spot candidates:
- String allocation for outcome IDs — $"..._{playerId}" on the hot path. For 5K worlds × 37K plays × ~5 increments each = 925M string allocations. Pre-building the ID strings per (player, period, stat) once per game and caching them would eliminate this.
- Dictionary hashing of string keys. Internally .NET Dictionary hashes strings via xxhash/marvin — fast but still ~100 ns per op. At 925M ops = 92s just in hashing.
- Two-phase iteration in derivers — derivers iterate state.AllOutcomes.Keys with LINQ filters on every world. Converting these to direct lookups against pre-computed player-ID sets would save work.
Rough optimisation budget: if we can shave 30–40% off the per-world simulation cost, we save ~120–160s at 5K. Scales linearly to savings at 100K.
OC write to staging: 237s (31%)¶
288 games × 10 chunks × ClickHouseBulkCopy per game × ~24,000 rows per game per chunk. At 5K = 2,880 bulk-copies total. Each carries ~4,800 Array(Float64) rows of 500 elements.
Hot-spot candidates:
- One bulk copy per game per chunk — 2,880 network round-trips. Could batch multiple games into one bulk copy.
- Array(Float64) serialisation overhead in ClickHouse.Driver — each row has a double[500]. Client-side serialisation likely takes non-trivial time.
- ClickHouse merge-tree writes create many small parts — each bulk-copy creates a part; 2,880 parts across 288 partitions.
Merge (arrayFlatten): 52s (7%)¶
10 staging batches × 288 games × arrayFlatten(groupArray((batch_index, values))) query. Probably close to floor given the amount of data being reshaped. Smaller target for optimisation.
Ranked optimisation targets¶
- Cache outcome ID strings per game — likely 20–30% wall-time reduction on simulation. Easy: one-time allocation per game ID at startup, reused across all worlds and plays.
- Use integer keys inside
GameAccumulationState— assign each outcome ID anintvia a lookup table at catalogue-build time. Dictionaryis ~3× faster than Dictionary . Harder: requires rewiring all accumulators and derivers. - Partition accumulator state per game. 288 games have independent state. One
Dictionaryper game instead of one giant dictionary lets accumulators skip all the "which team?" branching and makes per-game flushes zero-copy. - Bulk-copy multiple games per ClickHouse insert. Currently 288 round-trips per chunk. A single bulk copy covering all 288 games' partials per chunk would cut round-trips from 2,880 to 10 at scale 5K.
- Profile first. Before any of the above, get a dotTrace / perfview profile of a scale-500 OC-only run. Confirms the hot spots rather than guessing.
Suggested sequence: profile → cache outcome IDs → bulk-copy batching → reassess.
Validated results from the first optimisation pass (1K)¶
Two commits landed based on the analysis above: a readback fix (21bf4e68) and outcome-ID caching (8849faee). Measured end-to-end at 1K on localhost:
| Baseline | Optimised | Δ | |
|---|---|---|---|
| Wall time | 189.88s | 157.43s | -17% |
| Simulation | 79.5s | 78.9s | -0.8% |
| OC write (staging) | 38.9s | 36.6s | -5.9% |
| Merge | 10.5s | 7.8s | -26% |
Where the saving actually came from: the readback fix accounts for ~27s of the 32.5s total — the old post-run row-count loop did 288 full-context reads (~1 GB of data) purely to report TotalGameOutcomeRows, and none of that time was attributed to any measured phase. Replacing it with a single system.parts metrics query removed that cost entirely.
The outcome-ID caching contributed the remaining ~5s, which is within run-to-run variance — the predicted 20–30% simulation win did not materialise. Kept because it's a clean refactor and harmless; not counted as a real win.
Scope going forward: the simulation model is prototype code and is not a target for further optimisation per user direction. Future optimisation effort should be directed at accumulation and derivation layers, or at the OC staging write path (bulk-copy batching remains the highest-leverage unvalidated target).
ClickHouse Cloud validation — AU laptop → westus3 → container → 100K¶
Provisioned ClickHouse Cloud Production Scale (2 replicas × 4 CPU × ~8 GiB) in westus3. Stood up a dedicated Azure Container Apps Job in Sandbox subscription's rg-outcomecontext-experiments to run the experiment from the same region.
Staged optimisations, each measured:
| Change | Config | 1K wall | OC write | Merge | Notes |
|---|---|---|---|---|---|
| 1. Baseline AU→Cloud | laptop N=32 | 343s | 225s | 11s | 565ms RTT = OC write is ~100% network wait |
| 2. Same-region container (Consumption 4 CPU) | N=32 | 235s | 106s | 13s | Eliminates distance tax; sim now CPU-bound |
| 3. Dedicated D32 workload profile (16 CPU) | N=32 | 141s | 98s | 10s | Sim drops -68% (76s → 25s) |
4. Add set_async_insert=1; set_wait_for_async_insert=0 |
D32, N=32 | 60s | 16s | 13s | OC write -84%. Server buffers small inserts. Merge sensitive at higher scale. |
100K OC-only — the headline run:
| Phase | Time | % of total |
|---|---|---|
| Simulation | 1,530s (25.5 min) | 51% |
| Merge (288 × arrayFlatten, server-side, N=4 capped) | 1,210s (20.2 min) | 40% |
| OC write (2,880 async bulk inserts, N=32) | 233s (3.9 min) | 8% |
| Overhead | ~18s | 1% |
| Total | 2,991s (49.85 min) | 100% |
Peak client RAM 16.7 GB (Container Apps allocation: 32 GiB). 244,800 merged OC rows. Cost ≈ $0.80 per run.
Incidents worth noting:
- First 100K attempt failed at ~35 min in the merge phase with
Code: 241. DB::Exception: total memory limit exceeded (would use 14.41 GiB, maximum 14.40 GiB). Cause: 32 parallelarrayFlattenqueries each holding ~680 MB working set = ~22 GB server-side, exceeding the Production Scale tier's per-query budget. Fixed in commite7cc5618by capping final-merge parallelism at 4 (merge is server-bound, not client-bound). Re-run succeeded. - Stdout lost on fast-exit. Container Apps console-log ingestion dropped the Streaming Run Report when the container exited immediately after printing. Fixed in commit
60611cbcwithConsole.Out.FlushAsync()+ 3s pause before exit. - Cross-run data doubling detected on 2× consecutive runs in the same Docker container during localhost validation. Cause: ClickHouse Atomic DB defers DROP TABLE data cleanup;
system.partssummed old + new parts acrossactive=1. Fixed in commit0af492e1by appendingSYNCto every DROP TABLE statement.
Model calibration: round-trip cost analysis from the AU runs (5,760 staging RTs × 565ms / effective parallelism) predicted OC write within 10% and merge within 15% at every scale. Same-region projection (~70–90s for 1K with "near-zero" RTT) was wrong — residual was server-side ClickHouse overhead (~300ms per bulk insert independent of network), not latency. Learning baked into the new projection basis.
Two merge-phase optimisations tried and neither stuck¶
After the 100K baseline landed at 49.85 min with merge at 40% of wall time (20 min / 1,210s), two targeted optimisations were attempted. Both failed:
H. OPTIMIZE TABLE game_outcome_context_staging FINAL before the merge loop — theory: async_insert creates many small MergeTree parts; consolidating them before running 288 arrayFlatten queries would reduce per-query scan work. Result: OPTIMIZE TIMED OUT at 100K scale with Code: 236. DB::Exception: Aborting discovery process: Timeout, interrupting no-future-parts discovery. The default server-side OPTIMIZE timeout (~2 min) is not enough time to consolidate 28,800 staging rows across many parts. Raising the timeout would risk OPTIMIZE itself taking longer than the savings. Reverted.
C. Hash-shard each game's merge into 4 sub-range queries — theory: split arrayFlatten per-game work by cityHash64(outcome_id) % 4 = shardIndex, letting us run 16 concurrent queries at ~170 MB each instead of 4 at ~680 MB each. Result: 244,800 rows still correct, but total wall time regressed +13% (2,991s → 3,377s) and merge time regressed +10% (1,210s → 1,328s). Root cause: each sharded query still has to scan the full staging rows for its game (the game_id filter is the only index-eligible predicate; the outcome_id hash filter runs after the row read), so 16 shard queries do ~4× the total scan IO for the same output. Reverted.
Lesson: reducing per-query memory at the cost of redundant scans is a lose even when the memory headroom is real. A genuine sharded-merge approach would need to partition the staging table itself by outcome hash — not worth the complexity at 100K scale where the current 49.85 min is already acceptable. Parking merge-phase work; if sub-30-min is required later, the lever is a bigger Cloud tier rather than smarter SQL.
Full-pipeline (with PBP) validation — 10K works, 100K hits a new ceiling¶
After the 100K OC-only baseline landed, the next question was whether the orchestrator could ship the full pipeline (OC + PBP) at 100K. The PBP path went through three configurations before stabilising:
Attempt 1 — Parquet on ephemeral container disk. Ran 1K/5K/10K on the laptop this way. Tried 10K on the westus3 Container Apps job (same Parquet writer, same LocalDiskBlobSink). Stalled at 100+ minutes in the PBP write phase with no error, never completed inside the 2-hour replica budget. Ephemeral container storage is not characterised for the per-chunk fsync pattern the Parquet writer uses. Path abandoned on Cloud.
Attempt 2 — PBP direct to ClickHouse. Added ClickHousePlayByPlayBatchWriter: converts PlayByPlayRecord → PlayRecord and delegates to IStorageBackend.WritePlayByPlayAsync, reusing the same async-insert bulk-copy path that took OC write from 98s to 16s at 1K. First attempt failed with Code: 252. Too many partitions for single INSERT block (more than 100) — the play_by_play table was PARTITION BY game_id (288/season), exceeding Cloud's locked max_partitions_per_insert_block=100. Second attempt tried to raise the setting via set_max_partitions_per_insert_block=500 in the connection string; Cloud rejects the override with Code: 452. SETTING_CONSTRAINT_VIOLATION. Schema change landed: play_by_play is now PARTITION BY season_id (one partition per season; per-game DROP PARTITION is no longer free but ALTER TABLE ... DELETE WHERE game_id = X covers it via mutation if ever required).
Attempt 3 — PBP direct to ClickHouse with parallel bulk-copies. 1K succeeded in 304s (175K PBP rows/sec). 10K at chunk=500 OOM'd the 32 GiB container; raised to 32 vCPU / 64 GiB and it completed in 18.7 min (513K PBP rows/sec, 2.9× better per-row throughput than 1K as async_insert amortises per-bulk overhead). 100K at chunk=1000 OOM'd again (per-chunk PBP buffer hit ~18.5 GB × parallel slices); switched to chunk=500 (200 chunks) which fits memory but hit the 4-hour replica deadline still in the PBP write phase — super-linear past 10K by at least 1.75×.
Diagnosed root cause for the 100K super-linearity: PARTITION BY season_id means all 4.2B PBP rows at 100K live in a single partition. Cloud MergeTree's background merge pressure on a 4.2B-row partition appears to throttle foreground INSERT throughput in a way it doesn't at 420M rows (10K). The fix is a finer-grained partition scheme (e.g., (season_id, intDiv(world_id, 10000)) gives ~10 partitions per 100K run, still well under the 100-partition-per-insert limit because our chunks are contiguous world ranges that all map to the same world_id/10000 bucket).
Validated ceilings on Production Scale tier: - OC-only: 100K in 49.85 min - Full pipeline: 10K in 18.7 min - Full pipeline at 100K: not validated, hit 4h replica deadline
Paths to 100K full-pipeline (not yet executed):
1. Upgrade Cloud tier — Production (vs Scale) with 4 replicas × 16 CPU would 8× the insert throughput. Cheapest fix in engineering hours, most expensive in monthly spend.
2. Repartition PBP — (season_id, intDiv(world_id, 10000)) spreads across partitions, lets background merges parallelise. ~1 hour of code, needs one retry run at 100K to validate (~$4 of compute).
3. Split PBP off the critical path — ship OC-only as primary, run PBP asynchronously in a separate pipeline. Keeps the current validated 100K path; requires product decision on whether PBP is genuinely needed on the same SLA.
Production 3×16 tier — the 100K-in-under-3h milestone¶
The path-1 upgrade happened: the Cloud cluster was resized to Production with 3 replicas × 16 vCPU × 64 GiB = 48 vCPU / 192 GiB aggregate (6× the CPU and 24× the RAM of Production Scale). Along with the tier change the orchestrator's merge-parallelism cap was raised from 4 to 16 (the original cap was tuned to the ~14 GiB per-query budget on the smaller tier; 64 GiB per replica comfortably handles 16 × ~680 MB concurrent merges).
100K OC-only re-baseline on 3×16 tier:
| Phase | Prod Scale (2×4) | Prod 3×16 | Δ |
|---|---|---|---|
| Total wall | 2,991s (49.85 min) | 1,887s (31.4 min) | -37% |
| Sim | 1,530s | 1,412s | -8% (noise, client-bound) |
| OC write | 233s | 328s | +40% (async_insert variance) |
| Merge | 1,210s (20 min) | 137s (2.3 min) | -89% |
| Peak client RAM | 16.7 GB | 19.0 GB | similar |
The 20-min merge phase that had us chasing OPTIMIZE + sharded merges on the smaller tier simply disappeared with more replica RAM. That's where the big cost/time jump is: it's not a software fix, it's a compute-budget fix.
100K full-pipeline (OC + PBP, no season) on 3×16 tier — the first successful 100K full-pipeline run against Cloud: 9,887s (2h 44m). PBP write 7,955s for 4,198,546,157 rows — throughput 528K rows/sec vs 513K at 10K (effectively flat across 10× scale, super-linearity on the smaller tier was entirely a resource-starvation artifact).
Season accumulator refactor — closing the last gap¶
After the OC-only and no-season full-pipeline numbers landed, one gap remained: season_outcome_context was never exercised by the streaming orchestrator. The season accumulator existed but only ran in the standalone nice-to-have SeasonContextWriteExperiment. Wiring it into the orchestrator was straightforward — but running it at scale surfaced a memory-layout problem:
The original AmericanFootballSeasonAccumulator kept per-world state as Dictionary<int worldIndex, Dictionary<string outcomeId, double>>. Dict overhead + string duplication at 10K worlds was already 27 GB. Linear projection to 100K: ~270 GB. The container OOM'd mid-run at 3h.
Refactor (commit 41ab4c90): switched the inner layout to Dictionary<string outcomeId, double[worldCount]> — one pre-sized array per outcome, indexed by absolute world index. Same shape the game accumulator has used all along. At 100K that's 1,668 outcomes × 100,000 × 8 bytes ≈ 1.3 GB, flat regardless of how many worlds are in flight. Parameterless constructor preserved for tests/benchmarks; orchestrator passes TotalWorldCount so there's no grow-and-copy during accumulation. All 6 existing AmericanFootballSeasonAccumulatorTests pass unchanged.
100K full-pipeline + season — primary result:
| Phase | Time | % of total |
|---|---|---|
| Simulation | 2,169s (36 min) | 20% |
| PBP write | 7,983s (2h 13m) | 75% |
| OC write | 387s (6.5 min) | 4% |
| Merge | 121s (2 min) | 1% |
| Season write | 22.90s | <1% |
| Overhead | ~13s | <1% |
| Total wall | 10,696s (2h 58m) | 100% |
| Peak client RAM | 46.6 GB | (vs 46.8 GB on the no-season run — flat) |
| OC rows | 244,800 | |
| PBP rows | 4,198,981,505 | |
| Season rows | 1,728 |
The ~14 min difference vs the no-season run (2h 58m vs 2h 44m) is almost entirely in the sim phase (2,169s vs 1,462s, +48%). Cause: the shared season accumulator is lock-guarded (its inner dicts aren't thread-safe), so every worker pays lock-acquire + release on the sim hot path — 28.8M locked calls at 100K. The fix would be per-worker season accumulators + a merge step, following the same pattern the game accumulator uses. Not done here because +14 min of wall time is well within budget for the gate, but it's the obvious next optimisation if season runs ever move to the critical path.
Validated ceilings on Production 3×16 tier¶
- OC-only 100K: 31.4 min
- Full pipeline (OC + PBP) 100K: 2h 44m
- Full pipeline + season 100K: 2h 58m — primary "full" result
Artifacts available for review¶
- Redis vs ClickHouse deferred evaluation —
docs/outcome-context/evaluations/redis-as-store.md - ClickHouse schema review (strengths, proposed tuning, what not to change) —
docs/clickhouse-schema-review.md - Non-technical narrative —
docs/outcome-context/evaluations/storage-experiment/story.md - Sample accumulator outputs (single-world report, first-game diagnostic) —
docs/samples/ - Real benchmark runs (including streaming vs old-path comparison) —
docs/samples/experiment-runs/