Experiment Run Results¶

Captured output from real runs of the storage experiment against a local ClickHouse container.

Environment¶

ClickHouse 24.8 in Docker (single node, default config, no resource tuning)
Host: dev laptop (Windows, bash via Git Bash)
.NET 10, Release build
Client: ClickHouse.Driver v0.9.0 and ClickHouse.Client v7.8.0 (REQ-INFRA-2 compares both)

These are indicative results from a dev laptop, not production capacity numbers. A proper benchmark run needs: - Tuned ClickHouse instance (CPU, memory, disk sized appropriately) - Controlled network conditions - Multiple runs for statistical significance

Files¶

must-have-scale-10.txt — must-have suite at 10 worlds, HTTP no compression
must-have-scale-100.txt — must-have suite at 100 worlds, HTTP no compression
must-have-scale-10-compressed.txt — same at 10 worlds with UseCompression=true
must-have-scale-100-compressed.txt — same at 100 worlds with UseCompression=true
must-have-scale-1000.txt — must-have suite at 1,000 worlds (~10 min total, ~8 min for PBP season write)
streaming-all-scales.txt — streaming orchestrator at 10, 100, 1000 worlds (OC → staging + merge, PBP → local Parquet files)
streaming-scale-10000.txt — streaming at 10,000 worlds (chunk=500, 20 chunks, ~44 min total)
streaming-scale-1000-profiling.txt — 1K baseline run captured for profiling (189.88s wall)
streaming-scale-1000-optimized.txt — same 1K run after the readback fix + outcome-ID caching (157.43s wall, -17%)
streaming-scale-5000-no-pbp.txt — 5K OC-only run via --no-pbp (772s, 1.74 GB peak — isolates OC path from PBP buffering)
streaming-scale-1000-no-pbp-p1.txt — 1K OC-only, --parallel-workers 1 baseline (176.9s)
streaming-scale-1000-no-pbp-p2.txt — 1K OC-only, --parallel-workers 2 (154.7s)
streaming-scale-1000-no-pbp-p4.txt — 1K OC-only, --parallel-workers 4 sweet spot (151.1s)
streaming-scale-1000-no-pbp-p8.txt — 1K OC-only, --parallel-workers 8 regresses (172.1s)
streaming-scale-1000-no-pbp-p1-pwrite.txt — 1K OC-only, N=1 after parallelising OC write loop (176.2s — unchanged at N=1 by design)
streaming-scale-1000-no-pbp-p4-pwrite.txt — 1K OC-only, N=4 with parallel OC write (111.2s)
streaming-scale-1000-no-pbp-p8-pwrite.txt — 1K OC-only, N=8 with parallel OC write (106.2s — OC write at 16.5s)
streaming-scale-1000-no-pbp-p4-pmerge.txt — 1K OC-only, N=4 after also parallelising final merge (105.9s — merge at 7.4s)
streaming-scale-1000-no-pbp-p8-pmerge.txt — 1K OC-only, N=8 with parallel OC write + merge (112.8s — sim crept up, noise)
streaming-scale-1000-with-pbp-p1.txt — 1K WITH PBP, N=1 same-session baseline (255.8s)
streaming-scale-1000-with-pbp-p4.txt — 1K WITH PBP, N=4 (198.2s, -22.5% vs N=1 same session)
streaming-scale-5000-no-pbp-p4.txt — 5K OC-only, N=4 clean volume (345.1s, -55% vs historical N=1 772s)
streaming-scale-10000-no-pbp-p1.txt — 10K OC-only, N=1 clean baseline (1,565s / 26.1 min)
streaming-scale-10000-no-pbp-p4.txt — 10K OC-only, N=4 clean (723.8s / 12.1 min, -54% vs same-session N=1)
streaming-cloud-scale-1000-no-pbp-d32-async.txt — Cloud 1K OC-only, D32 N=32 + async_insert, 59.67s
streaming-cloud-scale-10000-no-pbp-d32-async.txt — Cloud 10K OC-only, D32 N=32 + async_insert, 399.60s
streaming-cloud-scale-100000-no-pbp-d32-async.txt — Cloud 100K OC-only, D32 N=32 + async_insert, merge-parallelism capped at 4, 2,991s (49.85 min)
streaming-cloud-scale-100000-no-pbp-d32-sharded-merge-reverted.txt — 100K with experimental hash-sharded merge (cityHash64(outcome_id) % 4). Correct results but +13% slower (3,377s) because each shard still scans all staging rows for its game. Reverted in commit 2049f9a4.
streaming-cloud-scale-1000-with-pbp-d32-ch.txt — Cloud 1K full-pipeline (OC + PBP), PBP direct to ClickHouse, 303.98s. PBP throughput: 175K rows/sec.
streaming-cloud-scale-10000-with-pbp-d32-ch-parallel.txt — Cloud 10K full-pipeline with 4-way parallel PBP writes, 1,119.68s (18.7 min). Highest validated with-PBP scale. PBP throughput: 513K rows/sec.
streaming-cloud-scale-100000-with-pbp-d32-ch-timed-out.txt — 100K full-pipeline attempt on Production Scale (2×4): hit 4h replica timeout mid-PBP-write. Super-linear PBP scaling past 10K on that tier.
streaming-cloud-scale-100000-oconly-prod3x16.txt — 100K OC-only on Production 3×16: 1,887s (31.4 min). Merge cap raised 4→16 with more replica RAM. -37% total vs Production Scale.
streaming-cloud-scale-100000-fullpipeline-prod3x16.txt — First successful 100K full-pipeline on Production 3×16: 9,887s (2h 44m). PBP write near-linear (528K rows/sec vs 513K at 10K). 4.2B PBP rows .
streaming-cloud-scale-10000-fullpipeline-season-prod3x16.txt — 10K full-pipeline + season context (first Cloud-scale season run): 1,088s. 1,728 season rows , overhead 7s.
streaming-cloud-scale-100000-fullpipeline-season-oom.txt — 100K full-pipeline + season: OOM after 3h with the original nested-dict accumulator. Root cause documented inline.
streaming-cloud-scale-100000-fullpipeline-season-prod3x16.txt — 100K full-pipeline + season with refactored AmericanFootballSeasonAccumulator: 10,696s (2h 58m). Peak RAM 46.6 GB (vs projected 270 GB with old accumulator). 244,800 OC + 4.2B PBP + 1,728 season rows . All three tables at 100K validated.

Streaming vs old-path comparison (end-to-end wall time)¶

Apples-to-apples: same ClickHouse instance, same simulation, same worlds. Old path = TestDataCache + direct ClickHouse writes. New path = StreamingOrchestrator with OC via staging+merge + PBP to Parquet files.

Scale	Chunk	Old total	New total	Δ	Old peak RAM	New peak RAM
10	10	23.8s	30.2s	+27%	n/a (unbounded)	594 MB
100	100	41.6s	38.0s	-9%	n/a	4.2 GB
1,000	100	~10 min	4.4 min	-56%	n/a	5.1 GB
10,000	500	(old path couldn't run)	43.9 min	unreachable vs measurable	—	19.3 GB

Scale 1,000 breakdown (streaming)¶

Total wall time: 262.7s
Simulation: 109.3s (42%)
OC write (staging): 55.1s (21%)
PBP write (Parquet): 54.4s (21%)
Merge (arrayFlatten): 16.1s (6%)
Overhead (setup + read-back verify): 27.8s (11%)
10 chunks of 100 worlds completed
Peak working set: 5.1 GB (vs old-path ~20+ GB extrapolated — old path hit memory wall past 1,000 worlds on 32 GB boxes)
OC rows: 244,799 (merged from 10 × ~24K partial rows)
PBP rows: 41,972,008 written to 10 Parquet files on local disk

Scale 10,000 breakdown (streaming, chunk=500, 20 chunks)¶

Total wall time: 2,634 s (43.9 min)
Simulation: 1,083s (41%)
OC write to staging: 499s (19%)
PBP write to Parquet: 726s (28%)
Merge (arrayFlatten): 201s (8%)
Overhead: ~125s (5%)
Peak working set: 19.3 GB (higher than expected — chunk=500 accumulator holds 500-world arrays per game × 288 games × ~850 outcomes)
OC rows (merged): 244,800
PBP rows: 419,763,551 (~420 million)
PBP Parquet files: 20 files, ~6.9 GB on disk
ClickHouse OC table: 2.68 GB compressed, 19.6 GB uncompressed, ratio 7.32× (up from 6.96× at scale 1,000)

Linear-scaling check — scale 1,000 → 10,000¶

Metric	Scale 1,000	Scale 10,000	Growth
Total wall time	262.7s	2,634.4s	10.0× (perfectly linear)
Simulation time	109.3s	1,083.0s	9.9×
OC write time	55.1s	498.5s	9.0×
PBP write time	54.4s	725.8s	13.3× (super-linear — disk contention?)
Merge time	16.1s	201.1s	12.5× (row-group growth matters for arrayFlatten)
OC rows	244,799	244,800	~same (catalogue-bounded, not world-bounded)
PBP rows	41,972,008	419,763,551	10.0×
OC compression ratio	6.96×	7.32×	improves further

Key observations¶

Streaming is SLOWER than in-memory below a threshold (scale 10 adds 27% overhead from staging + merge). That's the cost of chunking and we accept it.
Streaming wins as scale grows — at 1,000 worlds it's more than 2× faster than the old path, and crucially stays under 6 GB RAM where the old path at this scale struggles on 32 GB machines.
At 10,000 worlds the streaming path completes successfully in 44 minutes with 19 GB peak RAM — the old path couldn't run at this scale on a 32 GB laptop at all.
Scaling is nearly linear from 1K → 10K. Wall time grew exactly 10× for 10× more worlds. No hidden quadratic costs.
OC compression keeps improving at scale — 5.56× → 6.39× → 6.96× → 7.32× as array length grows. The Delta+ZSTD codec is doing more work as the Float64 arrays get longer.
PBP write is super-linear (13.3× for 10× rows) — possibly disk contention on the local SSD, or Parquet row-group boundary effects. Not alarming but worth measuring on faster hardware.
PBP Parquet files land on local disk (20 files, one per chunk). From there a separate pipeline ingests them into ClickHouse if/when needed.

Memory — chunk-size tradeoff¶

At scale 10,000 with chunk=500, we hit 19.3 GB peak. This is higher than the 5.1 GB we saw at scale 1,000 with chunk=100. The peak scales with chunk size (bigger arrays per accumulator), not total world count. If we'd picked chunk=100 we'd have seen peak closer to 5 GB with 100 chunks instead of 20 — same total time probably, much lower memory ceiling. Worth sweeping chunk size if we want to push toward 100K on a 32 GB box.

1K optimisation pass — where the -17% came from¶

Captured streaming-scale-1000-profiling.txt (baseline) and streaming-scale-1000-optimized.txt (after two commits). Wall-time dropped 189.88s → 157.43s (-32.5s, -17%). Breaking that down:

Component	Baseline	Optimised	Δ
Simulation	79.5s	78.9s	-0.6s
OC write (staging)	38.9s	36.6s	-2.3s
PBP write (Parquet)	41.3s	40.8s	-0.5s
Merge (arrayFlatten)	10.5s	7.8s	-2.7s
Measured subtotal	170.2s	164.1s	-6.1s
Untracked / readback	~19.7s	~-6.7s*	-26.8s
Total wall time	189.88s	157.43s	-32.5s

* Negative untracked is measurement noise — stopwatches overlap by a few seconds.

Conclusion: the big win was the readback fix (commit 21bf4e68) — the old post-run row-count loop did 288 full-context reads just to report TotalGameOutcomeRows, burning ~27s that wasn't attributed to any measured phase. Replacing it with a single system.parts metrics query gave the full -17%. The outcome-ID caching (commit 8849faee) was within run-to-run noise at this scale — kept because it's a clean refactor but not the source of the savings. Simulation-layer optimisation isn't planned (it's prototype code); future wins should come from accumulation and derivation.

Parallel accumulator spike — 1K OC-only sweep¶

Per-worker accumulator + per-worker PBP list, merged at chunk boundary (shared-nothing). Same session, same ClickHouse instance, --no-pbp.

Workers	Total wall	Sim	OC write	Merge	Peak RAM
1	176.9s	101.5s	59.6s	15.7s	648 MB
2	154.7s	77.7s	60.2s	16.5s	790 MB
4	151.1s	69.7s	64.2s	17.0s	924 MB
8	172.1s	90.8s	63.8s	17.3s	986 MB

Findings - Sim phase scales 1.45× at N=4 (101→70s). Per-worker accumulators work — the shared-state contention that killed the earlier parallel-worlds attempt is gone. - Sweet spot N=4, -15% total wall time. N=8 regresses — thread contention + Docker ClickHouse stealing cycles on the same laptop. - OC write + merge are I/O-bound and do not parallelise from sim workers. They're why the total speedup is 1.17× not 4×. - Peak RAM grows ~70–90 MB per added worker. Manageable at 1K; will matter at 10K+ with PBP on. - Correctness OK: 244,800 merged OC rows at every N.

Compare to previous Channel<T> + shared-accumulator attempt (~3% on 10K): shared-nothing was the right structural call.

Push further — parallel OC write + parallel final merge (1K OC-only)¶

With shared-nothing accumulators in place, the next serial phases were the 288-game merge-and-staging-write loop and the final per-game MergeStagingToFinalAsync loop. Both are embarrassingly parallel (different gameIds, different staging rows, different partitions). Wrapped them in Parallel.ForEachAsync with MaxDegreeOfParallelism = ParallelWorkers.

Config	Total	Sim	OC write	Merge
N=1 (all serial)	176.2s	104s	56s	16s
N=4 sim-only parallel	151.1s	70s	64s	17s
N=4 sim + OC-write parallel	111.2s	73s	22s	16s
N=4 sim + OC-write + merge parallel	105.9s	73s	25s	7s
N=8 sim + OC-write + merge parallel	112.8s	82s	23s	8s

Best config: N=4 with all three phases parallel → 105.9s, -40% vs N=1 baseline. Parallel ClickHouse writes are the big unlock: 2.3× faster OC write, 2.1× faster merge. The laptop sim phase caps around N=4 physical cores — N=8 doesn't help and occasionally regresses sim due to thread contention with Docker ClickHouse.

Parallelism scaling — 1K → 5K → 10K (OC-only)¶

Scale	N=1 wall	N=4 wall	Δ	N=4 per-world
1K	176s	106s	-40%	106 ms
5K	772s	345s	-55%	69 ms
10K	1,565s (26.1m)	724s (12.1m)	-54%	72 ms

The parallelism factor improves at scale, not degrades. At 1K the gain was -40%; at 5K and 10K it's ~-55%. Fixed per-chunk overhead amortises better at larger chunk-times, so per-world cost drops from 106 ms → ~70 ms from 5K onwards.

Scaling at N=4: 5K → 10K = 2.1× (near-linear, matching N=1's 2.03×). 100K OC-only projection at N=4 laptop: ~2.0 hours (10K × 10), down from the earlier ~4.4h estimate extrapolated from the 1K factor. All three scales produced the correct 244,800 merged OC rows.

Side finding — stale-volume hazard fixed in commit. The first 5K N=4 attempt returned 489,599 rows (2×) on a non-fresh Docker volume. Root cause: DROP TABLE IF EXISTS on an Atomic ClickHouse DB defers data cleanup; system.parts WHERE table='X' AND active=1 kept summing old and new parts for several minutes. Fixed by appending SYNC to all DROP TABLE statements in ClickHouseSchemas.cs, which forces synchronous deletion. Verified with back-to-back scale-10 runs: no doubling.

1K WITH PBP comparison — per-worker PBP concat verified¶

Same session, PBP enabled:

Config	Total	Sim	OC write	PBP write	Merge	Peak RAM
N=1	255.8s	124s	58s	57s	16s	6.67 GB
N=4	198.2s	93s	25s	72s	7s	6.78 GB

Wall time -22.5% with PBP on (vs -40% without). PBP write phase remains serial and doesn't parallelise (single Parquet writer per chunk).
PBP write regressed +15s at N=4 — likely Docker CH saturation + GC pressure during the concurrent OC staging write. Not alarming; the overall wall time still drops meaningfully.
Peak RAM effectively unchanged (6.67 → 6.78 GB) — PBP buffer dominates memory at this scale, and 4 partial OC accumulators cost ~80 MB total extra.
Correctness fine: 41.98M PBP rows match sequential.

Compression comparison (localhost, same scale)¶

Headline: Enabling HTTP body compression on ClickHouse.Driver makes almost no difference on localhost. This is expected — network isn't the bottleneck when the client and server are on the same machine. The compression_ratio metric reflects on-disk column-level ZSTD compression (set in the table DDL), which is independent of the HTTP wire-level compression flag.

Metric	Scale 10 (raw)	Scale 10 (compressed)	Δ	Scale 100 (raw)	Scale 100 (compressed)	Δ
OC full season write (ms)	18,708	18,563	-0.8%	7,678	7,190	-6.3%
OC rows/sec	12,255	12,363	+0.9%	31,751	33,899	+6.8%
OC compression ratio	5.56×	5.54×	~0	6.38×	6.39×	~0
PBP full season write (ms)	5,124	5,019	-2.1%	34,368	35,758	+4.0%
PBP rows/sec	81,853	83,807	+2.4%	122,166	117,424	-3.9%

Interpretation: - Differences are within run-to-run noise (~5% variance typical) - On-disk compression ratio is identical across runs (set by CODEC in DDL, not wire-level) - HTTP body compression matters for remote ClickHouse (cloud, across-network) where the compressed binary payload significantly reduces bytes-over-wire. On localhost over loopback, it's a wash.

Conclusion for the controlled evaluation: Test compression against a remote ClickHouse instance (ClickHouse Cloud or a cloud VM). For local development, keep it off to eliminate CPU cycles spent compressing bytes that stay inside the same machine.

Scale-10 vs Scale-100 Summary¶

Metric	10 worlds	100 worlds	Growth
OC full season write	18.7s	7.7s	0.4× (faster!)
OC rows/sec	12,255	31,751	2.6×
OC total rows (288 games)	229,274	243,776	1.06× (row count grows slightly as more players become active)
OC single-game read p50	17.3 ms	25.4 ms	1.5×
OC compressed bytes	4.76 MB	31.9 MB	6.7×
OC uncompressed bytes	26.5 MB	203.9 MB	7.7×
OC compression ratio	5.56×	6.38×	Improves at scale
OC bytes per row (compressed)	20.8	131.1	Array length dominates
PBP full season write	5.1s	34.4s	6.7×
PBP rows/sec	81,853	122,166	1.5× (better batch utilization)
PBP total rows	419,412	4,198,561	10×
PBP compressed bytes	10.5 MB	89.2 MB	8.5×
PBP uncompressed bytes	21.0 MB	205.0 MB	9.8×
PBP compression ratio	2.01×	2.30×	Improves at scale

Key Observations¶

Compression ratio validates the design. The Outcome Context (dense Array(Float64)) compresses at 5.56–6.38×, right in line with the spec's 5–6× expectation. The play-by-play's sparse MergeTree storage compresses at ~2×.
OC write got FASTER per-row at 100 worlds. Counter-intuitive but explained: at scale 10, each row has an array of 10 elements; at scale 100, 100 elements. Fewer rows per array length means ClickHouse.Driver amortizes overhead better.
PBP scales linearly. Writing 10× more PBP data takes ~7× longer. ClickHouse absorbs the load without falling off.
Single-game read is sub-30ms at both scales. Well within the 1-second target for REQ-OC-R1 at 10K and 5-second target at 100K.
REQ-INFRA-2 client comparison is inconclusive. ClickHouse.Driver won at scale 10 (138 vs 99 rows/sec); ClickHouse.Client won at scale 100 (1022 vs 978). Variance dominates at these sample sizes. A controlled evaluation at realistic workload is still required.

How to Reproduce (historical)¶

Note: The LBS.Model.AmericanFootball.StorageExperiment project that produced these captures has been removed. The exact commands that generated the .txt samples above are no longer runnable as written, so this section is kept only as a historical record of how the original runs were invoked.

The OC simulation now lives under src/Apps/LBS.OutcomeContext.SimulationRunner* (entry point src/Apps/LBS.OutcomeContext.SimulationRunner/Program.cs). It is configured via OC_* environment variables (for example OC_CLICKHOUSE_CONNECTION, OC_ENGINE, OC_WORLD_COUNT) rather than the old --suite / --backend / --scale CLI flags, so it is not a drop-in equivalent. See that project for current invocation details.

The original (now-removed) StorageExperiment was driven roughly like this:

# Start local ClickHouse, run a suite at a given scale against it, then tear down.
# The StorageExperiment project no longer exists — preserved here for context only.
dotnet run --project <StorageExperiment> -c Release -- \
  --suite must-have --backend clickhouse --scale 10 \
  --environment "local-docker" \
  --connection-string "Host=localhost;Port=8123;Database=experiment;Username=default;Password=clickhouse"

What's Not Yet Tested¶

Should-have experiments (REQ-OC-R3/R4/R5, REQ-OC-W3/W5, REQ-PBP-R1–R5, REQ-INFRA-3) — framework ready but not run here
Nice-to-have experiments (REQ-OC-W4/R6, REQ-PBP-S1)
10,000-world and 100,000-world scales (need tuned ClickHouse, not laptop Docker)
ClickHouse Cloud as a results store (no --cloud-connection-string was set for these runs)