Experiment Run Results¶
Captured output from real runs of the storage experiment against a local ClickHouse container.
Environment¶
- ClickHouse 24.8 in Docker (single node, default config, no resource tuning)
- Host: dev laptop (Windows, bash via Git Bash)
- .NET 10, Release build
- Client: ClickHouse.Driver v0.9.0 and ClickHouse.Client v7.8.0 (REQ-INFRA-2 compares both)
These are indicative results from a dev laptop, not production capacity numbers. A proper benchmark run needs: - Tuned ClickHouse instance (CPU, memory, disk sized appropriately) - Controlled network conditions - Multiple runs for statistical significance
Files¶
must-have-scale-10.txt— must-have suite at 10 worlds, HTTP no compressionmust-have-scale-100.txt— must-have suite at 100 worlds, HTTP no compressionmust-have-scale-10-compressed.txt— same at 10 worlds withUseCompression=truemust-have-scale-100-compressed.txt— same at 100 worlds withUseCompression=truemust-have-scale-1000.txt— must-have suite at 1,000 worlds (~10 min total, ~8 min for PBP season write)streaming-all-scales.txt— streaming orchestrator at 10, 100, 1000 worlds (OC → staging + merge, PBP → local Parquet files)streaming-scale-10000.txt— streaming at 10,000 worlds (chunk=500, 20 chunks, ~44 min total)streaming-scale-1000-profiling.txt— 1K baseline run captured for profiling (189.88s wall)streaming-scale-1000-optimized.txt— same 1K run after the readback fix + outcome-ID caching (157.43s wall, -17%)streaming-scale-5000-no-pbp.txt— 5K OC-only run via--no-pbp(772s, 1.74 GB peak — isolates OC path from PBP buffering)streaming-scale-1000-no-pbp-p1.txt— 1K OC-only,--parallel-workers 1baseline (176.9s)streaming-scale-1000-no-pbp-p2.txt— 1K OC-only,--parallel-workers 2(154.7s)streaming-scale-1000-no-pbp-p4.txt— 1K OC-only,--parallel-workers 4sweet spot (151.1s)streaming-scale-1000-no-pbp-p8.txt— 1K OC-only,--parallel-workers 8regresses (172.1s)streaming-scale-1000-no-pbp-p1-pwrite.txt— 1K OC-only, N=1 after parallelising OC write loop (176.2s — unchanged at N=1 by design)streaming-scale-1000-no-pbp-p4-pwrite.txt— 1K OC-only, N=4 with parallel OC write (111.2s)streaming-scale-1000-no-pbp-p8-pwrite.txt— 1K OC-only, N=8 with parallel OC write (106.2s — OC write at 16.5s)streaming-scale-1000-no-pbp-p4-pmerge.txt— 1K OC-only, N=4 after also parallelising final merge (105.9s — merge at 7.4s)streaming-scale-1000-no-pbp-p8-pmerge.txt— 1K OC-only, N=8 with parallel OC write + merge (112.8s — sim crept up, noise)streaming-scale-1000-with-pbp-p1.txt— 1K WITH PBP, N=1 same-session baseline (255.8s)streaming-scale-1000-with-pbp-p4.txt— 1K WITH PBP, N=4 (198.2s, -22.5% vs N=1 same session)streaming-scale-5000-no-pbp-p4.txt— 5K OC-only, N=4 clean volume (345.1s, -55% vs historical N=1 772s)streaming-scale-10000-no-pbp-p1.txt— 10K OC-only, N=1 clean baseline (1,565s / 26.1 min)streaming-scale-10000-no-pbp-p4.txt— 10K OC-only, N=4 clean (723.8s / 12.1 min, -54% vs same-session N=1)streaming-cloud-scale-1000-no-pbp-d32-async.txt— Cloud 1K OC-only, D32 N=32 +async_insert, 59.67sstreaming-cloud-scale-10000-no-pbp-d32-async.txt— Cloud 10K OC-only, D32 N=32 +async_insert, 399.60sstreaming-cloud-scale-100000-no-pbp-d32-async.txt— Cloud 100K OC-only, D32 N=32 +async_insert, merge-parallelism capped at 4, 2,991s (49.85 min)streaming-cloud-scale-100000-no-pbp-d32-sharded-merge-reverted.txt— 100K with experimental hash-sharded merge (cityHash64(outcome_id) % 4). Correct results but +13% slower (3,377s) because each shard still scans all staging rows for its game. Reverted in commit 2049f9a4.streaming-cloud-scale-1000-with-pbp-d32-ch.txt— Cloud 1K full-pipeline (OC + PBP), PBP direct to ClickHouse, 303.98s. PBP throughput: 175K rows/sec.streaming-cloud-scale-10000-with-pbp-d32-ch-parallel.txt— Cloud 10K full-pipeline with 4-way parallel PBP writes, 1,119.68s (18.7 min). Highest validated with-PBP scale. PBP throughput: 513K rows/sec.streaming-cloud-scale-100000-with-pbp-d32-ch-timed-out.txt— 100K full-pipeline attempt on Production Scale (2×4): hit 4h replica timeout mid-PBP-write. Super-linear PBP scaling past 10K on that tier.streaming-cloud-scale-100000-oconly-prod3x16.txt— 100K OC-only on Production 3×16: 1,887s (31.4 min). Merge cap raised 4→16 with more replica RAM. -37% total vs Production Scale.streaming-cloud-scale-100000-fullpipeline-prod3x16.txt— First successful 100K full-pipeline on Production 3×16: 9,887s (2h 44m). PBP write near-linear (528K rows/sec vs 513K at 10K). 4.2B PBP rows .streaming-cloud-scale-10000-fullpipeline-season-prod3x16.txt— 10K full-pipeline + season context (first Cloud-scale season run): 1,088s. 1,728 season rows , overhead 7s.streaming-cloud-scale-100000-fullpipeline-season-oom.txt— 100K full-pipeline + season: OOM after 3h with the original nested-dict accumulator. Root cause documented inline.streaming-cloud-scale-100000-fullpipeline-season-prod3x16.txt— 100K full-pipeline + season with refactored AmericanFootballSeasonAccumulator: 10,696s (2h 58m). Peak RAM 46.6 GB (vs projected 270 GB with old accumulator). 244,800 OC + 4.2B PBP + 1,728 season rows . All three tables at 100K validated.
Streaming vs old-path comparison (end-to-end wall time)¶
Apples-to-apples: same ClickHouse instance, same simulation, same worlds. Old path = TestDataCache + direct ClickHouse writes. New path = StreamingOrchestrator with OC via staging+merge + PBP to Parquet files.
| Scale | Chunk | Old total | New total | Δ | Old peak RAM | New peak RAM |
|---|---|---|---|---|---|---|
| 10 | 10 | 23.8s | 30.2s | +27% | n/a (unbounded) | 594 MB |
| 100 | 100 | 41.6s | 38.0s | -9% | n/a | 4.2 GB |
| 1,000 | 100 | ~10 min | 4.4 min | -56% | n/a | 5.1 GB |
| 10,000 | 500 | (old path couldn't run) | 43.9 min | unreachable vs measurable | — | 19.3 GB |
Scale 1,000 breakdown (streaming)¶
- Total wall time: 262.7s
- Simulation: 109.3s (42%)
- OC write (staging): 55.1s (21%)
- PBP write (Parquet): 54.4s (21%)
- Merge (arrayFlatten): 16.1s (6%)
- Overhead (setup + read-back verify): 27.8s (11%)
- 10 chunks of 100 worlds completed
- Peak working set: 5.1 GB (vs old-path ~20+ GB extrapolated — old path hit memory wall past 1,000 worlds on 32 GB boxes)
- OC rows: 244,799 (merged from 10 × ~24K partial rows)
- PBP rows: 41,972,008 written to 10 Parquet files on local disk
Scale 10,000 breakdown (streaming, chunk=500, 20 chunks)¶
- Total wall time: 2,634 s (43.9 min)
- Simulation: 1,083s (41%)
- OC write to staging: 499s (19%)
- PBP write to Parquet: 726s (28%)
- Merge (arrayFlatten): 201s (8%)
- Overhead: ~125s (5%)
- Peak working set: 19.3 GB (higher than expected — chunk=500 accumulator holds 500-world arrays per game × 288 games × ~850 outcomes)
- OC rows (merged): 244,800
- PBP rows: 419,763,551 (~420 million)
- PBP Parquet files: 20 files, ~6.9 GB on disk
- ClickHouse OC table: 2.68 GB compressed, 19.6 GB uncompressed, ratio 7.32× (up from 6.96× at scale 1,000)
Linear-scaling check — scale 1,000 → 10,000¶
| Metric | Scale 1,000 | Scale 10,000 | Growth |
|---|---|---|---|
| Total wall time | 262.7s | 2,634.4s | 10.0× (perfectly linear) |
| Simulation time | 109.3s | 1,083.0s | 9.9× |
| OC write time | 55.1s | 498.5s | 9.0× |
| PBP write time | 54.4s | 725.8s | 13.3× (super-linear — disk contention?) |
| Merge time | 16.1s | 201.1s | 12.5× (row-group growth matters for arrayFlatten) |
| OC rows | 244,799 | 244,800 | ~same (catalogue-bounded, not world-bounded) |
| PBP rows | 41,972,008 | 419,763,551 | 10.0× |
| OC compression ratio | 6.96× | 7.32× | improves further |
Key observations¶
- Streaming is SLOWER than in-memory below a threshold (scale 10 adds 27% overhead from staging + merge). That's the cost of chunking and we accept it.
- Streaming wins as scale grows — at 1,000 worlds it's more than 2× faster than the old path, and crucially stays under 6 GB RAM where the old path at this scale struggles on 32 GB machines.
- At 10,000 worlds the streaming path completes successfully in 44 minutes with 19 GB peak RAM — the old path couldn't run at this scale on a 32 GB laptop at all.
- Scaling is nearly linear from 1K → 10K. Wall time grew exactly 10× for 10× more worlds. No hidden quadratic costs.
- OC compression keeps improving at scale — 5.56× → 6.39× → 6.96× → 7.32× as array length grows. The Delta+ZSTD codec is doing more work as the Float64 arrays get longer.
- PBP write is super-linear (13.3× for 10× rows) — possibly disk contention on the local SSD, or Parquet row-group boundary effects. Not alarming but worth measuring on faster hardware.
- PBP Parquet files land on local disk (20 files, one per chunk). From there a separate pipeline ingests them into ClickHouse if/when needed.
Memory — chunk-size tradeoff¶
At scale 10,000 with chunk=500, we hit 19.3 GB peak. This is higher than the 5.1 GB we saw at scale 1,000 with chunk=100. The peak scales with chunk size (bigger arrays per accumulator), not total world count. If we'd picked chunk=100 we'd have seen peak closer to 5 GB with 100 chunks instead of 20 — same total time probably, much lower memory ceiling. Worth sweeping chunk size if we want to push toward 100K on a 32 GB box.
1K optimisation pass — where the -17% came from¶
Captured streaming-scale-1000-profiling.txt (baseline) and streaming-scale-1000-optimized.txt (after two commits). Wall-time dropped 189.88s → 157.43s (-32.5s, -17%). Breaking that down:
| Component | Baseline | Optimised | Δ |
|---|---|---|---|
| Simulation | 79.5s | 78.9s | -0.6s |
| OC write (staging) | 38.9s | 36.6s | -2.3s |
| PBP write (Parquet) | 41.3s | 40.8s | -0.5s |
| Merge (arrayFlatten) | 10.5s | 7.8s | -2.7s |
| Measured subtotal | 170.2s | 164.1s | -6.1s |
| Untracked / readback | ~19.7s | ~-6.7s* | -26.8s |
| Total wall time | 189.88s | 157.43s | -32.5s |
* Negative untracked is measurement noise — stopwatches overlap by a few seconds.
Conclusion: the big win was the readback fix (commit 21bf4e68) — the old post-run row-count loop did 288 full-context reads just to report TotalGameOutcomeRows, burning ~27s that wasn't attributed to any measured phase. Replacing it with a single system.parts metrics query gave the full -17%. The outcome-ID caching (commit 8849faee) was within run-to-run noise at this scale — kept because it's a clean refactor but not the source of the savings. Simulation-layer optimisation isn't planned (it's prototype code); future wins should come from accumulation and derivation.
Parallel accumulator spike — 1K OC-only sweep¶
Per-worker accumulator + per-worker PBP list, merged at chunk boundary (shared-nothing). Same session, same ClickHouse instance, --no-pbp.
| Workers | Total wall | Sim | OC write | Merge | Peak RAM |
|---|---|---|---|---|---|
| 1 | 176.9s | 101.5s | 59.6s | 15.7s | 648 MB |
| 2 | 154.7s | 77.7s | 60.2s | 16.5s | 790 MB |
| 4 | 151.1s | 69.7s | 64.2s | 17.0s | 924 MB |
| 8 | 172.1s | 90.8s | 63.8s | 17.3s | 986 MB |
Findings - Sim phase scales 1.45× at N=4 (101→70s). Per-worker accumulators work — the shared-state contention that killed the earlier parallel-worlds attempt is gone. - Sweet spot N=4, -15% total wall time. N=8 regresses — thread contention + Docker ClickHouse stealing cycles on the same laptop. - OC write + merge are I/O-bound and do not parallelise from sim workers. They're why the total speedup is 1.17× not 4×. - Peak RAM grows ~70–90 MB per added worker. Manageable at 1K; will matter at 10K+ with PBP on. - Correctness OK: 244,800 merged OC rows at every N.
Compare to previous Channel<T> + shared-accumulator attempt (~3% on 10K): shared-nothing was the right structural call.
Push further — parallel OC write + parallel final merge (1K OC-only)¶
With shared-nothing accumulators in place, the next serial phases were the 288-game merge-and-staging-write loop and the final per-game MergeStagingToFinalAsync loop. Both are embarrassingly parallel (different gameIds, different staging rows, different partitions). Wrapped them in Parallel.ForEachAsync with MaxDegreeOfParallelism = ParallelWorkers.
| Config | Total | Sim | OC write | Merge |
|---|---|---|---|---|
| N=1 (all serial) | 176.2s | 104s | 56s | 16s |
| N=4 sim-only parallel | 151.1s | 70s | 64s | 17s |
| N=4 sim + OC-write parallel | 111.2s | 73s | 22s | 16s |
| N=4 sim + OC-write + merge parallel | 105.9s | 73s | 25s | 7s |
| N=8 sim + OC-write + merge parallel | 112.8s | 82s | 23s | 8s |
Best config: N=4 with all three phases parallel → 105.9s, -40% vs N=1 baseline. Parallel ClickHouse writes are the big unlock: 2.3× faster OC write, 2.1× faster merge. The laptop sim phase caps around N=4 physical cores — N=8 doesn't help and occasionally regresses sim due to thread contention with Docker ClickHouse.
Parallelism scaling — 1K → 5K → 10K (OC-only)¶
| Scale | N=1 wall | N=4 wall | Δ | N=4 per-world |
|---|---|---|---|---|
| 1K | 176s | 106s | -40% | 106 ms |
| 5K | 772s | 345s | -55% | 69 ms |
| 10K | 1,565s (26.1m) | 724s (12.1m) | -54% | 72 ms |
The parallelism factor improves at scale, not degrades. At 1K the gain was -40%; at 5K and 10K it's ~-55%. Fixed per-chunk overhead amortises better at larger chunk-times, so per-world cost drops from 106 ms → ~70 ms from 5K onwards.
Scaling at N=4: 5K → 10K = 2.1× (near-linear, matching N=1's 2.03×). 100K OC-only projection at N=4 laptop: ~2.0 hours (10K × 10), down from the earlier ~4.4h estimate extrapolated from the 1K factor. All three scales produced the correct 244,800 merged OC rows.
Side finding — stale-volume hazard fixed in commit. The first 5K N=4 attempt returned 489,599 rows (2×) on a non-fresh Docker volume. Root cause: DROP TABLE IF EXISTS on an Atomic ClickHouse DB defers data cleanup; system.parts WHERE table='X' AND active=1 kept summing old and new parts for several minutes. Fixed by appending SYNC to all DROP TABLE statements in ClickHouseSchemas.cs, which forces synchronous deletion. Verified with back-to-back scale-10 runs: no doubling.
1K WITH PBP comparison — per-worker PBP concat verified¶
Same session, PBP enabled:
| Config | Total | Sim | OC write | PBP write | Merge | Peak RAM |
|---|---|---|---|---|---|---|
| N=1 | 255.8s | 124s | 58s | 57s | 16s | 6.67 GB |
| N=4 | 198.2s | 93s | 25s | 72s | 7s | 6.78 GB |
- Wall time -22.5% with PBP on (vs -40% without). PBP write phase remains serial and doesn't parallelise (single Parquet writer per chunk).
- PBP write regressed +15s at N=4 — likely Docker CH saturation + GC pressure during the concurrent OC staging write. Not alarming; the overall wall time still drops meaningfully.
- Peak RAM effectively unchanged (6.67 → 6.78 GB) — PBP buffer dominates memory at this scale, and 4 partial OC accumulators cost ~80 MB total extra.
- Correctness fine: 41.98M PBP rows match sequential.
Compression comparison (localhost, same scale)¶
Headline: Enabling HTTP body compression on ClickHouse.Driver makes almost no difference on localhost. This is expected — network isn't the bottleneck when the client and server are on the same machine. The compression_ratio metric reflects on-disk column-level ZSTD compression (set in the table DDL), which is independent of the HTTP wire-level compression flag.
| Metric | Scale 10 (raw) | Scale 10 (compressed) | Δ | Scale 100 (raw) | Scale 100 (compressed) | Δ |
|---|---|---|---|---|---|---|
| OC full season write (ms) | 18,708 | 18,563 | -0.8% | 7,678 | 7,190 | -6.3% |
| OC rows/sec | 12,255 | 12,363 | +0.9% | 31,751 | 33,899 | +6.8% |
| OC compression ratio | 5.56× | 5.54× | ~0 | 6.38× | 6.39× | ~0 |
| PBP full season write (ms) | 5,124 | 5,019 | -2.1% | 34,368 | 35,758 | +4.0% |
| PBP rows/sec | 81,853 | 83,807 | +2.4% | 122,166 | 117,424 | -3.9% |
Interpretation: - Differences are within run-to-run noise (~5% variance typical) - On-disk compression ratio is identical across runs (set by CODEC in DDL, not wire-level) - HTTP body compression matters for remote ClickHouse (cloud, across-network) where the compressed binary payload significantly reduces bytes-over-wire. On localhost over loopback, it's a wash.
Conclusion for the controlled evaluation: Test compression against a remote ClickHouse instance (ClickHouse Cloud or a cloud VM). For local development, keep it off to eliminate CPU cycles spent compressing bytes that stay inside the same machine.
Scale-10 vs Scale-100 Summary¶
| Metric | 10 worlds | 100 worlds | Growth |
|---|---|---|---|
| OC full season write | 18.7s | 7.7s | 0.4× (faster!) |
| OC rows/sec | 12,255 | 31,751 | 2.6× |
| OC total rows (288 games) | 229,274 | 243,776 | 1.06× (row count grows slightly as more players become active) |
| OC single-game read p50 | 17.3 ms | 25.4 ms | 1.5× |
| OC compressed bytes | 4.76 MB | 31.9 MB | 6.7× |
| OC uncompressed bytes | 26.5 MB | 203.9 MB | 7.7× |
| OC compression ratio | 5.56× | 6.38× | Improves at scale |
| OC bytes per row (compressed) | 20.8 | 131.1 | Array length dominates |
| PBP full season write | 5.1s | 34.4s | 6.7× |
| PBP rows/sec | 81,853 | 122,166 | 1.5× (better batch utilization) |
| PBP total rows | 419,412 | 4,198,561 | 10× |
| PBP compressed bytes | 10.5 MB | 89.2 MB | 8.5× |
| PBP uncompressed bytes | 21.0 MB | 205.0 MB | 9.8× |
| PBP compression ratio | 2.01× | 2.30× | Improves at scale |
Key Observations¶
-
Compression ratio validates the design. The Outcome Context (dense Array(Float64)) compresses at 5.56–6.38×, right in line with the spec's 5–6× expectation. The play-by-play's sparse MergeTree storage compresses at ~2×.
-
OC write got FASTER per-row at 100 worlds. Counter-intuitive but explained: at scale 10, each row has an array of 10 elements; at scale 100, 100 elements. Fewer rows per array length means ClickHouse.Driver amortizes overhead better.
-
PBP scales linearly. Writing 10× more PBP data takes ~7× longer. ClickHouse absorbs the load without falling off.
-
Single-game read is sub-30ms at both scales. Well within the 1-second target for REQ-OC-R1 at 10K and 5-second target at 100K.
-
REQ-INFRA-2 client comparison is inconclusive. ClickHouse.Driver won at scale 10 (138 vs 99 rows/sec); ClickHouse.Client won at scale 100 (1022 vs 978). Variance dominates at these sample sizes. A controlled evaluation at realistic workload is still required.
How to Reproduce (historical)¶
Note: The
LBS.Model.AmericanFootball.StorageExperimentproject that produced these captures has been removed. The exact commands that generated the.txtsamples above are no longer runnable as written, so this section is kept only as a historical record of how the original runs were invoked.The OC simulation now lives under
src/Apps/LBS.OutcomeContext.SimulationRunner*(entry pointsrc/Apps/LBS.OutcomeContext.SimulationRunner/Program.cs). It is configured viaOC_*environment variables (for exampleOC_CLICKHOUSE_CONNECTION,OC_ENGINE,OC_WORLD_COUNT) rather than the old--suite/--backend/--scaleCLI flags, so it is not a drop-in equivalent. See that project for current invocation details.
The original (now-removed) StorageExperiment was driven roughly like this:
# Start local ClickHouse, run a suite at a given scale against it, then tear down.
# The StorageExperiment project no longer exists — preserved here for context only.
dotnet run --project <StorageExperiment> -c Release -- \
--suite must-have --backend clickhouse --scale 10 \
--environment "local-docker" \
--connection-string "Host=localhost;Port=8123;Database=experiment;Username=default;Password=clickhouse"
What's Not Yet Tested¶
- Should-have experiments (REQ-OC-R3/R4/R5, REQ-OC-W3/W5, REQ-PBP-R1–R5, REQ-INFRA-3) — framework ready but not run here
- Nice-to-have experiments (REQ-OC-W4/R6, REQ-PBP-S1)
- 10,000-world and 100,000-world scales (need tuned ClickHouse, not laptop Docker)
- ClickHouse Cloud as a results store (no
--cloud-connection-stringwas set for these runs)