Skip to content

Experiment Run Results

Captured output from real runs of the storage experiment against a local ClickHouse container.

Environment

  • ClickHouse 24.8 in Docker (single node, default config, no resource tuning)
  • Host: dev laptop (Windows, bash via Git Bash)
  • .NET 10, Release build
  • Client: ClickHouse.Driver v0.9.0 and ClickHouse.Client v7.8.0 (REQ-INFRA-2 compares both)

These are indicative results from a dev laptop, not production capacity numbers. A proper benchmark run needs: - Tuned ClickHouse instance (CPU, memory, disk sized appropriately) - Controlled network conditions - Multiple runs for statistical significance

Files

Streaming vs old-path comparison (end-to-end wall time)

Apples-to-apples: same ClickHouse instance, same simulation, same worlds. Old path = TestDataCache + direct ClickHouse writes. New path = StreamingOrchestrator with OC via staging+merge + PBP to Parquet files.

Scale Chunk Old total New total Δ Old peak RAM New peak RAM
10 10 23.8s 30.2s +27% n/a (unbounded) 594 MB
100 100 41.6s 38.0s -9% n/a 4.2 GB
1,000 100 ~10 min 4.4 min -56% n/a 5.1 GB
10,000 500 (old path couldn't run) 43.9 min unreachable vs measurable 19.3 GB

Scale 1,000 breakdown (streaming)

  • Total wall time: 262.7s
  • Simulation: 109.3s (42%)
  • OC write (staging): 55.1s (21%)
  • PBP write (Parquet): 54.4s (21%)
  • Merge (arrayFlatten): 16.1s (6%)
  • Overhead (setup + read-back verify): 27.8s (11%)
  • 10 chunks of 100 worlds completed
  • Peak working set: 5.1 GB (vs old-path ~20+ GB extrapolated — old path hit memory wall past 1,000 worlds on 32 GB boxes)
  • OC rows: 244,799 (merged from 10 × ~24K partial rows)
  • PBP rows: 41,972,008 written to 10 Parquet files on local disk

Scale 10,000 breakdown (streaming, chunk=500, 20 chunks)

  • Total wall time: 2,634 s (43.9 min)
  • Simulation: 1,083s (41%)
  • OC write to staging: 499s (19%)
  • PBP write to Parquet: 726s (28%)
  • Merge (arrayFlatten): 201s (8%)
  • Overhead: ~125s (5%)
  • Peak working set: 19.3 GB (higher than expected — chunk=500 accumulator holds 500-world arrays per game × 288 games × ~850 outcomes)
  • OC rows (merged): 244,800
  • PBP rows: 419,763,551 (~420 million)
  • PBP Parquet files: 20 files, ~6.9 GB on disk
  • ClickHouse OC table: 2.68 GB compressed, 19.6 GB uncompressed, ratio 7.32× (up from 6.96× at scale 1,000)

Linear-scaling check — scale 1,000 → 10,000

Metric Scale 1,000 Scale 10,000 Growth
Total wall time 262.7s 2,634.4s 10.0× (perfectly linear)
Simulation time 109.3s 1,083.0s 9.9×
OC write time 55.1s 498.5s 9.0×
PBP write time 54.4s 725.8s 13.3× (super-linear — disk contention?)
Merge time 16.1s 201.1s 12.5× (row-group growth matters for arrayFlatten)
OC rows 244,799 244,800 ~same (catalogue-bounded, not world-bounded)
PBP rows 41,972,008 419,763,551 10.0×
OC compression ratio 6.96× 7.32× improves further

Key observations

  • Streaming is SLOWER than in-memory below a threshold (scale 10 adds 27% overhead from staging + merge). That's the cost of chunking and we accept it.
  • Streaming wins as scale grows — at 1,000 worlds it's more than 2× faster than the old path, and crucially stays under 6 GB RAM where the old path at this scale struggles on 32 GB machines.
  • At 10,000 worlds the streaming path completes successfully in 44 minutes with 19 GB peak RAM — the old path couldn't run at this scale on a 32 GB laptop at all.
  • Scaling is nearly linear from 1K → 10K. Wall time grew exactly 10× for 10× more worlds. No hidden quadratic costs.
  • OC compression keeps improving at scale — 5.56× → 6.39× → 6.96× → 7.32× as array length grows. The Delta+ZSTD codec is doing more work as the Float64 arrays get longer.
  • PBP write is super-linear (13.3× for 10× rows) — possibly disk contention on the local SSD, or Parquet row-group boundary effects. Not alarming but worth measuring on faster hardware.
  • PBP Parquet files land on local disk (20 files, one per chunk). From there a separate pipeline ingests them into ClickHouse if/when needed.

Memory — chunk-size tradeoff

At scale 10,000 with chunk=500, we hit 19.3 GB peak. This is higher than the 5.1 GB we saw at scale 1,000 with chunk=100. The peak scales with chunk size (bigger arrays per accumulator), not total world count. If we'd picked chunk=100 we'd have seen peak closer to 5 GB with 100 chunks instead of 20 — same total time probably, much lower memory ceiling. Worth sweeping chunk size if we want to push toward 100K on a 32 GB box.

1K optimisation pass — where the -17% came from

Captured streaming-scale-1000-profiling.txt (baseline) and streaming-scale-1000-optimized.txt (after two commits). Wall-time dropped 189.88s → 157.43s (-32.5s, -17%). Breaking that down:

Component Baseline Optimised Δ
Simulation 79.5s 78.9s -0.6s
OC write (staging) 38.9s 36.6s -2.3s
PBP write (Parquet) 41.3s 40.8s -0.5s
Merge (arrayFlatten) 10.5s 7.8s -2.7s
Measured subtotal 170.2s 164.1s -6.1s
Untracked / readback ~19.7s ~-6.7s* -26.8s
Total wall time 189.88s 157.43s -32.5s

* Negative untracked is measurement noise — stopwatches overlap by a few seconds.

Conclusion: the big win was the readback fix (commit 21bf4e68) — the old post-run row-count loop did 288 full-context reads just to report TotalGameOutcomeRows, burning ~27s that wasn't attributed to any measured phase. Replacing it with a single system.parts metrics query gave the full -17%. The outcome-ID caching (commit 8849faee) was within run-to-run noise at this scale — kept because it's a clean refactor but not the source of the savings. Simulation-layer optimisation isn't planned (it's prototype code); future wins should come from accumulation and derivation.

Parallel accumulator spike — 1K OC-only sweep

Per-worker accumulator + per-worker PBP list, merged at chunk boundary (shared-nothing). Same session, same ClickHouse instance, --no-pbp.

Workers Total wall Sim OC write Merge Peak RAM
1 176.9s 101.5s 59.6s 15.7s 648 MB
2 154.7s 77.7s 60.2s 16.5s 790 MB
4 151.1s 69.7s 64.2s 17.0s 924 MB
8 172.1s 90.8s 63.8s 17.3s 986 MB

Findings - Sim phase scales 1.45× at N=4 (101→70s). Per-worker accumulators work — the shared-state contention that killed the earlier parallel-worlds attempt is gone. - Sweet spot N=4, -15% total wall time. N=8 regresses — thread contention + Docker ClickHouse stealing cycles on the same laptop. - OC write + merge are I/O-bound and do not parallelise from sim workers. They're why the total speedup is 1.17× not 4×. - Peak RAM grows ~70–90 MB per added worker. Manageable at 1K; will matter at 10K+ with PBP on. - Correctness OK: 244,800 merged OC rows at every N.

Compare to previous Channel<T> + shared-accumulator attempt (~3% on 10K): shared-nothing was the right structural call.

Push further — parallel OC write + parallel final merge (1K OC-only)

With shared-nothing accumulators in place, the next serial phases were the 288-game merge-and-staging-write loop and the final per-game MergeStagingToFinalAsync loop. Both are embarrassingly parallel (different gameIds, different staging rows, different partitions). Wrapped them in Parallel.ForEachAsync with MaxDegreeOfParallelism = ParallelWorkers.

Config Total Sim OC write Merge
N=1 (all serial) 176.2s 104s 56s 16s
N=4 sim-only parallel 151.1s 70s 64s 17s
N=4 sim + OC-write parallel 111.2s 73s 22s 16s
N=4 sim + OC-write + merge parallel 105.9s 73s 25s 7s
N=8 sim + OC-write + merge parallel 112.8s 82s 23s 8s

Best config: N=4 with all three phases parallel → 105.9s, -40% vs N=1 baseline. Parallel ClickHouse writes are the big unlock: 2.3× faster OC write, 2.1× faster merge. The laptop sim phase caps around N=4 physical cores — N=8 doesn't help and occasionally regresses sim due to thread contention with Docker ClickHouse.

Parallelism scaling — 1K → 5K → 10K (OC-only)

Scale N=1 wall N=4 wall Δ N=4 per-world
1K 176s 106s -40% 106 ms
5K 772s 345s -55% 69 ms
10K 1,565s (26.1m) 724s (12.1m) -54% 72 ms

The parallelism factor improves at scale, not degrades. At 1K the gain was -40%; at 5K and 10K it's ~-55%. Fixed per-chunk overhead amortises better at larger chunk-times, so per-world cost drops from 106 ms → ~70 ms from 5K onwards.

Scaling at N=4: 5K → 10K = 2.1× (near-linear, matching N=1's 2.03×). 100K OC-only projection at N=4 laptop: ~2.0 hours (10K × 10), down from the earlier ~4.4h estimate extrapolated from the 1K factor. All three scales produced the correct 244,800 merged OC rows.

Side finding — stale-volume hazard fixed in commit. The first 5K N=4 attempt returned 489,599 rows (2×) on a non-fresh Docker volume. Root cause: DROP TABLE IF EXISTS on an Atomic ClickHouse DB defers data cleanup; system.parts WHERE table='X' AND active=1 kept summing old and new parts for several minutes. Fixed by appending SYNC to all DROP TABLE statements in ClickHouseSchemas.cs, which forces synchronous deletion. Verified with back-to-back scale-10 runs: no doubling.

1K WITH PBP comparison — per-worker PBP concat verified

Same session, PBP enabled:

Config Total Sim OC write PBP write Merge Peak RAM
N=1 255.8s 124s 58s 57s 16s 6.67 GB
N=4 198.2s 93s 25s 72s 7s 6.78 GB
  • Wall time -22.5% with PBP on (vs -40% without). PBP write phase remains serial and doesn't parallelise (single Parquet writer per chunk).
  • PBP write regressed +15s at N=4 — likely Docker CH saturation + GC pressure during the concurrent OC staging write. Not alarming; the overall wall time still drops meaningfully.
  • Peak RAM effectively unchanged (6.67 → 6.78 GB) — PBP buffer dominates memory at this scale, and 4 partial OC accumulators cost ~80 MB total extra.
  • Correctness fine: 41.98M PBP rows match sequential.

Compression comparison (localhost, same scale)

Headline: Enabling HTTP body compression on ClickHouse.Driver makes almost no difference on localhost. This is expected — network isn't the bottleneck when the client and server are on the same machine. The compression_ratio metric reflects on-disk column-level ZSTD compression (set in the table DDL), which is independent of the HTTP wire-level compression flag.

Metric Scale 10 (raw) Scale 10 (compressed) Δ Scale 100 (raw) Scale 100 (compressed) Δ
OC full season write (ms) 18,708 18,563 -0.8% 7,678 7,190 -6.3%
OC rows/sec 12,255 12,363 +0.9% 31,751 33,899 +6.8%
OC compression ratio 5.56× 5.54× ~0 6.38× 6.39× ~0
PBP full season write (ms) 5,124 5,019 -2.1% 34,368 35,758 +4.0%
PBP rows/sec 81,853 83,807 +2.4% 122,166 117,424 -3.9%

Interpretation: - Differences are within run-to-run noise (~5% variance typical) - On-disk compression ratio is identical across runs (set by CODEC in DDL, not wire-level) - HTTP body compression matters for remote ClickHouse (cloud, across-network) where the compressed binary payload significantly reduces bytes-over-wire. On localhost over loopback, it's a wash.

Conclusion for the controlled evaluation: Test compression against a remote ClickHouse instance (ClickHouse Cloud or a cloud VM). For local development, keep it off to eliminate CPU cycles spent compressing bytes that stay inside the same machine.

Scale-10 vs Scale-100 Summary

Metric 10 worlds 100 worlds Growth
OC full season write 18.7s 7.7s 0.4× (faster!)
OC rows/sec 12,255 31,751 2.6×
OC total rows (288 games) 229,274 243,776 1.06× (row count grows slightly as more players become active)
OC single-game read p50 17.3 ms 25.4 ms 1.5×
OC compressed bytes 4.76 MB 31.9 MB 6.7×
OC uncompressed bytes 26.5 MB 203.9 MB 7.7×
OC compression ratio 5.56× 6.38× Improves at scale
OC bytes per row (compressed) 20.8 131.1 Array length dominates
PBP full season write 5.1s 34.4s 6.7×
PBP rows/sec 81,853 122,166 1.5× (better batch utilization)
PBP total rows 419,412 4,198,561 10×
PBP compressed bytes 10.5 MB 89.2 MB 8.5×
PBP uncompressed bytes 21.0 MB 205.0 MB 9.8×
PBP compression ratio 2.01× 2.30× Improves at scale

Key Observations

  1. Compression ratio validates the design. The Outcome Context (dense Array(Float64)) compresses at 5.56–6.38×, right in line with the spec's 5–6× expectation. The play-by-play's sparse MergeTree storage compresses at ~2×.

  2. OC write got FASTER per-row at 100 worlds. Counter-intuitive but explained: at scale 10, each row has an array of 10 elements; at scale 100, 100 elements. Fewer rows per array length means ClickHouse.Driver amortizes overhead better.

  3. PBP scales linearly. Writing 10× more PBP data takes ~7× longer. ClickHouse absorbs the load without falling off.

  4. Single-game read is sub-30ms at both scales. Well within the 1-second target for REQ-OC-R1 at 10K and 5-second target at 100K.

  5. REQ-INFRA-2 client comparison is inconclusive. ClickHouse.Driver won at scale 10 (138 vs 99 rows/sec); ClickHouse.Client won at scale 100 (1022 vs 978). Variance dominates at these sample sizes. A controlled evaluation at realistic workload is still required.

How to Reproduce (historical)

Note: The LBS.Model.AmericanFootball.StorageExperiment project that produced these captures has been removed. The exact commands that generated the .txt samples above are no longer runnable as written, so this section is kept only as a historical record of how the original runs were invoked.

The OC simulation now lives under src/Apps/LBS.OutcomeContext.SimulationRunner* (entry point src/Apps/LBS.OutcomeContext.SimulationRunner/Program.cs). It is configured via OC_* environment variables (for example OC_CLICKHOUSE_CONNECTION, OC_ENGINE, OC_WORLD_COUNT) rather than the old --suite / --backend / --scale CLI flags, so it is not a drop-in equivalent. See that project for current invocation details.

The original (now-removed) StorageExperiment was driven roughly like this:

# Start local ClickHouse, run a suite at a given scale against it, then tear down.
# The StorageExperiment project no longer exists — preserved here for context only.
dotnet run --project <StorageExperiment> -c Release -- \
  --suite must-have --backend clickhouse --scale 10 \
  --environment "local-docker" \
  --connection-string "Host=localhost;Port=8123;Database=experiment;Username=default;Password=clickhouse"

What's Not Yet Tested

  • Should-have experiments (REQ-OC-R3/R4/R5, REQ-OC-W3/W5, REQ-PBP-R1–R5, REQ-INFRA-3) — framework ready but not run here
  • Nice-to-have experiments (REQ-OC-W4/R6, REQ-PBP-S1)
  • 10,000-world and 100,000-world scales (need tuned ClickHouse, not laptop Docker)
  • ClickHouse Cloud as a results store (no --cloud-connection-string was set for these runs)