Storage Experiment — R&D Findings¶
Prepared: 2026-04-22
Activity period covered: 2026-04-16 → 2026-04-21 (Australian income year FY2025-26)
Code branch: LBS-1183
Primary author / responsible engineer: Mark Harris
Purpose of this document: Evidence package for the Australian R&D Tax Incentive (RDTI) program. Sets out the technical uncertainties that could not be resolved from existing knowledge at the outset of the project, the systematic progression of hypothesis-driven experimental work undertaken to resolve them, and the new knowledge obtained — including the knowledge obtained from experiments whose outcomes disproved the hypothesis.
This document is structured to align with the AusIndustry guidance on "core R&D activities" under s355-25 ITAA 1997: each experiment block states the prior knowledge gap, the hypothesis, the systematic method, the observation, and the logical conclusion drawn. Where experiments produced a negative result (hypothesis disproved), the knowledge gained is stated explicitly — negative results count as new knowledge for RDTI purposes.
Index¶
- 1. Project context and overall R&D purpose
- 2. Technical uncertainties at project outset
- 3. Systematic progression of experimental work
- 3.1 Columnar store viability for dense Array(Float64) simulation arrays
- 3.2 Basket-read latency as a function of scale
- 3.3 Full-context read latency projection from 1K to 100K
- 3.4 Parallel-simulation via shared accumulator and Channel
- 3.5 Shared-nothing parallel accumulator + parallel staging write + parallel merge
- 3.6 Cross-run data isolation under ClickHouse Atomic databases
- 3.7 Parquet-on-ephemeral-disk write path viability on Azure Container Apps
- 3.8 Play-by-play partitioning under ClickHouse Cloud insert constraints
- 3.9 Connection-level
async_insertbehaviour for many small bulk inserts - 3.10 Server-side memory ceiling under high merge parallelism
- 3.11 Staging-table OPTIMIZE FINAL before per-game merge
- 3.12 Hash-sharded per-game merge
- 3.13 PBP super-linearity on single-partition 100K load
- 3.14 Cloud tier upgrade as a capacity lever
- 3.15 Season-accumulator memory layout at 100K
- 3.16 Spec-driven schema review applied to validated baseline
- 3.17 Redis as candidate primary store
- 4. New knowledge obtained
- 5. Unresolved uncertainties
- 6. Evidence index
- 7. Timeline
1. Project context and overall R&D purpose¶
LuckBox Studios operates a sports prediction platform in which users wager on outcomes derived from simulated competitions. The platform's economic model requires the ability to materialise, store, and serve per-outcome probability distributions produced by running Monte Carlo simulations of a competition at 100,000-world scale — i.e., 100,000 independent simulated seasons of an American Football league, producing approximately 245,000 per-outcome Float64 arrays each of length 100,000 (one array per outcome per game) plus 4.2 billion play-by-play event rows per run.
No published benchmark, vendor guidance, or prior-art reference was known to address the specific combination of (a) data shape — dense Array(Float64) per outcome indexed by world, (b) access pattern — 1-30 outcome "basket" reads with optional full-game context reads under interactive latency budget, (c) total volume — approximately 200 GB uncompressed per run and 4.2 billion ancillary event rows, (d) operational envelope — a single run must complete within an interactive analytical working day (sub-4-hour target, sub-3-hour stretch) on commodity cloud infrastructure at a per-run compute cost consistent with an indie studio operating budget (sub-$30/run target).
Candidate architectures existed in general terms (columnar analytical stores such as ClickHouse; in-memory key-value stores such as Redis; object-store Parquet archives) but for each candidate, the questions of whether it would meet the shape/pattern/volume/envelope constraints in combination, and what implementation-level choices (partitioning, codec, orchestration, parallelism, memory layout) would be required to do so, could not be answered from vendor documentation, published benchmarks, or prior engineering experience. Each candidate choice therefore required experimental validation.
The R&D activity described in this document was undertaken to resolve those uncertainties through a systematic, hypothesis-driven progression of work, and to generate new knowledge that would let LuckBox Studios build the storage subsystem for the platform on a sound empirical footing rather than on speculation.
The outcome of the work (recorded as the primary result throughout this document) is a validated end-to-end path: 100K worlds with all three outcome tables (game outcome context, play-by-play, season outcome context) written correctly in 2 hours 58 minutes for approximately $27 per run on ClickHouse Cloud Production tier (3×16 configuration). A significant portion of the experimental work documented below contributed to refining or invalidating choices that had initially seemed correct from first principles or vendor guidance; those negative results are as important to the R&D record as the positive results and are documented with equal rigour.
2. Technical uncertainties at project outset¶
The following technical uncertainties were identified at project outset. Each could not be resolved from existing public knowledge, vendor documentation, or prior engineering experience, and each required systematic experimental investigation to resolve. Each experiment block in section 3 addresses one or more of these uncertainties.
- Columnar store fit. Whether a columnar analytical database (ClickHouse) could store and retrieve approximately 245,000 dense
Array(Float64)rows per run at the target compression ratio (5× minimum from spec), given that the documented use cases for ClickHouse array columns centre on sparse event data rather than dense Monte Carlo output. - Read-latency projection. Whether read latency for the two access patterns (1-30 outcome basket, full ~850-outcome game context) would remain query-planner-bound (flat with scale) or become data-transfer-bound (linear in payload) between 1K worlds and 100K worlds.
- Write-path concurrency model. Which concurrency model would deliver linear-with-core-count throughput for the accumulator → storage write pipeline, given that the inner accumulator state structure is not inherently thread-safe and shared-producer/single-consumer queuing is the conventionally-recommended pattern.
- Client-library behaviour at Cloud scale. Whether the official
ClickHouse.Driverclient library's bulk-copy path would behave consistently at Cloud scale across different schema choices, client-side chunk sizes, and Cloud-side constraints not documented in the vendor's published best-practice material. - ClickHouse Cloud operational constraints. Whether Cloud-tier constraints (documented and undocumented) would require schema and orchestration choices that diverge from the vendor's own best-practice documentation, and how those constraints interact with the data-shape of a dense simulation output.
- Memory-scale behaviour of dense accumulators. Whether accumulator data structures that tested correct at 1K-world scale would also behave correctly and within memory budget at 100K-world scale, or whether the internal layout (dictionary overhead, string duplication, list growth patterns) would cause super-linear memory growth.
- Alternative-store cost/performance trade-off. Whether an alternative primary store (Redis) could plausibly out-perform ClickHouse on either the read or the write side at the target scale, given the specific data shape and access pattern.
- End-to-end run-cost economics. Whether the total cost of a 100K-world run (compute + storage + data-transfer + any operational overhead) could be brought within the target of approximately $30/run on a publicly-priced Cloud configuration, and by what combination of resource-sizing and software-optimisation choices.
3. Systematic progression of experimental work¶
Each subsection describes one experimental activity following the structure: Prior knowledge gap → Hypothesis → Method → Observation → Conclusion / New knowledge → Evidence references. Experiments are ordered broadly chronologically. Negative results (hypothesis disproved) are retained in full and not elided — each one contributed knowledge that constrained subsequent experimental choices.
3.1 Columnar store viability for dense Array(Float64) simulation arrays¶
Prior knowledge gap. ClickHouse's array column type is documented and benchmarked primarily on event-style data (variable-length, sparse). Its behaviour on dense simulation-derived Array(Float64) of uniform length-N, used as the primary physical representation of per-outcome probability distributions, was not covered by vendor benchmarks or published third-party work.
Hypothesis. A ClickHouse Array(Float64) CODEC(Delta, ZSTD) column, used as the payload of a ReplacingMergeTree(context_version) table ordered by (game_id, outcome_id) and partitioned by season_id, would deliver (a) better than 5× on-disk compression for simulation-derived Float64 arrays and (b) correct round-trip of inserted data through the official ClickHouse.Driver and ClickHouse.Client .NET libraries.
Method. Implemented the game_outcome_context and game_outcome_context_staging schemas in ClickHouseSchemas.cs. Implemented the ClickHouseBackend class using ClickHouse.Driver with parameterised SQL. Ran the MustHaveSuite (REQ-OC-W1 write, REQ-OC-R1 read, REQ-OC-R3 full-context read, REQ-OC-S1 storage-measurement) at scales 10, 100, 1,000 worlds against a local Dockerised ClickHouse 24.8 instance. Measured on-disk compressed and uncompressed size via system.parts.bytes_on_disk and bytes_uncompressed. Measured end-to-end round-trip correctness via integration tests using Testcontainers.
Observation. - Compression ratio at 10 worlds: 5.56×. At 100 worlds: 6.39×. At 1,000 worlds: 6.96×. At 10,000 worlds: 7.32×. Ratio trending upward with array length (longer arrays amortise Delta+ZSTD metadata). - Round-trip correctness: 244,800 rows written and read back with matching values in all integration tests. - Full-season write time at 1,000 worlds: 48.6s sustained (approximately 5K rows/second, array-size-bound).
Conclusion / New knowledge. The hypothesis was confirmed. Dense simulation-derived Array(Float64) does compress at ratios better than 5× under Delta+ZSTD (contrary to the theoretical expectation that Delta codecs do not help on near-random floats — the Delta encoding on the array-metadata path and ZSTD on the payload together deliver the observed ratio). Round-trip is reliable through the official driver. New operational knowledge: compression ratio improves with longer arrays because metadata cost amortises; this means the 100K-scale ratio will be at least as good as the 1K-scale ratio, which was material to the 100K storage-cost projection.
Evidence. Commits 7bd32122, 122a43e8, b985b85b, 85c7a4fb. Run logs docs/samples/experiment-runs/must-have-scale-10.txt, must-have-scale-100.txt, must-have-scale-1000.txt. Schema file src/Models/AmericanFootball/LBS.Model.AmericanFootball.StorageExperiment/Infrastructure/ClickHouse/ClickHouseSchemas.cs (committed form).
3.2 Basket-read latency as a function of scale¶
Prior knowledge gap. The user-facing access pattern is a "basket read" — retrieval of 1 to 30 outcomes for a single game at a time, which maps onto a ClickHouse query with a primary-key-prefix filter (game_id = ?) and an outcome_id IN (...) secondary filter. It was not known whether basket-read latency would remain near-constant at a given basket size regardless of world count (because per-row payload length grows with world count: 1K worlds = 8 KB per row, 100K worlds = 800 KB per row).
Hypothesis. Basket-read latency is dominated by the query planner and index lookup, not by payload-size, up to the point at which network transfer becomes the binding cost. For basket sizes 1-30 rows at world counts up to 1,000, the latency will stay inside a 50ms interactive budget.
Method. Added the REQ-OC-R3-BASKET experiment measuring mean, p50, and p95 latency for basket sizes {1, 5, 10, 20, 30} at world counts {10, 100, 1,000}. Ran the sweep against a local Dockerised ClickHouse. Each measurement: 100 independent basket reads, different random game + outcome set each time.
Observation. Mean latency for basket sizes 1-30: 8.3-20.5 ms across all scales and basket sizes. Latency at basket size 30, 1K worlds: 20.4 ms. No monotonic increase with basket size at a given scale (indicating the per-outcome cost inside a single query is near-zero; the cost is dominated by the one-time query setup).
Conclusion / New knowledge. The hypothesis was confirmed for the scales tested. Basket-read latency is query-planner-bound, not payload-bound, at 1K scale. Projection to 100K (where payload per row is 100× larger) remains within interactive budget: the measured pattern (flat with scale) gives an approximately 30-40 ms projection at 100K, well inside the 1-second spec target. This is the lever that unblocks use-case (1-30 outcome basket prediction) independently of the more expensive full-context read path (next experiment).
Evidence. Commits af8748ce, 07ae9a9f. Run log docs/samples/experiment-runs/basket-all-scales.txt.
3.3 Full-context read latency projection from 1K to 100K¶
Prior knowledge gap. The less common access pattern is a "full-context read" — retrieval of all approximately 850 outcomes for a single game, each row carrying an Array(Float64) of length equal to the world count. At 1K worlds a full game payload is approximately 6.8 MB; at 100K worlds it is approximately 680 MB. It was not known whether query planner dominance (as established in 3.2 for basket reads) would also hold for full-context reads at 100K, or whether payload-transfer would become the binding constraint.
Hypothesis. Full-context read latency would remain query-planner-bound at 100K, giving a projection of approximately 3 seconds. Alternative (fallback) hypothesis: latency becomes transfer-bound, giving a projection of approximately 15 seconds.
Method. Measured full-context read latency (all 850 outcomes for one game) at world counts 10, 100, 1,000. p50 and p95 recorded. Projected forward to 100K under both hypotheses.
Observation. Latency at 10 worlds: p50 17 ms, p95 28 ms. At 100 worlds: p50 25 ms, p95 27 ms. At 1,000 worlds: p50 200 ms, p95 322 ms. The 100 → 1,000 step (10× more data) moved p50 from 25 ms to 200 ms — an 8× increase, indicating partial transfer-boundedness rather than full planner-boundedness.
Conclusion / New knowledge. The hypothesis was partially confirmed and partially disproved. Full-context read latency is neither fully planner-bound nor fully transfer-bound at the measured scales; the growth ratio (8× latency for 10× data) sits between the two expected extremes. Honest projection to 100K is therefore a range of 3-15 seconds, not a point estimate — a material finding for the spec-target evaluation (the spec calls for 5 seconds at 100K). Unresolved uncertainty: which regime dominates at 100K cannot be established without a 100K full-context read measurement, which was not executed within this project's scope. The finding is recorded in status docs as "3-15 seconds honest range" rather than a single projected number.
Evidence. Same run logs as 3.2, and sections of docs/outcome-context/evaluations/storage-experiment/status.md titled "Full-context vs basket reads" and "Full-context projection at 100K worlds".
3.4 Parallel-simulation via shared accumulator and Channel¶
Prior knowledge gap. The simulation inner loop (SeasonEngine.SimulateSeason + accumulator.AccumulateGame) accounts for approximately 42% of end-to-end wall time at 10K worlds. The conventional .NET pattern for parallelising a CPU-bound producer → serial consumer pipeline is System.Threading.Channels.Channel<T> with a bounded capacity and multiple producer tasks. It was not known whether this pattern would deliver 3-5× throughput improvement on a shared in-memory accumulator whose internal dictionaries are not thread-safe.
Hypothesis. An 8-way producer set feeding a bounded Channel<T> that drains into a single-threaded accumulator would deliver 3-5× simulation throughput improvement at 10K worlds.
Method. Implemented the parallel-simulation path (--parallel-worlds N and --channel-capacity N) in StreamingOrchestrator. Ran the 10K streaming pipeline with N=8 and channel capacity 1,000. Compared against a serial baseline at the same scale. Measured wall time breakdown by phase (simulation, OC write, PBP write, merge) and peak client RAM.
Observation. N=8 delivered 2,548s total wall time vs serial 2,634s — a 3.3% improvement (within run-to-run noise). Simulation sub-phase: 1,062s vs 1,083s (2% improvement). PBP write sub-phase: 741s vs 726s (2% regression). Merge sub-phase: 136s vs 201s (32% improvement).
Conclusion / New knowledge. The hypothesis was disproved. The shared-consumer accumulator is the binding bottleneck — producers spend most of their time blocked on channel.WriteAsync because the single consumer cannot drain fast enough. Additionally, on a single-host Docker setup the ClickHouse container competes for the same host cores as the simulation threads, explaining the slight regression on the PBP write phase (where ClickHouse is CPU-throttled by competition with the sim threads). The merge sub-phase improvement (32%) is consistent with this explanation: after simulation finishes, the host has spare CPU that ClickHouse's merge work can use.
New knowledge obtained: on a single-host architecture with the database co-resident, a shared-accumulator pattern cannot be materially parallelised regardless of producer count; the accumulator itself must become shared-nothing (per-worker) before parallelism pays off. This result directly motivated the successor experiment (3.5).
The parallel-simulation feature was reverted from the codebase in commit 4794ba36 because the added complexity was not earning its keep, with a note in the status doc documenting the reason for the revert so the finding would not be lost.
Evidence. Commits 0e528114, 17ed620b, e35eb2ba, 90dfc2e8, 4794ba36 (revert). Run logs docs/samples/experiment-runs/streaming-scale-10000-parallel.txt and streaming-scale-10000.txt.
3.5 Shared-nothing parallel accumulator + parallel staging write + parallel merge¶
Prior knowledge gap. Having established in 3.4 that a shared consumer is the binding bottleneck, it was not known whether a shared-nothing redesign (per-worker accumulator state, merged at chunk flush) combined with parallelisation of the per-game staging write and per-game final merge loops would deliver the originally-hypothesised 3-5× improvement, or whether some other bottleneck (host CPU, ClickHouse container, Docker virtualisation) would limit the gain.
Hypothesis. Per-worker shared-nothing accumulators plus Parallel.ForEachAsync over the 288-game staging-write loop plus Parallel.ForEachAsync over the 288-game final-merge loop would deliver at least 2× end-to-end improvement at 1K worlds, with the factor improving at larger scales as the per-chunk overhead amortises.
Method. Implemented three structural changes in three separate commits:
1. --parallel-workers N flag with per-worker AmericanFootballOutcomeAccumulator and per-worker List<PlayByPlayRecord>; partials merged into a shared chunk-level accumulator at chunk flush.
2. Parallel.ForEachAsync with MaxDegreeOfParallelism = N around the per-chunk staging-write loop.
3. Parallel.ForEachAsync with MaxDegreeOfParallelism = N around the final per-game merge loop.
Swept N ∈ {1, 4, 8} at 1K OC-only (5 distinct configs measured separately), 1K with PBP (2 configs), 5K OC-only (2 configs), 10K OC-only (2 configs). Measured wall time by phase plus peak RAM per config. Verified 244,800 row count at every config matches the serial baseline.
Observation. - 1K OC-only: N=1 176s → N=4 106s (-40%) → N=8 113s (slight regression at N=8 due to physical-core exhaustion on the 4-core test laptop). - 5K OC-only: N=1 772s → N=4 345s (-55%). - 10K OC-only: N=1 1,565s → N=4 724s (-54%). - 1K with PBP: N=1 256s → N=4 198s (-22.5%; PBP write stays serial and now dominates). - All row counts match baseline at every N.
Conclusion / New knowledge. Hypothesis confirmed, and the factor improves with scale as predicted (1K: 40%, 5K/10K: 55%). The three structural changes all earn their complexity: - Per-worker accumulators account for the sim-phase gain. - Parallel staging write is the single largest contributor at 5K/10K (OC write 56s → 22s at N=4). - Parallel merge also delivers 2.1× (16s → 7s at N=4).
New knowledge obtained: (a) ClickHouse accepts concurrent inserts against the same staging table without throttling at N=4 on a local Docker instance; (b) the arrayFlatten merge step is also safely parallelisable by game_id without cross-talk; (c) on a 4-core host, N=8 is counter-productive due to core exhaustion; (d) the PBP Parquet write stays serial under this design and becomes the dominant phase in the full pipeline. Finding (d) motivated experiment 3.7. Finding (a) became a load-bearing assumption for the subsequent Cloud-scale work.
Evidence. Commits 39b24d4c, e0af3495, 0af492e1 (DROP TABLE SYNC fix landed in the same parallel sweep). Run logs docs/samples/experiment-runs/streaming-scale-1000-no-pbp-p1.txt, ...-p4.txt, ...-p8.txt, ...-p1-pwrite.txt, ...-p4-pwrite.txt, ...-p8-pwrite.txt, ...-p8-pmerge.txt, ...-p4-pmerge.txt, streaming-scale-1000-with-pbp-p1.txt, ...-p4.txt, streaming-scale-5000-no-pbp-p4.txt, streaming-scale-10000-no-pbp-p1.txt, ...-p4.txt.
3.6 Cross-run data isolation under ClickHouse Atomic databases¶
Prior knowledge gap. During the parallelism validation at 5K (experiment 3.5), a first-attempt run returned 489,599 rows — exactly 2× the correct 244,800. The cause was not immediately obvious; the code logic had been verified at 1K and the parallelism itself was not changing the semantics of the write.
Hypothesis (diagnostic). Either (a) the new parallelism was causing double-writes, (b) the DROP TABLE + CREATE TABLE reset logic was failing, or (c) the issue was in the measurement path rather than the write path.
Method. Queried system.parts WHERE table='game_outcome_context' AND active=1 in the middle of a run; observed parts from both the current run and the prior run's game_outcome_context still active. Consulted ClickHouse documentation on Atomic database (default since CH 20.10) DROP semantics: DROP TABLE on an Atomic database defers data-file cleanup asynchronously; physical rows can persist under active=1 for several minutes after the logical DROP returns. Tested the fix of appending SYNC to every DROP TABLE statement (DROP TABLE IF EXISTS game_outcome_context SYNC), which forces ClickHouse to wait for data-file cleanup before returning.
Observation. With SYNC appended to all DROP TABLE statements, two consecutive 1K runs in the same Docker container produced 244,800 rows each (correct). Without SYNC, the same sequence reproduced the 2× row count issue 100% of the time. Earlier 1K results (which were unaffected) had all been run in containers that were torn down between measurement batches — the defer didn't fire across container tear-down.
Conclusion / New knowledge. The doubling was a measurement-path problem caused by ClickHouse Atomic database DROP semantics, not a write-path problem, and not caused by the new parallelism. Fix: append SYNC to all DROP TABLE statements in the experiment setup path. New knowledge: any ClickHouse experiment that runs multiple iterations in the same database connection/container must use DROP TABLE ... SYNC or use a fresh schema namespace per run; system.parts active=1 can and will show stale parts from prior runs if DROPs are asynchronous. This is a load-bearing operational finding for all subsequent Cloud-scale runs.
Evidence. Commit 0af492e1. Schema file ClickHouseSchemas.cs (DROP statements include SYNC).
3.7 Parquet-on-ephemeral-disk write path viability on Azure Container Apps¶
Prior knowledge gap. The play-by-play (PBP) write path had been validated on a developer laptop using a Parquet writer writing to the local SSD through an IBlobSink abstraction (LocalDiskBlobSink). It was not known whether this same path would work on Azure Container Apps, where the container's writable filesystem is ephemeral container-local storage rather than a persistent SSD. Azure's documentation does not characterise the ephemeral filesystem for per-chunk fsync workloads such as those the Parquet writer produces.
Hypothesis. The Parquet writer that succeeded at 1K/5K/10K on the laptop would transfer to Azure Container Apps with comparable or better throughput, given the container allocation of 32 vCPU / 64 GiB in westus3.
Method. Deployed the streaming orchestrator with --pbp-sink-type localdisk to the westus3 Container Apps Job. Ran at 10K world scale. Monitored wall-time progression.
Observation. At 10K scale, the PBP write phase stalled without error at over 100 minutes (the previous laptop 10K PBP write completed in 726 seconds / 12 minutes). No exception was thrown; the container continued to accumulate chunk-flush calls in its log stream but made no further progress. The 2-hour replica deadline was reached without the run completing.
Conclusion / New knowledge. Hypothesis disproved. Azure Container Apps' ephemeral container storage is not fit-for-purpose for the Parquet writer's per-chunk fsync pattern at this volume. The failure mode is pathological slowdown rather than outright error, which made it initially difficult to diagnose. New knowledge: Container Apps ephemeral storage cannot be assumed equivalent to local SSD for high-fsync workloads; any Cloud deployment of a write-path that exceeded bandwidth a local test needs re-validation on the target filesystem before commitment. The Parquet-on-ephemeral path was abandoned for Cloud deployment, motivating experiment 3.8 (PBP direct to ClickHouse as an alternative path).
Evidence. Commit 8e549097 (AzureBlobSink stub — the eventually-correct path for a future production deployment; not exercised in this project). Run log docs/samples/experiment-runs/streaming-cloud-scale-100000-with-pbp-d32-ch-timed-out.txt (later-dated attempt showing the same class of stall; the specific 10K stall was not retained as a log but is documented in the status doc).
3.8 Play-by-play partitioning under ClickHouse Cloud insert constraints¶
Prior knowledge gap. Having abandoned the Parquet-on-ephemeral path (3.7), an alternative PBP path was needed: write PBP directly to a play_by_play ClickHouse table using the same bulk-copy code path as the OC writes. The original schema had PARTITION BY game_id, giving 288 partitions per season. It was not known whether ClickHouse Cloud (on the Production Scale 2×4 tier) would accept bulk inserts that span 288 partitions, or how that would interact with the vendor's documented max_partitions_per_insert_block setting.
Hypothesis. ClickHouse Cloud would accept the 288-partition bulk inserts either by default or by overriding max_partitions_per_insert_block via connection string setting (set_max_partitions_per_insert_block=500).
Method. Implemented ClickHousePlayByPlayBatchWriter that delegates to IStorageBackend.WritePlayByPlayAsync, converting PlayByPlayRecord → PlayRecord for the same async-insert bulk-copy path. Wired in via --pbp-sink-type clickhouse. Ran 1K scale against Cloud. When first attempt failed, tried the setting override. When override failed, changed the schema.
Observation.
- First attempt: Code: 252. DB::Exception: Too many partitions for single INSERT block (more than 100). Cloud has max_partitions_per_insert_block = 100 and is not respecting our per-query override attempts at default settings.
- Second attempt with set_max_partitions_per_insert_block=500 in connection string: Code: 452. DB::Exception: SETTING_CONSTRAINT_VIOLATION. Cloud explicitly prohibits raising this setting via client override on the Production Scale tier.
- Third attempt after DDL change to PARTITION BY season_id: succeeded at 1K (304s, 175K PBP rows/sec).
Conclusion / New knowledge. Hypothesis disproved twice. New knowledge: ClickHouse Cloud (Production Scale tier) enforces max_partitions_per_insert_block = 100 with a SETTING_CONSTRAINT_VIOLATION when client attempts to raise it — this constraint is not documented in the vendor's best-practice partitioning guide and only manifests as a runtime error. The practical implication is that any Cloud schema with > 100 partitions per insert block cannot work for a bulk-copy write path. Schema changed: play_by_play now PARTITION BY season_id (1 partition per season). Trade-off accepted: per-game DROP PARTITION becomes per-game ALTER TABLE ... DELETE WHERE game_id = X (a mutation, not a metadata operation). This finding feeds directly into experiment 3.13 (where the 1-partition choice revealed a different scale-boundary problem).
Evidence. Commits 7831e36f (PBP direct path), c815683d (schema change to PARTITION BY season_id). Run logs docs/samples/experiment-runs/streaming-cloud-scale-1000-with-pbp-d32-ch.txt.
3.9 Connection-level async_insert behaviour for many small bulk inserts¶
Prior knowledge gap. The OC write path produces many small bulk-copies (288 games × N chunks = thousands of inserts over a run), each carrying a few thousand rows. ClickHouse exposes an async_insert mechanism in which the server buffers inserts and batches them server-side before committing. The behaviour of async_insert=1; wait_for_async_insert=0 (fire-and-forget from the client perspective) on a write workload of this shape, and the resulting throughput on Cloud, was not documented by the vendor in terms of our specific mix of bulk-copy size and chunk count.
Hypothesis. Enabling async_insert=1; wait_for_async_insert=0 on the connection string would reduce per-insert round-trip overhead and deliver at least a 50% reduction in the OC write phase at 1K scale on the Cloud D32 container.
Method. Added set_async_insert=1; set_wait_for_async_insert=0 to the connection string in ClickHouseBackend. Measured OC-only write phase at 1K, 10K, and 100K on the Production Scale 2×4 tier with 16-CPU and 32-CPU D32 workload profiles.
Observation. - D32 16-CPU, no async: 141s wall / 98s OC write at 1K. - D32 16-CPU, with async: 60s wall / 16s OC write at 1K (OC write −84%). - Scaled to 100K OC-only: OC write took 233s out of 2,991s wall time (7.8% of total) — server-side async buffering keeps the write phase negligible.
Conclusion / New knowledge. Hypothesis confirmed and exceeded. async_insert with fire-and-forget semantics is the dominant write-path lever on Cloud for our workload shape (many small bulk-copies). New knowledge: at Cloud-scale with a many-chunk write pattern, the client-side throughput ceiling is dominated by per-round-trip synchronous server acknowledgement, not by insert processing; fire-and-forget semantics trade a small window of in-flight-data-loss risk (in practice bounded to the CH server's async-insert buffer window, typically a few hundred ms) for a multi-fold throughput improvement. Trade-off acceptable for experimental workloads where the run-level success check is "did all 244,800 rows arrive" rather than "did each individual insert complete before the next was issued".
Evidence. Commits 20c28696 (AU→Cloud benchmark logs + async_insert), test flags throughout. Run logs docs/samples/experiment-runs/streaming-cloud-scale-1000-no-pbp-d32-async.txt, streaming-cloud-scale-10000-no-pbp-d32-async.txt, streaming-cloud-scale-100000-no-pbp-d32-async.txt.
3.10 Server-side memory ceiling under high merge parallelism¶
Prior knowledge gap. The final-merge phase (288 arrayFlatten queries, one per game) had been parallelised at N=4 on the laptop (experiment 3.5) with clean gains. At 100K scale on the Cloud Production Scale 2×4 tier, raising this parallelism cap to N=32 (to match the client-side --parallel-workers setting) was expected to deliver further improvement.
Hypothesis. N=32 parallel arrayFlatten merges would scale merge throughput on the 2×4 Cloud tier to fit the merge phase inside approximately 90 seconds at 100K worlds.
Method. Ran 100K OC-only with merge-parallelism capped at N=32. Observed result.
Observation. Run failed at approximately 35 minutes into the merge phase with Code: 241. DB::Exception: total memory limit exceeded (would use 14.41 GiB, maximum 14.40 GiB). Post-mortem from query-log: 32 parallel arrayFlatten queries each held approximately 680 MB working set, totalling approximately 22 GB server-side — exceeding the Production Scale tier's per-query memory budget of approximately 14.4 GB.
Conclusion / New knowledge. Hypothesis disproved. New knowledge: the merge phase is server-side memory-bound, not client-side throughput-bound, on the Production Scale tier. Each parallel arrayFlatten query has a ~680 MB resident working set at the 100K array length; tier memory budget dictates the maximum safe concurrent merge count. Fix: cap merge parallelism at N=4 on the Production Scale 2×4 tier (~2.7 GB). Later, after tier upgrade (experiment 3.14), cap raised to N=16 on Production 3×16 (~10.8 GB working set against 64 GiB per-replica RAM). This experiment established the pattern that merge-parallelism must be sized to per-query working set, not to client worker count — a load-bearing operational finding.
Evidence. Commit e7cc5618 (the cap-at-4 fix). Run log diagnostic detail in the status.md "Incidents worth noting" section and in docs/samples/experiment-runs/streaming-cloud-scale-100000-no-pbp-d32-async.txt.
3.11 Staging-table OPTIMIZE FINAL before per-game merge¶
Prior knowledge gap. The OC async-insert path creates many small MergeTree parts in the staging table (experiment 3.9). The 288 arrayFlatten merge queries then each scan across all those parts. It was not known whether running OPTIMIZE TABLE game_outcome_context_staging FINAL before the merge loop — consolidating small parts into fewer larger parts — would reduce per-query scan work enough to meaningfully shorten the merge phase.
Hypothesis. OPTIMIZE TABLE FINAL on the staging table before the per-game merge loop would reduce the 100K merge phase from 20 minutes to approximately 12-15 minutes by reducing per-query scan overhead.
Method. Added a call to OPTIMIZE TABLE game_outcome_context_staging FINAL before the per-game merge loop in the orchestrator. Ran at 100K scale on the Cloud Production Scale 2×4 tier.
Observation. OPTIMIZE TIMED OUT at 100K scale with Code: 236. DB::Exception: Aborting discovery process: Timeout, interrupting no-future-parts discovery. The default server-side OPTIMIZE timeout (approximately 2 minutes) was not enough time to consolidate 28,800 staging rows across the many parts.
Conclusion / New knowledge. Hypothesis disproved on operational grounds. New knowledge: the default OPTIMIZE TABLE timeout on the Production Scale Cloud tier is approximately 2 minutes, which is insufficient to consolidate a 100K-scale staging table. Raising the timeout would risk OPTIMIZE itself taking longer than the savings it produced, making the optimisation a net loss even in the success case. Reverted. Feeds forward to experiment 3.12 (the hash-sharding alternative, which was the next hypothesis-test in the same merge-phase optimisation workstream).
Evidence. Commits 6c9c9614 (apply), d3c66d42 (docs of H and C merge experiments — both tried and reverted). The revert is bundled into the subsequent sharded-merge revert.
3.12 Hash-sharded per-game merge¶
Prior knowledge gap. Having established in 3.10 that merge parallelism is server-memory-bound, and in 3.11 that OPTIMIZE doesn't deliver within the operational window, a further hypothesis was tested: if each of the 288 per-game merge queries could be split into 4 sub-range queries filtered by cityHash64(outcome_id) % 4 = shardIndex, then with per-query working set dropping to ~170 MB each, the parallelism cap could safely rise to N=16 — delivering 4× merge throughput.
Hypothesis. Hash-sharding each per-game merge into 4 outcome-hash sub-ranges would reduce per-query memory to ~170 MB, allow merge parallelism of N=16 on the Production Scale 2×4 tier, and deliver approximately 3× improvement in merge phase wall time at 100K scale.
Method. Implemented hash-sharded merge in ClickHouseBackend.MergeStagingToFinalAsync. Ran at 100K OC-only on Production Scale 2×4.
Observation. - 244,800 rows still correct after merge (correctness preserved). - Total wall time regressed +13% (2,991s → 3,377s). - Merge phase regressed +10% (1,210s → 1,328s).
Conclusion / New knowledge. Hypothesis disproved. Root cause identified by query-log analysis: each sharded query must still scan the full staging rows for its game because the game_id filter is the only index-eligible predicate; the outcome_id hash filter runs after the row read, not as an index lookup. Sharding by 4 therefore does approximately 4× the total scan I/O for the same output.
New knowledge obtained: reducing per-query memory at the cost of redundant scans is a net loss even when memory headroom is real. A genuine sharded-merge approach would require partitioning the staging table itself by outcome_id hash (so that the hash filter becomes a partition predicate) — a cost-heavy schema change not justified at the 100K scale where the current 49.85 min merge on the smaller tier, or the 2.3 min merge on the upgraded tier, is already acceptable.
This experiment plus 3.11 together closed out the merge-phase optimisation workstream with a clear recorded finding: the lever for faster merge at 100K is a bigger Cloud tier, not smarter SQL. This finding directly justified the decision in experiment 3.14 to upgrade the Cloud tier rather than continue pursuing software-level merge optimisations.
Evidence. Commits 19888ebd (apply), 2049f9a4 (revert), d3c66d42 (docs). Run log docs/samples/experiment-runs/streaming-cloud-scale-100000-no-pbp-d32-sharded-merge-reverted.txt.
3.13 PBP super-linearity on single-partition 100K load¶
Prior knowledge gap. Experiment 3.8 had established that play_by_play must be PARTITION BY season_id (a single partition per season) on the Cloud Production Scale 2×4 tier. At 10K worlds this partition holds approximately 420M rows. At 100K it holds approximately 4.2B rows. It was not known whether foreground INSERT throughput on a single 4.2B-row MergeTree partition would behave linearly or super-linearly compared to the 420M-row case, and the vendor documentation does not address single-partition scaling at this row count.
Hypothesis. PBP write throughput at 100K would project approximately linearly from the 10K result (1,120s total, 818s PBP write at 513K rows/sec) to approximately 3 hours total — inside the 4-hour replica deadline on the Cloud Container Apps Job.
Method. Ran the full 100K full-pipeline (OC + PBP, no season) on the Cloud Production Scale 2×4 tier with 32 vCPU / 64 GiB container, 4-way parallel PBP writes, chunk=500 (200 chunks). Monitored wall-time progression to estimate PBP throughput.
Observation. Run hit the 4-hour replica deadline with the PBP write phase still running. The projected 3-hour completion time did not hold; actual PBP throughput at 100K was significantly below the 513K rows/sec seen at 10K — super-linear at at least 1.75×.
Conclusion / New knowledge. Hypothesis disproved. New knowledge: at 4.2B rows in a single MergeTree partition on the Production Scale 2×4 tier, background merge pressure throttles foreground INSERT throughput in a way it does not at 420M rows. The throttling is not documented in the vendor best-practice material for single-partition scaling. Specifically, the resource budget of the smaller tier is fully consumed by background merges once the partition grows past an approximate (but not precisely measured) row-count threshold between 420M and 4.2B, leaving insufficient headroom for continued foreground ingest.
Three candidate fixes identified:
1. Upgrade Cloud tier — more aggregate CPU and RAM allows both foreground INSERT and background merge to proceed in parallel. Tested in experiment 3.14.
2. Repartition PBP into (season_id, intDiv(world_id, 10000)) — spreads across ~10 partitions, each getting background merges in parallel. Tested in experiment 3.16 (empirically disproved on client-side grounds).
3. Split PBP off critical path — run OC-only as primary, defer PBP to an async pipeline. Not tested; left as a product-level option.
Evidence. Commit e7801606 (parallel PBP + finish cleanup — the context in which the attempt happened). Run log docs/samples/experiment-runs/streaming-cloud-scale-100000-with-pbp-d32-ch-timed-out.txt.
3.14 Cloud tier upgrade as a capacity lever¶
Prior knowledge gap. Experiments 3.10, 3.11, 3.12 and 3.13 had all concluded, from different angles, that the binding ceiling on the 2×4 Production Scale tier was server-side resources (memory for merges, aggregate throughput for writes), not software design. It was not known how much lift the Production 3×16 tier (3 replicas × 16 vCPU × 64 GiB = 48 vCPU / 192 GiB aggregate — 6× CPU and 24× RAM vs the Production Scale 2×4) would deliver, or whether it would unblock the 100K full-pipeline path within the 4-hour replica budget. The vendor's own tier-comparison guidance does not specifically address this workload shape.
Hypothesis. The Production 3×16 tier would (a) eliminate the server-side merge memory ceiling identified in 3.10, (b) restore near-linear PBP throughput by providing enough aggregate resource for concurrent foreground ingest and background merge, and (c) let a 100K full-pipeline run complete within the 4-hour replica budget.
Method. Resized the Cloud cluster from Production Scale 2×4 to Production 3×16. Raised the orchestrator's merge-parallelism cap from 4 to 16 (commit ba0bb337) to reflect the tier's higher per-replica memory. Re-ran 100K OC-only for baseline, then 100K full-pipeline (OC + PBP, no season). Measured wall time by phase.
Observation. - 100K OC-only on 3×16: 1,887s (31.4 min) — versus 2,991s (49.85 min) on 2×4. Reduction: −37%. - Merge phase specifically: 137s (2.3 min) on 3×16 versus 1,210s (20 min) on 2×4. Reduction: −89%. - 100K full-pipeline (OC + PBP, no season) on 3×16: 9,887s (2h 44m) — first successful 100K full-pipeline run. - PBP write throughput at 100K on 3×16: 528K rows/sec (vs 513K at 10K on the smaller tier — effectively flat across the 10× scale jump, confirming that the super-linearity observed in 3.13 was a resource-starvation artifact of the smaller tier, not an algorithmic property of the write path).
Conclusion / New knowledge. All three hypotheses confirmed. New knowledge: the PBP super-linearity observed in 3.13 was entirely a resource-starvation artifact of the Production Scale 2×4 tier and disappears on the Production 3×16 tier with no software change. The 20-minute merge phase that three separate merge-optimisation experiments (3.10, 3.11, 3.12) had been attempting to chase down was also eliminated by the tier upgrade. This experiment closed out the "can ClickHouse Cloud host the 100K full-pipeline path at all?" question with a positive result at a known-economic cost.
Cost delta: Production 3×16 is approximately $0.83/minute when active (vs $0.29/minute for Production Scale 2×4), but the tier idle-pauses after 15 minutes of inactivity, so the per-run cost is bounded by per-run compute time. 100K full-pipeline on 3×16: approximately $25 per run. 100K OC-only on 3×16: approximately $5 per run. Both are inside the project target of <$30/run.
Evidence. Commits 7b1c98c0 (docs), ba0bb337 (merge parallelism cap raised to 16). Run logs docs/samples/experiment-runs/streaming-cloud-scale-100000-oconly-prod3x16.txt, streaming-cloud-scale-100000-fullpipeline-prod3x16.txt.
3.15 Season-accumulator memory layout at 100K¶
Prior knowledge gap. The season outcome context table (season_outcome_context) had an accumulator (AmericanFootballSeasonAccumulator) that had been validated correct at 1K scale with a data layout of Dictionary<int worldIndex, Dictionary<string outcomeId, double>>. At 100K scale this layout was projected to approximately 270 GB of process RAM based on nested-dict overhead and string duplication per-world, but the exact scaling had not been measured and the projection was not known to be correct.
Hypothesis. The nested-dictionary layout would scale to 100K within the 64 GiB container RAM budget provided by the D32-benchmark workload profile, by virtue of the fact that most outcomes only increment on a small fraction of worlds (sparse emissions).
Method. Wired the season accumulator into the streaming orchestrator (commit 363a7726) and ran 100K full-pipeline + season on the Production 3×16 tier. Monitored peak container RAM.
Observation. Run OOM'd at approximately 3 hours into a run that was otherwise on track for a 2h 58m total. Peak container RAM before OOM was approaching 64 GiB. Post-mortem memory sampling established that emissions were not as sparse as hypothesised — the nested-dict overhead (approximately 72 bytes per Dictionary entry plus string-key duplication across worlds) dominated, giving a linear-in-worlds scaling factor that matched the 270 GB projection.
Conclusion / New knowledge (part 1). Hypothesis disproved. The original layout cannot scale to 100K within any reasonable container RAM budget. Refactor required.
Successor hypothesis. Re-layout the accumulator as Dictionary<string outcomeId, double[worldCount]> — one pre-sized array per outcome, indexed by absolute world index. This mirrors the layout the game accumulator had used all along. Projected 100K memory: 1,668 outcomes × 100,000 doubles × 8 bytes ≈ 1.3 GB flat, regardless of how many worlds are simulated.
Method (part 2). Refactored AmericanFootballSeasonAccumulator (commit 41ab4c90). Preserved the parameterless constructor for test compatibility; added a (int initialCapacity) overload so the orchestrator can pass TotalWorldCount and avoid grow-and-copy during accumulation. Verified all 6 existing AmericanFootballSeasonAccumulatorTests pass unchanged (same observable behaviour, internal layout change only). Ran 100K full-pipeline + season on Production 3×16.
Observation (part 2). Run completed in 10,696s (2h 58m) with peak container RAM 46.6 GB (vs 46.8 GB on the prior no-season run — essentially flat; the season accumulator contributed ~1.3 GB as projected, not 270 GB). Season table written with 1,728 rows — correct. Both OC (244,800 rows) and PBP (4,198,981,505 rows) row counts match prior validated no-season baseline exactly.
Conclusion / New knowledge (part 2). Successor hypothesis confirmed. New knowledge: the memory behaviour of .NET nested-dictionary layouts at large scale is dominated by per-entry overhead and string duplication to an extent that is not obvious from first principles; for dense per-world accumulator shapes, the fixed-array-per-outcome layout is the correct default even when correctness doesn't require it. This is the primary result: 100K full-pipeline + season context written correctly in 2h 58m for approximately $27 per run.
Note on remaining sim-phase cost: the shared season accumulator is lock-guarded (its inner arrays are written by multiple simulation workers). At 100K this is approximately 28.8M locked-write calls, and the ~14 min extra sim time vs the no-season run (2h 58m vs 2h 44m) is attributable to lock contention. A per-worker season accumulator + merge step would eliminate this, following the same pattern as the game accumulator. Not pursued in this project because +14 min of wall time is well within budget; recorded as a known follow-up.
Evidence. Commits 363a7726 (wire season into orchestrator — OOM version), 41ab4c90 (refactor to fixed-array layout), 2d36db1b (docs of successful run), 5c0645d3 (status doc fold-in). Run logs docs/samples/experiment-runs/streaming-cloud-scale-100000-fullpipeline-season-oom.txt (the OOM), streaming-cloud-scale-100000-fullpipeline-season-prod3x16.txt (the success — primary 2h 58m result).
3.16 Spec-driven schema review applied to validated baseline¶
Prior knowledge gap. Having achieved the 2h 58m primary result (experiment 3.15), a review of the ClickHouse schemas against current best-practice documentation was undertaken to establish whether further improvements could be obtained by applying the vendor's own documented recommendations that had not yet been applied. The documented recommendations were: (a) partition PBP by (season_id, intDiv(world_id, 10000)) for background-merge parallelism across multiple partitions, (b) add CODEC(Delta, ZSTD) to monotonic integer columns, (c) add consistency-level PARTITION BY to season_outcome_context and experiment_results.
Hypothesis. Applying the four documented best-practice changes on top of the validated 2h 58m baseline would deliver 20-40% improvement in the PBP write phase (per vendor guidance) while preserving correctness. Improvement would come from parallelisable background merges on the split PBP partition.
Method. Applied all four changes in a single commit (34155f2d). Ran the full 100K full-pipeline + season on Production 3×16. Monitored progress.
Observation. Run catastrophically regressed. Progress after 4 hours: approximately 2,000 of 100,000 worlds processed (2% complete). Run terminated at the 4-hour replica deadline. Comparison vs the 2h 58m validated baseline: approximately 68× slowdown in effective per-world throughput.
Diagnostic investigation (via Cloud system.query_log over the 5-hour run window):
- Insert queries: 387 (baseline for a complete run: approximately 30,000).
- Select queries: 8,737 (ratio 22:1 select-to-insert, vs the approximately 2:1 ratio on the baseline).
- Individual insert latency when they did run: healthy (21s for 5.2M rows = 250K rows/sec, consistent with historical rates).
- No exceptions, no MEMORY_LIMIT errors, no server-side failures.
Root-cause identified from the 22:1 select-to-insert ratio: the ClickHouseBulkCopy client library fires a SELECT ... WHERE ?=? column-metadata query before every insert (InitAsync). With the computed partition expression intDiv(world_id, 10000) now present in the play_by_play DDL, each of those metadata queries became materially slower — likely because the driver's column-introspection path re-parses and evaluates the partition expression per call rather than caching it. Client-side time was stuck between inserts; the server was idle most of the run.
Conclusion / New knowledge. Hypothesis disproved. Major new knowledge: ClickHouse vendor best-practice documentation describes table-level design in isolation and does not account for client-library behaviour (e.g., BulkCopy's per-insert metadata fetch) or specific orchestration patterns. Schema changes that look like clear improvements from a vendor-spec perspective can interact catastrophically with client-library internals at scale. The schema changes were reverted in commit 150ec156. The prior 2h 58m baseline with the "wrong-by-the-book" PARTITION BY season_id remained the validated primary result.
Specific operational knowledge obtained: never commit schema changes as improvements without a full-scale validation run first, even when the changes are derived directly from vendor best-practice documentation. This is a load-bearing engineering-practice finding for any future storage-layer work on this system. Two plausible future paths identified but not pursued (recorded for future work): cache the BulkCopy InitAsync column-metadata result once per run instead of per-insert, or use a different insert path (raw POST of RowBinary without the per-call metadata fetch).
Evidence. Commits 73b6d6e7 (schema review doc), 34155f2d (applied changes), 150ec156 (revert), 4ba604e5 (empirical update to the schema review doc with full diagnostic). Run log docs/samples/experiment-runs/streaming-cloud-scale-100000-fullpipeline-schema-review-regressed.txt. Schema review doc docs/clickhouse-schema-review.md (empirical-result column retained in-situ as a permanent record).
3.17 Redis as candidate primary store¶
Prior knowledge gap. Before fully committing to ClickHouse as the primary store, a comparable candidate (Redis) needed evaluation. The question was whether Redis could plausibly out-perform ClickHouse on either the read or write side at the target scale, and whether it could do so within the economic envelope of the project.
Hypothesis. Redis might deliver superior write throughput due to zero server-side compression cost and in-memory write semantics, potentially making it preferable to ClickHouse at the 100K write-phase target.
Method. Desk evaluation (no empirical Redis benchmark run). Compared Redis and ClickHouse on five axes using the validated 100K run numbers from ClickHouse as the reference baseline: 1. Read-side comparison (point-read, basket read, full-context read). 2. Write-side comparison, with particular attention to the data-shape-specific transpose cost (simulation produces world-by-world, storage consumes outcome-by-outcome). 3. Storage cost (on-disk compressed in CH vs in-RAM uncompressed in Redis). 4. Analytical query capability. 5. Architectural-constant costs (the matrix transpose is a fundamental cost regardless of backend).
Observation.
- Point-reads: Redis faster (~1 ms vs ~10 ms on CH) but both well inside the 1-second spec target, so latency isn't the decision driver.
- Write-side: Redis's no-compression cost on the wire is approximately 7× larger transfer than ClickHouse's Delta+ZSTD compression. Zero server-side processing cost approximately cancels out, making pure write throughput approximately equivalent, not better.
- Transpose: the architectural matrix-transpose cost (world-by-world sim → outcome-by-outcome storage) is an unavoidable constant regardless of backend. ClickHouse absorbs it on the server via arrayFlatten (approximately 137 s on the validated baseline); Redis would need equivalent logic client-side or via Lua scripting.
- Storage cost: 28 GB compressed on disk (CH) vs 196 GB in RAM (Redis). At AWS ElastiCache r6g.16xlarge rates, Redis carries a standing cost of approximately $3,500-5,000/month vs ClickHouse's idle-paused approximately $4-5 per run — a 50-100× cost gap.
- Analytical queries: supported in ClickHouse (full SQL), not supported in Redis (pure KV store).
Conclusion / New knowledge. Hypothesis disproved on reasoning, without empirical Redis run. Redis is not competitive as the primary store for this workload: it is not meaningfully faster on either read or write, has approximately 50-100× higher storage cost at our data volumes, and cannot serve analytical queries. Redis remains a plausible future fit as a hot-path cache in front of ClickHouse once real production read traffic patterns are known, but not as a replacement.
Evidence. Document docs/outcome-context/evaluations/redis-as-store.md. Commit c3f0e6e2.
4. New knowledge obtained¶
The following items of new knowledge were obtained through the systematic progression of work in section 3. Each is directly traceable to the experiment(s) from which it was derived. These are the generalisable findings that constitute the R&D output of the project.
-
Dense
Array(Float64) CODEC(Delta, ZSTD)in ClickHouse compresses at 5.56× → 7.32× across scales 10 → 10,000 worlds, with ratio improving with array length (metadata amortisation). Source: 3.1. -
Basket-read latency (1-30 outcomes / game) is query-planner-bound, not payload-bound, at 1K scale, projecting cleanly to approximately 30-40 ms at 100K worlds. Source: 3.2.
-
Full-context read latency sits in a mixed regime between query-planner-bound and transfer-bound at 1K worlds, giving an honest 3-15 second projection range at 100K rather than a point estimate. Source: 3.3.
-
A shared-consumer
Channel<T>accumulator pattern cannot be materially parallelised because the consumer is the binding constraint regardless of producer count. Successor pattern (per-worker shared-nothing accumulators) delivers -40% at 1K and -55% at 5K/10K, with the factor growing at scale. Sources: 3.4, 3.5. -
ClickHouse Atomic database DROP TABLE is asynchronous by default; operational code that queries
system.parts active=1within a single container lifespan must useDROP TABLE ... SYNCto avoid stale-part contamination. Source: 3.6. -
Azure Container Apps ephemeral filesystem is not fit-for-purpose for per-chunk fsync workloads at bandwidth levels characteristic of Parquet chunk writes; the failure mode is pathological slowdown, not outright error. Source: 3.7.
-
ClickHouse Cloud Production Scale tier enforces
max_partitions_per_insert_block = 100withSETTING_CONSTRAINT_VIOLATIONon client overrides; this constraint is not documented in the vendor's best-practice partitioning guide. Source: 3.8. -
async_insert=1; wait_for_async_insert=0on the connection string reduces OC write phase at 1K scale by 84% on Cloud via server-side buffering, trading a small window of in-flight-data-loss risk for multi-fold throughput improvement. Source: 3.9. -
Merge-phase parallelism is server-side memory-bound, not client-side throughput-bound. Safe parallelism at 100K scale is dictated by
(per-replica-RAM / per-query-working-set), where per-query working set is approximately 680 MB at 100K arrays. Source: 3.10. -
OPTIMIZE TABLE FINAL has an approximately 2-minute default timeout on the Production Scale Cloud tier that is insufficient to consolidate a 100K-scale staging table. Source: 3.11.
-
Reducing per-query memory via hash-sharding at the cost of redundant scans is a net loss even when memory headroom is real; the hash filter runs after the row read, not as a partition predicate. Source: 3.12.
-
PBP write super-linearity on a single-partition 4.2B-row load, as observed on the Production Scale 2×4 tier, is a resource-starvation artifact and disappears on the Production 3×16 tier with no software change. Background merge pressure on a large partition throttles foreground INSERT on smaller tiers but not when the tier has sufficient aggregate resource. Sources: 3.13, 3.14.
-
Tier upgrade, rather than software optimisation, is the correct lever for merge-phase wall time at 100K scale. Three separate merge-phase software-optimisation experiments (3.10, 3.11, 3.12) were unable to deliver what the tier upgrade delivered trivially. Source: 3.14.
-
.NET nested-dictionary layouts exhibit linear-in-cardinality memory overhead that dominates at large scale. For dense per-world accumulator shapes, the fixed-array-per-outcome layout is the correct default regardless of correctness requirements — the memory difference at 100K was approximately 270 GB vs approximately 1.3 GB. Source: 3.15.
-
ClickHouse vendor best-practice documentation describes table-level design in isolation and does not account for client-library behaviour. A schema change that looks correct from vendor-spec perspective (world-bucket partitioning + monotonic-integer codecs) can interact catastrophically with client library internals (
BulkCopy.InitAsyncmetadata-fetch) at scale, producing a 68× regression. Schema changes, regardless of source, must be validated empirically at full scale before acceptance. Source: 3.16. -
Redis is not a competitive primary store for dense Array(Float64) Monte Carlo outputs at this scale: not meaningfully faster, approximately 50-100× higher storage cost, no analytical query support. Remains plausible as a hot-path cache after production read patterns are known. Source: 3.17.
5. Unresolved uncertainties¶
The following technical uncertainties were not fully resolved within this project's scope. Each is recorded for future work.
-
100K full-context read latency. Measured 200 ms at 1,000 worlds; honest projection at 100K is 3-15 seconds depending on which regime dominates. A 100K full-context read measurement has not been executed; the 5-second spec target cannot be asserted as met. (Source: 3.3.)
-
Sub-60-second 100K feasibility path. A separate analytical document (
docs/outcome-context/evaluations/storage-experiment/problem-statement.md) sketches what would be needed (approximately 500 cores distributed sim + proportionally larger Cloud tier). Not empirically tested. -
PBP repartition feasibility with a client-side
InitAsyncworkaround. The 68× regression in 3.16 was caused by client library interaction, not by the schema itself. A cached-metadata or raw-RowBinary-POST path could allow the(season_id, intDiv(world_id, 10000))partition scheme to deliver its theoretical benefit. Not implemented. -
Should-have experiment suite at 100K scale on the upgraded tier. REQ-OC-R1/R3/R4/R5 (concurrent reads, versioning, codec comparison) have been implemented and integration-tested but have not been run at 100K on the Production 3×16 tier. Re-running them would establish read-side performance at the validated write-side scale.
-
Codec comparison for
Array(Float64) valuescolumn.Delta+ZSTDdelivers 7.3× at 10K worlds; alternative codecs (Gorilla,FPC, plainZSTD) have not been measured side-by-side.
6. Evidence index¶
All evidence cited below is either committed source code, committed documentation, or committed run log text files held in this repository under the LBS-1183 branch.
6.1 Experiment run logs¶
All experiment run logs are held under docs/samples/experiment-runs/. Each log is the literal console output of one experimental execution, committed verbatim.
| Log file | Experiment(s) | Scale / environment |
|---|---|---|
must-have-scale-10.txt |
3.1 | 10 worlds, localhost Docker |
must-have-scale-100.txt |
3.1 | 100 worlds, localhost Docker |
must-have-scale-1000.txt |
3.1 | 1,000 worlds, localhost Docker |
must-have-scale-10-compressed.txt, must-have-scale-100-compressed.txt |
HTTP compression sub-experiment | 10/100 worlds, localhost |
basket-all-scales.txt |
3.2 | 10/100/1,000 worlds, basket |
streaming-all-scales.txt, streaming-scale-10000.txt, streaming-scale-5000.txt |
3.5, 3.7 (laptop baselines) | 5K/10K worlds, localhost |
streaming-scale-10000-parallel.txt |
3.4 | 10K, localhost, --parallel-worlds 8 |
streaming-scale-5000-no-pbp.txt |
3.5 baseline | 5K OC-only, localhost |
streaming-scale-1000-profiling.txt, streaming-scale-1000-optimized.txt |
Optimisation sub-pass | 1K localhost |
streaming-scale-1000-no-pbp-p{1,2,4,8}.txt |
3.5 N sweep (sim parallel only) | 1K OC-only localhost |
streaming-scale-1000-no-pbp-p{1,4,8}-pwrite.txt |
3.5 N sweep (sim + write parallel) | 1K OC-only localhost |
streaming-scale-1000-no-pbp-p{4,8}-pmerge.txt |
3.5 N sweep (sim + write + merge parallel) | 1K OC-only localhost |
streaming-scale-1000-with-pbp-p{1,4}.txt |
3.5 (with PBP) | 1K full pipeline localhost |
streaming-scale-5000-no-pbp-p4.txt, streaming-scale-10000-no-pbp-p{1,4}.txt |
3.5 at 5K/10K | Localhost |
streaming-cloud-scale-1000-no-pbp-p{8,16,32}.txt |
Cloud 1K sweep | Cloud Production Scale |
streaming-cloud-scale-100000-no-pbp-d32-async.txt |
3.9, 3.10, 3.11 | 100K OC-only Cloud 2×4 |
streaming-cloud-scale-1000-no-pbp-d32-async.txt, streaming-cloud-scale-10000-no-pbp-d32-async.txt |
3.9 | 1K/10K OC-only Cloud 2×4 |
streaming-cloud-scale-100000-no-pbp-d32-sharded-merge-reverted.txt |
3.12 | 100K sharded-merge Cloud 2×4 |
streaming-cloud-scale-1000-with-pbp-d32-ch.txt, streaming-cloud-scale-10000-with-pbp-d32-ch-parallel.txt |
3.8 | 1K/10K full-pipeline Cloud 2×4 |
streaming-cloud-scale-100000-with-pbp-d32-ch-timed-out.txt |
3.13 | 100K full-pipeline Cloud 2×4 (timed out) |
streaming-cloud-scale-100000-oconly-prod3x16.txt |
3.14 | 100K OC-only Cloud 3×16 — primary |
streaming-cloud-scale-100000-fullpipeline-prod3x16.txt |
3.14 | 100K full pipeline (OC+PBP) Cloud 3×16 |
streaming-cloud-scale-10000-fullpipeline-season-prod3x16.txt |
3.15 checkpoint | 10K full+season Cloud 3×16 |
streaming-cloud-scale-100000-fullpipeline-season-oom.txt |
3.15 (pre-refactor OOM) | 100K full+season Cloud 3×16 |
streaming-cloud-scale-100000-fullpipeline-season-prod3x16.txt |
3.15 PRIMARY RESULT | 100K full+season Cloud 3×16 |
streaming-cloud-scale-100000-fullpipeline-schema-review-regressed.txt |
3.16 | 100K with schema review changes Cloud 3×16 |
6.2 Source-code commits referenced¶
| Commit | Date | Description |
|---|---|---|
4110233b |
2026-04-16 | Storage experiment design spec and implementation plan |
426b2b67 |
2026-04-16 | Create Accumulation, StorageExperiment, StorageExperiment.Tests projects |
7bd32122 |
2026-04-17 | ClickHouseBackend + integration tests |
40b1cb85 |
2026-04-17 | ResultsWriter for ClickHouse Cloud experiment results |
122a43e8 |
2026-04-17 | OC experiments (REQ-OC-W1, R1, R3, S1) |
af8748ce |
2026-04-18 | Basket-size selective read experiment — evidence for 3.2 |
3a26f8cd |
2026-04-18 | Streaming-mode design spec |
1829d509 |
2026-04-18 | StreamingOrchestrator with smoke test |
ad4e5523 |
2026-04-18 | StreamingRunConfig + StreamingRunReport |
0e528114 |
2026-04-18 | 3.4 — parallel-simulation via Channel |
4794ba36 |
2026-04-18 | 3.4 revert |
39b24d4c |
2026-04-18 | 3.5 — per-worker accumulator + PBP list |
e0af3495 |
2026-04-18 | 3.5 — parallel OC write + final merge |
0af492e1 |
2026-04-18 | 3.6 — DROP TABLE SYNC + validated parallelism at 5K/10K |
9160ed28 |
2026-04-19 | Container Apps Job CI workflow |
20c28696 |
2026-04-19 | 3.9 — Slim Dockerfile + async_insert connection string |
e7cc5618 |
2026-04-19 | 3.10 — cap merge parallelism at 4 |
60611cbc |
2026-04-19 | Flush stdout before exit (Container Apps logging fix) |
6c9c9614 |
2026-04-19 | 3.11 — OPTIMIZE staging FINAL (later reverted) |
19888ebd |
2026-04-19 | 3.12 — hash-sharded merge (later reverted) |
2049f9a4 |
2026-04-19 | 3.12 revert |
d3c66d42 |
2026-04-19 | 3.11 and 3.12 documented as tried-and-reverted |
7831e36f |
2026-04-20 | 3.8 — PBP direct to ClickHouse |
c815683d |
2026-04-20 | 3.8 — play_by_play PARTITION BY season_id |
e7801606 |
2026-04-20 | 3.13 — parallel PBP writes |
ba0bb337 |
2026-04-20 | 3.14 — merge parallelism cap 4 → 16 |
363a7726 |
2026-04-20 | 3.15 — orchestrator writes season_outcome_context (OOM version) |
41ab4c90 |
2026-04-20 | 3.15 — refactor season accumulator to per-outcome fixed arrays |
2d36db1b |
2026-04-20 | 3.15 — 100K full+season success docs |
5c0645d3 |
2026-04-20 | 3.15 — long-form status doc fold-in |
c3f0e6e2 |
2026-04-21 | 3.17 — Redis vs ClickHouse deferred evaluation |
73b6d6e7 |
2026-04-21 | 3.16 — ClickHouse schema review document |
34155f2d |
2026-04-21 | 3.16 — schema review changes applied |
150ec156 |
2026-04-21 | 3.16 — schema review changes reverted |
4ba604e5 |
2026-04-21 | 3.16 — schema review doc annotated with empirical result |
6.3 Supporting documentation¶
| Document | Role |
|---|---|
docs/outcome-context/evaluations/storage-experiment/status.md |
Long-form status document (chronological / topical, complement to this doc) |
docs/outcome-context/evaluations/storage-experiment/status-summary.md |
Executive summary of performance results |
docs/outcome-context/evaluations/storage-experiment/story.md |
Non-technical narrative |
docs/outcome-context/evaluations/storage-experiment/problem-statement.md |
Forward-looking analysis: sub-60-second 100K feasibility |
docs/clickhouse-schema-review.md |
Schema review with empirical-result annotation (evidence for 3.16) |
docs/outcome-context/evaluations/redis-as-store.md |
Redis evaluation reasoning (evidence for 3.17) |
docs/superpowers/specs/2026-04-16-storage-experiment-requirements.md |
Original requirements document |
docs/superpowers/specs/2026-04-16-storage-experiment-design.md |
Original design specification |
docs/superpowers/specs/2026-04-18-streaming-mode-design.md |
Streaming-orchestrator design |
docs/superpowers/plans/2026-04-16-storage-experiment.md |
Original implementation plan |
docs/superpowers/plans/2026-04-18-streaming-mode.md |
Streaming-mode implementation plan |
7. Timeline¶
The R&D activities described in this document occurred entirely within the Australian income year FY2025-26 (1 July 2025 - 30 June 2026), specifically within the window 2026-04-16 through 2026-04-21.
| Date | Activity |
|---|---|
| 2026-04-16 | Design spec + implementation plan committed. Accumulator + storage project scaffolding. |
| 2026-04-17 | ClickHouseBackend + integration tests + must-have experiment suite implemented. First localhost benchmark runs (3.1). Season accumulator implemented. |
| 2026-04-18 | Streaming orchestrator designed and implemented (with 1K/5K/10K localhost validation). Parallel-simulation experiment (3.4) tried and reverted. Shared-nothing parallel approach (3.5) implemented and validated at 1K/5K/10K. Basket-read experiment (3.2) executed. DROP TABLE SYNC fix (3.6). |
| 2026-04-19 | Cloud infrastructure stood up in westus3. CI pipeline for container image. AU→Cloud baseline runs. D32 workload profile validated. async_insert lever (3.9) validated. Merge-memory ceiling (3.10) diagnosed. OPTIMIZE FINAL (3.11) and sharded-merge (3.12) tried and reverted. 100K OC-only baseline on Production Scale 2×4 established. |
| 2026-04-20 | PBP direct-to-ClickHouse path implemented (3.8). PBP partitioning schema change. 100K full-pipeline on Production Scale 2×4 hit 4h ceiling (3.13). Cloud tier upgraded to Production 3×16 (3.14). 100K full-pipeline validated. Season accumulator wired in, hit OOM, refactored to per-outcome arrays (3.15). Primary 100K full-pipeline + season result: 2h 58m. |
| 2026-04-21 | Redis evaluation documented (3.17). ClickHouse schema review (3.16) performed. Schema changes applied, full-scale run regressed 68×. Changes reverted. Empirical-update annotation added to schema review doc. |
| 2026-04-22 | R&D findings documentation (this document) prepared. |
End of document.