Storage Experiment — Status Summary¶

Date: 2026-04-19 • Branch: LBS-1183

High-level test state + performance numbers. For process, architecture, and retrospective analysis, see status.md. For the R&D Tax evidence write-up (hypothesis → method → result → knowledge, including negative results), see rd-findings.md.

TL;DR¶

100K full-pipeline validated end-to-end on Production 3×16 tier, covering all three outcome tables (game_outcome_context, play_by_play, season_outcome_context) in 2h 58m for ~$27/run. 244,800 OC + 4.2B PBP + 1,728 season rows all correct.
100K OC-only validated at 31.4 min. Merge collapsed from 20 min → 2.3 min (-89%) with tier upgrade + merge-parallelism cap raised to 16.
100K full-pipeline (OC + PBP, no season) validated at 2h 44m. PBP write near-linear between 10K and 100K at 528K rows/sec.
Season accumulator refactored from per-world nested dicts to per-outcome fixed arrays. Memory at 100K dropped from a projected 270 GB (OOM at 64 GB) to ~1.3 GB. Same observable behaviour; 6 existing unit tests pass unchanged.
10K full-pipeline + season validated at 18.1 min (smaller-scale checkpoint for the same path).
async_insert=1; wait_for_async_insert=0 is the key Cloud lever: OC write -86% at 1K.

Measured performance¶

Localhost (Docker ClickHouse 24.8, laptop 4-core)¶

Scale	Pipeline	N=1 wall	N=4 wall	Δ
1K	OC-only	176s	106s	-40%
1K	OC+PBP	256s	198s	-22.5%
5K	OC-only	772s	345s	-55%
10K	OC-only	1,565s	724s	-54%
10K	OC+PBP	2,634s	not measured	—

Cloud (ClickHouse Cloud Production Scale in westus3)¶

OC-only path (async_insert + merge-parallelism capped at 4):

Environment	Scale	Config	Wall	Sim	OC write	Merge	Notes
AU laptop → Cloud	1K	N=32	343s	63s	225s	11s	~565ms RTT dominates OC write
Container Apps Consumption (4 CPU)	1K	N=32	235s	76s	106s	13s	same-region, sim CPU-bound
Container Apps Consumption (4 CPU)	5K	N=32	625s	367s	156s	62s
Container D32 (16 CPU) + async_insert	1K	N=32	60s	23s	16s	13s	-58% vs D32 no-async
Container D32 (16 CPU) + async_insert	10K	N=32	400s	199s	39s	155s	merge regression at scale
Container D32 (16 CPU) + async_insert	100K	N=32	2,991s (49.85 min)	1,530s	233s	1,210s	primary result on 2×4 tier
Container D32 (32 CPU) + async_insert, Cloud 3×16 tier, merge cap=16	100K	N=32	1,887s (31.4 min)	1,412s	328s	137s	headline — merge -89% via tier+cap

Full pipeline (OC + PBP direct to ClickHouse, 4-way parallel PBP writes):

Cloud tier	Scale	Wall	PBP write	PBP rows	Status
Prod Scale 2×4	1K	304s	239s	42M	OK
Prod Scale 2×4	10K	1,120s	818s	420M	OK
Prod Scale 2×4	100K	>4h	did not complete	—	FAILED at replica-timeout
Prod 3×16	100K	9,887s (2h 44m)	7,955s	4.2B	OK — headline full-pipeline result

Full pipeline + season context (adds season_outcome_context writes):

Cloud tier	Scale	Wall	Season write	Season rows	Status
Prod 3×16	10K	1,088s (18.1 min)	7.08s	1,728	OK
Prod 3×16 (original accumulator)	100K	~3h then OOM	—	—	FAILED — historical
Prod 3×16 (refactored accumulator)	100K	10,696s (2h 58m)	22.90s	1,728	OK — primary full-pipeline result

100K OC-only result detail (headline)¶

On Production 3×16 tier with merge cap=16 (current best): - Wall time: 31.4 min - OC rows written: 244,800 - Peak client RAM: 19.0 GB - Cost: ~$0.53 ACI + ~$4.25 CH Cloud = ~$4.80/run - Phase breakdown: sim 1,412s / 75% • merge 137s / 7% • OC write 328s / 17% • overhead 9s / 1%

On prior Production Scale (2×4) tier with merge cap=4: - Wall time: 49.85 min, cost ~$0.80/run (cheaper hourly but longer) - Same OC rows written, same peak RAM - Merge was 20 min (40% of wall), not 2.3 min

10K full-pipeline result detail (highest validated with-PBP scale)¶

Wall time: 18.7 min (1,120s)
OC rows written: 244,800
PBP rows written: 419,758,294 (matches laptop 10K exactly)
Peak client RAM: 41.8 GB (32 vCPU / 64 GiB container)
Cost: ~$0.32 per run
Phase breakdown: sim 136s, OC write 51s, PBP write 818s (73%), merge 106s

PBP throughput on Cloud (direct bulk-copy path, 4-way parallel): - 1K: 175K rows/sec (single-path bulk) - 10K: 513K rows/sec (parallel + larger bulks amortise async_insert overhead) - Linear projection to 100K: ~2.3h PBP write — did not hold, actual hit 4h timeout

100K full-pipeline: NOT VALIDATED¶

Attempted with 32 vCPU / 64 GiB container, 4-way parallel PBP to ClickHouse. Hit 4-hour replica deadline still in the PBP write phase. PBP scaled worse than linear from 10K (expected ~2.3h, got >4h).

Likely root cause: play_by_play is PARTITION BY season_id (single partition per season due to Cloud's locked max_partitions_per_insert_block=100). At 100K the single partition contains ~4.2B rows; background-merge pressure on that one partition throttles foreground INSERT throughput in a way that doesn't show at 10K (420M rows).

Fixes, ranked: 1. Upgrade Cloud tier (Production Scale → Production) — raises aggregate insert throughput, no code change. 2. Repartition PBP as (season_id, intDiv(world_id, 10000)) — spreads data across ~10 partitions per season, each getting background merges in parallel. Schema change + single re-run to validate. 3. Split PBP off the critical path — ship OC-only as primary, run PBP in a separate pipeline when needed.

100K projection vs measurement¶

Projection basis	Estimate	Delta vs actual (49.85 min)
Localhost extrapolation, N=1 serial	~7.3h	8.7× slower
Localhost, N=4 OC-only	~2.0h	2.4× slower
D32 1K (no async) × 100	~67 min	17 min over
D32 1K + async × 100	~50 min	spot on

Quality gates¶

Gate	Status
OC row count at every scale (localhost + Cloud 1K/5K/10K/100K)	244,800 consistent
Unit + integration tests	108/109 passing (1 intentional skip)
Cross-run data isolation (`DROP TABLE ... SYNC`)	fixed
Parallel-path correctness vs sequential	matches
100K end-to-end on Cloud	successful run documented
Merge OOM under high parallelism	root-caused, capped to N=4 merge-workers

PBP path — from broken to partially-validated¶

The IPlayByPlayBatchWriter abstraction had two production paths; validation found one unusable on Cloud:

Path	Local disk (Parquet)	ClickHouse direct
1K	41s (laptop N=1)	239s (Cloud D32 N=32)
10K	726s (laptop N=1)	818s (Cloud, 4-way parallel)
10K on Container Apps ephemeral disk	STALLED at 100+ min, never completed	—
100K	not attempted	did not complete in 4h

The Parquet-on-ephemeral-container-disk path is dead for Cloud deployment — container local storage is not suited for the Parquet writer's per-chunk flush pattern. ClickHouse direct was the working alternative at 10K; 100K exposes a new ceiling (super-linear PBP write scaling).

Schema change landed as part of the PBP→ClickHouse work: play_by_play switched from PARTITION BY game_id (288/season, rejected by Cloud) to PARTITION BY season_id (1/season).

Known gap: season context not yet in the pipeline¶

Every run documented above writes two tables: - game_outcome_context (per-game OC) — ~244,800 rows at every scale, Array(Float64)[worlds] per row - play_by_play — 42M / 420M / 4.2B rows at 1K / 10K / 100K

A third table, season_outcome_context, holds season-level aggregates (standings, milestones, team W-L across all worlds). It has ~1,668 outcome IDs per season × Array(Float64)[worlds]: - 144× less data than game OC at any given scale - Should write in well under a minute once wired into the orchestrator

The AmericanFootballSeasonAccumulator and WriteSeasonOutcomeContextAsync exist, but are only exercised by standalone experiments (SeasonContextWriteExperiment/SeasonContextReadExperiment in the nice-to-have suite). The streaming orchestrator doesn't currently instantiate the season accumulator, feed per-game results to it, or emit its output. Work item tracked for next pass: feed each simulated season into the season accumulator during the chunk loop and flush once at the end; add a SeasonWriteTime line to StreamingRunReport. Data-volume implications are tiny so headline numbers won't move materially, but "full pipeline" isn't truly full until this is in.

Remaining levers if sub-30-min 100K is needed¶

Unblock merge parallelism — upgrade Cloud tier (more replica RAM). Current Production Scale limits merge to N=4. On a bigger tier we could run N=16 merges → ~20 min wall.
Split merge by outcome range — chunk arrayFlatten into halves or quarters per game, reducing peak working-set per query. Same Cloud tier, more parallel merges. Needs code change (~1 day).
Raise sim parallelism past N=32 — D32 has 32 vCPUs; we're allocating 16. Push to N=32 with 32-vCPU container → sim drops from 25 min to ~15 min.

Next validation points¶

With-PBP at scale on D32 — so far we've validated OC-only. Full-pipeline (PBP enabled) hasn't been run at 10K/100K on Cloud. Prerequisite if we ship full-pipeline to prod.
Australia East Cloud region (when available — user noted Mon/Tue). Removes the long-haul network path from dev workflow and lets the AU team iterate without the westus3 container.
Should-have experiment suite on Cloud (concurrent reads, versioning, REQ-INFRA-3 codec comparison).
Final report for storage-decision gate review.

Decisions needed¶

Production Cloud tier — current Production Scale covers 100K in 50 min. If the decision gate requires sub-30-min runs, upgrade now rather than later.
Full-pipeline 100K — run it, or ship OC-only and handle PBP in a separate pipeline? The OC-only path is validated; PBP at 100K would need a memory redesign regardless of tier.
Sub-60s feasibility path — ratify or defer. See 100K-in-60s problem statement.