Storage Experiment — Status Summary¶
Date: 2026-04-19 • Branch: LBS-1183
High-level test state + performance numbers. For process, architecture, and retrospective analysis, see status.md. For the R&D Tax evidence write-up (hypothesis → method → result → knowledge, including negative results), see rd-findings.md.
TL;DR¶
- 100K full-pipeline validated end-to-end on Production 3×16 tier, covering all three outcome tables (
game_outcome_context,play_by_play,season_outcome_context) in 2h 58m for ~$27/run. 244,800 OC + 4.2B PBP + 1,728 season rows all correct. - 100K OC-only validated at 31.4 min. Merge collapsed from 20 min → 2.3 min (-89%) with tier upgrade + merge-parallelism cap raised to 16.
- 100K full-pipeline (OC + PBP, no season) validated at 2h 44m. PBP write near-linear between 10K and 100K at 528K rows/sec.
- Season accumulator refactored from per-world nested dicts to per-outcome fixed arrays. Memory at 100K dropped from a projected 270 GB (OOM at 64 GB) to ~1.3 GB. Same observable behaviour; 6 existing unit tests pass unchanged.
- 10K full-pipeline + season validated at 18.1 min (smaller-scale checkpoint for the same path).
async_insert=1; wait_for_async_insert=0is the key Cloud lever: OC write -86% at 1K.
Measured performance¶
Localhost (Docker ClickHouse 24.8, laptop 4-core)¶
| Scale | Pipeline | N=1 wall | N=4 wall | Δ |
|---|---|---|---|---|
| 1K | OC-only | 176s | 106s | -40% |
| 1K | OC+PBP | 256s | 198s | -22.5% |
| 5K | OC-only | 772s | 345s | -55% |
| 10K | OC-only | 1,565s | 724s | -54% |
| 10K | OC+PBP | 2,634s | not measured | — |
Cloud (ClickHouse Cloud Production Scale in westus3)¶
OC-only path (async_insert + merge-parallelism capped at 4):
| Environment | Scale | Config | Wall | Sim | OC write | Merge | Notes |
|---|---|---|---|---|---|---|---|
| AU laptop → Cloud | 1K | N=32 | 343s | 63s | 225s | 11s | ~565ms RTT dominates OC write |
| Container Apps Consumption (4 CPU) | 1K | N=32 | 235s | 76s | 106s | 13s | same-region, sim CPU-bound |
| Container Apps Consumption (4 CPU) | 5K | N=32 | 625s | 367s | 156s | 62s | |
| Container D32 (16 CPU) + async_insert | 1K | N=32 | 60s | 23s | 16s | 13s | -58% vs D32 no-async |
| Container D32 (16 CPU) + async_insert | 10K | N=32 | 400s | 199s | 39s | 155s | merge regression at scale |
| Container D32 (16 CPU) + async_insert | 100K | N=32 | 2,991s (49.85 min) | 1,530s | 233s | 1,210s | primary result on 2×4 tier |
| Container D32 (32 CPU) + async_insert, Cloud 3×16 tier, merge cap=16 | 100K | N=32 | 1,887s (31.4 min) | 1,412s | 328s | 137s | headline — merge -89% via tier+cap |
Full pipeline (OC + PBP direct to ClickHouse, 4-way parallel PBP writes):
| Cloud tier | Scale | Wall | PBP write | PBP rows | Status |
|---|---|---|---|---|---|
| Prod Scale 2×4 | 1K | 304s | 239s | 42M | OK |
| Prod Scale 2×4 | 10K | 1,120s | 818s | 420M | OK |
| Prod Scale 2×4 | 100K | >4h | did not complete | — | FAILED at replica-timeout |
| Prod 3×16 | 100K | 9,887s (2h 44m) | 7,955s | **4.2B ** | OK — headline full-pipeline result |
Full pipeline + season context (adds season_outcome_context writes):
| Cloud tier | Scale | Wall | Season write | Season rows | Status |
|---|---|---|---|---|---|
| Prod 3×16 | 10K | 1,088s (18.1 min) | 7.08s | 1,728 | OK |
| Prod 3×16 (original accumulator) | 100K | ~3h then OOM | — | — | FAILED — historical |
| Prod 3×16 (refactored accumulator) | 100K | 10,696s (2h 58m) | 22.90s | **1,728 ** | OK — primary full-pipeline result |
100K OC-only result detail (headline)¶
On Production 3×16 tier with merge cap=16 (current best): - Wall time: 31.4 min - OC rows written: 244,800 - Peak client RAM: 19.0 GB - Cost: ~$0.53 ACI + ~$4.25 CH Cloud = ~$4.80/run - Phase breakdown: sim 1,412s / 75% • merge 137s / 7% • OC write 328s / 17% • overhead 9s / 1%
On prior Production Scale (2×4) tier with merge cap=4: - Wall time: 49.85 min, cost ~$0.80/run (cheaper hourly but longer) - Same OC rows written, same peak RAM - Merge was 20 min (40% of wall), not 2.3 min
10K full-pipeline result detail (highest validated with-PBP scale)¶
- Wall time: 18.7 min (1,120s)
- OC rows written: 244,800
- PBP rows written: 419,758,294 (matches laptop 10K exactly)
- Peak client RAM: 41.8 GB (32 vCPU / 64 GiB container)
- Cost: ~$0.32 per run
- Phase breakdown: sim 136s, OC write 51s, PBP write 818s (73%), merge 106s
PBP throughput on Cloud (direct bulk-copy path, 4-way parallel): - 1K: 175K rows/sec (single-path bulk) - 10K: 513K rows/sec (parallel + larger bulks amortise async_insert overhead) - Linear projection to 100K: ~2.3h PBP write — did not hold, actual hit 4h timeout
100K full-pipeline: NOT VALIDATED¶
Attempted with 32 vCPU / 64 GiB container, 4-way parallel PBP to ClickHouse. Hit 4-hour replica deadline still in the PBP write phase. PBP scaled worse than linear from 10K (expected ~2.3h, got >4h).
Likely root cause: play_by_play is PARTITION BY season_id (single partition per season due to Cloud's locked max_partitions_per_insert_block=100). At 100K the single partition contains ~4.2B rows; background-merge pressure on that one partition throttles foreground INSERT throughput in a way that doesn't show at 10K (420M rows).
Fixes, ranked:
1. Upgrade Cloud tier (Production Scale → Production) — raises aggregate insert throughput, no code change.
2. Repartition PBP as (season_id, intDiv(world_id, 10000)) — spreads data across ~10 partitions per season, each getting background merges in parallel. Schema change + single re-run to validate.
3. Split PBP off the critical path — ship OC-only as primary, run PBP in a separate pipeline when needed.
100K projection vs measurement¶
| Projection basis | Estimate | Delta vs actual (49.85 min) |
|---|---|---|
| Localhost extrapolation, N=1 serial | ~7.3h | 8.7× slower |
| Localhost, N=4 OC-only | ~2.0h | 2.4× slower |
| D32 1K (no async) × 100 | ~67 min | 17 min over |
| D32 1K + async × 100 | ~50 min | spot on |
Quality gates¶
| Gate | Status |
|---|---|
| OC row count at every scale (localhost + Cloud 1K/5K/10K/100K) | 244,800 consistent |
| Unit + integration tests | 108/109 passing (1 intentional skip) |
Cross-run data isolation (DROP TABLE ... SYNC) |
fixed |
| Parallel-path correctness vs sequential | matches |
| 100K end-to-end on Cloud | successful run documented |
| Merge OOM under high parallelism | root-caused, capped to N=4 merge-workers |
PBP path — from broken to partially-validated¶
The IPlayByPlayBatchWriter abstraction had two production paths; validation found one unusable on Cloud:
| Path | Local disk (Parquet) | ClickHouse direct |
|---|---|---|
| 1K | 41s (laptop N=1) | 239s (Cloud D32 N=32) |
| 10K | 726s (laptop N=1) | 818s (Cloud, 4-way parallel) |
| 10K on Container Apps ephemeral disk | STALLED at 100+ min, never completed | — |
| 100K | not attempted | did not complete in 4h |
The Parquet-on-ephemeral-container-disk path is dead for Cloud deployment — container local storage is not suited for the Parquet writer's per-chunk flush pattern. ClickHouse direct was the working alternative at 10K; 100K exposes a new ceiling (super-linear PBP write scaling).
Schema change landed as part of the PBP→ClickHouse work: play_by_play switched from PARTITION BY game_id (288/season, rejected by Cloud) to PARTITION BY season_id (1/season).
Known gap: season context not yet in the pipeline¶
Every run documented above writes two tables:
- game_outcome_context (per-game OC) — ~244,800 rows at every scale, Array(Float64)[worlds] per row
- play_by_play — 42M / 420M / 4.2B rows at 1K / 10K / 100K
A third table, season_outcome_context, holds season-level aggregates (standings, milestones, team W-L across all worlds). It has ~1,668 outcome IDs per season × Array(Float64)[worlds]:
- 144× less data than game OC at any given scale
- Should write in well under a minute once wired into the orchestrator
The AmericanFootballSeasonAccumulator and WriteSeasonOutcomeContextAsync exist, but are only exercised by standalone experiments (SeasonContextWriteExperiment/SeasonContextReadExperiment in the nice-to-have suite). The streaming orchestrator doesn't currently instantiate the season accumulator, feed per-game results to it, or emit its output. Work item tracked for next pass: feed each simulated season into the season accumulator during the chunk loop and flush once at the end; add a SeasonWriteTime line to StreamingRunReport. Data-volume implications are tiny so headline numbers won't move materially, but "full pipeline" isn't truly full until this is in.
Remaining levers if sub-30-min 100K is needed¶
- Unblock merge parallelism — upgrade Cloud tier (more replica RAM). Current Production Scale limits merge to N=4. On a bigger tier we could run N=16 merges → ~20 min wall.
- Split merge by outcome range — chunk
arrayFlatteninto halves or quarters per game, reducing peak working-set per query. Same Cloud tier, more parallel merges. Needs code change (~1 day). - Raise sim parallelism past N=32 — D32 has 32 vCPUs; we're allocating 16. Push to N=32 with 32-vCPU container → sim drops from 25 min to ~15 min.
Next validation points¶
- With-PBP at scale on D32 — so far we've validated OC-only. Full-pipeline (PBP enabled) hasn't been run at 10K/100K on Cloud. Prerequisite if we ship full-pipeline to prod.
- Australia East Cloud region (when available — user noted Mon/Tue). Removes the long-haul network path from dev workflow and lets the AU team iterate without the westus3 container.
- Should-have experiment suite on Cloud (concurrent reads, versioning, REQ-INFRA-3 codec comparison).
- Final report for storage-decision gate review.
Decisions needed¶
- Production Cloud tier — current Production Scale covers 100K in 50 min. If the decision gate requires sub-30-min runs, upgrade now rather than later.
- Full-pipeline 100K — run it, or ship OC-only and handle PBP in a separate pipeline? The OC-only path is validated; PBP at 100K would need a memory redesign regardless of tier.
- Sub-60s feasibility path — ratify or defer. See 100K-in-60s problem statement.