Skip to content

Storage Experiment — Status Summary

Date: 2026-04-19 • Branch: LBS-1183

High-level test state + performance numbers. For process, architecture, and retrospective analysis, see status.md. For the R&D Tax evidence write-up (hypothesis → method → result → knowledge, including negative results), see rd-findings.md.

TL;DR

  • 100K full-pipeline validated end-to-end on Production 3×16 tier, covering all three outcome tables (game_outcome_context, play_by_play, season_outcome_context) in 2h 58m for ~$27/run. 244,800 OC + 4.2B PBP + 1,728 season rows all correct.
  • 100K OC-only validated at 31.4 min. Merge collapsed from 20 min → 2.3 min (-89%) with tier upgrade + merge-parallelism cap raised to 16.
  • 100K full-pipeline (OC + PBP, no season) validated at 2h 44m. PBP write near-linear between 10K and 100K at 528K rows/sec.
  • Season accumulator refactored from per-world nested dicts to per-outcome fixed arrays. Memory at 100K dropped from a projected 270 GB (OOM at 64 GB) to ~1.3 GB. Same observable behaviour; 6 existing unit tests pass unchanged.
  • 10K full-pipeline + season validated at 18.1 min (smaller-scale checkpoint for the same path).
  • async_insert=1; wait_for_async_insert=0 is the key Cloud lever: OC write -86% at 1K.

Measured performance

Localhost (Docker ClickHouse 24.8, laptop 4-core)

Scale Pipeline N=1 wall N=4 wall Δ
1K OC-only 176s 106s -40%
1K OC+PBP 256s 198s -22.5%
5K OC-only 772s 345s -55%
10K OC-only 1,565s 724s -54%
10K OC+PBP 2,634s not measured

Cloud (ClickHouse Cloud Production Scale in westus3)

OC-only path (async_insert + merge-parallelism capped at 4):

Environment Scale Config Wall Sim OC write Merge Notes
AU laptop → Cloud 1K N=32 343s 63s 225s 11s ~565ms RTT dominates OC write
Container Apps Consumption (4 CPU) 1K N=32 235s 76s 106s 13s same-region, sim CPU-bound
Container Apps Consumption (4 CPU) 5K N=32 625s 367s 156s 62s
Container D32 (16 CPU) + async_insert 1K N=32 60s 23s 16s 13s -58% vs D32 no-async
Container D32 (16 CPU) + async_insert 10K N=32 400s 199s 39s 155s merge regression at scale
Container D32 (16 CPU) + async_insert 100K N=32 2,991s (49.85 min) 1,530s 233s 1,210s primary result on 2×4 tier
Container D32 (32 CPU) + async_insert, Cloud 3×16 tier, merge cap=16 100K N=32 1,887s (31.4 min) 1,412s 328s 137s headline — merge -89% via tier+cap

Full pipeline (OC + PBP direct to ClickHouse, 4-way parallel PBP writes):

Cloud tier Scale Wall PBP write PBP rows Status
Prod Scale 2×4 1K 304s 239s 42M OK
Prod Scale 2×4 10K 1,120s 818s 420M OK
Prod Scale 2×4 100K >4h did not complete FAILED at replica-timeout
Prod 3×16 100K 9,887s (2h 44m) 7,955s **4.2B ** OK — headline full-pipeline result

Full pipeline + season context (adds season_outcome_context writes):

Cloud tier Scale Wall Season write Season rows Status
Prod 3×16 10K 1,088s (18.1 min) 7.08s 1,728 OK
Prod 3×16 (original accumulator) 100K ~3h then OOM FAILED — historical
Prod 3×16 (refactored accumulator) 100K 10,696s (2h 58m) 22.90s **1,728 ** OK — primary full-pipeline result

100K OC-only result detail (headline)

On Production 3×16 tier with merge cap=16 (current best): - Wall time: 31.4 min - OC rows written: 244,800 - Peak client RAM: 19.0 GB - Cost: ~$0.53 ACI + ~$4.25 CH Cloud = ~$4.80/run - Phase breakdown: sim 1,412s / 75% • merge 137s / 7% • OC write 328s / 17% • overhead 9s / 1%

On prior Production Scale (2×4) tier with merge cap=4: - Wall time: 49.85 min, cost ~$0.80/run (cheaper hourly but longer) - Same OC rows written, same peak RAM - Merge was 20 min (40% of wall), not 2.3 min

10K full-pipeline result detail (highest validated with-PBP scale)

  • Wall time: 18.7 min (1,120s)
  • OC rows written: 244,800
  • PBP rows written: 419,758,294 (matches laptop 10K exactly)
  • Peak client RAM: 41.8 GB (32 vCPU / 64 GiB container)
  • Cost: ~$0.32 per run
  • Phase breakdown: sim 136s, OC write 51s, PBP write 818s (73%), merge 106s

PBP throughput on Cloud (direct bulk-copy path, 4-way parallel): - 1K: 175K rows/sec (single-path bulk) - 10K: 513K rows/sec (parallel + larger bulks amortise async_insert overhead) - Linear projection to 100K: ~2.3h PBP write — did not hold, actual hit 4h timeout

100K full-pipeline: NOT VALIDATED

Attempted with 32 vCPU / 64 GiB container, 4-way parallel PBP to ClickHouse. Hit 4-hour replica deadline still in the PBP write phase. PBP scaled worse than linear from 10K (expected ~2.3h, got >4h).

Likely root cause: play_by_play is PARTITION BY season_id (single partition per season due to Cloud's locked max_partitions_per_insert_block=100). At 100K the single partition contains ~4.2B rows; background-merge pressure on that one partition throttles foreground INSERT throughput in a way that doesn't show at 10K (420M rows).

Fixes, ranked: 1. Upgrade Cloud tier (Production Scale → Production) — raises aggregate insert throughput, no code change. 2. Repartition PBP as (season_id, intDiv(world_id, 10000)) — spreads data across ~10 partitions per season, each getting background merges in parallel. Schema change + single re-run to validate. 3. Split PBP off the critical path — ship OC-only as primary, run PBP in a separate pipeline when needed.

100K projection vs measurement

Projection basis Estimate Delta vs actual (49.85 min)
Localhost extrapolation, N=1 serial ~7.3h 8.7× slower
Localhost, N=4 OC-only ~2.0h 2.4× slower
D32 1K (no async) × 100 ~67 min 17 min over
D32 1K + async × 100 ~50 min spot on

Quality gates

Gate Status
OC row count at every scale (localhost + Cloud 1K/5K/10K/100K) 244,800 consistent
Unit + integration tests 108/109 passing (1 intentional skip)
Cross-run data isolation (DROP TABLE ... SYNC) fixed
Parallel-path correctness vs sequential matches
100K end-to-end on Cloud successful run documented
Merge OOM under high parallelism root-caused, capped to N=4 merge-workers

PBP path — from broken to partially-validated

The IPlayByPlayBatchWriter abstraction had two production paths; validation found one unusable on Cloud:

Path Local disk (Parquet) ClickHouse direct
1K 41s (laptop N=1) 239s (Cloud D32 N=32)
10K 726s (laptop N=1) 818s (Cloud, 4-way parallel)
10K on Container Apps ephemeral disk STALLED at 100+ min, never completed
100K not attempted did not complete in 4h

The Parquet-on-ephemeral-container-disk path is dead for Cloud deployment — container local storage is not suited for the Parquet writer's per-chunk flush pattern. ClickHouse direct was the working alternative at 10K; 100K exposes a new ceiling (super-linear PBP write scaling).

Schema change landed as part of the PBP→ClickHouse work: play_by_play switched from PARTITION BY game_id (288/season, rejected by Cloud) to PARTITION BY season_id (1/season).

Known gap: season context not yet in the pipeline

Every run documented above writes two tables: - game_outcome_context (per-game OC) — ~244,800 rows at every scale, Array(Float64)[worlds] per row - play_by_play — 42M / 420M / 4.2B rows at 1K / 10K / 100K

A third table, season_outcome_context, holds season-level aggregates (standings, milestones, team W-L across all worlds). It has ~1,668 outcome IDs per season × Array(Float64)[worlds]: - 144× less data than game OC at any given scale - Should write in well under a minute once wired into the orchestrator

The AmericanFootballSeasonAccumulator and WriteSeasonOutcomeContextAsync exist, but are only exercised by standalone experiments (SeasonContextWriteExperiment/SeasonContextReadExperiment in the nice-to-have suite). The streaming orchestrator doesn't currently instantiate the season accumulator, feed per-game results to it, or emit its output. Work item tracked for next pass: feed each simulated season into the season accumulator during the chunk loop and flush once at the end; add a SeasonWriteTime line to StreamingRunReport. Data-volume implications are tiny so headline numbers won't move materially, but "full pipeline" isn't truly full until this is in.

Remaining levers if sub-30-min 100K is needed

  1. Unblock merge parallelism — upgrade Cloud tier (more replica RAM). Current Production Scale limits merge to N=4. On a bigger tier we could run N=16 merges → ~20 min wall.
  2. Split merge by outcome range — chunk arrayFlatten into halves or quarters per game, reducing peak working-set per query. Same Cloud tier, more parallel merges. Needs code change (~1 day).
  3. Raise sim parallelism past N=32 — D32 has 32 vCPUs; we're allocating 16. Push to N=32 with 32-vCPU container → sim drops from 25 min to ~15 min.

Next validation points

  1. With-PBP at scale on D32 — so far we've validated OC-only. Full-pipeline (PBP enabled) hasn't been run at 10K/100K on Cloud. Prerequisite if we ship full-pipeline to prod.
  2. Australia East Cloud region (when available — user noted Mon/Tue). Removes the long-haul network path from dev workflow and lets the AU team iterate without the westus3 container.
  3. Should-have experiment suite on Cloud (concurrent reads, versioning, REQ-INFRA-3 codec comparison).
  4. Final report for storage-decision gate review.

Decisions needed

  • Production Cloud tier — current Production Scale covers 100K in 50 min. If the decision gate requires sub-30-min runs, upgrade now rather than later.
  • Full-pipeline 100K — run it, or ship OC-only and handle PBP in a separate pipeline? The OC-only path is validated; PBP at 100K would need a memory redesign regardless of tier.
  • Sub-60s feasibility path — ratify or defer. See 100K-in-60s problem statement.