Environment: Azure Container Apps Job, D32-benchmark (16 vCPU / 32 GiB), westus3
ClickHouse Cloud Production Scale (2 replicas × 4 CPU × ~8 GiB)
Connection: HTTPS, Compression=true, set_async_insert=1, set_wait_for_async_insert=0
Job execution: oc-exp-1k-p32-v5csdy1
Image: commit 19888ebd (experimental hash-sharded merge at 4 shards/game)
Timestamp: 2026-04-19T23:30 local (13:30 UTC)

Backend: clickhouse, Scale: 100000 worlds, Environment: local
Streaming mode: totalWorlds=100000, chunkSize=1000 [PBP DISABLED] [parallel=32]
PBP sink: local at ./pbp-parquet

Streaming Run Report
--------------------
Total wall time:         3377.17 s
Simulation time:         1808.04 s
OC write time:            227.25 s
PBP write time:             0.00 s
Merge time:              1327.55 s
Peak working set:        16297.1 MB
Chunks completed:            100
OC rows (merged):        244,800
PBP rows (written):            0

Notes:
- This run validated that sharding per-game arrayFlatten merges by
  cityHash64(outcome_id) % 4 produced CORRECT results (244,800 rows)
  but REGRESSED total wall time by +12.9% vs the unsharded baseline
  (2,991s -> 3,377s) and merge time by +10% (1,210s -> 1,328s).
- Root cause: each sharded query still has to scan the full staging
  rows for its game (filter on game_id is the only index-eligible
  predicate; the outcome_id hash filter runs AFTER the row read).
  So 16 sharded queries did ~4x the total scan IO of 4 unsharded
  queries for the same output.
- Change reverted in commit 2049f9a4. The sharded merge approach
  would need a staging-table redesign (partition or ORDER BY on
  outcome_id hash) to actually reduce scan work per shard.
