Environment: Azure Container Apps Job, D32-benchmark (16 vCPU / 32 GiB), westus3 ClickHouse Cloud Production Scale (2 replicas × 4 CPU × ~8 GiB) Connection: HTTPS, Compression=true, set_async_insert=1, set_wait_for_async_insert=0 Job execution: oc-exp-1k-p32-v5csdy1 Image: commit 19888ebd (experimental hash-sharded merge at 4 shards/game) Timestamp: 2026-04-19T23:30 local (13:30 UTC) Backend: clickhouse, Scale: 100000 worlds, Environment: local Streaming mode: totalWorlds=100000, chunkSize=1000 [PBP DISABLED] [parallel=32] PBP sink: local at ./pbp-parquet Streaming Run Report -------------------- Total wall time: 3377.17 s Simulation time: 1808.04 s OC write time: 227.25 s PBP write time: 0.00 s Merge time: 1327.55 s Peak working set: 16297.1 MB Chunks completed: 100 OC rows (merged): 244,800 PBP rows (written): 0 Notes: - This run validated that sharding per-game arrayFlatten merges by cityHash64(outcome_id) % 4 produced CORRECT results (244,800 rows) but REGRESSED total wall time by +12.9% vs the unsharded baseline (2,991s -> 3,377s) and merge time by +10% (1,210s -> 1,328s). - Root cause: each sharded query still has to scan the full staging rows for its game (filter on game_id is the only index-eligible predicate; the outcome_id hash filter runs AFTER the row read). So 16 sharded queries did ~4x the total scan IO of 4 unsharded queries for the same output. - Change reverted in commit 2049f9a4. The sharded merge approach would need a staging-table redesign (partition or ORDER BY on outcome_id hash) to actually reduce scan work per shard.