100K Worlds in Under 60 Seconds — Problem Statement¶
Date: 2026-04-18 Audience: Executive leadership Author: Storage Experiment team, LBS-1183
The ask¶
Run a full 100,000-world season simulation for a single sport, end-to-end, in under 60 seconds.
The math (one slide)¶
We have measured this on a laptop. The cost is dominated by pure simulation CPU — the storage/database layer is no longer the bottleneck.
| Measurement | Value |
|---|---|
| Current laptop (4-core parallel, OC-only) | 106 seconds for 1,000 worlds |
| Pure simulation CPU per world | ~0.3 CPU-seconds |
| Total CPU work for 100,000 worlds | ~29,200 CPU-seconds (≈ 8 CPU-hours) |
| To finish in 60 seconds | ~490 CPUs running concurrently (theoretical minimum) |
| Realistic target with coordination + I/O overhead | 600–800 CPUs |
Translation: we need to turn ~8 hours of single-machine work into 1 minute of fleet work. The work itself is embarrassingly parallel — every simulated world is independent — so this is fundamentally a scale-out problem, not a deep optimisation problem.
Why we cannot do this on one machine¶
- The largest single-node servers top out at 96–128 cores.
- Even a top-end 96-core server = 96 × 60s = 5,760 CPU-seconds of headroom — short of the 29,200 we need by 5×.
- Memory bandwidth and I/O become the ceiling before CPU count does.
A single machine cannot hit this target. This is a distributed-compute problem, not a hardware-upgrade problem.
Why this is actually good news¶
The problem has three properties that make it very cloud-friendly:
- Embarrassingly parallel — no coordination between workers during simulation.
- Bursty — we don't run this continuously, only on demand.
- Cheap per run — ~500 vCPU-minutes at spot / consumption pricing is ~$3–7 per 100K run on current Azure list prices.
Standing up a dedicated 500-core cluster and letting it sit idle would cost us far more than paying per-run on bursty infrastructure.
Proposed architecture — Azure Functions + queue (burst pattern)¶
This fits the problem's shape almost perfectly:
- Fan-out via queue. A producer splits "simulate 100K worlds" into ~1,000 messages of "simulate worlds N..N+99". Messages land on an Azure Storage Queue (or Service Bus for ordering / dead-letter support).
- Azure Functions consume the queue. On Flex Consumption, Functions auto-scale out to hundreds or low thousands of instances — one instance per message. Each instance simulates its 100-world chunk, writes its Outcome Context slice to ClickHouse Cloud, and acknowledges the message.
- ClickHouse Cloud absorbs concurrent writes. The existing staging-table +
arrayFlatten-merge pattern handles this today at laptop scale; needs a sized cluster (4–8 shards, NVMe) to absorb ~1.6 GB/s aggregate compressed writes from the fleet. - Completion signal triggers the final merge. When the last message is acknowledged, a second function fires the merge across the 288 games and the run is done.
Cost shape¶
| Component | Per 100K run | Monthly baseline if idle |
|---|---|---|
| Azure Functions (Flex Consumption, ~1,000 × 60s × 4 GB) | ~$3.50 | $0 |
| ClickHouse Cloud burst (during run) | ~$1–3 | depends on cluster-down-time policy |
| Queue / orchestration | <$0.10 | ~$0 |
| Total per run | ~$5–7 | $0 (scales to zero) |
Key point: cost is per-run and proportional to use. There is no "always-on" infrastructure tax. If we run this 10 times a day it's ~$50/day. If we run it once a month it's ~$7/month.
Risks and caveats¶
- Cold start. First-time Function instance spin-up is 3–10 seconds; at 1,000 concurrent instances this can eat 10–20 s of the 60-s budget. Mitigations: always-ready instances on Flex (costs ~$100/month kept warm), or pre-warmers fired 30 s before the run. Not a blocker, but a design input.
- ClickHouse concurrent-insert ceiling. 1,000 concurrent writers will stress the cluster. Needs validation against a real Cloud cluster before we commit.
- Platform support. .NET 10 on Azure Functions Flex Consumption needs to be confirmed (the prototype is on a preview runtime).
- Idempotency. If a queue message is retried, the ClickHouse write must be idempotent. Small design constraint on the write path.
- The 1K parallelism factor is not yet validated at 10K / 100K. Currently being measured; if the factor breaks down at larger scale, the per-worker numbers shift.
Next steps — in order¶
- Confirm the parallelism factor at 10K (laptop run in progress). Without this, the 490-core number is extrapolation from 1K.
- Stand up a ClickHouse Cloud cluster and re-measure 1K / 5K against it. Removes the Docker-on-laptop confound and gives real concurrent-insert behaviour.
- Prototype the Functions fan-out at small scale — 10K worlds via 100 queue messages. Validates cold-start budget, idempotency, and end-to-end orchestration without paying for a full 100K burst.
- Scale to a single 100K burst and time it. This is the go / no-go for the architecture.
- Commercial decision: does the product actually need <60s, or is 2–3 minutes acceptable? Half the cluster, half the cost, same architecture.
Estimated timeline to a working 100K-in-60s proof: 2–3 weeks of engineering once ClickHouse Cloud is standing. Most of that is orchestration + idempotency hardening, not simulation code.
What we need from exec¶
- Confirmation that <60s is the real target (not an aspirational ceiling).
- Sign-off on a ClickHouse Cloud spend to unblock steps 2–4 above.
- Agreement on scope: is this OC-only (deliverable in 60s) or full-pipeline with play-by-play (fundamentally different architecture — sub-60s with PBP is not currently feasible)?