100K Worlds in Under 60 Seconds — Problem Statement¶

Date: 2026-04-18 Audience: Executive leadership Author: Storage Experiment team, LBS-1183

The ask¶

Run a full 100,000-world season simulation for a single sport, end-to-end, in under 60 seconds.

The math (one slide)¶

We have measured this on a laptop. The cost is dominated by pure simulation CPU — the storage/database layer is no longer the bottleneck.

Measurement	Value
Current laptop (4-core parallel, OC-only)	106 seconds for 1,000 worlds
Pure simulation CPU per world	~0.3 CPU-seconds
Total CPU work for 100,000 worlds	~29,200 CPU-seconds (≈ 8 CPU-hours)
To finish in 60 seconds	~490 CPUs running concurrently (theoretical minimum)
Realistic target with coordination + I/O overhead	600–800 CPUs

Translation: we need to turn ~8 hours of single-machine work into 1 minute of fleet work. The work itself is embarrassingly parallel — every simulated world is independent — so this is fundamentally a scale-out problem, not a deep optimisation problem.

Why we cannot do this on one machine¶

The largest single-node servers top out at 96–128 cores.
Even a top-end 96-core server = 96 × 60s = 5,760 CPU-seconds of headroom — short of the 29,200 we need by 5×.
Memory bandwidth and I/O become the ceiling before CPU count does.

A single machine cannot hit this target. This is a distributed-compute problem, not a hardware-upgrade problem.

Why this is actually good news¶

The problem has three properties that make it very cloud-friendly:

Embarrassingly parallel — no coordination between workers during simulation.
Bursty — we don't run this continuously, only on demand.
Cheap per run — ~500 vCPU-minutes at spot / consumption pricing is ~$3–7 per 100K run on current Azure list prices.

Standing up a dedicated 500-core cluster and letting it sit idle would cost us far more than paying per-run on bursty infrastructure.

Proposed architecture — Azure Functions + queue (burst pattern)¶

This fits the problem's shape almost perfectly:

Fan-out via queue. A producer splits "simulate 100K worlds" into ~1,000 messages of "simulate worlds N..N+99". Messages land on an Azure Storage Queue (or Service Bus for ordering / dead-letter support).
Azure Functions consume the queue. On Flex Consumption, Functions auto-scale out to hundreds or low thousands of instances — one instance per message. Each instance simulates its 100-world chunk, writes its Outcome Context slice to ClickHouse Cloud, and acknowledges the message.
ClickHouse Cloud absorbs concurrent writes. The existing staging-table + arrayFlatten-merge pattern handles this today at laptop scale; needs a sized cluster (4–8 shards, NVMe) to absorb ~1.6 GB/s aggregate compressed writes from the fleet.
Completion signal triggers the final merge. When the last message is acknowledged, a second function fires the merge across the 288 games and the run is done.

Cost shape¶

Component	Per 100K run	Monthly baseline if idle
Azure Functions (Flex Consumption, ~1,000 × 60s × 4 GB)	~$3.50	$0
ClickHouse Cloud burst (during run)	~$1–3	depends on cluster-down-time policy
Queue / orchestration	<$0.10	~$0
Total per run	~$5–7	$0 (scales to zero)

Key point: cost is per-run and proportional to use. There is no "always-on" infrastructure tax. If we run this 10 times a day it's ~$50/day. If we run it once a month it's ~$7/month.

Risks and caveats¶

Cold start. First-time Function instance spin-up is 3–10 seconds; at 1,000 concurrent instances this can eat 10–20 s of the 60-s budget. Mitigations: always-ready instances on Flex (costs ~$100/month kept warm), or pre-warmers fired 30 s before the run. Not a blocker, but a design input.
ClickHouse concurrent-insert ceiling. 1,000 concurrent writers will stress the cluster. Needs validation against a real Cloud cluster before we commit.
Platform support. .NET 10 on Azure Functions Flex Consumption needs to be confirmed (the prototype is on a preview runtime).
Idempotency. If a queue message is retried, the ClickHouse write must be idempotent. Small design constraint on the write path.
The 1K parallelism factor is not yet validated at 10K / 100K. Currently being measured; if the factor breaks down at larger scale, the per-worker numbers shift.

Next steps — in order¶

Confirm the parallelism factor at 10K (laptop run in progress). Without this, the 490-core number is extrapolation from 1K.
Stand up a ClickHouse Cloud cluster and re-measure 1K / 5K against it. Removes the Docker-on-laptop confound and gives real concurrent-insert behaviour.
Prototype the Functions fan-out at small scale — 10K worlds via 100 queue messages. Validates cold-start budget, idempotency, and end-to-end orchestration without paying for a full 100K burst.
Scale to a single 100K burst and time it. This is the go / no-go for the architecture.
Commercial decision: does the product actually need <60s, or is 2–3 minutes acceptable? Half the cluster, half the cost, same architecture.

Estimated timeline to a working 100K-in-60s proof: 2–3 weeks of engineering once ClickHouse Cloud is standing. Most of that is orchestration + idempotency hardening, not simulation code.

What we need from exec¶

Confirmation that <60s is the real target (not an aspirational ceiling).
Sign-off on a ClickHouse Cloud spend to unblock steps 2–4 above.
Agreement on scope: is this OC-only (deliverable in 60s) or full-pipeline with play-by-play (fundamentally different architecture — sub-60s with PBP is not currently feasible)?