Redis as an Outcome-Context store — deferred evaluation notes¶
Status: not evaluated in code, not benchmarked. These notes capture the reasoning from an architecture discussion during the storage-experiment work so the decision can be revisited with context rather than re-derived from scratch.
Current primary store: ClickHouse Cloud Production tier (validated 100K full-pipeline end-to-end in 2h 58m — see status-summary.md).
Question¶
Could Redis be faster than ClickHouse for the Outcome Context table, on either the read side or the write side?
OC workload shape¶
Numbers from the validated 100K run:
- Rows per season: 244,800 (288 games × ~850 outcomes per game)
- Row shape:
Array(Float64)of lengthworldCount— at 100K that's 800 KB per row - Total size at 100K: ~196 GB uncompressed, ~28 GB compressed (7.3× on ClickHouse with Delta+ZSTD)
- Reads:
- Basket (1–30 outcomes of a single game) — the common case for a user's prediction. Measured 8–20 ms on ClickHouse at 10K scale.
- Full-context (all ~850 outcomes for a game) — rare; ~200 ms at 1K scale.
- Writes:
- Staging + merge pattern from the streaming orchestrator. 100K OC write = 328 s staging bulk-copies + 137 s server-side merge on Cloud.
- Read latency target (spec): 1 s at 100K. Both candidates crush this, so "faster" isn't the primary decision driver.
Read-side comparison¶
| ClickHouse | Redis | |
|---|---|---|
| Point-read latency | ~10 ms (measured) | ~1 ms (estimated, point-lookup on key) |
| Basket of 30 outcomes | ~10–20 ms | ~3–5 ms (pipelined GET) |
| Full-context (850 outcomes) | ~200 ms at 1K, projected ~3–15 s at 100K | ~100 ms (pipelined MGET + wire) |
| Range/analytical queries | full SQL | not supported (pure KV) |
| Cold-data access | seamless (just reads from disk) | evict-from-RAM or N/A |
Raw latency: Redis wins by ~5–10× on point-lookups. But both are well under the 1-second spec target, so latency alone doesn't justify the change.
Write-side comparison¶
The write side has three components: network transfer, server-side persistence, and the shape transpose from sim output (world-by-world) to storage output (game-by-game).
Network + server persistence¶
| ClickHouse | Redis | |
|---|---|---|
| Compression on write | ~7× (Delta+ZSTD on Float64) | none by default (can be added client-side) |
| Wire data at 100K | ~28 GB | ~196 GB |
| Wire time at 10 Gbps intra-region | ~22 s | ~157 s |
| Server-side cost | MergeTree part creation + compression | RAM write, near-instant |
| Measured 100K OC write (client + server) | 328 s | not measured |
Even if Redis's server-side cost is effectively zero, the wire-bandwidth disadvantage from no compression is ~7× larger transfer than CH. Total write time comes out roughly similar once you account for everything; Redis doesn't obviously win on pure write throughput at this data shape.
The shape transpose — an architectural constant¶
Simulation produces world-by-world data (one world simulates all 288 games). Storage wants per-outcome-across-worlds arrays (what readers consume). Something, somewhere, has to pivot the matrix. Three places the transpose can happen, each with hard tradeoffs:
| Strategy | Where transpose happens | Memory cost | Works on CH? | Works on Redis? |
|---|---|---|---|---|
| Chunked staging + server merge (current) | Server-side arrayFlatten in CH |
~47 GB client peak at 100K | Yes (137 s merge) | Yes — but needs custom Lua / Redis Stack |
| Direct final write, per-game buffer | Client-side, per-game | 680 MB × in-flight games (up to 196 GB) | OOMs at 100K | OOMs at 100K |
| Skip merge, store per-chunk | Read-side reassembles | low write cost | kills read latency (N GETs per outcome) | kills read latency |
The 137 s merge isn't a ClickHouse penalty — it's a transpose cost. Switching to Redis doesn't eliminate it, just moves it elsewhere.
Storage cost at 100K¶
| ClickHouse | Redis | |
|---|---|---|
| Working set on disk / in RAM | ~28 GB compressed on disk | ~196 GB in RAM |
| Hosting cost | Cloud idle-pauses after 15 min; per-run cost ~$4–5 when active | Must stay resident; at AWS ElastiCache r6g.16xlarge (400 GB RAM) ≈ ~$3,500–5,000/month standing cost |
| Seasons retained | many seasons on the same cluster at trivial storage cost | each retained season × ~196 GB RAM × $ |
For the anticipated usage pattern (hot season + many cold seasons), the Redis option is ~50–100× more expensive to keep the same data available.
Where Redis could still make sense¶
Redis is a strong fit as a hot-path cache in front of ClickHouse, not a replacement:
- If observation in production shows that a small percentage of games get the majority of reads, cache those
(game_id, outcome_id) → Array(Float64)pairs in Redis. - Reads get Redis latency for the hot path, ClickHouse storage economics for the cold path.
- Standard "cache-aside" pattern — miss on Redis falls through to CH, populates Redis.
- Cache invalidation is trivial: when a run finishes and emits a new context_version, invalidate all keys for that season.
This decision should come after we see real read traffic patterns in production, not as a preemptive design choice.
Summary — why we didn't evaluate further¶
- Latency isn't driving the decision — both candidates are sub-target.
- Write-side is not obviously faster — Redis's no-compression cost on the wire approximately cancels its zero server-side cost.
- Transpose isn't a DB choice — the ~137 s merge step would exist in some form with Redis too.
- Storage cost is ~50× higher at our data volumes.
- Analytical queries (aggregations, cross-world/cross-game queries) are free in CH, impossible in Redis.
- Validated end-to-end on CH at 100K — no equivalent validation effort has been spent on Redis, and this experiment scoped a single backend.
If we ever re-evaluate: do it after production traffic patterns are known, specifically look at the cache-layer pattern, and put a dollar figure on the incremental read latency benefit vs the incremental hosting cost.