Redis as an Outcome-Context store — deferred evaluation notes¶

Status: not evaluated in code, not benchmarked. These notes capture the reasoning from an architecture discussion during the storage-experiment work so the decision can be revisited with context rather than re-derived from scratch.

Current primary store: ClickHouse Cloud Production tier (validated 100K full-pipeline end-to-end in 2h 58m — see status-summary.md).

Question¶

Could Redis be faster than ClickHouse for the Outcome Context table, on either the read side or the write side?

OC workload shape¶

Numbers from the validated 100K run:

Rows per season: 244,800 (288 games × ~850 outcomes per game)
Row shape: Array(Float64) of length worldCount — at 100K that's 800 KB per row
Total size at 100K: ~196 GB uncompressed, ~28 GB compressed (7.3× on ClickHouse with Delta+ZSTD)
Reads:
Basket (1–30 outcomes of a single game) — the common case for a user's prediction. Measured 8–20 ms on ClickHouse at 10K scale.
Full-context (all ~850 outcomes for a game) — rare; ~200 ms at 1K scale.
Writes:
Staging + merge pattern from the streaming orchestrator. 100K OC write = 328 s staging bulk-copies + 137 s server-side merge on Cloud.
Read latency target (spec): 1 s at 100K. Both candidates crush this, so "faster" isn't the primary decision driver.

Read-side comparison¶

	ClickHouse	Redis
Point-read latency	~10 ms (measured)	~1 ms (estimated, point-lookup on key)
Basket of 30 outcomes	~10–20 ms	~3–5 ms (pipelined GET)
Full-context (850 outcomes)	~200 ms at 1K, projected ~3–15 s at 100K	~100 ms (pipelined MGET + wire)
Range/analytical queries	full SQL	not supported (pure KV)
Cold-data access	seamless (just reads from disk)	evict-from-RAM or N/A

Raw latency: Redis wins by ~5–10× on point-lookups. But both are well under the 1-second spec target, so latency alone doesn't justify the change.

Write-side comparison¶

The write side has three components: network transfer, server-side persistence, and the shape transpose from sim output (world-by-world) to storage output (game-by-game).

Network + server persistence¶

	ClickHouse	Redis
Compression on write	~7× (Delta+ZSTD on Float64)	none by default (can be added client-side)
Wire data at 100K	~28 GB	~196 GB
Wire time at 10 Gbps intra-region	~22 s	~157 s
Server-side cost	MergeTree part creation + compression	RAM write, near-instant
Measured 100K OC write (client + server)	328 s	not measured

Even if Redis's server-side cost is effectively zero, the wire-bandwidth disadvantage from no compression is ~7× larger transfer than CH. Total write time comes out roughly similar once you account for everything; Redis doesn't obviously win on pure write throughput at this data shape.

The shape transpose — an architectural constant¶

Simulation produces world-by-world data (one world simulates all 288 games). Storage wants per-outcome-across-worlds arrays (what readers consume). Something, somewhere, has to pivot the matrix. Three places the transpose can happen, each with hard tradeoffs:

Strategy	Where transpose happens	Memory cost	Works on CH?	Works on Redis?
Chunked staging + server merge (current)	Server-side `arrayFlatten` in CH	~47 GB client peak at 100K	Yes (137 s merge)	Yes — but needs custom Lua / Redis Stack
Direct final write, per-game buffer	Client-side, per-game	680 MB × in-flight games (up to 196 GB)	OOMs at 100K	OOMs at 100K
Skip merge, store per-chunk	Read-side reassembles	low write cost	kills read latency (N GETs per outcome)	kills read latency

The 137 s merge isn't a ClickHouse penalty — it's a transpose cost. Switching to Redis doesn't eliminate it, just moves it elsewhere.

Storage cost at 100K¶

	ClickHouse	Redis
Working set on disk / in RAM	~28 GB compressed on disk	~196 GB in RAM
Hosting cost	Cloud idle-pauses after 15 min; per-run cost ~$4–5 when active	Must stay resident; at AWS ElastiCache r6g.16xlarge (400 GB RAM) ≈ ~$3,500–5,000/month standing cost
Seasons retained	many seasons on the same cluster at trivial storage cost	each retained season × ~196 GB RAM × $

For the anticipated usage pattern (hot season + many cold seasons), the Redis option is ~50–100× more expensive to keep the same data available.

Where Redis could still make sense¶

Redis is a strong fit as a hot-path cache in front of ClickHouse, not a replacement:

If observation in production shows that a small percentage of games get the majority of reads, cache those (game_id, outcome_id) → Array(Float64) pairs in Redis.
Reads get Redis latency for the hot path, ClickHouse storage economics for the cold path.
Standard "cache-aside" pattern — miss on Redis falls through to CH, populates Redis.
Cache invalidation is trivial: when a run finishes and emits a new context_version, invalidate all keys for that season.

This decision should come after we see real read traffic patterns in production, not as a preemptive design choice.

Summary — why we didn't evaluate further¶

Latency isn't driving the decision — both candidates are sub-target.
Write-side is not obviously faster — Redis's no-compression cost on the wire approximately cancels its zero server-side cost.
Transpose isn't a DB choice — the ~137 s merge step would exist in some form with Redis too.
Storage cost is ~50× higher at our data volumes.
Analytical queries (aggregations, cross-world/cross-game queries) are free in CH, impossible in Redis.
Validated end-to-end on CH at 100K — no equivalent validation effort has been spent on Redis, and this experiment scoped a single backend.

If we ever re-evaluate: do it after production traffic patterns are known, specifically look at the cache-layer pattern, and put a dollar figure on the incremental read latency benefit vs the incremental hosting cost.