Skip to content

Redis as an Outcome-Context store — deferred evaluation notes

Status: not evaluated in code, not benchmarked. These notes capture the reasoning from an architecture discussion during the storage-experiment work so the decision can be revisited with context rather than re-derived from scratch.

Current primary store: ClickHouse Cloud Production tier (validated 100K full-pipeline end-to-end in 2h 58m — see status-summary.md).

Question

Could Redis be faster than ClickHouse for the Outcome Context table, on either the read side or the write side?

OC workload shape

Numbers from the validated 100K run:

  • Rows per season: 244,800 (288 games × ~850 outcomes per game)
  • Row shape: Array(Float64) of length worldCount — at 100K that's 800 KB per row
  • Total size at 100K: ~196 GB uncompressed, ~28 GB compressed (7.3× on ClickHouse with Delta+ZSTD)
  • Reads:
  • Basket (1–30 outcomes of a single game) — the common case for a user's prediction. Measured 8–20 ms on ClickHouse at 10K scale.
  • Full-context (all ~850 outcomes for a game) — rare; ~200 ms at 1K scale.
  • Writes:
  • Staging + merge pattern from the streaming orchestrator. 100K OC write = 328 s staging bulk-copies + 137 s server-side merge on Cloud.
  • Read latency target (spec): 1 s at 100K. Both candidates crush this, so "faster" isn't the primary decision driver.

Read-side comparison

ClickHouse Redis
Point-read latency ~10 ms (measured) ~1 ms (estimated, point-lookup on key)
Basket of 30 outcomes ~10–20 ms ~3–5 ms (pipelined GET)
Full-context (850 outcomes) ~200 ms at 1K, projected ~3–15 s at 100K ~100 ms (pipelined MGET + wire)
Range/analytical queries full SQL not supported (pure KV)
Cold-data access seamless (just reads from disk) evict-from-RAM or N/A

Raw latency: Redis wins by ~5–10× on point-lookups. But both are well under the 1-second spec target, so latency alone doesn't justify the change.

Write-side comparison

The write side has three components: network transfer, server-side persistence, and the shape transpose from sim output (world-by-world) to storage output (game-by-game).

Network + server persistence

ClickHouse Redis
Compression on write ~7× (Delta+ZSTD on Float64) none by default (can be added client-side)
Wire data at 100K ~28 GB ~196 GB
Wire time at 10 Gbps intra-region ~22 s ~157 s
Server-side cost MergeTree part creation + compression RAM write, near-instant
Measured 100K OC write (client + server) 328 s not measured

Even if Redis's server-side cost is effectively zero, the wire-bandwidth disadvantage from no compression is ~7× larger transfer than CH. Total write time comes out roughly similar once you account for everything; Redis doesn't obviously win on pure write throughput at this data shape.

The shape transpose — an architectural constant

Simulation produces world-by-world data (one world simulates all 288 games). Storage wants per-outcome-across-worlds arrays (what readers consume). Something, somewhere, has to pivot the matrix. Three places the transpose can happen, each with hard tradeoffs:

Strategy Where transpose happens Memory cost Works on CH? Works on Redis?
Chunked staging + server merge (current) Server-side arrayFlatten in CH ~47 GB client peak at 100K Yes (137 s merge) Yes — but needs custom Lua / Redis Stack
Direct final write, per-game buffer Client-side, per-game 680 MB × in-flight games (up to 196 GB) OOMs at 100K OOMs at 100K
Skip merge, store per-chunk Read-side reassembles low write cost kills read latency (N GETs per outcome) kills read latency

The 137 s merge isn't a ClickHouse penalty — it's a transpose cost. Switching to Redis doesn't eliminate it, just moves it elsewhere.

Storage cost at 100K

ClickHouse Redis
Working set on disk / in RAM ~28 GB compressed on disk ~196 GB in RAM
Hosting cost Cloud idle-pauses after 15 min; per-run cost ~$4–5 when active Must stay resident; at AWS ElastiCache r6g.16xlarge (400 GB RAM) ≈ ~$3,500–5,000/month standing cost
Seasons retained many seasons on the same cluster at trivial storage cost each retained season × ~196 GB RAM × $

For the anticipated usage pattern (hot season + many cold seasons), the Redis option is ~50–100× more expensive to keep the same data available.

Where Redis could still make sense

Redis is a strong fit as a hot-path cache in front of ClickHouse, not a replacement:

  • If observation in production shows that a small percentage of games get the majority of reads, cache those (game_id, outcome_id) → Array(Float64) pairs in Redis.
  • Reads get Redis latency for the hot path, ClickHouse storage economics for the cold path.
  • Standard "cache-aside" pattern — miss on Redis falls through to CH, populates Redis.
  • Cache invalidation is trivial: when a run finishes and emits a new context_version, invalidate all keys for that season.

This decision should come after we see real read traffic patterns in production, not as a preemptive design choice.

Summary — why we didn't evaluate further

  1. Latency isn't driving the decision — both candidates are sub-target.
  2. Write-side is not obviously faster — Redis's no-compression cost on the wire approximately cancels its zero server-side cost.
  3. Transpose isn't a DB choice — the ~137 s merge step would exist in some form with Redis too.
  4. Storage cost is ~50× higher at our data volumes.
  5. Analytical queries (aggregations, cross-world/cross-game queries) are free in CH, impossible in Redis.
  6. Validated end-to-end on CH at 100K — no equivalent validation effort has been spent on Redis, and this experiment scoped a single backend.

If we ever re-evaluate: do it after production traffic patterns are known, specifically look at the cache-layer pattern, and put a dollar figure on the incremental read latency benefit vs the incremental hosting cost.