Skip to content

ClickHouse schema review — Foundry storage experiment

Reviewed: 2026-04-21 • Schemas: ClickHouseSchemas.cs (formerly at src/Models/AmericanFootball/LBS.Model.AmericanFootball.StorageExperiment/Infrastructure/ClickHouse/ClickHouseSchemas.cs — this file and its project have since been removed)

EMPIRICAL UPDATE (2026-04-21): Changes 1, 2, 4, 5 below were applied in commit 34155f2d and validated with a 100K full-pipeline run. The run regressed catastrophically — deadline-exceeded at 4h with only 2% of worlds processed (baseline: 2h 58m complete). Reverted in commit 150ec156. The likely cause is ClickHouseBulkCopy.InitAsync (which fires a SELECT ... WHERE ?=? metadata query before every insert) getting materially slower in the presence of the computed intDiv(world_id, 10000) partition expression: CH Cloud's system.query_log showed 8,737 selects vs 387 inserts over 5 hours (≈22:1, vs the ≈2:1 ratio on the previous run). The inserts that did reach the server ran at historical speed; the client-library overhead is what killed the run. Lesson: schema changes that look like best-practice improvements in isolation can interact badly with client-library behaviour. Never commit schema changes without a validated run. See streaming-cloud-scale-100000-fullpipeline-schema-review-regressed.txt for the diagnostic detail. The proposals below are kept for reference but flagged with where they were reverted.

Evaluates our four MergeTree tables (game_outcome_context, game_outcome_context_staging, season_outcome_context, play_by_play) plus experiment_results against current ClickHouse best-practice docs. Where we diverge from the docs, each item is flagged as Ship as-is / Test / Change with effort + expected impact.

Methodology

Authoritative source for each guideline: the clickhouse/clickhouse-docs reference material (best-practices section, data-compression, sparse-primary-indexes, partitioning-keys). All proposed changes are listed with their expected impact and validation path — none applied blindly.

Evaluation framework (one row per guideline): - ORDER BY — primary key / sort key shape, cardinality ordering, query alignment - PARTITION BY — partitioning strategy, partition count, insert-block constraints - Column codecs — compression codec choice per column shape - Column types — smallest correct numeric type, LowCardinality thresholds - Skip indexes / projections — for reads we haven't validated yet - Engine choice — MergeTree vs ReplacingMergeTree etc.

Summary of findings

Overall verdict: the schemas are well-designed and close to best practice. Most of the tuning levers give small wins (<20% on their own phase). Two stand out as worth validating:

# Change Table Effort Expected impact Empirical result
1 PARTITION BY (season_id, intDiv(world_id, 10000)) play_by_play 1-line DDL PBP write -20–40% (parallel background merges across 10 partitions instead of 1). REVERTED — the computed partition expression made ClickHouseBulkCopy.InitAsync prohibitively slow, killing throughput 68×.
2 Add per-column CODEC(Delta, ZSTD) to monotonic integer columns play_by_play DDL edit PBP storage -15–30%. REVERTED — applied with #1 in the same run; can't isolate its impact but was rolled back together.
3 Test Gorilla/FPC codecs on values Array(Float64) game_outcome_context, season_outcome_context, staging 2-3 side-by-side schemas + one sample run Currently Delta+ZSTD gives 7.3× ratio; Gorilla or plain ZSTD might do similar or slightly better on random floats. Unknown win, worth one benchmark. Not attempted.
4 Add PARTITION BY season_id to season_outcome_context season_outcome_context 1-line DDL Consistency with other tables; enables DROP PARTITION for retention. No perf impact. REVERTED — bundled with #1, could have stayed in but rolled back for cleanliness. Can re-apply in isolation.
5 Add PARTITION BY toYYYYMM(run_timestamp) to experiment_results experiment_results 1-line DDL Retention management. No perf impact. REVERTED — same as #4.
6 Downsize context_version UInt32 → UInt8; batch_index UInt32 → UInt8 game OC tables 2-line DDL Storage -3 bytes/row. Negligible (~0.1% at 100K). Not worth the schema churn. Not attempted.
7 Add skip indexes for primary_player_id, is_touchdown play_by_play Index DDL + materialize Future read optimisation — only worth doing once real query patterns are known. CH docs specifically warn against speculative skip indexes. Not attempted.

Updated recommendation: don't re-apply #1 without a client-side workaround for the InitAsync cost (e.g., caching the column schema once per run, or switching to a different insert path). #3 is still worth a small side-by-side benchmark. #4 and #5 are harmless consistency fixes that can re-apply in isolation if someone wants them.


Per-table evaluation

1. game_outcome_context

CREATE TABLE game_outcome_context (
    game_id          LowCardinality(String),
    season_id        LowCardinality(String),
    context_version  UInt32,
    outcome_id       LowCardinality(String),
    values           Array(Float64) CODEC(Delta, ZSTD)
)
ENGINE = ReplacingMergeTree(context_version)
ORDER BY (game_id, outcome_id)
PARTITION BY season_id
Aspect Evaluation Grade
Engine choice (ReplacingMergeTree) Correct — enables context_version deduplication semantics for versioned contexts. Yes
ORDER BY (game_id, outcome_id) Ascending cardinality order (288 games × 850 outcomes). Aligns with per-game basket-read pattern (measured 8-20 ms at 1-10K). Yes
PARTITION BY season_id One partition per season. Appropriate for multi-season tenancy + per-season DROP PARTITION. Yes
LowCardinality(String) columns game_id (288), season_id (1-N), outcome_id (~850). All well within the 10K LC threshold. Yes
context_version UInt32 Functional but 3 bytes wider than needed. Most contexts will have versions in the UInt8 range. Minor — see Change #6
values Array(Float64) CODEC(Delta, ZSTD) Delta+ZSTD gave 5.56× → 7.32× measured compression at 1K → 100K scale (good). BUT Delta is designed for monotonic sequences; per-outcome value arrays are effectively random floats (simulation-derived). Gorilla or FPC might do equally well or better; plain ZSTD is worth comparing. Caution — see Change #3

Result: ship as-is. Optionally benchmark codec alternatives for values (Change #3).

2. game_outcome_context_staging

CREATE TABLE game_outcome_context_staging (
    game_id, season_id, context_version, batch_index, outcome_id,
    values Array(Float64) CODEC(Delta, ZSTD)
)
ENGINE = MergeTree()
ORDER BY (game_id, outcome_id, batch_index)
-- no PARTITION BY
Aspect Evaluation Grade
No PARTITION BY Correct for a short-lived staging table. TRUNCATE is used at end of run. Yes
ORDER BY (game_id, outcome_id, batch_index) Matches the merge query (GROUP BY game_id, outcome_id + arraySort(x -> x.1, ... batch_index, values)). Exactly what we want. Yes
batch_index UInt32 Chunks are typically ≤ 200, so UInt8 would do. Trivial saving. Minor
Values codec Same discussion as above — Delta on simulation-derived random floats isn't optimal on paper but compresses well in practice. Caution

Result: ship as-is.

3. season_outcome_context

CREATE TABLE season_outcome_context (
    season_id, context_version, outcome_id,
    values Array(Float64) CODEC(Delta, ZSTD)
)
ENGINE = ReplacingMergeTree(context_version)
ORDER BY (season_id, outcome_id)
-- no PARTITION BY
Aspect Evaluation Grade
No PARTITION BY Inconsistent with game_outcome_context which partitions by season_id. With multiple seasons in the same cluster we'd benefit from PARTITION BY season_id for DROP PARTITION and background-merge isolation. — see Change #4
ORDER BY (season_id, outcome_id) Ascending cardinality; aligns with per-season basket reads. Yes
Engine (ReplacingMergeTree) Correct, same reason as game OC. Yes

Result: add PARTITION BY season_id for consistency. No perf change at current data volumes.

4. play_by_play

CREATE TABLE play_by_play (
    game_id           LowCardinality(String),
    season_id         LowCardinality(String),
    world_id          UInt32,
    week              UInt8,
    home_team         LowCardinality(String),
    ...                 -- 38 more columns
)
ENGINE = MergeTree()
ORDER BY (game_id, world_id, play_sequence)
PARTITION BY season_id
Aspect Evaluation Grade
Smallest correct numeric types Extensive use of UInt8, UInt16, Int8 for bounded-range fields. Yes
LowCardinality(String) 16 columns use LC for enum-like strings (teams, personnel, formations, play types, etc). Every one is correct — unique counts are all in the 2-32 range. Yes
ORDER BY (game_id, world_id, play_sequence) Aligns with per-game, per-world query pattern. Note: strict ascending-cardinality would suggest (game_id, play_sequence, world_id) (play_sequence has ~130 unique values vs world_id's 100K), which would compress better. But filters almost always include world_id, so the query-alignment benefit outweighs compression. Yes
PARTITION BY season_id Currently 1 partition per 100K run → ~4.2B rows per partition. All foreground inserts and background merges serialise on that one partition. This is the write-throughput ceiling we hit. — see Change #1
No per-column CODEC on monotonic integers world_id (monotonic within a game), play_sequence (monotonic within a drive), clock_seconds, drive_*_clock_seconds (monotonic within a drive) — all benefit from CODEC(Delta, ZSTD). CH docs cite 15-30% storage reduction on monotonic integer columns with Delta. — see Change #2
No skip indexes on primary_player_id, is_touchdown etc. CH docs: "don't add skip indexes speculatively — do it after real query patterns are known." We haven't exercised the read side at scale yet. Defer. (deferred)

Result: two worth changing (Changes #1 and #2).

5. experiment_results

CREATE TABLE experiment_results (
    run_id, run_timestamp, requirement_id, backend, scale,
    metric_name, metric_value, metric_unit, environment, notes
)
ENGINE = MergeTree()
ORDER BY (requirement_id, backend, scale, run_timestamp)
-- no PARTITION BY
Aspect Evaluation Grade
ORDER BY (requirement_id, backend, scale, run_timestamp) Ascending cardinality + query alignment ("metrics for REQ-X on CH at scale Y over time"). Yes
No PARTITION BY Log-like data that grows monotonically over time. Docs recommend PARTITION BY toYYYYMM(run_timestamp) for this shape — enables easy retention (DROP PARTITION old months). — see Change #5
run_id is String (not LowCardinality) Run IDs are unique per row, so LC would bloat not help. Yes

Result: add monthly partitioning when we're ready to think about result-table retention.


Proposed follow-up sequence (if we pursue schema optimisation)

  1. Change #1 (PBP partitioning) — already detailed in a prior discussion. One DDL line, one 100K full-pipeline rerun to measure. Projected 20-40% improvement in PBP write phase.
  2. Change #2 (PBP column codecs) — add CODEC(Delta, ZSTD) to monotonic columns in the same DDL change. Measure via system.parts.bytes_on_disk before/after (no new run needed, just recreate the table and re-run).
  3. Change #3 (OC values codec comparison) — stand up three parallel tables with different codecs on values, write the same 10K dataset into each, compare compressed size + merge time. Most informative and lowest-risk experiment; ~1 hour of work.
  4. Changes #4 and #5 (consistency partitions) — low-risk DDL cleanup. Roll into the next schema migration.

What we're explicitly NOT changing

  • ReplacingMergeTree(context_version) on the OC tables — correctly supports versioned contexts. Alternative (ReplicatedMergeTree in plain form) loses the version-based dedup semantics we actually use.
  • ORDER BY tuples on any table — all match the observed query patterns. Changing to optimise compression would hurt read latency, which matters more.
  • Skip indexes or projections — CH docs specifically warn against adding these without evidence. Would evaluate once real read patterns emerge.
  • Dropping LowCardinality on high-cardinality-looking columns — every LC use we have is under the 10K threshold where LC still wins.

Gaps this review doesn't cover

  • Read performance against the current schemas at 100K — we have 1K/10K latency measurements but haven't re-run the should-have suite (REQ-OC-R1/R3/R4/R5) against the 3×16 tier. Changes to the schema here (especially skip indexes) should be driven by those numbers.
  • Data-skipping index benchmarks — CH docs are clear these should be data-driven, not speculative. Need real query traces.
  • Materialized views / projections — same argument. Premature without query-pattern evidence.