ClickHouse schema review — Foundry storage experiment¶

Reviewed: 2026-04-21 • Schemas: ClickHouseSchemas.cs (formerly at src/Models/AmericanFootball/LBS.Model.AmericanFootball.StorageExperiment/Infrastructure/ClickHouse/ClickHouseSchemas.cs — this file and its project have since been removed)

EMPIRICAL UPDATE (2026-04-21): Changes 1, 2, 4, 5 below were applied in commit 34155f2d and validated with a 100K full-pipeline run. The run regressed catastrophically — deadline-exceeded at 4h with only 2% of worlds processed (baseline: 2h 58m complete). Reverted in commit 150ec156. The likely cause is ClickHouseBulkCopy.InitAsync (which fires a SELECT ... WHERE ?=? metadata query before every insert) getting materially slower in the presence of the computed intDiv(world_id, 10000) partition expression: CH Cloud's system.query_log showed 8,737 selects vs 387 inserts over 5 hours (≈22:1, vs the ≈2:1 ratio on the previous run). The inserts that did reach the server ran at historical speed; the client-library overhead is what killed the run. Lesson: schema changes that look like best-practice improvements in isolation can interact badly with client-library behaviour. Never commit schema changes without a validated run. See streaming-cloud-scale-100000-fullpipeline-schema-review-regressed.txt for the diagnostic detail. The proposals below are kept for reference but flagged with where they were reverted.

Evaluates our four MergeTree tables (game_outcome_context, game_outcome_context_staging, season_outcome_context, play_by_play) plus experiment_results against current ClickHouse best-practice docs. Where we diverge from the docs, each item is flagged as Ship as-is / Test / Change with effort + expected impact.

Methodology¶

Authoritative source for each guideline: the clickhouse/clickhouse-docs reference material (best-practices section, data-compression, sparse-primary-indexes, partitioning-keys). All proposed changes are listed with their expected impact and validation path — none applied blindly.

Evaluation framework (one row per guideline): - ORDER BY — primary key / sort key shape, cardinality ordering, query alignment - PARTITION BY — partitioning strategy, partition count, insert-block constraints - Column codecs — compression codec choice per column shape - Column types — smallest correct numeric type, LowCardinality thresholds - Skip indexes / projections — for reads we haven't validated yet - Engine choice — MergeTree vs ReplacingMergeTree etc.

Summary of findings¶

Overall verdict: the schemas are well-designed and close to best practice. Most of the tuning levers give small wins (<20% on their own phase). Two stand out as worth validating:

#	Change	Table	Effort	Expected impact	Empirical result
1	PARTITION BY `(season_id, intDiv(world_id, 10000))`	`play_by_play`	1-line DDL	PBP write -20–40% (parallel background merges across 10 partitions instead of 1).	REVERTED — the computed partition expression made `ClickHouseBulkCopy.InitAsync` prohibitively slow, killing throughput 68×.
2	Add per-column CODEC(Delta, ZSTD) to monotonic integer columns	`play_by_play`	DDL edit	PBP storage -15–30%.	REVERTED — applied with #1 in the same run; can't isolate its impact but was rolled back together.
3	Test Gorilla/FPC codecs on `values` Array(Float64)	`game_outcome_context`, `season_outcome_context`, staging	2-3 side-by-side schemas + one sample run	Currently Delta+ZSTD gives 7.3× ratio; Gorilla or plain ZSTD might do similar or slightly better on random floats. Unknown win, worth one benchmark.	Not attempted.
4	Add `PARTITION BY season_id` to `season_outcome_context`	`season_outcome_context`	1-line DDL	Consistency with other tables; enables DROP PARTITION for retention. No perf impact.	REVERTED — bundled with #1, could have stayed in but rolled back for cleanliness. Can re-apply in isolation.
5	Add `PARTITION BY toYYYYMM(run_timestamp)` to `experiment_results`	`experiment_results`	1-line DDL	Retention management. No perf impact.	REVERTED — same as #4.
6	Downsize `context_version` UInt32 → UInt8; `batch_index` UInt32 → UInt8	game OC tables	2-line DDL	Storage -3 bytes/row. Negligible (~0.1% at 100K). Not worth the schema churn.	Not attempted.
7	Add skip indexes for `primary_player_id`, `is_touchdown`	`play_by_play`	Index DDL + materialize	Future read optimisation — only worth doing once real query patterns are known. CH docs specifically warn against speculative skip indexes.	Not attempted.

Updated recommendation: don't re-apply #1 without a client-side workaround for the InitAsync cost (e.g., caching the column schema once per run, or switching to a different insert path). #3 is still worth a small side-by-side benchmark. #4 and #5 are harmless consistency fixes that can re-apply in isolation if someone wants them.

Per-table evaluation¶

1. `game_outcome_context`¶

CREATE TABLE game_outcome_context (
    game_id          LowCardinality(String),
    season_id        LowCardinality(String),
    context_version  UInt32,
    outcome_id       LowCardinality(String),
    values           Array(Float64) CODEC(Delta, ZSTD)
)
ENGINE = ReplacingMergeTree(context_version)
ORDER BY (game_id, outcome_id)
PARTITION BY season_id

Aspect	Evaluation	Grade
Engine choice (ReplacingMergeTree)	Correct — enables `context_version` deduplication semantics for versioned contexts.	Yes
ORDER BY `(game_id, outcome_id)`	Ascending cardinality order (288 games × 850 outcomes). Aligns with per-game basket-read pattern (measured 8-20 ms at 1-10K).	Yes
PARTITION BY `season_id`	One partition per season. Appropriate for multi-season tenancy + per-season DROP PARTITION.	Yes
`LowCardinality(String)` columns	game_id (288), season_id (1-N), outcome_id (~850). All well within the 10K LC threshold.	Yes
`context_version UInt32`	Functional but 3 bytes wider than needed. Most contexts will have versions in the UInt8 range.	Minor — see Change #6
`values Array(Float64) CODEC(Delta, ZSTD)`	Delta+ZSTD gave 5.56× → 7.32× measured compression at 1K → 100K scale (good). BUT Delta is designed for monotonic sequences; per-outcome value arrays are effectively random floats (simulation-derived). Gorilla or FPC might do equally well or better; plain ZSTD is worth comparing.	Caution — see Change #3

Result: ship as-is. Optionally benchmark codec alternatives for values (Change #3).

2. `game_outcome_context_staging`¶

CREATE TABLE game_outcome_context_staging (
    game_id, season_id, context_version, batch_index, outcome_id,
    values Array(Float64) CODEC(Delta, ZSTD)
)
ENGINE = MergeTree()
ORDER BY (game_id, outcome_id, batch_index)
-- no PARTITION BY

Aspect	Evaluation	Grade
No PARTITION BY	Correct for a short-lived staging table. `TRUNCATE` is used at end of run.	Yes
ORDER BY `(game_id, outcome_id, batch_index)`	Matches the merge query (`GROUP BY game_id, outcome_id` + `arraySort(x -> x.1, ... batch_index, values)`). Exactly what we want.	Yes
`batch_index UInt32`	Chunks are typically ≤ 200, so UInt8 would do. Trivial saving.	Minor
Values codec	Same discussion as above — Delta on simulation-derived random floats isn't optimal on paper but compresses well in practice.	Caution

Result: ship as-is.

3. `season_outcome_context`¶

CREATE TABLE season_outcome_context (
    season_id, context_version, outcome_id,
    values Array(Float64) CODEC(Delta, ZSTD)
)
ENGINE = ReplacingMergeTree(context_version)
ORDER BY (season_id, outcome_id)
-- no PARTITION BY

Aspect	Evaluation	Grade
No PARTITION BY	Inconsistent with `game_outcome_context` which partitions by `season_id`. With multiple seasons in the same cluster we'd benefit from `PARTITION BY season_id` for DROP PARTITION and background-merge isolation.	— see Change #4
ORDER BY `(season_id, outcome_id)`	Ascending cardinality; aligns with per-season basket reads.	Yes
Engine (ReplacingMergeTree)	Correct, same reason as game OC.	Yes

Result: add PARTITION BY season_id for consistency. No perf change at current data volumes.

4. `play_by_play`¶

CREATE TABLE play_by_play (
    game_id           LowCardinality(String),
    season_id         LowCardinality(String),
    world_id          UInt32,
    week              UInt8,
    home_team         LowCardinality(String),
    ...                 -- 38 more columns
)
ENGINE = MergeTree()
ORDER BY (game_id, world_id, play_sequence)
PARTITION BY season_id

Aspect	Evaluation	Grade
Smallest correct numeric types	Extensive use of UInt8, UInt16, Int8 for bounded-range fields.	Yes
`LowCardinality(String)`	16 columns use LC for enum-like strings (teams, personnel, formations, play types, etc). Every one is correct — unique counts are all in the 2-32 range.	Yes
ORDER BY `(game_id, world_id, play_sequence)`	Aligns with per-game, per-world query pattern. Note: strict ascending-cardinality would suggest `(game_id, play_sequence, world_id)` (play_sequence has ~130 unique values vs world_id's 100K), which would compress better. But filters almost always include `world_id`, so the query-alignment benefit outweighs compression.	Yes
PARTITION BY `season_id`	Currently 1 partition per 100K run → ~4.2B rows per partition. All foreground inserts and background merges serialise on that one partition. This is the write-throughput ceiling we hit.	— see Change #1
No per-column CODEC on monotonic integers	`world_id` (monotonic within a game), `play_sequence` (monotonic within a drive), `clock_seconds`, `drive_*_clock_seconds` (monotonic within a drive) — all benefit from `CODEC(Delta, ZSTD)`. CH docs cite 15-30% storage reduction on monotonic integer columns with Delta.	— see Change #2
No skip indexes on `primary_player_id`, `is_touchdown` etc.	CH docs: "don't add skip indexes speculatively — do it after real query patterns are known." We haven't exercised the read side at scale yet. Defer.	(deferred)

Result: two worth changing (Changes #1 and #2).

5. `experiment_results`¶

CREATE TABLE experiment_results (
    run_id, run_timestamp, requirement_id, backend, scale,
    metric_name, metric_value, metric_unit, environment, notes
)
ENGINE = MergeTree()
ORDER BY (requirement_id, backend, scale, run_timestamp)
-- no PARTITION BY

Aspect	Evaluation	Grade
ORDER BY `(requirement_id, backend, scale, run_timestamp)`	Ascending cardinality + query alignment ("metrics for REQ-X on CH at scale Y over time").	Yes
No PARTITION BY	Log-like data that grows monotonically over time. Docs recommend `PARTITION BY toYYYYMM(run_timestamp)` for this shape — enables easy retention (DROP PARTITION old months).	— see Change #5
`run_id` is `String` (not LowCardinality)	Run IDs are unique per row, so LC would bloat not help.	Yes

Result: add monthly partitioning when we're ready to think about result-table retention.

Proposed follow-up sequence (if we pursue schema optimisation)¶

Change #1 (PBP partitioning) — already detailed in a prior discussion. One DDL line, one 100K full-pipeline rerun to measure. Projected 20-40% improvement in PBP write phase.
Change #2 (PBP column codecs) — add CODEC(Delta, ZSTD) to monotonic columns in the same DDL change. Measure via system.parts.bytes_on_disk before/after (no new run needed, just recreate the table and re-run).
Change #3 (OC values codec comparison) — stand up three parallel tables with different codecs on values, write the same 10K dataset into each, compare compressed size + merge time. Most informative and lowest-risk experiment; ~1 hour of work.
Changes #4 and #5 (consistency partitions) — low-risk DDL cleanup. Roll into the next schema migration.

What we're explicitly NOT changing¶

ReplacingMergeTree(context_version) on the OC tables — correctly supports versioned contexts. Alternative (ReplicatedMergeTree in plain form) loses the version-based dedup semantics we actually use.
ORDER BY tuples on any table — all match the observed query patterns. Changing to optimise compression would hurt read latency, which matters more.
Skip indexes or projections — CH docs specifically warn against adding these without evidence. Would evaluate once real read patterns emerge.
Dropping LowCardinality on high-cardinality-looking columns — every LC use we have is under the 10K threshold where LC still wins.

Gaps this review doesn't cover¶

Read performance against the current schemas at 100K — we have 1K/10K latency measurements but haven't re-run the should-have suite (REQ-OC-R1/R3/R4/R5) against the 3×16 tier. Changes to the schema here (especially skip indexes) should be driven by those numbers.
Data-skipping index benchmarks — CH docs are clear these should be data-driven, not speculative. Need real query traces.
Materialized views / projections — same argument. Premature without query-pattern evidence.