ClickHouse schema review — Foundry storage experiment¶
Reviewed: 2026-04-21 • Schemas: ClickHouseSchemas.cs (formerly at src/Models/AmericanFootball/LBS.Model.AmericanFootball.StorageExperiment/Infrastructure/ClickHouse/ClickHouseSchemas.cs — this file and its project have since been removed)
EMPIRICAL UPDATE (2026-04-21): Changes 1, 2, 4, 5 below were applied in commit
34155f2dand validated with a 100K full-pipeline run. The run regressed catastrophically — deadline-exceeded at 4h with only 2% of worlds processed (baseline: 2h 58m complete). Reverted in commit150ec156. The likely cause isClickHouseBulkCopy.InitAsync(which fires aSELECT ... WHERE ?=?metadata query before every insert) getting materially slower in the presence of the computedintDiv(world_id, 10000)partition expression: CH Cloud'ssystem.query_logshowed 8,737 selects vs 387 inserts over 5 hours (≈22:1, vs the ≈2:1 ratio on the previous run). The inserts that did reach the server ran at historical speed; the client-library overhead is what killed the run. Lesson: schema changes that look like best-practice improvements in isolation can interact badly with client-library behaviour. Never commit schema changes without a validated run. Seestreaming-cloud-scale-100000-fullpipeline-schema-review-regressed.txtfor the diagnostic detail. The proposals below are kept for reference but flagged with where they were reverted.
Evaluates our four MergeTree tables (game_outcome_context, game_outcome_context_staging, season_outcome_context, play_by_play) plus experiment_results against current ClickHouse best-practice docs. Where we diverge from the docs, each item is flagged as Ship as-is / Test / Change with effort + expected impact.
Methodology¶
Authoritative source for each guideline: the clickhouse/clickhouse-docs reference material (best-practices section, data-compression, sparse-primary-indexes, partitioning-keys). All proposed changes are listed with their expected impact and validation path — none applied blindly.
Evaluation framework (one row per guideline):
- ORDER BY — primary key / sort key shape, cardinality ordering, query alignment
- PARTITION BY — partitioning strategy, partition count, insert-block constraints
- Column codecs — compression codec choice per column shape
- Column types — smallest correct numeric type, LowCardinality thresholds
- Skip indexes / projections — for reads we haven't validated yet
- Engine choice — MergeTree vs ReplacingMergeTree etc.
Summary of findings¶
Overall verdict: the schemas are well-designed and close to best practice. Most of the tuning levers give small wins (<20% on their own phase). Two stand out as worth validating:
| # | Change | Table | Effort | Expected impact | Empirical result |
|---|---|---|---|---|---|
| 1 | PARTITION BY (season_id, intDiv(world_id, 10000)) |
play_by_play |
1-line DDL | PBP write -20–40% (parallel background merges across 10 partitions instead of 1). | REVERTED — the computed partition expression made ClickHouseBulkCopy.InitAsync prohibitively slow, killing throughput 68×. |
| 2 | Add per-column CODEC(Delta, ZSTD) to monotonic integer columns | play_by_play |
DDL edit | PBP storage -15–30%. | REVERTED — applied with #1 in the same run; can't isolate its impact but was rolled back together. |
| 3 | Test Gorilla/FPC codecs on values Array(Float64) |
game_outcome_context, season_outcome_context, staging |
2-3 side-by-side schemas + one sample run | Currently Delta+ZSTD gives 7.3× ratio; Gorilla or plain ZSTD might do similar or slightly better on random floats. Unknown win, worth one benchmark. | Not attempted. |
| 4 | Add PARTITION BY season_id to season_outcome_context |
season_outcome_context |
1-line DDL | Consistency with other tables; enables DROP PARTITION for retention. No perf impact. | REVERTED — bundled with #1, could have stayed in but rolled back for cleanliness. Can re-apply in isolation. |
| 5 | Add PARTITION BY toYYYYMM(run_timestamp) to experiment_results |
experiment_results |
1-line DDL | Retention management. No perf impact. | REVERTED — same as #4. |
| 6 | Downsize context_version UInt32 → UInt8; batch_index UInt32 → UInt8 |
game OC tables | 2-line DDL | Storage -3 bytes/row. Negligible (~0.1% at 100K). Not worth the schema churn. | Not attempted. |
| 7 | Add skip indexes for primary_player_id, is_touchdown |
play_by_play |
Index DDL + materialize | Future read optimisation — only worth doing once real query patterns are known. CH docs specifically warn against speculative skip indexes. | Not attempted. |
Updated recommendation: don't re-apply #1 without a client-side workaround for the InitAsync cost (e.g., caching the column schema once per run, or switching to a different insert path). #3 is still worth a small side-by-side benchmark. #4 and #5 are harmless consistency fixes that can re-apply in isolation if someone wants them.
Per-table evaluation¶
1. game_outcome_context¶
CREATE TABLE game_outcome_context (
game_id LowCardinality(String),
season_id LowCardinality(String),
context_version UInt32,
outcome_id LowCardinality(String),
values Array(Float64) CODEC(Delta, ZSTD)
)
ENGINE = ReplacingMergeTree(context_version)
ORDER BY (game_id, outcome_id)
PARTITION BY season_id
| Aspect | Evaluation | Grade |
|---|---|---|
| Engine choice (ReplacingMergeTree) | Correct — enables context_version deduplication semantics for versioned contexts. |
Yes |
ORDER BY (game_id, outcome_id) |
Ascending cardinality order (288 games × 850 outcomes). Aligns with per-game basket-read pattern (measured 8-20 ms at 1-10K). | Yes |
PARTITION BY season_id |
One partition per season. Appropriate for multi-season tenancy + per-season DROP PARTITION. | Yes |
LowCardinality(String) columns |
game_id (288), season_id (1-N), outcome_id (~850). All well within the 10K LC threshold. | Yes |
context_version UInt32 |
Functional but 3 bytes wider than needed. Most contexts will have versions in the UInt8 range. | Minor — see Change #6 |
values Array(Float64) CODEC(Delta, ZSTD) |
Delta+ZSTD gave 5.56× → 7.32× measured compression at 1K → 100K scale (good). BUT Delta is designed for monotonic sequences; per-outcome value arrays are effectively random floats (simulation-derived). Gorilla or FPC might do equally well or better; plain ZSTD is worth comparing. | Caution — see Change #3 |
Result: ship as-is. Optionally benchmark codec alternatives for values (Change #3).
2. game_outcome_context_staging¶
CREATE TABLE game_outcome_context_staging (
game_id, season_id, context_version, batch_index, outcome_id,
values Array(Float64) CODEC(Delta, ZSTD)
)
ENGINE = MergeTree()
ORDER BY (game_id, outcome_id, batch_index)
-- no PARTITION BY
| Aspect | Evaluation | Grade |
|---|---|---|
| No PARTITION BY | Correct for a short-lived staging table. TRUNCATE is used at end of run. |
Yes |
ORDER BY (game_id, outcome_id, batch_index) |
Matches the merge query (GROUP BY game_id, outcome_id + arraySort(x -> x.1, ... batch_index, values)). Exactly what we want. |
Yes |
batch_index UInt32 |
Chunks are typically ≤ 200, so UInt8 would do. Trivial saving. | Minor |
| Values codec | Same discussion as above — Delta on simulation-derived random floats isn't optimal on paper but compresses well in practice. | Caution |
Result: ship as-is.
3. season_outcome_context¶
CREATE TABLE season_outcome_context (
season_id, context_version, outcome_id,
values Array(Float64) CODEC(Delta, ZSTD)
)
ENGINE = ReplacingMergeTree(context_version)
ORDER BY (season_id, outcome_id)
-- no PARTITION BY
| Aspect | Evaluation | Grade |
|---|---|---|
| No PARTITION BY | Inconsistent with game_outcome_context which partitions by season_id. With multiple seasons in the same cluster we'd benefit from PARTITION BY season_id for DROP PARTITION and background-merge isolation. |
— see Change #4 |
ORDER BY (season_id, outcome_id) |
Ascending cardinality; aligns with per-season basket reads. | Yes |
| Engine (ReplacingMergeTree) | Correct, same reason as game OC. | Yes |
Result: add PARTITION BY season_id for consistency. No perf change at current data volumes.
4. play_by_play¶
CREATE TABLE play_by_play (
game_id LowCardinality(String),
season_id LowCardinality(String),
world_id UInt32,
week UInt8,
home_team LowCardinality(String),
... -- 38 more columns
)
ENGINE = MergeTree()
ORDER BY (game_id, world_id, play_sequence)
PARTITION BY season_id
| Aspect | Evaluation | Grade |
|---|---|---|
| Smallest correct numeric types | Extensive use of UInt8, UInt16, Int8 for bounded-range fields. | Yes |
LowCardinality(String) |
16 columns use LC for enum-like strings (teams, personnel, formations, play types, etc). Every one is correct — unique counts are all in the 2-32 range. | Yes |
ORDER BY (game_id, world_id, play_sequence) |
Aligns with per-game, per-world query pattern. Note: strict ascending-cardinality would suggest (game_id, play_sequence, world_id) (play_sequence has ~130 unique values vs world_id's 100K), which would compress better. But filters almost always include world_id, so the query-alignment benefit outweighs compression. |
Yes |
PARTITION BY season_id |
Currently 1 partition per 100K run → ~4.2B rows per partition. All foreground inserts and background merges serialise on that one partition. This is the write-throughput ceiling we hit. | — see Change #1 |
| No per-column CODEC on monotonic integers | world_id (monotonic within a game), play_sequence (monotonic within a drive), clock_seconds, drive_*_clock_seconds (monotonic within a drive) — all benefit from CODEC(Delta, ZSTD). CH docs cite 15-30% storage reduction on monotonic integer columns with Delta. |
— see Change #2 |
No skip indexes on primary_player_id, is_touchdown etc. |
CH docs: "don't add skip indexes speculatively — do it after real query patterns are known." We haven't exercised the read side at scale yet. Defer. | (deferred) |
Result: two worth changing (Changes #1 and #2).
5. experiment_results¶
CREATE TABLE experiment_results (
run_id, run_timestamp, requirement_id, backend, scale,
metric_name, metric_value, metric_unit, environment, notes
)
ENGINE = MergeTree()
ORDER BY (requirement_id, backend, scale, run_timestamp)
-- no PARTITION BY
| Aspect | Evaluation | Grade |
|---|---|---|
ORDER BY (requirement_id, backend, scale, run_timestamp) |
Ascending cardinality + query alignment ("metrics for REQ-X on CH at scale Y over time"). | Yes |
| No PARTITION BY | Log-like data that grows monotonically over time. Docs recommend PARTITION BY toYYYYMM(run_timestamp) for this shape — enables easy retention (DROP PARTITION old months). |
— see Change #5 |
run_id is String (not LowCardinality) |
Run IDs are unique per row, so LC would bloat not help. | Yes |
Result: add monthly partitioning when we're ready to think about result-table retention.
Proposed follow-up sequence (if we pursue schema optimisation)¶
- Change #1 (PBP partitioning) — already detailed in a prior discussion. One DDL line, one 100K full-pipeline rerun to measure. Projected 20-40% improvement in PBP write phase.
- Change #2 (PBP column codecs) — add
CODEC(Delta, ZSTD)to monotonic columns in the same DDL change. Measure viasystem.parts.bytes_on_diskbefore/after (no new run needed, just recreate the table and re-run). - Change #3 (OC values codec comparison) — stand up three parallel tables with different codecs on
values, write the same 10K dataset into each, compare compressed size + merge time. Most informative and lowest-risk experiment; ~1 hour of work. - Changes #4 and #5 (consistency partitions) — low-risk DDL cleanup. Roll into the next schema migration.
What we're explicitly NOT changing¶
ReplacingMergeTree(context_version)on the OC tables — correctly supports versioned contexts. Alternative (ReplicatedMergeTreein plain form) loses the version-based dedup semantics we actually use.ORDER BYtuples on any table — all match the observed query patterns. Changing to optimise compression would hurt read latency, which matters more.- Skip indexes or projections — CH docs specifically warn against adding these without evidence. Would evaluate once real read patterns emerge.
- Dropping
LowCardinalityon high-cardinality-looking columns — every LC use we have is under the 10K threshold where LC still wins.
Gaps this review doesn't cover¶
- Read performance against the current schemas at 100K — we have 1K/10K latency measurements but haven't re-run the should-have suite (REQ-OC-R1/R3/R4/R5) against the 3×16 tier. Changes to the schema here (especially skip indexes) should be driven by those numbers.
- Data-skipping index benchmarks — CH docs are clear these should be data-driven, not speculative. Need real query traces.
- Materialized views / projections — same argument. Premature without query-pattern evidence.