Canonical Entity Mapping — Presentation Outline¶
A 5–6 slide deck with a live demo in the middle, aimed at a mixed audience: execs, managers, end users, and developers. The structure leads with the problem in business terms, shows the solution working, then circles back to value, roadmap, and asks.
This is a working draft to take to coworkers — bullets are talking-point density, not final copy.
Slide 1 — "Today, every new data provider costs us weeks"¶
Audience focus: Execs, managers Key message: Every new provider was a weeks-of-dev tax — and worse, the mapping work happened too late and in the wrong place, so ops couldn't see it and data science couldn't trust it.
- Each provider (FoxSports, StatsPerform, SuperCoach, NRL Web…) had its own bespoke C# importer — adding ESPN for NFL would have been weeks of work
- We stored the same team / player / fixture multiple times, once per provider, and stitched them together with event-sourced "mapping" commands
- Mapping was opaque to operators: the mapping layer didn't show whether an entity had been mapped, when it was mapped, or by whom. That created a steady stream of "is this linked yet?" questions and downstream delays whenever a mapping was missing or wrong.
- Mapping happened too late in the pipeline: data science was effectively re-doing the canonicalisation work themselves on raw provider feeds, because our canonical ids weren't available at the point they needed them. Their training data couldn't reliably reference our internal ids — so we were paying for the same work twice.
- Net result: slow onboarding, blocked operators, and a data science team building on identifiers they couldn't fully trust
Visual: "before" tangle diagram — six provider boxes each with their own pipe into duplicate aggregates, with a mapping arrow tangle. Annotate two pain points: "engineer-only mapping layer" and "DS re-derives canonical ids downstream".
Slide 2 — "What we built: one canonical entity, one crosswalk, no per-provider code"¶
Audience focus: Devs (will love this), managers (need to follow the shape) Key message: We replaced bespoke importers with a generic, data-driven pipeline. ClickHouse holds the truth; Foundry consumes it.
- Standardised provider tables in ClickHouse — same columns across every provider for a given entity type (
{provider}_teams,{provider}_participants, etc.). Data science populates these. - One normalised crosswalk (
entity_mapping_crosswalk) —(canonical_id ↔ provider, provider_id)for every entity type and every provider, in one table - Generic importers — one importer per entity type reads from any provider's standardised table + the crosswalk → emits a
CreateXCommandagainst the canonical id. Adding a new provider = zero C# code. - Operator UI at
/admin/crosswalkplus a per-entity link-provider dialog for managing mappings without engineering involvement
Visual: "after" diagram — ClickHouse on the left (provider tables + crosswalk), generic importers in the middle, single canonical aggregates on the right.
DEMO (5–7 minutes)¶
Show the dialog doing the work — far more persuasive than describing it.
Suggested demo path:
- Navigate to
/teams/{some-team-id}→ click Provider Mappings. - Open the link-provider dialog. Show the three columns: source (LuckBox team), candidates from all providers, preview.
- Demonstrate fuzzy match → narrow with provider filter → link a candidate. Note the live "Mapped to …" pill, conflict detection, diagnostics panel.
- Repeat the structured match for a fixture (
/fixtures/{id}) — show how it matches by home team + away team + start time, not name. - Close with a quick visit to
/admin/crosswalkshowing the unified view across providers.
Tip: have a recognisable example loaded (e.g. a well-known team across three providers) — abstract test data is forgettable. Have a dev console open in case anyone asks "what's actually happening?" but don't lead with it.
Slide 3 — "What this unlocks"¶
Audience focus: Execs, managers Key message: We've turned a weeks-of-dev problem into a days-of-data problem — and stopped paying for the same identifier work twice.
- New provider onboarding: data eng creates the standardised tables + populates crosswalk → done. No deploy, no PR, no engineer. Weeks → days.
- Operators self-serve: bad or missing mapping? Fix it in the wizard. Status visible — who linked this, when, and to what — no ticket required.
- Canonical ids upstream of data science: standardised provider tables + crosswalk are populated before DS consumes the data, so training pipelines reference our canonical ids natively. No more shadow canonicalisation.
- One source of truth: one team is one team. Reporting, analytics, and ML training all see consistent canonical ids.
- Lower coupling: data engineering and app engineering can move independently — DS owns the provider tables, app eng owns the canonical model.
Visual: side-by-side comparison.
| Old | New | |
|---|---|---|
| Provider onboarding | Weeks of dev | Days, data-only |
| Canonical ids for DS | Post-hoc, DS-built | From day one, infrastructure-provided |
| Operator visibility | Engineer-only | Self-serve UI with audit trail |
Slide 4 — "What's next"¶
Audience focus: Managers (planning), execs (timing) Key message: The mapping rebuild is the foundation; here's what we'll stack on top.
- Prove it: onboard ESPN for NFL with zero C# changes — that's the headline validation
- Per-provider import wizard — admin UI to kick off + monitor data loads, replacing the current developer-driven runbook
- Decommission legacy mapping infrastructure —
AggregateRelationshipBuilder, the dual-command path, and the per-provider importer code can all come out (saves ~20 files, simplifies the domain layer) - Migrate existing consumers — Ballr, scoreboard, projections currently read AggregateRelations; switch them to the crosswalk
- Composite-id support for SuperCoach Teams + Fixtures (their identifiers don't fit the simple model)
Visual: timeline ribbon with the items above, with one already crossed off ("ClickHouse infra + per-entity dialogs ").
Slide 5 — "Risks and asks"¶
Audience focus: Execs (decisions), managers (resourcing) Key message: The hard work is done; here's what we need from the room.
- Data migration: existing dual aggregates get deleted on cut-over. Already aligned, but worth re-confirming with stakeholders.
- ClickHouse becomes critical path: if the crosswalk is down, importers stall. Need to confirm SLA + monitoring from infra.
- Decision needed: do we want a soft cut-over (legacy importers stay alongside generic ones during validation) or hard (legacy gets deleted on day one)?
- Resourcing: an owner from data engineering for the standardised provider tables — this is now their interface, not a side concern.
Visual: minimal — three traffic-light boxes (green: shipped; amber: in-flight; red: needs decision).
Optional Slide 6 — "Q&A / Backup material"¶
Keep one slide of dense detail in case someone asks: the score-band cutoffs, the FastEndpoints route list, the migration sequence. Don't show it unless asked — having it prevents you from getting derailed.
Speaker tips¶
- Open with the problem, not the architecture. The exec half of the room needs to feel the pain before they care about the fix.
- Land the "DS shadow canonicalisation" point hard. It reframes the work from "engineering cleanup" to "we've stopped paying for the same identifier problem twice." Worth saying out loud:
"What this means in practice: data science was solving the same identifier problem we were, in parallel, with their own logic. That's now infrastructure — they get our canonical ids the moment data lands, and their models train against the same ids the platform serves."
- Land the "opaque to users" point explicitly on slide 1. It sets up both the demo and slide 3's self-serve bullet:
"Event sourcing wasn't the wrong technical choice — but it kept the mapping decisions inside the event log, where only engineers could see them. That's what we've fixed."
- For the dev contingent, drop a one-liner like "we deleted ~3,000 lines of bespoke importer code and replaced it with one generic importer per entity type" — they'll perk up.
- Don't get sucked into architecture before you've earned it with the problem statement and the demo.
- Have a real, recognisable demo example loaded — abstract test data is forgettable.