Canonical Entity Mapping — Presentation Outline¶

A 5–6 slide deck with a live demo in the middle, aimed at a mixed audience: execs, managers, end users, and developers. The structure leads with the problem in business terms, shows the solution working, then circles back to value, roadmap, and asks.

This is a working draft to take to coworkers — bullets are talking-point density, not final copy.

Slide 1 — "Today, every new data provider costs us weeks"¶

Audience focus: Execs, managers Key message: Every new provider was a weeks-of-dev tax — and worse, the mapping work happened too late and in the wrong place, so ops couldn't see it and data science couldn't trust it.

Each provider (FoxSports, StatsPerform, SuperCoach, NRL Web…) had its own bespoke C# importer — adding ESPN for NFL would have been weeks of work
We stored the same team / player / fixture multiple times, once per provider, and stitched them together with event-sourced "mapping" commands
Mapping was opaque to operators: the mapping layer didn't show whether an entity had been mapped, when it was mapped, or by whom. That created a steady stream of "is this linked yet?" questions and downstream delays whenever a mapping was missing or wrong.
Mapping happened too late in the pipeline: data science was effectively re-doing the canonicalisation work themselves on raw provider feeds, because our canonical ids weren't available at the point they needed them. Their training data couldn't reliably reference our internal ids — so we were paying for the same work twice.
Net result: slow onboarding, blocked operators, and a data science team building on identifiers they couldn't fully trust

Visual: "before" tangle diagram — six provider boxes each with their own pipe into duplicate aggregates, with a mapping arrow tangle. Annotate two pain points: "engineer-only mapping layer" and "DS re-derives canonical ids downstream".

Slide 2 — "What we built: one canonical entity, one crosswalk, no per-provider code"¶

Audience focus: Devs (will love this), managers (need to follow the shape) Key message: We replaced bespoke importers with a generic, data-driven pipeline. ClickHouse holds the truth; Foundry consumes it.

Standardised provider tables in ClickHouse — same columns across every provider for a given entity type ({provider}_teams, {provider}_participants, etc.). Data science populates these.
One normalised crosswalk (entity_mapping_crosswalk) — (canonical_id ↔ provider, provider_id) for every entity type and every provider, in one table
Generic importers — one importer per entity type reads from any provider's standardised table + the crosswalk → emits a CreateXCommand against the canonical id. Adding a new provider = zero C# code.
Operator UI at /admin/crosswalk plus a per-entity link-provider dialog for managing mappings without engineering involvement

Visual: "after" diagram — ClickHouse on the left (provider tables + crosswalk), generic importers in the middle, single canonical aggregates on the right.

DEMO (5–7 minutes)¶

Show the dialog doing the work — far more persuasive than describing it.

Suggested demo path:

Navigate to /teams/{some-team-id} → click Provider Mappings.
Open the link-provider dialog. Show the three columns: source (LuckBox team), candidates from all providers, preview.
Demonstrate fuzzy match → narrow with provider filter → link a candidate. Note the live "Mapped to …" pill, conflict detection, diagnostics panel.
Repeat the structured match for a fixture (/fixtures/{id}) — show how it matches by home team + away team + start time, not name.
Close with a quick visit to /admin/crosswalk showing the unified view across providers.

Tip: have a recognisable example loaded (e.g. a well-known team across three providers) — abstract test data is forgettable. Have a dev console open in case anyone asks "what's actually happening?" but don't lead with it.

Slide 3 — "What this unlocks"¶

Audience focus: Execs, managers Key message: We've turned a weeks-of-dev problem into a days-of-data problem — and stopped paying for the same identifier work twice.

New provider onboarding: data eng creates the standardised tables + populates crosswalk → done. No deploy, no PR, no engineer. Weeks → days.
Operators self-serve: bad or missing mapping? Fix it in the wizard. Status visible — who linked this, when, and to what — no ticket required.
Canonical ids upstream of data science: standardised provider tables + crosswalk are populated before DS consumes the data, so training pipelines reference our canonical ids natively. No more shadow canonicalisation.
One source of truth: one team is one team. Reporting, analytics, and ML training all see consistent canonical ids.
Lower coupling: data engineering and app engineering can move independently — DS owns the provider tables, app eng owns the canonical model.

Visual: side-by-side comparison.

	Old	New
Provider onboarding	Weeks of dev	Days, data-only
Canonical ids for DS	Post-hoc, DS-built	From day one, infrastructure-provided
Operator visibility	Engineer-only	Self-serve UI with audit trail

Slide 4 — "What's next"¶

Audience focus: Managers (planning), execs (timing) Key message: The mapping rebuild is the foundation; here's what we'll stack on top.

Prove it: onboard ESPN for NFL with zero C# changes — that's the headline validation
Per-provider import wizard — admin UI to kick off + monitor data loads, replacing the current developer-driven runbook
Decommission legacy mapping infrastructure — AggregateRelationshipBuilder, the dual-command path, and the per-provider importer code can all come out (saves ~20 files, simplifies the domain layer)
Migrate existing consumers — Ballr, scoreboard, projections currently read AggregateRelations; switch them to the crosswalk
Composite-id support for SuperCoach Teams + Fixtures (their identifiers don't fit the simple model)

Visual: timeline ribbon with the items above, with one already crossed off ("ClickHouse infra + per-entity dialogs ").

Slide 5 — "Risks and asks"¶

Audience focus: Execs (decisions), managers (resourcing) Key message: The hard work is done; here's what we need from the room.

Data migration: existing dual aggregates get deleted on cut-over. Already aligned, but worth re-confirming with stakeholders.
ClickHouse becomes critical path: if the crosswalk is down, importers stall. Need to confirm SLA + monitoring from infra.
Decision needed: do we want a soft cut-over (legacy importers stay alongside generic ones during validation) or hard (legacy gets deleted on day one)?
Resourcing: an owner from data engineering for the standardised provider tables — this is now their interface, not a side concern.

Visual: minimal — three traffic-light boxes (green: shipped; amber: in-flight; red: needs decision).

Optional Slide 6 — "Q&A / Backup material"¶

Keep one slide of dense detail in case someone asks: the score-band cutoffs, the FastEndpoints route list, the migration sequence. Don't show it unless asked — having it prevents you from getting derailed.

Speaker tips¶

Open with the problem, not the architecture. The exec half of the room needs to feel the pain before they care about the fix.
Land the "DS shadow canonicalisation" point hard. It reframes the work from "engineering cleanup" to "we've stopped paying for the same identifier problem twice." Worth saying out loud:

"What this means in practice: data science was solving the same identifier problem we were, in parallel, with their own logic. That's now infrastructure — they get our canonical ids the moment data lands, and their models train against the same ids the platform serves."
Land the "opaque to users" point explicitly on slide 1. It sets up both the demo and slide 3's self-serve bullet:

"Event sourcing wasn't the wrong technical choice — but it kept the mapping decisions inside the event log, where only engineers could see them. That's what we've fixed."
For the dev contingent, drop a one-liner like "we deleted ~3,000 lines of bespoke importer code and replaced it with one generic importer per entity type" — they'll perk up.
Don't get sucked into architecture before you've earned it with the problem statement and the demo.
Have a real, recognisable demo example loaded — abstract test data is forgettable.