Skip to content

Identity Linking — Problem & Solution Summary

Problem

LBS.Foundry ingests sports data from multiple providers (FoxSports, SuperCoach, StatsPerform, NRL Web, and — in the NFL 2026-2027 initiative — NflVerse, ESPN). The same real-world entity (player, team, fixture, competition, season, venue) has a different ID in each provider, so the system needs a reliable, auditable way to recognise that those IDs all refer to one canonical entity.

What broke before the rework

  • Duplicate aggregates per entity. Each provider created its own aggregate for the same real-world player/team/fixture, plus a canonical "LuckBox" aggregate. Analytics showed e.g. three Patrick Dangerfields instead of one.
  • Bespoke C# importers per provider. Adding a new provider meant writing a new importer class, a new command generator, and a new mapping builder — not a data-only change.
  • Event-sourced mapping layer. Provider-to-canonical links were captured via MapParticipantCommand / ParticipantMappedEvent (and equivalents for team/season/competition/sporting-event), rebuilt through AggregateRelationshipBuilder into a PostgreSQL AggregateRelations table. Dual-command handling (Map*Command plus the normal Create*Command) in every aggregate made the write path expensive to reason about.
  • Limited scope. AggregateRelations really only served participants well; teams, sporting events, competitions, seasons and venues had no consistent mapping story.
  • Manual work did not scale. Name + DOB matching happened in spreadsheets; each new partner produced a fresh round of ad-hoc code.

Solution

Push the mapping and provider data out of Marten/PostgreSQL and into ClickHouse, where data engineering already owns the ETL. Foundry reads both the standardized provider data and the crosswalk from ClickHouse, and resolves a single canonical aggregate per entity.

Architecture (what was shipped)

DATA ENGINEERING
  Transforms raw provider data into standardized ClickHouse tables
  Populates entity_mapping_crosswalk (provider id ↔ canonical id)

CLICKHOUSE (foundry database)
  participants | teams | competitions | seasons | sporting_events | venues
    - One table per entity type (not per provider)
    - `provider LowCardinality(String)` discriminator column
    - ORDER BY (sport, provider, provider_id); venues use (provider, provider_id)
    - Denormalized names (competition_name, season_name, home_team_name, etc.)
      so rows are self-describing with no JOINs required to render

  entity_mapping_crosswalk
    - canonical_id, entity_type, sport, provider, provider_id, display_name
    - ORDER BY (entity_type, provider, provider_id)
    - Bloom-filter skipping index on canonical_id for reverse lookups

FOUNDRY
  IEntityMappingService     → resolves canonical ↔ provider ids against crosswalk
  Generic per-entity importers → read any provider table, yield canonical commands
  Single aggregate per entity (no provider duplicates, no mapping events)
  /admin/crosswalk admin UI → view, link, and unlink provider mappings

Deliberate non-goals

These were weighed and rejected during brainstorming — the problem framing here should not be read as a case for any of them:

  • No confidence scoring. Data engineering owns match quality. The crosswalk records the decision, not a score.
  • No temporal validity / valid-from-valid-to. If a provider reassigns an ID, the crosswalk row is updated in place (ReplacingMergeTree on updated_at); historical replay is not a requirement.
  • No in-app fuzzy matcher, no AI matcher. The admin UI's "Add Link" dialog offers ngram-based candidate search + DOB match as assistance for the human operator, but matching decisions are made by data engineering upstream.
  • No dual aggregates, no event-sourced mapping layer. AggregateRelationshipBuilder, MapXCommand, XMappedEvent, and AggregateRelations were deleted. Canonical IDs are generated GUIDs owned by the crosswalk — not deterministic UUIDv5.
  • No backward-compatibility / dual-write window. Existing dual aggregates were discarded, the schema was recreated, and provider data was re-imported from a clean slate.

What changes for adding a new provider

Zero C# code. The process:

  1. Data engineering creates ETL that writes to the standard tables with provider = "Espn" (or whatever).
  2. Data engineering populates crosswalk entries mapping ESPN provider IDs to canonical IDs (reusing existing canonical IDs where the entity already exists in another provider).
  3. Run the existing generic importer with the new provider name. It reads from the same tables, resolves canonical IDs via the crosswalk, and emits standard domain commands.

No new importer class, no new provider folder, no deployment for the mapping.