Skip to content

Foundry Data Ingestion — Data Engineering Guide

Why This Approach

Data science and data engineering are the first pass on any new data source. By the time a provider's data reaches Foundry, your team has already done the hard work — researching the provider's API, understanding their data model, identifying the core fields, and mapping them to our domain. You know what a "participant" or "team" looks like for that provider better than anyone.

This architecture puts that expertise at the centre. Rather than encoding provider-specific knowledge in application code (which requires a developer, a PR, a deployment), the standardized provider tables let your team define what the data looks like and how it maps. The application just reads what you've prepared.

The result: - You control the data pipeline end-to-end — from raw provider API to standardized table to crosswalk mapping - New providers don't need code changes — your team adds tables and crosswalk entries, Foundry imports from them - Schema is consistent — every provider's participants table has the same columns, so your ETL tooling is reusable across providers - Identity resolution stays in your domain — you decide which FoxSports player is the same as which SuperCoach player, not application code

Overview

Foundry uses ClickHouse as the source of truth for external provider data and entity identity mapping. Data engineering is responsible for:

  1. Populating provider tables — standardized tables containing entity data from each provider
  2. Populating the crosswalk — the entity_mapping_crosswalk table that links provider IDs to canonical Foundry IDs

Foundry's generic importers read from these tables to create domain aggregates. Adding a new provider requires zero application code changes — just new table data and crosswalk entries.

Architecture

YOUR ETL PIPELINE
  Raw provider APIs/files → Transform → Load into ClickHouse provider tables
  Match entities across providers → Load into entity_mapping_crosswalk

CLICKHOUSE (foundry database)
  participants                 ← you populate these (one row per provider per entity)
  teams                        ← all tables carry a `provider` LowCardinality(String) column
  competitions
  seasons
  sporting_events
  venues
  entity_mapping_crosswalk     ← you populate this

FOUNDRY (reads from ClickHouse)
  Generic importers read provider tables + crosswalk → create domain aggregates

Schema design note. Every provider table is keyed by (sport, provider, provider_id) (or (provider, provider_id) for venues — no sport column). One table per entity type, not one per provider × entity. Rationale:

  • Schema evolution is a single ALTER TABLE per entity — not one per provider.
  • Onboarding a new provider is data-only — no DDL, no application change. Just start inserting rows with your new provider value.
  • Mirrors the entity_mapping_crosswalk layout (keyed by (entity_type, provider, provider_id)).
  • Cross-provider queries become one SELECT — no UNION ALL over N tables.
  • Per-provider queries still hit the primary index prefix, so performance is unchanged.

Process Maps

Onboarding a New Provider

flowchart TD
    A[Data Science researches new provider API] --> B[Identify core entities and fields]
    B --> C[Map provider fields to standardized schema]
    C --> D[Data Engineering builds ETL pipeline]
    D --> G[Populate entity tables with provider = 'newguy']
    G --> H{Entities already exist in other providers?}
    H -->|Yes| I[Match to existing canonical IDs via crosswalk]
    H -->|No| J[Generate new canonical IDs]
    I --> K[Insert crosswalk entries]
    J --> K
    K --> L[Trigger Foundry generic import]
    L --> M[Domain aggregates created with canonical IDs]

Ongoing Data Ingestion

flowchart TD
    A[Provider API / data feed updates] --> B[ETL transforms to standardized schema]
    B --> C[Upsert rows into provider tables]
    C --> D{New entities?}
    D -->|Yes| E[Add crosswalk entries - match or generate canonical IDs]
    D -->|No| F[Crosswalk already has mappings]
    E --> G[Trigger Foundry import]
    F --> G
    G --> H[Generic importer reads provider table]
    H --> I[Batch resolve canonical IDs from crosswalk]
    I --> J{Crosswalk entry exists?}
    J -->|Yes| K[Create/update domain aggregate]
    J -->|No| L[Skip - entity not mapped yet]

Adding a New Field

flowchart TD
    A[New field needed - e.g. home_ground on teams] --> B[Foundry developer creates ClickHouse migration]
    B --> C[Migration ALTERs all existing provider team tables]
    C --> D[Migration updates template for future providers]
    D --> E[Foundry developer updates domain types]
    E --> F[Deploy Foundry - migration runs on startup]
    F --> G[Data Engineering populates new column via ETL]
    G --> H[Next import picks up new field]

Cross-Provider Entity Resolution

flowchart TD
    A[FoxSports player 12345] --> D[Crosswalk]
    B[SuperCoach player nrl/2026/classic/1] --> D
    C[StatsPerform player SP-999] --> D
    D --> E[canonical_id: abc-123]
    E --> F[Single Foundry ParticipantAggregate]
    F --> G[ParticipantContract read model]

Provider Tables

Naming Convention

Tables are one-per-entity with a provider LowCardinality(String) discriminator column. Full list: foundry.participants, foundry.teams, foundry.competitions, foundry.seasons, foundry.sporting_events, foundry.venues.

Provider names are lowercase strings in the provider column: foxsports, statsperform, supercoach, espn, nflverse.

Entity tables (one per entity type): foundry.participants, foundry.teams, foundry.competitions, foundry.seasons, foundry.sporting_events, foundry.venues.

Each row carries a provider discriminator — e.g. a FoxSports participant is a row in foundry.participants with provider = 'foxsports'.

Table Schemas

All tables use ENGINE = ReplacingMergeTree(updated_at). Insert a row with the same ORDER BY key and a newer updated_at to update it — ClickHouse deduplicates on merge.

participants

Column Type Required Notes
provider_id String Yes The provider's unique ID for this player
first_name String Yes
surname String Yes
full_name String Yes e.g., "Patrick Dangerfield"
display_name String No Short form, e.g., "P. Dangerfield"
date_of_birth Nullable(Date) No Format: YYYY-MM-DD
nationality String No
height Nullable(Int32) No In centimetres
weight Nullable(Int32) No In kilograms
sport LowCardinality(String) Yes e.g., "RugbyLeague", "Cricket", "AmericanFootball"
status String No e.g., "Active", "Retired", "Injured"
updated_at DateTime64(3) Auto Defaults to now64() if omitted

ORDER BY: (sport, provider_id)

teams

Column Type Required Notes
provider_id String Yes
name String Yes e.g., "Melbourne Storm"
short_name String No e.g., "Storm"
full_name String No
code String No e.g., "MEL"
display_name String No
sport LowCardinality(String) Yes
updated_at DateTime64(3) Auto

ORDER BY: (sport, provider_id)

competitions

Column Type Required Notes
provider_id String Yes
name String Yes e.g., "NRL Telstra Premiership"
short_name String No e.g., "NRL"
code String No e.g., "NRL"
sport LowCardinality(String) Yes
updated_at DateTime64(3) Auto

ORDER BY: (sport, provider_id)

seasons

Column Type Required Notes
provider_id String Yes
season_name String Yes e.g., "2026 NRL Season"
season_year Int32 Yes e.g., 2026
competition_provider_id String Yes FK — must match a provider_id in competitions for the same provider
sport LowCardinality(String) Yes
updated_at DateTime64(3) Auto

ORDER BY: (sport, provider_id)

sporting_events

Column Type Required Notes
provider_id String Yes
competition_provider_id String Yes FK — must match a provider_id in competitions for the same provider
season_provider_id String Yes FK — must match a provider_id in seasons for the same provider
home_team_provider_id String Yes FK — must match a provider_id in teams for the same provider
away_team_provider_id String Yes FK — must match a provider_id in teams for the same provider
start_time_utc DateTime64(3) Yes UTC timestamp
venue_provider_id String No FK — must match a provider_id in venues for the same provider
group_name String No Round/group name, e.g., "Round 1", "Finals Week 1"
sport LowCardinality(String) Yes
updated_at DateTime64(3) Auto

ORDER BY: (sport, provider_id)

venues

Column Type Required Notes
provider_id String Yes
name String Yes e.g., "AAMI Park"
city String No
country String No
state String No
updated_at DateTime64(3) Auto

ORDER BY: (provider_id)

lineups

Per-game lineup membership — one row per (sporting event, team, participant). Unlike the six core tables this is an event-scoped membership fact (like squads is a season-scoped one). It is normally populated by a provider's canonicaliser (e.g. the ESPN scraper writes it from espn.event_participants), not hand-loaded.

Column Type Required Notes
provider_id String Yes Composite, e.g. {event}/{team}/{athlete} — unique per lineup member
sporting_event_provider_id String Yes FK — must match a provider_id in sporting_events for the same provider
team_provider_id String Yes FK — must match a provider_id in teams for the same provider
participant_provider_id String Yes FK — must match a provider_id in participants for the same provider
team_name String No Denormalised for a self-describing row
selection_role LowCardinality(String) Yes Starting, Bench, Interchange, Withdrawn, Replacement
position LowCardinality(String) No Raw provider position; the domain Position mapping is done downstream by the importer
jersey_number Nullable(Int32) No
type LowCardinality(String) No e.g. Player
sport LowCardinality(String) Yes
updated_at DateTime64(3) Auto

ORDER BY: (sport, provider, provider_id)

Important Notes

  • provider_id must be a string. Even if the provider uses numeric IDs, store them as strings (e.g., "12345" not 12345).
  • Cross-references use provider IDs, not canonical IDs. A sporting event's home_team_provider_id should be the same provider's team ID (e.g., if FoxSports calls Melbourne Storm "99", then rows in foundry.sporting_events with provider = 'foxsports' should have home_team_provider_id = "99").
  • Sport names must match Foundry's constants exactly: RugbyLeague, Cricket, AmericanFootball, AustralianFootball. Case-sensitive.
  • Updates are inserts. To update a player's name, insert a new row with the same (sport, provider, provider_id) and a newer updated_at. ClickHouse's ReplacingMergeTree deduplicates automatically.

Entity Mapping Crosswalk

The crosswalk links provider IDs to canonical Foundry IDs. This is where cross-provider entity resolution happens.

Schema

foundry.entity_mapping_crosswalk
Column Type Notes
canonical_id UUID The Foundry canonical ID — generate with generateUUIDv4()
entity_type LowCardinality(String) Participant, Team, Competition, Season, SportingEvent, Venue, Lineup
sport LowCardinality(String) Must match the sport in the provider table
provider LowCardinality(String) Must match the provider table name prefix (e.g., foxsports, statsperform)
provider_id String Must match the provider_id in the provider table
display_name String Human-readable name for admin UI
created_at DateTime64(3) Auto
updated_at DateTime64(3) Auto

ORDER BY: (entity_type, provider, provider_id)

How Canonical IDs Work

Each real-world entity (player, team, etc.) gets one canonical UUID. Multiple providers can map to the same canonical ID:

-- Patrick Dangerfield exists in 3 providers, all map to the same canonical_id
INSERT INTO foundry.entity_mapping_crosswalk VALUES
    ('a1b2c3d4-...', 'Participant', 'RugbyLeague', 'foxsports',    '12345',  'Patrick Dangerfield'),
    ('a1b2c3d4-...', 'Participant', 'RugbyLeague', 'supercoach',   'ABC42',  'Patrick Dangerfield'),
    ('a1b2c3d4-...', 'Participant', 'RugbyLeague', 'statsperform', 'SP-999', 'Patrick Dangerfield');

Rules: - Generate one canonical_id per real-world entity using generateUUIDv4() - All providers that refer to the same entity share the same canonical_id - Each (provider, provider_id) pair appears at most once per entity type - canonical_id values are permanent — never change them once assigned

Populating the Crosswalk

Step 1: Assign canonical IDs to your primary provider

Pick your most complete provider (e.g., FoxSports for NRL) and assign a canonical ID to each entity:

INSERT INTO foundry.entity_mapping_crosswalk
SELECT
    generateUUIDv4() AS canonical_id,
    'Participant' AS entity_type,
    sport,
    'foxsports' AS provider,
    provider_id,
    full_name AS display_name
FROM foundry.participants FINAL
WHERE provider = 'foxsports';

Step 2: Map secondary providers to the same canonical IDs

This is where your matching logic lives. Match by name, DOB, team, or whatever signals work for your data:

-- Example: match supercoach players to foxsports players by name + DOB
INSERT INTO foundry.entity_mapping_crosswalk
SELECT
    cx.canonical_id,
    'Participant' AS entity_type,
    sc.sport,
    'supercoach' AS provider,
    sc.provider_id,
    sc.full_name AS display_name
FROM foundry.participants sc FINAL
JOIN foundry.participants fox FINAL
    ON fox.provider = 'foxsports'
    AND lower(sc.full_name) = lower(fox.full_name)
    AND sc.date_of_birth = fox.date_of_birth
    AND sc.sport = fox.sport
JOIN foundry.entity_mapping_crosswalk cx FINAL
    ON cx.provider = 'foxsports'
    AND cx.provider_id = fox.provider_id
    AND cx.entity_type = 'Participant'
WHERE sc.provider = 'supercoach';

Step 3: Handle unmatched entities

Entities that exist in a secondary provider but not in the primary get their own canonical ID:

-- SuperCoach players with no FoxSports match get new canonical IDs
INSERT INTO foundry.entity_mapping_crosswalk
SELECT
    generateUUIDv4() AS canonical_id,
    'Participant' AS entity_type,
    sc.sport,
    'supercoach' AS provider,
    sc.provider_id,
    sc.full_name AS display_name
FROM foundry.participants sc FINAL
LEFT JOIN foundry.entity_mapping_crosswalk cx FINAL
    ON cx.provider = 'supercoach'
    AND cx.provider_id = sc.provider_id
    AND cx.entity_type = 'Participant'
WHERE sc.provider = 'supercoach'
  AND cx.canonical_id IS NULL;

Import Order

Foundry imports entities in this order due to cross-references:

Competition -> Season -> Team -> Venue -> SportingEvent -> Participant

Ensure crosswalk entries exist for parent entities before child entities. For example, seasons reference competitions — the competition's crosswalk entry must exist before the season import runs.

Multiple IDs Per Provider

Some providers have multiple ID schemes for the same entity. For example, SuperCoach player IDs are only unique within a context of sport + year + game type — the raw integer 1 could be different players depending on the parameters.

Convention: Use Composite IDs

Store the most specific, globally unique form as provider_id in the provider table. For SuperCoach, this means the composite form:

provider_id = "nrl/2026/classic/1"    (not just "1")

The composite form is unique across all seasons and game types.

Multiple Crosswalk Entries for Alternate IDs

If downstream systems reference entities by different ID forms, add additional crosswalk rows for each form. All map to the same canonical_id:

-- Same player, same provider, multiple ID forms
INSERT INTO foundry.entity_mapping_crosswalk VALUES
    ('abc-123...', 'Participant', 'RugbyLeague', 'supercoach', 'nrl/2026/classic/1', 'Reece Walsh'),
    ('abc-123...', 'Participant', 'RugbyLeague', 'supercoach', 'nrl/2025/classic/1', 'Reece Walsh');

Any lookup by any of these IDs resolves to the same canonical entity.

Rules

  • The provider table (supercoach_participants) stores the primary composite ID in provider_id
  • The crosswalk can have multiple rows per (provider, canonical_id) with different provider_id values
  • Each (entity_type, provider, provider_id) combination is unique — no duplicates
  • Downstream consumers query with whatever ID form they have — they don't need to know the scheme
  • The canonical ID stays the same regardless of which ID form was used to look it up

Seasonal ID Changes

When a provider reuses IDs across seasons (same raw ID, different player), use composite IDs that include the season context. Each season's composite ID gets its own crosswalk entry pointing to the correct canonical ID:

-- Player 1 is Reece Walsh in 2026 but was someone else in 2024
('abc-123...', 'Participant', 'RugbyLeague', 'supercoach', 'nrl/2026/classic/1', 'Reece Walsh'),
('def-456...', 'Participant', 'RugbyLeague', 'supercoach', 'nrl/2024/classic/1', 'Other Player'),

Adding a New Provider

  1. Choose a provider name — lowercase, no spaces (e.g., espn, nflverse)
  2. Tables are created automatically — Foundry creates the 6 consolidated tables on startup. Post-V004, no provider list exists in code. To onboard a new provider, just start inserting rows with your new provider value — no DDL, no application change.
  3. Populate the provider tables — ETL your provider's data into the standardized tables
  4. Populate the crosswalk — Match entities to existing canonical IDs or generate new ones
  5. Trigger an import in Foundry — The generic importers will read your tables and create domain aggregates

Validation Queries

-- Check crosswalk coverage: how many entities per provider?
SELECT entity_type, provider, count() AS entries
FROM foundry.entity_mapping_crosswalk FINAL
GROUP BY entity_type, provider
ORDER BY entity_type, provider;

-- Find entities with only one provider (potential missing mappings)
SELECT canonical_id, entity_type, sport, count() AS provider_count
FROM foundry.entity_mapping_crosswalk FINAL
GROUP BY canonical_id, entity_type, sport
HAVING provider_count = 1
ORDER BY entity_type;

-- Check for orphaned provider table entries (no crosswalk mapping)
SELECT fp.provider, fp.provider_id, fp.full_name, fp.sport
FROM foundry.participants fp FINAL
LEFT JOIN foundry.entity_mapping_crosswalk cx FINAL
    ON cx.provider = fp.provider
    AND cx.provider_id = fp.provider_id
    AND cx.entity_type = 'Participant'
WHERE fp.provider = 'foxsports'
  AND cx.canonical_id IS NULL;

-- Check for crosswalk entries with no provider table data
SELECT cx.provider, cx.provider_id, cx.display_name
FROM foundry.entity_mapping_crosswalk cx FINAL
LEFT JOIN foundry.participants fp FINAL
    ON fp.provider = cx.provider
    AND fp.provider_id = cx.provider_id
WHERE cx.provider = 'foxsports'
    AND cx.entity_type = 'Participant'
    AND fp.provider_id IS NULL;

Mapped Lineups Saved Query (LBS-1756)

Foundry's lineup importer reads foundry.lineups joined with the crosswalk to resolve the canonical SportingEvent / Team / Participant ids, via a ClickHouse Cloud saved query. The code references it through ClickHouseQueries.QueryIds.MappedLineupMembers.

To activate it: create the saved query below in ClickHouse Cloud, then replace that placeholder with the saved query's GUID (the GUID is assigned by ClickHouse Cloud — it cannot be set from code).

Parameterized by provider_name + sport; the projected columns map 1:1 to MappedLineupRow:

SELECT
    l.provider                     AS provider,
    l.provider_id                  AS provider_id,
    l.sporting_event_provider_id   AS sporting_event_provider_id,
    l.team_provider_id             AS team_provider_id,
    l.participant_provider_id      AS participant_provider_id,
    l.team_name                    AS team_name,
    l.selection_role               AS selection_role,
    l.position                     AS position,
    l.jersey_number                AS jersey_number,
    l.type                         AS type,
    l.sport                        AS sport,
    se.canonical_id                AS canonical_sporting_event_id,
    tm.canonical_id                AS canonical_team_id,
    pt.canonical_id                AS canonical_participant_id
FROM foundry.lineups l FINAL
LEFT JOIN foundry.entity_mapping_crosswalk se FINAL
    ON se.entity_type = 'SportingEvent'
   AND se.provider = l.provider AND se.provider_id = l.sporting_event_provider_id AND se.sport = l.sport
LEFT JOIN foundry.entity_mapping_crosswalk tm FINAL
    ON tm.entity_type = 'Team'
   AND tm.provider = l.provider AND tm.provider_id = l.team_provider_id AND tm.sport = l.sport
LEFT JOIN foundry.entity_mapping_crosswalk pt FINAL
    ON pt.entity_type = 'Participant'
   AND pt.provider = l.provider AND pt.provider_id = l.participant_provider_id AND pt.sport = l.sport
WHERE l.provider = {provider_name:String}
  AND l.sport = {sport:String}

The canonical-id columns are nullable: a LEFT JOIN miss (the SE/Team/Participant not yet resolved in the crosswalk) leaves them NULL, which the importer treats as "not mapped yet — skip". This mirrors the existing MappedPlayers saved query.

FAQ

Q: Can I update a player's name? A: Insert a new row with the same (sport, provider_id). ReplacingMergeTree keeps the latest updated_at.

Q: What if I map a provider ID to the wrong canonical ID? A: Delete the old row and insert a new one. Or use the admin UI at /admin/crosswalk to unlink and relink.

Q: What if a player's details change (e.g., name, status, height)? A: Update the player's row in the provider table (insert a new row with the same provider_id and a newer updated_at). The canonical ID stays the same — it represents the player, not their current details.

Q: Do I need to populate all 6 entity types? A: No. Populate what your provider has. Foundry's importers skip entity types with no data.

Q: What sport names does Foundry accept? A: RugbyLeague, Cricket, AmericanFootball, AustralianFootball. Case-sensitive. Check SportsDataProviders for the latest list.

Q: We need to add a new field to teams (e.g., home_ground). How does that flow through? A: This is a coordinated change across ClickHouse and Foundry: 1. Foundry developer creates a ClickHouse migration that runs a single ALTER TABLE foundry.teams ADD COLUMN ... (post-V004 there's only one table per entity, so one DDL covers every provider) 2. Foundry developer updates the ProviderTeam read model, CreateTeamCommand, TeamCreatedEvent, and TeamContract to include the new field 3. Data engineering populates the new column in each provider's teams table via your ETL 4. Next import picks up the new field and flows it into the domain

The key point: the column must exist in the ClickHouse table before you can populate it. The migration handles that. You don't need to wait for a Foundry deployment to start preparing your data — just populate the column once the migration runs.

Q: A provider has multiple IDs for the same entity (e.g., SuperCoach composite IDs). How do I handle that? A: Use the composite/unique form as provider_id in the provider table. Add additional crosswalk rows for each alternate ID form, all pointing to the same canonical_id. See the "Multiple IDs Per Provider" section above.

Q: How do I handle provider IDs that change? A: Delete the old crosswalk entry and add a new one with the same canonical_id but the new provider_id. The canonical ID is permanent — only provider IDs change.

Q: A provider reuses the same raw integer ID for different players across seasons. What do I do? A: Build a composite provider_id that includes the season context (e.g., nrl/2026/classic/1). Each season's composite ID gets its own crosswalk entry pointing to the correct canonical entity.

Canonical-ID Scheme (effective 2026-05-20)

Foundry canonicalisation uses a deterministic UUIDv5 over a structured, name-free key. DE owns canonicalisation end-to-end via the Python SDK (foundry_sdk.canonical_ids); Foundry's C# code computes the same canonical_ids via LBS.Anchor.CanonicalIds.Mint. Both sides assert against the same gold-fixture JSON file, so any recipe drift surfaces immediately. Divergence here is the duplicate-identity failure mode the system was rebuilt to avoid.

Namespace

FOUNDRY_CANONICAL_NS = a8441375-cf92-4c19-9a20-59d33b9b967f

Permanent. Do not change without a coordinated scheme transition.

Key shape

key          = "<entity_type>|<sport_lower_or_empty>|<anchor_provider_lower>|<anchor_pid>[|<parent_canon>]"
canonical_id = uuid5(FOUNDRY_CANONICAL_NS, key)

Per-entity recipes

entity_type key
Participant Participant\|{sport_lower}\|{anchor_lower}\|{anchor_pid}
Team Team\|{sport_lower}\|{anchor_lower}\|{anchor_pid}
Competition Competition\|{sport_lower}\|{anchor_lower}\|{anchor_pid}
Season Season\|{sport_lower}\|{anchor_lower}\|{anchor_pid}\|{competition_canonical_id_lowercase_dashed}
SportingEvent SportingEvent\|{sport_lower}\|{anchor_lower}\|{anchor_pid}
Venue Venue\|\|{anchor_lower}\|{anchor_pid} (empty sport segment)

Normalisation

  • UTF-8 byte encoding.
  • Separator is the pipe character |. Reserved: anchor_provider and anchor_pid must not contain a pipe.
  • entity_type is PascalCase (Participant, Team, ...).
  • sport is the SDK Sport.Name lowercased (rugbyleague, cricket). The crosswalk's sport column stays PascalCase; only the hash key is lowercase.
  • anchor_provider is the provider name lowercased (foxsports, nrlweb, statsperform, supercoach).
  • anchor_pid is the source provider's provider_id verbatim — no trimming or case-folding.
  • parent_canon (Season only) is the parent's canonical_id in lowercase dashed form, xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.

Anchor registry (planned, not yet shipped)

Today the anchor provider is foxsports-by-convention — DE callers pass anchor_provider="foxsports" directly to mint_canonical_id(...). There is no DE-owned (entity_type, sport) -> anchor_provider table yet; the planned foundry.entity_resolution_anchor table (spec §3) lands in a later PR. Anchor switches will be ordered procedures — see spec §5.

Gold fixtures

The forward-only test fixtures live at src/Tests/LBS.UnitTests/EntityResolution/Fixtures/canonical-id-gold-fixtures.json. DE's ETL test suite should assert against the same fixture file so any recipe drift surfaces on both sides immediately.

Dual-Recipe Migration (effective 2026-05-28)

Read this before any foxsports backfill. The naive workflow — compute the new deterministic UUIDv5 for every foxsports row and write it to the crosswalk — would mint parallel canonical_ids alongside the historical IDs already there. The result is identity duplication. The dual-recipe workflow below avoids it.

Why foxsports is special

Every existing foxsports canonical_id in foundry.entity_mapping_crosswalk was originally minted by the C# XId.From(...) static methods under per-entity-type namespaces — the legacy recipe. Those IDs are deterministic UUIDv5s, just from a different namespace and key shape than the new recipe.

For a brand-new anchor provider (one DE is onboarding now, never seen by the legacy pipeline), no rows exist; use the new recipe directly. But for foxsports — and for any historical sport whose entities are already in the crosswalk — DE must check whether the legacy ID already exists before minting a new deterministic one. If it exists, reuse it; the legacy ID is still the source of truth for downstream code that has not yet been retired off XId.From(...) (see LBS-1663).

Legacy recipe table

Per-entity-type namespaces, dash-separated keys:

entity_type namespace key
Participant c09d16cc-370a-4c8b-b40f-002b791a7419 {provider_id}-{sport}
Team fa2d3c2a-aa3f-4de3-bdef-692c27ef6874 {provider_id}-{sport}-{name}
Competition a9321893-d6d4-4f69-8087-8d7d460bf682 {provider_id}-{sport}-{name}
Season 669d4c2c-f262-45ef-8e0e-6c4023abaf00 {competition_canon}-{provider_id}-{sport}-{name}
SportingEvent f31ad7d3-5ebd-4999-8039-36a2d5474ced {provider_id}-{sport}-{home_team}-v-{away_team}
Venue 1775701f-28e3-480e-8ed3-56aaefb9d3d1 {provider_id}

Gotchas worth knowing

  • Legacy recipes do not include the provider in the hash. A (foxsports, 121727) and (supercoach, 121727) participant with the same sport produce the same legacy canonical_id. In practice this never bit because only foxsports ever minted; other providers were linked, not minted.
  • Team / Competition / Season / SportingEvent recipes include the name — a rename changes the legacy ID. Use the EXACT name that was used at original mint time. For active backfills this typically means joining against the original feed snapshot, not the current foundry.<table>.
  • Season chains on the parent Competition's legacy canonical_id. Compute the parent's legacy ID first and pass it as the parent_competition_canon arg.
  • Venue legacy key is just the provider_id — no sport, no name. High collision risk if multi-provider Venues ever existed.

Use mint_canonical_id with the legacy_lookup callback. A single call: 1. computes the legacy recipe ID; 2. asks your callback whether that ID is already in entity_mapping_crosswalk; 3. returns the legacy ID if hit, otherwise mints via the new recipe.

from foundry_sdk import mint_canonical_id

def crosswalk_has(canonical_id):
    rows = ch.query(
        "SELECT 1 FROM foundry.entity_mapping_crosswalk FINAL "
        "WHERE canonical_id = %s LIMIT 1",
        (str(canonical_id),)).result_rows
    return bool(rows)

cid = mint_canonical_id(
    entity_type="Participant",
    sport="RugbyLeague",
    anchor_provider="foxsports",
    anchor_pid="121727",
    legacy_lookup=crosswalk_has,
)
# For 121727: returns 8f30a360-ae58-527a-a252-8c8d799b67fd (legacy)
# For a brand-new pid never seen before: returns the new-recipe UUID

For Team / Competition / Season / SportingEvent the call needs the legacy inputs too — pass legacy_name (Team / Competition / Season), legacy_parent_competition_canon (Season), legacy_home_team and legacy_away_team (SportingEvent). Omitted required inputs silently fall through to the new recipe.

Backfill loop

from foundry_sdk import mint_canonical_id

rows = ch.query(
    "SELECT provider_id, full_name FROM foundry.participants FINAL "
    "WHERE provider = 'foxsports' AND sport = 'RugbyLeague' "
    "  AND (lower(provider), provider_id) NOT IN ("
    "      SELECT lower(provider), provider_id FROM foundry.entity_mapping_crosswalk FINAL"
    "      WHERE entity_type = 'Participant' AND sport = 'RugbyLeague')"
).result_rows

for provider_id, full_name in rows:
    cid = mint_canonical_id(
        "Participant", "RugbyLeague", "foxsports", str(provider_id),
        legacy_lookup=crosswalk_has,
    )
    ch.insert(
        "foundry.entity_mapping_crosswalk",
        [(str(cid), 'Participant', 'RugbyLeague', 'foxsports', str(provider_id), full_name)],
        column_names=['canonical_id', 'entity_type', 'sport', 'provider', 'provider_id', 'display_name'],
    )

New anchor provider — no legacy step

For an entity_type / sport / provider combination with no historical crosswalk rows, skip the legacy_lookup and use the new recipe directly:

cid = mint_canonical_id(
    entity_type="Participant",
    sport="<new_sport>",
    anchor_provider="<new_anchor>",
    anchor_pid="42",
)

Direct legacy computation

If DE wants to compute a legacy canonical_id without wrapping it in a mint flow — e.g. to verify a row already in the crosswalk matches the legacy recipe — call legacy_canonical_id directly:

from foundry_sdk import legacy_canonical_id

legacy_id = legacy_canonical_id(
    "Participant", provider_id="121727", sport="RugbyLeague",
)
# 8f30a360-ae58-527a-a252-8c8d799b67fd

C# equivalent

LBS.Anchor.CanonicalIds exposes the same two-step process Python ETL uses: compute the canonical_id via the SDK (with an optional legacy-lookup callback), then write the crosswalk row separately. Two overloads — sync Mint(...) for in-memory / test lookups, async MintAsync(...) for awaitable DB-backed lookups:

using LBS.Anchor;

// The caller supplies the crosswalk-existence check — there is no built-in
// "is this canonical_id present" helper on IEntityMappingService. A typical
// implementation runs a `SELECT 1 FROM foundry.entity_mapping_crosswalk
// FINAL WHERE canonical_id = @id LIMIT 1` against your IClickHouseRepository.
// `CrosswalkContains` below is a stand-in for that caller-owned helper.
//
// The SDK is sync/async-agnostic at this boundary; supply a
// Func<Guid, Task<bool>> for MintAsync or Func<Guid, bool> for Mint.
var resolved = await CanonicalIds.MintAsync(
    entityType: "Participant",
    sport: "RugbyLeague",
    anchorProvider: "foxsports",
    anchorPid: "121727",
    legacyLookup: id => CrosswalkContains(id, ct));   // your helper

// Then write the crosswalk row via the public RegisterMappingAsync
// (idempotent on the same canonical_id, throws CanonicalIdConflictException
// on a different one for the same key).
await mappingService.RegisterMappingAsync(
    resolved,
    entityType: "Participant",
    provider: "foxsports",
    providerId: "121727",
    sport: "RugbyLeague",
    displayName: "Sam Verrills");

For tests / scripts where the lookup state is in-memory, use the sync overload with a HashSet-backed closure — same trick the Python tests use with lambda c: c == expected:

var existing = new HashSet<Guid> { /* canonical_ids already in the crosswalk */ };
var resolved = CanonicalIds.Mint(
    entityType: "Participant",
    sport: "RugbyLeague",
    anchorProvider: "foxsports",
    anchorPid: "121727",
    legacyLookup: candidate => existing.Contains(candidate));

Pure-recipe sync helpers (no DB, no callback) are also available for direct computation:

var legacyId = CanonicalIds.LegacyMint("Participant", "121727", "RugbyLeague");
var newId    = CanonicalIds.Mint("Participant", "RugbyLeague", "foxsports", "121727");

Cross-language parity

Both implementations are pinned by src/Tests/LBS.UnitTests/EntityResolution/Fixtures/canonical-id-legacy-gold-fixtures.json. C# GoldFixturesTests.LegacyGoldFixtures_ProduceExpectedCanonicalIds and Python test_legacy_gold_fixtures_match_against_repo_file_if_available both load and assert against it. A drift between the two implementations fails the test on the affected side.

Permanent invariants (legacy)

  • All 6 legacy namespaces are locked — changing any of them would invalidate every existing crosswalk row under that entity type.
  • For foxsports backfills, the legacy_lookup pattern is the recommended default. For brand-new anchor providers, plain mint_canonical_id is correct.
  • canonical_id values, once written, never change.