Foundry Data Ingestion — Data Engineering Guide¶
Why This Approach¶
Data science and data engineering are the first pass on any new data source. By the time a provider's data reaches Foundry, your team has already done the hard work — researching the provider's API, understanding their data model, identifying the core fields, and mapping them to our domain. You know what a "participant" or "team" looks like for that provider better than anyone.
This architecture puts that expertise at the centre. Rather than encoding provider-specific knowledge in application code (which requires a developer, a PR, a deployment), the standardized provider tables let your team define what the data looks like and how it maps. The application just reads what you've prepared.
The result: - You control the data pipeline end-to-end — from raw provider API to standardized table to crosswalk mapping - New providers don't need code changes — your team adds tables and crosswalk entries, Foundry imports from them - Schema is consistent — every provider's participants table has the same columns, so your ETL tooling is reusable across providers - Identity resolution stays in your domain — you decide which FoxSports player is the same as which SuperCoach player, not application code
Overview¶
Foundry uses ClickHouse as the source of truth for external provider data and entity identity mapping. Data engineering is responsible for:
- Populating provider tables — standardized tables containing entity data from each provider
- Populating the crosswalk — the
entity_mapping_crosswalktable that links provider IDs to canonical Foundry IDs
Foundry's generic importers read from these tables to create domain aggregates. Adding a new provider requires zero application code changes — just new table data and crosswalk entries.
Architecture¶
YOUR ETL PIPELINE
Raw provider APIs/files → Transform → Load into ClickHouse provider tables
Match entities across providers → Load into entity_mapping_crosswalk
CLICKHOUSE (foundry database)
participants ← you populate these (one row per provider per entity)
teams ← all tables carry a `provider` LowCardinality(String) column
competitions
seasons
sporting_events
venues
entity_mapping_crosswalk ← you populate this
FOUNDRY (reads from ClickHouse)
Generic importers read provider tables + crosswalk → create domain aggregates
Schema design note. Every provider table is keyed by (sport, provider, provider_id) (or (provider, provider_id) for venues — no sport column). One table per entity type, not one per provider × entity. Rationale:
- Schema evolution is a single
ALTER TABLEper entity — not one per provider. - Onboarding a new provider is data-only — no DDL, no application change. Just start inserting rows with your new
providervalue. - Mirrors the
entity_mapping_crosswalklayout (keyed by(entity_type, provider, provider_id)). - Cross-provider queries become one SELECT — no UNION ALL over N tables.
- Per-provider queries still hit the primary index prefix, so performance is unchanged.
Process Maps¶
Onboarding a New Provider¶
flowchart TD
A[Data Science researches new provider API] --> B[Identify core entities and fields]
B --> C[Map provider fields to standardized schema]
C --> D[Data Engineering builds ETL pipeline]
D --> G[Populate entity tables with provider = 'newguy']
G --> H{Entities already exist in other providers?}
H -->|Yes| I[Match to existing canonical IDs via crosswalk]
H -->|No| J[Generate new canonical IDs]
I --> K[Insert crosswalk entries]
J --> K
K --> L[Trigger Foundry generic import]
L --> M[Domain aggregates created with canonical IDs]
Ongoing Data Ingestion¶
flowchart TD
A[Provider API / data feed updates] --> B[ETL transforms to standardized schema]
B --> C[Upsert rows into provider tables]
C --> D{New entities?}
D -->|Yes| E[Add crosswalk entries - match or generate canonical IDs]
D -->|No| F[Crosswalk already has mappings]
E --> G[Trigger Foundry import]
F --> G
G --> H[Generic importer reads provider table]
H --> I[Batch resolve canonical IDs from crosswalk]
I --> J{Crosswalk entry exists?}
J -->|Yes| K[Create/update domain aggregate]
J -->|No| L[Skip - entity not mapped yet]
Adding a New Field¶
flowchart TD
A[New field needed - e.g. home_ground on teams] --> B[Foundry developer creates ClickHouse migration]
B --> C[Migration ALTERs all existing provider team tables]
C --> D[Migration updates template for future providers]
D --> E[Foundry developer updates domain types]
E --> F[Deploy Foundry - migration runs on startup]
F --> G[Data Engineering populates new column via ETL]
G --> H[Next import picks up new field]
Cross-Provider Entity Resolution¶
flowchart TD
A[FoxSports player 12345] --> D[Crosswalk]
B[SuperCoach player nrl/2026/classic/1] --> D
C[StatsPerform player SP-999] --> D
D --> E[canonical_id: abc-123]
E --> F[Single Foundry ParticipantAggregate]
F --> G[ParticipantContract read model]
Provider Tables¶
Naming Convention¶
Tables are one-per-entity with a provider LowCardinality(String) discriminator column. Full list: foundry.participants, foundry.teams, foundry.competitions, foundry.seasons, foundry.sporting_events, foundry.venues.
Provider names are lowercase strings in the provider column: foxsports, statsperform, supercoach, espn, nflverse.
Entity tables (one per entity type): foundry.participants, foundry.teams, foundry.competitions, foundry.seasons, foundry.sporting_events, foundry.venues.
Each row carries a provider discriminator — e.g. a FoxSports participant is a row in foundry.participants with provider = 'foxsports'.
Table Schemas¶
All tables use ENGINE = ReplacingMergeTree(updated_at). Insert a row with the same ORDER BY key and a newer updated_at to update it — ClickHouse deduplicates on merge.
participants¶
| Column | Type | Required | Notes |
|---|---|---|---|
| provider_id | String | Yes | The provider's unique ID for this player |
| first_name | String | Yes | |
| surname | String | Yes | |
| full_name | String | Yes | e.g., "Patrick Dangerfield" |
| display_name | String | No | Short form, e.g., "P. Dangerfield" |
| date_of_birth | Nullable(Date) | No | Format: YYYY-MM-DD |
| nationality | String | No | |
| height | Nullable(Int32) | No | In centimetres |
| weight | Nullable(Int32) | No | In kilograms |
| sport | LowCardinality(String) | Yes | e.g., "RugbyLeague", "Cricket", "AmericanFootball" |
| status | String | No | e.g., "Active", "Retired", "Injured" |
| updated_at | DateTime64(3) | Auto | Defaults to now64() if omitted |
ORDER BY: (sport, provider_id)
teams¶
| Column | Type | Required | Notes |
|---|---|---|---|
| provider_id | String | Yes | |
| name | String | Yes | e.g., "Melbourne Storm" |
| short_name | String | No | e.g., "Storm" |
| full_name | String | No | |
| code | String | No | e.g., "MEL" |
| display_name | String | No | |
| sport | LowCardinality(String) | Yes | |
| updated_at | DateTime64(3) | Auto |
ORDER BY: (sport, provider_id)
competitions¶
| Column | Type | Required | Notes |
|---|---|---|---|
| provider_id | String | Yes | |
| name | String | Yes | e.g., "NRL Telstra Premiership" |
| short_name | String | No | e.g., "NRL" |
| code | String | No | e.g., "NRL" |
| sport | LowCardinality(String) | Yes | |
| updated_at | DateTime64(3) | Auto |
ORDER BY: (sport, provider_id)
seasons¶
| Column | Type | Required | Notes |
|---|---|---|---|
| provider_id | String | Yes | |
| season_name | String | Yes | e.g., "2026 NRL Season" |
| season_year | Int32 | Yes | e.g., 2026 |
| competition_provider_id | String | Yes | FK — must match a provider_id in competitions for the same provider |
| sport | LowCardinality(String) | Yes | |
| updated_at | DateTime64(3) | Auto |
ORDER BY: (sport, provider_id)
sporting_events¶
| Column | Type | Required | Notes |
|---|---|---|---|
| provider_id | String | Yes | |
| competition_provider_id | String | Yes | FK — must match a provider_id in competitions for the same provider |
| season_provider_id | String | Yes | FK — must match a provider_id in seasons for the same provider |
| home_team_provider_id | String | Yes | FK — must match a provider_id in teams for the same provider |
| away_team_provider_id | String | Yes | FK — must match a provider_id in teams for the same provider |
| start_time_utc | DateTime64(3) | Yes | UTC timestamp |
| venue_provider_id | String | No | FK — must match a provider_id in venues for the same provider |
| group_name | String | No | Round/group name, e.g., "Round 1", "Finals Week 1" |
| sport | LowCardinality(String) | Yes | |
| updated_at | DateTime64(3) | Auto |
ORDER BY: (sport, provider_id)
venues¶
| Column | Type | Required | Notes |
|---|---|---|---|
| provider_id | String | Yes | |
| name | String | Yes | e.g., "AAMI Park" |
| city | String | No | |
| country | String | No | |
| state | String | No | |
| updated_at | DateTime64(3) | Auto |
ORDER BY: (provider_id)
lineups¶
Per-game lineup membership — one row per (sporting event, team, participant). Unlike the six core
tables this is an event-scoped membership fact (like squads is a season-scoped one). It is normally
populated by a provider's canonicaliser (e.g. the ESPN scraper writes it from espn.event_participants),
not hand-loaded.
| Column | Type | Required | Notes |
|---|---|---|---|
| provider_id | String | Yes | Composite, e.g. {event}/{team}/{athlete} — unique per lineup member |
| sporting_event_provider_id | String | Yes | FK — must match a provider_id in sporting_events for the same provider |
| team_provider_id | String | Yes | FK — must match a provider_id in teams for the same provider |
| participant_provider_id | String | Yes | FK — must match a provider_id in participants for the same provider |
| team_name | String | No | Denormalised for a self-describing row |
| selection_role | LowCardinality(String) | Yes | Starting, Bench, Interchange, Withdrawn, Replacement |
| position | LowCardinality(String) | No | Raw provider position; the domain Position mapping is done downstream by the importer |
| jersey_number | Nullable(Int32) | No | |
| type | LowCardinality(String) | No | e.g. Player |
| sport | LowCardinality(String) | Yes | |
| updated_at | DateTime64(3) | Auto |
ORDER BY: (sport, provider, provider_id)
Important Notes¶
provider_idmust be a string. Even if the provider uses numeric IDs, store them as strings (e.g.,"12345"not12345).- Cross-references use provider IDs, not canonical IDs. A sporting event's
home_team_provider_idshould be the same provider's team ID (e.g., if FoxSports calls Melbourne Storm"99", then rows infoundry.sporting_eventswithprovider = 'foxsports'should havehome_team_provider_id = "99"). - Sport names must match Foundry's constants exactly:
RugbyLeague,Cricket,AmericanFootball,AustralianFootball. Case-sensitive. - Updates are inserts. To update a player's name, insert a new row with the same
(sport, provider, provider_id)and a newerupdated_at. ClickHouse'sReplacingMergeTreededuplicates automatically.
Entity Mapping Crosswalk¶
The crosswalk links provider IDs to canonical Foundry IDs. This is where cross-provider entity resolution happens.
Schema¶
| Column | Type | Notes |
|---|---|---|
| canonical_id | UUID | The Foundry canonical ID — generate with generateUUIDv4() |
| entity_type | LowCardinality(String) | Participant, Team, Competition, Season, SportingEvent, Venue, Lineup |
| sport | LowCardinality(String) | Must match the sport in the provider table |
| provider | LowCardinality(String) | Must match the provider table name prefix (e.g., foxsports, statsperform) |
| provider_id | String | Must match the provider_id in the provider table |
| display_name | String | Human-readable name for admin UI |
| created_at | DateTime64(3) | Auto |
| updated_at | DateTime64(3) | Auto |
ORDER BY: (entity_type, provider, provider_id)
How Canonical IDs Work¶
Each real-world entity (player, team, etc.) gets one canonical UUID. Multiple providers can map to the same canonical ID:
-- Patrick Dangerfield exists in 3 providers, all map to the same canonical_id
INSERT INTO foundry.entity_mapping_crosswalk VALUES
('a1b2c3d4-...', 'Participant', 'RugbyLeague', 'foxsports', '12345', 'Patrick Dangerfield'),
('a1b2c3d4-...', 'Participant', 'RugbyLeague', 'supercoach', 'ABC42', 'Patrick Dangerfield'),
('a1b2c3d4-...', 'Participant', 'RugbyLeague', 'statsperform', 'SP-999', 'Patrick Dangerfield');
Rules:
- Generate one canonical_id per real-world entity using generateUUIDv4()
- All providers that refer to the same entity share the same canonical_id
- Each (provider, provider_id) pair appears at most once per entity type
- canonical_id values are permanent — never change them once assigned
Populating the Crosswalk¶
Step 1: Assign canonical IDs to your primary provider¶
Pick your most complete provider (e.g., FoxSports for NRL) and assign a canonical ID to each entity:
INSERT INTO foundry.entity_mapping_crosswalk
SELECT
generateUUIDv4() AS canonical_id,
'Participant' AS entity_type,
sport,
'foxsports' AS provider,
provider_id,
full_name AS display_name
FROM foundry.participants FINAL
WHERE provider = 'foxsports';
Step 2: Map secondary providers to the same canonical IDs¶
This is where your matching logic lives. Match by name, DOB, team, or whatever signals work for your data:
-- Example: match supercoach players to foxsports players by name + DOB
INSERT INTO foundry.entity_mapping_crosswalk
SELECT
cx.canonical_id,
'Participant' AS entity_type,
sc.sport,
'supercoach' AS provider,
sc.provider_id,
sc.full_name AS display_name
FROM foundry.participants sc FINAL
JOIN foundry.participants fox FINAL
ON fox.provider = 'foxsports'
AND lower(sc.full_name) = lower(fox.full_name)
AND sc.date_of_birth = fox.date_of_birth
AND sc.sport = fox.sport
JOIN foundry.entity_mapping_crosswalk cx FINAL
ON cx.provider = 'foxsports'
AND cx.provider_id = fox.provider_id
AND cx.entity_type = 'Participant'
WHERE sc.provider = 'supercoach';
Step 3: Handle unmatched entities¶
Entities that exist in a secondary provider but not in the primary get their own canonical ID:
-- SuperCoach players with no FoxSports match get new canonical IDs
INSERT INTO foundry.entity_mapping_crosswalk
SELECT
generateUUIDv4() AS canonical_id,
'Participant' AS entity_type,
sc.sport,
'supercoach' AS provider,
sc.provider_id,
sc.full_name AS display_name
FROM foundry.participants sc FINAL
LEFT JOIN foundry.entity_mapping_crosswalk cx FINAL
ON cx.provider = 'supercoach'
AND cx.provider_id = sc.provider_id
AND cx.entity_type = 'Participant'
WHERE sc.provider = 'supercoach'
AND cx.canonical_id IS NULL;
Import Order¶
Foundry imports entities in this order due to cross-references:
Competition -> Season -> Team -> Venue -> SportingEvent -> Participant
Ensure crosswalk entries exist for parent entities before child entities. For example, seasons reference competitions — the competition's crosswalk entry must exist before the season import runs.
Multiple IDs Per Provider¶
Some providers have multiple ID schemes for the same entity. For example, SuperCoach player IDs are only unique within a context of sport + year + game type — the raw integer 1 could be different players depending on the parameters.
Convention: Use Composite IDs¶
Store the most specific, globally unique form as provider_id in the provider table. For SuperCoach, this means the composite form:
The composite form is unique across all seasons and game types.
Multiple Crosswalk Entries for Alternate IDs¶
If downstream systems reference entities by different ID forms, add additional crosswalk rows for each form. All map to the same canonical_id:
-- Same player, same provider, multiple ID forms
INSERT INTO foundry.entity_mapping_crosswalk VALUES
('abc-123...', 'Participant', 'RugbyLeague', 'supercoach', 'nrl/2026/classic/1', 'Reece Walsh'),
('abc-123...', 'Participant', 'RugbyLeague', 'supercoach', 'nrl/2025/classic/1', 'Reece Walsh');
Any lookup by any of these IDs resolves to the same canonical entity.
Rules¶
- The provider table (
supercoach_participants) stores the primary composite ID inprovider_id - The crosswalk can have multiple rows per
(provider, canonical_id)with differentprovider_idvalues - Each
(entity_type, provider, provider_id)combination is unique — no duplicates - Downstream consumers query with whatever ID form they have — they don't need to know the scheme
- The canonical ID stays the same regardless of which ID form was used to look it up
Seasonal ID Changes¶
When a provider reuses IDs across seasons (same raw ID, different player), use composite IDs that include the season context. Each season's composite ID gets its own crosswalk entry pointing to the correct canonical ID:
-- Player 1 is Reece Walsh in 2026 but was someone else in 2024
('abc-123...', 'Participant', 'RugbyLeague', 'supercoach', 'nrl/2026/classic/1', 'Reece Walsh'),
('def-456...', 'Participant', 'RugbyLeague', 'supercoach', 'nrl/2024/classic/1', 'Other Player'),
Adding a New Provider¶
- Choose a provider name — lowercase, no spaces (e.g.,
espn,nflverse) - Tables are created automatically — Foundry creates the 6 consolidated tables on startup. Post-V004, no provider list exists in code. To onboard a new provider, just start inserting rows with your new
providervalue — no DDL, no application change. - Populate the provider tables — ETL your provider's data into the standardized tables
- Populate the crosswalk — Match entities to existing canonical IDs or generate new ones
- Trigger an import in Foundry — The generic importers will read your tables and create domain aggregates
Validation Queries¶
-- Check crosswalk coverage: how many entities per provider?
SELECT entity_type, provider, count() AS entries
FROM foundry.entity_mapping_crosswalk FINAL
GROUP BY entity_type, provider
ORDER BY entity_type, provider;
-- Find entities with only one provider (potential missing mappings)
SELECT canonical_id, entity_type, sport, count() AS provider_count
FROM foundry.entity_mapping_crosswalk FINAL
GROUP BY canonical_id, entity_type, sport
HAVING provider_count = 1
ORDER BY entity_type;
-- Check for orphaned provider table entries (no crosswalk mapping)
SELECT fp.provider, fp.provider_id, fp.full_name, fp.sport
FROM foundry.participants fp FINAL
LEFT JOIN foundry.entity_mapping_crosswalk cx FINAL
ON cx.provider = fp.provider
AND cx.provider_id = fp.provider_id
AND cx.entity_type = 'Participant'
WHERE fp.provider = 'foxsports'
AND cx.canonical_id IS NULL;
-- Check for crosswalk entries with no provider table data
SELECT cx.provider, cx.provider_id, cx.display_name
FROM foundry.entity_mapping_crosswalk cx FINAL
LEFT JOIN foundry.participants fp FINAL
ON fp.provider = cx.provider
AND fp.provider_id = cx.provider_id
WHERE cx.provider = 'foxsports'
AND cx.entity_type = 'Participant'
AND fp.provider_id IS NULL;
Mapped Lineups Saved Query (LBS-1756)¶
Foundry's lineup importer reads foundry.lineups joined with the crosswalk to resolve the canonical
SportingEvent / Team / Participant ids, via a ClickHouse Cloud saved query. The code references it
through ClickHouseQueries.QueryIds.MappedLineupMembers.
To activate it: create the saved query below in ClickHouse Cloud, then replace that placeholder with the saved query's GUID (the GUID is assigned by ClickHouse Cloud — it cannot be set from code).
Parameterized by provider_name + sport; the projected columns map 1:1 to MappedLineupRow:
SELECT
l.provider AS provider,
l.provider_id AS provider_id,
l.sporting_event_provider_id AS sporting_event_provider_id,
l.team_provider_id AS team_provider_id,
l.participant_provider_id AS participant_provider_id,
l.team_name AS team_name,
l.selection_role AS selection_role,
l.position AS position,
l.jersey_number AS jersey_number,
l.type AS type,
l.sport AS sport,
se.canonical_id AS canonical_sporting_event_id,
tm.canonical_id AS canonical_team_id,
pt.canonical_id AS canonical_participant_id
FROM foundry.lineups l FINAL
LEFT JOIN foundry.entity_mapping_crosswalk se FINAL
ON se.entity_type = 'SportingEvent'
AND se.provider = l.provider AND se.provider_id = l.sporting_event_provider_id AND se.sport = l.sport
LEFT JOIN foundry.entity_mapping_crosswalk tm FINAL
ON tm.entity_type = 'Team'
AND tm.provider = l.provider AND tm.provider_id = l.team_provider_id AND tm.sport = l.sport
LEFT JOIN foundry.entity_mapping_crosswalk pt FINAL
ON pt.entity_type = 'Participant'
AND pt.provider = l.provider AND pt.provider_id = l.participant_provider_id AND pt.sport = l.sport
WHERE l.provider = {provider_name:String}
AND l.sport = {sport:String}
The canonical-id columns are nullable: a LEFT JOIN miss (the SE/Team/Participant not yet resolved in
the crosswalk) leaves them NULL, which the importer treats as "not mapped yet — skip". This mirrors
the existing MappedPlayers saved query.
FAQ¶
Q: Can I update a player's name?
A: Insert a new row with the same (sport, provider_id). ReplacingMergeTree keeps the latest updated_at.
Q: What if I map a provider ID to the wrong canonical ID?
A: Delete the old row and insert a new one. Or use the admin UI at /admin/crosswalk to unlink and relink.
Q: What if a player's details change (e.g., name, status, height)?
A: Update the player's row in the provider table (insert a new row with the same provider_id and a newer updated_at). The canonical ID stays the same — it represents the player, not their current details.
Q: Do I need to populate all 6 entity types? A: No. Populate what your provider has. Foundry's importers skip entity types with no data.
Q: What sport names does Foundry accept?
A: RugbyLeague, Cricket, AmericanFootball, AustralianFootball. Case-sensitive. Check SportsDataProviders for the latest list.
Q: We need to add a new field to teams (e.g., home_ground). How does that flow through?
A: This is a coordinated change across ClickHouse and Foundry:
1. Foundry developer creates a ClickHouse migration that runs a single ALTER TABLE foundry.teams ADD COLUMN ... (post-V004 there's only one table per entity, so one DDL covers every provider)
2. Foundry developer updates the ProviderTeam read model, CreateTeamCommand, TeamCreatedEvent, and TeamContract to include the new field
3. Data engineering populates the new column in each provider's teams table via your ETL
4. Next import picks up the new field and flows it into the domain
The key point: the column must exist in the ClickHouse table before you can populate it. The migration handles that. You don't need to wait for a Foundry deployment to start preparing your data — just populate the column once the migration runs.
Q: A provider has multiple IDs for the same entity (e.g., SuperCoach composite IDs). How do I handle that?
A: Use the composite/unique form as provider_id in the provider table. Add additional crosswalk rows for each alternate ID form, all pointing to the same canonical_id. See the "Multiple IDs Per Provider" section above.
Q: How do I handle provider IDs that change?
A: Delete the old crosswalk entry and add a new one with the same canonical_id but the new provider_id. The canonical ID is permanent — only provider IDs change.
Q: A provider reuses the same raw integer ID for different players across seasons. What do I do?
A: Build a composite provider_id that includes the season context (e.g., nrl/2026/classic/1). Each season's composite ID gets its own crosswalk entry pointing to the correct canonical entity.
Canonical-ID Scheme (effective 2026-05-20)¶
Foundry canonicalisation uses a deterministic UUIDv5 over a structured,
name-free key. DE owns canonicalisation end-to-end via the Python SDK
(foundry_sdk.canonical_ids); Foundry's C# code computes the same
canonical_ids via LBS.Anchor.CanonicalIds.Mint. Both sides assert
against the same gold-fixture JSON file, so any recipe drift surfaces
immediately. Divergence here is the duplicate-identity failure mode
the system was rebuilt to avoid.
Namespace¶
Permanent. Do not change without a coordinated scheme transition.
Key shape¶
key = "<entity_type>|<sport_lower_or_empty>|<anchor_provider_lower>|<anchor_pid>[|<parent_canon>]"
canonical_id = uuid5(FOUNDRY_CANONICAL_NS, key)
Per-entity recipes¶
| entity_type | key |
|---|---|
| Participant | Participant\|{sport_lower}\|{anchor_lower}\|{anchor_pid} |
| Team | Team\|{sport_lower}\|{anchor_lower}\|{anchor_pid} |
| Competition | Competition\|{sport_lower}\|{anchor_lower}\|{anchor_pid} |
| Season | Season\|{sport_lower}\|{anchor_lower}\|{anchor_pid}\|{competition_canonical_id_lowercase_dashed} |
| SportingEvent | SportingEvent\|{sport_lower}\|{anchor_lower}\|{anchor_pid} |
| Venue | Venue\|\|{anchor_lower}\|{anchor_pid} (empty sport segment) |
Normalisation¶
- UTF-8 byte encoding.
- Separator is the pipe character
|. Reserved: anchor_provider and anchor_pid must not contain a pipe. entity_typeis PascalCase (Participant,Team, ...).sportis the SDK Sport.Name lowercased (rugbyleague,cricket). The crosswalk'ssportcolumn stays PascalCase; only the hash key is lowercase.anchor_provideris the provider name lowercased (foxsports,nrlweb,statsperform,supercoach).anchor_pidis the source provider'sprovider_idverbatim — no trimming or case-folding.parent_canon(Season only) is the parent's canonical_id in lowercase dashed form,xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.
Anchor registry (planned, not yet shipped)¶
Today the anchor provider is foxsports-by-convention — DE callers pass
anchor_provider="foxsports" directly to mint_canonical_id(...). There is
no DE-owned (entity_type, sport) -> anchor_provider table yet; the
planned foundry.entity_resolution_anchor table (spec §3) lands in a
later PR. Anchor switches will be ordered procedures — see spec §5.
Gold fixtures¶
The forward-only test fixtures live at
src/Tests/LBS.UnitTests/EntityResolution/Fixtures/canonical-id-gold-fixtures.json.
DE's ETL test suite should assert against the same fixture file so any
recipe drift surfaces on both sides immediately.
Dual-Recipe Migration (effective 2026-05-28)¶
Read this before any foxsports backfill. The naive workflow — compute the new deterministic UUIDv5 for every foxsports row and write it to the crosswalk — would mint parallel canonical_ids alongside the historical IDs already there. The result is identity duplication. The dual-recipe workflow below avoids it.
Why foxsports is special¶
Every existing foxsports canonical_id in foundry.entity_mapping_crosswalk
was originally minted by the C# XId.From(...) static methods under
per-entity-type namespaces — the legacy recipe. Those IDs are
deterministic UUIDv5s, just from a different namespace and key shape
than the new recipe.
For a brand-new anchor provider (one DE is onboarding now, never seen
by the legacy pipeline), no rows exist; use the new recipe directly.
But for foxsports — and for any historical sport whose entities are
already in the crosswalk — DE must check whether the legacy ID
already exists before minting a new deterministic one. If it exists,
reuse it; the legacy ID is still the source of truth for downstream
code that has not yet been retired off XId.From(...) (see LBS-1663).
Legacy recipe table¶
Per-entity-type namespaces, dash-separated keys:
| entity_type | namespace | key |
|---|---|---|
| Participant | c09d16cc-370a-4c8b-b40f-002b791a7419 |
{provider_id}-{sport} |
| Team | fa2d3c2a-aa3f-4de3-bdef-692c27ef6874 |
{provider_id}-{sport}-{name} |
| Competition | a9321893-d6d4-4f69-8087-8d7d460bf682 |
{provider_id}-{sport}-{name} |
| Season | 669d4c2c-f262-45ef-8e0e-6c4023abaf00 |
{competition_canon}-{provider_id}-{sport}-{name} |
| SportingEvent | f31ad7d3-5ebd-4999-8039-36a2d5474ced |
{provider_id}-{sport}-{home_team}-v-{away_team} |
| Venue | 1775701f-28e3-480e-8ed3-56aaefb9d3d1 |
{provider_id} |
Gotchas worth knowing¶
- Legacy recipes do not include the provider in the hash. A
(foxsports, 121727)and(supercoach, 121727)participant with the same sport produce the same legacy canonical_id. In practice this never bit because only foxsports ever minted; other providers were linked, not minted. - Team / Competition / Season / SportingEvent recipes include the
name — a rename changes the legacy ID. Use the EXACT name that was
used at original mint time. For active backfills this typically
means joining against the original feed snapshot, not the current
foundry.<table>. - Season chains on the parent Competition's legacy canonical_id.
Compute the parent's legacy ID first and pass it as the
parent_competition_canonarg. - Venue legacy key is just the
provider_id— no sport, no name. High collision risk if multi-provider Venues ever existed.
Recommended foxsports backfill — Python¶
Use mint_canonical_id with the legacy_lookup callback. A single
call:
1. computes the legacy recipe ID;
2. asks your callback whether that ID is already in
entity_mapping_crosswalk;
3. returns the legacy ID if hit, otherwise mints via the new recipe.
from foundry_sdk import mint_canonical_id
def crosswalk_has(canonical_id):
rows = ch.query(
"SELECT 1 FROM foundry.entity_mapping_crosswalk FINAL "
"WHERE canonical_id = %s LIMIT 1",
(str(canonical_id),)).result_rows
return bool(rows)
cid = mint_canonical_id(
entity_type="Participant",
sport="RugbyLeague",
anchor_provider="foxsports",
anchor_pid="121727",
legacy_lookup=crosswalk_has,
)
# For 121727: returns 8f30a360-ae58-527a-a252-8c8d799b67fd (legacy)
# For a brand-new pid never seen before: returns the new-recipe UUID
For Team / Competition / Season / SportingEvent the call needs the
legacy inputs too — pass legacy_name (Team / Competition / Season),
legacy_parent_competition_canon (Season), legacy_home_team and
legacy_away_team (SportingEvent). Omitted required inputs silently
fall through to the new recipe.
Backfill loop¶
from foundry_sdk import mint_canonical_id
rows = ch.query(
"SELECT provider_id, full_name FROM foundry.participants FINAL "
"WHERE provider = 'foxsports' AND sport = 'RugbyLeague' "
" AND (lower(provider), provider_id) NOT IN ("
" SELECT lower(provider), provider_id FROM foundry.entity_mapping_crosswalk FINAL"
" WHERE entity_type = 'Participant' AND sport = 'RugbyLeague')"
).result_rows
for provider_id, full_name in rows:
cid = mint_canonical_id(
"Participant", "RugbyLeague", "foxsports", str(provider_id),
legacy_lookup=crosswalk_has,
)
ch.insert(
"foundry.entity_mapping_crosswalk",
[(str(cid), 'Participant', 'RugbyLeague', 'foxsports', str(provider_id), full_name)],
column_names=['canonical_id', 'entity_type', 'sport', 'provider', 'provider_id', 'display_name'],
)
New anchor provider — no legacy step¶
For an entity_type / sport / provider combination with no historical
crosswalk rows, skip the legacy_lookup and use the new recipe
directly:
cid = mint_canonical_id(
entity_type="Participant",
sport="<new_sport>",
anchor_provider="<new_anchor>",
anchor_pid="42",
)
Direct legacy computation¶
If DE wants to compute a legacy canonical_id without wrapping it in a
mint flow — e.g. to verify a row already in the crosswalk matches the
legacy recipe — call legacy_canonical_id directly:
from foundry_sdk import legacy_canonical_id
legacy_id = legacy_canonical_id(
"Participant", provider_id="121727", sport="RugbyLeague",
)
# 8f30a360-ae58-527a-a252-8c8d799b67fd
C# equivalent¶
LBS.Anchor.CanonicalIds exposes the same two-step process Python ETL
uses: compute the canonical_id via the SDK (with an optional
legacy-lookup callback), then write the crosswalk row separately. Two
overloads — sync Mint(...) for in-memory / test lookups, async
MintAsync(...) for awaitable DB-backed lookups:
using LBS.Anchor;
// The caller supplies the crosswalk-existence check — there is no built-in
// "is this canonical_id present" helper on IEntityMappingService. A typical
// implementation runs a `SELECT 1 FROM foundry.entity_mapping_crosswalk
// FINAL WHERE canonical_id = @id LIMIT 1` against your IClickHouseRepository.
// `CrosswalkContains` below is a stand-in for that caller-owned helper.
//
// The SDK is sync/async-agnostic at this boundary; supply a
// Func<Guid, Task<bool>> for MintAsync or Func<Guid, bool> for Mint.
var resolved = await CanonicalIds.MintAsync(
entityType: "Participant",
sport: "RugbyLeague",
anchorProvider: "foxsports",
anchorPid: "121727",
legacyLookup: id => CrosswalkContains(id, ct)); // your helper
// Then write the crosswalk row via the public RegisterMappingAsync
// (idempotent on the same canonical_id, throws CanonicalIdConflictException
// on a different one for the same key).
await mappingService.RegisterMappingAsync(
resolved,
entityType: "Participant",
provider: "foxsports",
providerId: "121727",
sport: "RugbyLeague",
displayName: "Sam Verrills");
For tests / scripts where the lookup state is in-memory, use the sync
overload with a HashSet-backed closure — same trick the Python tests
use with lambda c: c == expected:
var existing = new HashSet<Guid> { /* canonical_ids already in the crosswalk */ };
var resolved = CanonicalIds.Mint(
entityType: "Participant",
sport: "RugbyLeague",
anchorProvider: "foxsports",
anchorPid: "121727",
legacyLookup: candidate => existing.Contains(candidate));
Pure-recipe sync helpers (no DB, no callback) are also available for direct computation:
var legacyId = CanonicalIds.LegacyMint("Participant", "121727", "RugbyLeague");
var newId = CanonicalIds.Mint("Participant", "RugbyLeague", "foxsports", "121727");
Cross-language parity¶
Both implementations are pinned by
src/Tests/LBS.UnitTests/EntityResolution/Fixtures/canonical-id-legacy-gold-fixtures.json.
C# GoldFixturesTests.LegacyGoldFixtures_ProduceExpectedCanonicalIds
and Python test_legacy_gold_fixtures_match_against_repo_file_if_available
both load and assert against it. A drift between the two
implementations fails the test on the affected side.
Permanent invariants (legacy)¶
- All 6 legacy namespaces are locked — changing any of them would invalidate every existing crosswalk row under that entity type.
- For foxsports backfills, the
legacy_lookuppattern is the recommended default. For brand-new anchor providers, plainmint_canonical_idis correct. canonical_idvalues, once written, never change.