Runbook: Archive Non-LuckBox Aggregates¶

Status: Tool implemented (ArchiveOldAggregates.cs exists); execution status unconfirmed Branch: claude/identity-linking-design Owner: tbd Scope: Marten event streams in events.mt_events whose Create event carried a non-LuckBox DataSource. These are duplicate aggregates left behind by the pre-crosswalk per-provider import flow.

Background¶

Before the ClickHouse crosswalk, each provider's import created its own Marten aggregate for the same real-world entity (e.g. FoxSports' "Reece Walsh" and StatsPerform's "Reece Walsh" were separate ParticipantAggregate streams). The canonical consolidation now lives in ClickHouse (foundry.entity_mapping_crosswalk), and generic importers produce a single DataSource = "LuckBox" aggregate per canonical entity.

The contract builders filter read models to DataSource == "LuckBox", so the duplicates don't appear in the API — but they still take up space in the event store and block cleanup of that filter.

UAT baseline as of 2026-04-17: 16,669 non-LuckBox streams across six aggregate types:

Entity	Non-LuckBox streams	Create-only	With updates
Participant	14,559	12,538	2,021
SportingEvent	1,282	335	947
Venue	460	460	0
Team	346	74	272
Season	15	0	15
Competition	7	4	3

Cross-reference check (UAT): zero LuckBox aggregates reference any non-LuckBox stream. The archive candidates are completely detached from the live LuckBox graph.

Strategy: Marten `ArchiveStream`¶

session.Events.ArchiveStream(streamId) is the right primitive:

Moves the stream's events from mt_events to mt_events_archived (audit trail preserved)
Flags the stream as archived; projections stop firing for it
Any projected document for the stream is removed via the archive projection path
Reversible via session.Events.UnarchiveStream(streamId) if we ever need to recover

Not used: DeleteStream (irreversible) or manual DELETE (bypasses Marten's stream state).

Preconditions¶

Each of these must be verified before running with --apply against any environment.

Code is deployed. The branch containing the IEntityMappingService migration, SQL-injection hardening, and GetDetails handler refactor has been merged and deployed to the target environment. Rolling back to a pre-migration version after archiving will leave code referencing AggregateRelations that no longer has the aggregates it expected.

Discovery report is clean.

set PGPASSWORD=...
cd tools/archive-old-aggregates
dotnet run DiscoverOldAggregates.cs

Required state:

Section [3] cross-reference checks all show zero — no LuckBox aggregate references a non-LuckBox stream.
Section [1] breakdowns match expectations for that environment (investigate any surprises before proceeding).
Backup taken. Full logical backup of the Postgres database (or confirm a recent automated backup exists). Azure PostgreSQL Flexible Server has point-in-time restore, but confirm the retention window covers the archive operation.
--apply variant code-reviewed. The archive variant now exists at tools/archive-old-aggregates/ArchiveOldAggregates.cs. It defaults to dry-run, requires --apply to write, and prompts for an interactive ARCHIVE confirmation (skippable with --yes for CI). Re-review it against the target environment before running with --apply.
Maintenance window agreed. Archive is online-safe but will generate write load. Schedule for a low-traffic window.

Rollout stages¶

Principle: smallest-risk first, confirm invariants between stages. Do not proceed to the next stage if any error count is non-zero after the previous.

Between each stage, re-run DiscoverOldAggregates.cs and confirm: - The archived entity's count dropped to zero - All cross-reference warnings are still zero - No new error logs in the API

Stage 0 — Dry-run verification (no writes)¶

Run DiscoverOldAggregates.cs -- --dump-csv to produce stream-ids.csv. Spot-check a dozen rows by hand — confirm the data_source column matches the provider naming convention and the stream IDs look plausible.

Stage 1 — Venue (460 streams, lowest risk)¶

Venue is the cleanest test: - Every non-LuckBox Venue has a LuckBox counterpart (exact 1:1 ratio — 460 of each) - All are create-only — zero updates, so archiving removes exactly one event per stream - No cross-references

# Dry run
dotnet run ArchiveOldAggregates.cs -- --entity=Venue

# Apply
dotnet run ArchiveOldAggregates.cs -- --entity=Venue --apply --limit=50    # first
dotnet run ArchiveOldAggregates.cs -- --entity=Venue --apply                # remainder

Validate: - Venue rows in mt_events dropped by 460 - Venue rows in mt_events_archived increased by 460 - Venue API queries still work (the LuckBox venues remain)

Stage 2 — Competition and Season (22 streams, small blast radius)¶

Similar pattern, tiny numbers: - Competition: 7 streams - Season: 15 streams

dotnet run ArchiveOldAggregates.cs -- --entity=Competition --apply
dotnet run ArchiveOldAggregates.cs -- --entity=Season --apply

Stage 3 — Team (346 streams)¶

dotnet run ArchiveOldAggregates.cs -- --entity=Team --apply --limit=100   # batch 1
# verify, then
dotnet run ArchiveOldAggregates.cs -- --entity=Team --apply                # remainder

Stage 4 — SportingEvent (1,282 streams)¶

Larger volume, more events per stream (947 of 1,282 have updates):

dotnet run ArchiveOldAggregates.cs -- --entity=SportingEvent --apply --limit=250   # per pass
# repeat 5–6 times with validation between

Stage 5 — Participant (14,559 streams — the big one)¶

Bulk of the work. Expect this to take longest:

dotnet run ArchiveOldAggregates.cs -- --entity=Participant --apply --limit=1000
# repeat ~15 times, monitoring between

Observability during archive¶

Watch these while --apply runs:

Metric	Source	Expected behaviour
Event count on the target entity in `mt_events`	`SELECT COUNT(*) FROM events.mt_events WHERE type = 'participant_created_event' AND data->>'DataSource' != 'LuckBox'`	Decreases per batch
Archived events	`SELECT COUNT(*) FROM events.mt_events_archived`	Increases matching the decrease above
API error rate	Application Insights / APM dashboard	No change
Projection lag	`mt_event_progression`	Transiently elevated during archive; recovers
Postgres connection count	Azure metrics	Within normal bounds
Tool logs	stdout	Progress lines every N streams; zero errors

Abort condition: any of these regress. Stop, re-run discovery, investigate before continuing.

Rollback¶

If something goes wrong mid-stage:

Stop the archive tool (Ctrl+C). Marten archives per-stream in a transaction, so partial state is only ever at stream boundaries — no torn streams.
Inspect the damage. Compare DiscoverOldAggregates.cs output to the baseline. Any LuckBox aggregates now broken?
Reverse individual streams if needed:
```
await session.Events.UnarchiveStream(streamId);
```
The stream-ids.csv from the discovery dump tells you which streams were in scope.
Catastrophic recovery: if the archive table itself is damaged, restore from the pre-archive backup. Azure Postgres point-in-time restore covers this.

Un-archiving only restores the event stream — any downstream Postgres read-model documents removed by the archive projection will repopulate as the projection replays.

Post-archive cleanup¶

Once all stages complete successfully and the discovery tool reports zero non-LuckBox streams:

Remove the DataSource == "LuckBox" filter from contract builders. This was item #1 from the earlier cleanup review — blocked only by the presence of non-LuckBox aggregates in the event store. Files:
src/Domain/LBS.Domain.Sport/Participant/ContractBuilder/ParticipantContractBuilder.cs
src/Domain/LBS.Domain.Sport/Team/ContractBuilder/TeamContractBuilder.cs
src/Domain/LBS.Domain.Sport/Competition/ContractBuilder/CompetitionContractBuilder.cs
src/Domain/LBS.Domain.Sport/Season/ContractBuilder/SeasonContractBuilder.cs
src/Domain/LBS.Domain.Sport/SportingEvent/ContractBuilder/SportingEventContractBuilder.cs
src/Domain/LBS.Domain.Sport/Venue/ContractBuilder/VenueContractBuilder.cs
Decide the fate of readmodels."AggregateRelations". Confirmed earlier: no C# code reads or writes this table. The table itself can be dropped as a follow-up DDL migration when you're comfortable. Until then, it sits dormant.
Optional: compact mt_events_archived if size becomes a concern. Archived events are kept for audit; if that's no longer needed they can be truncated, but that's irreversible.

Open questions for the runbook review¶

Execution order within a batch — current plan archives by arbitrary order. Do we prefer oldest-first, newest-first, or per-provider?
Concurrency — is it safe to run two --entity=X passes in parallel against different entity types? Marten is per-stream transactional, so likely yes, but worth confirming.
Projection rebuild requirement — after archiving, do we want to force a projection rebuild to guarantee read-model consistency, or rely on the per-stream archive path?
Should we log to a table instead of just stdout? Would help auditability.
Who runs it — is this operator-run or a scheduled one-off job?

Appendix: the `--apply` tool¶

The archive variant is implemented at tools/archive-old-aggregates/ArchiveOldAggregates.cs as a single-file .NET tool. It:

Takes --entity=<Participant|Team|Competition|Season|SportingEvent|Venue> (required; one at a time)
Takes --limit=<N> (default 100; safety cap per invocation)
Takes --batch-size=<N> (default 50; streams per Marten transaction)
Defaults to dry-run; requires the --apply flag to actually archive
Prompts for an interactive ARCHIVE confirmation before the first write; pass --yes to skip it (CI mode)
For each stream: opens a Marten session → session.Events.ArchiveStream(id) → SaveChangesAsync → logs progress
Batches in transactions of --batch-size streams to balance throughput vs. rollback granularity
Prints a summary including: attempted, archived, failed (with error), elapsed time
Writes a per-run log line to runs.log (next to the tool) for audit
Re-queries the remaining non-LuckBox stream count as a post-hoc sanity check

Connection details are read from env vars (PG_HOST, PG_DATABASE, PG_USER, PGPASSWORD, EVENTS_SCHEMA).