Skip to content

Runbook: Archive Non-LuckBox Aggregates

Status: Tool implemented (ArchiveOldAggregates.cs exists); execution status unconfirmed Branch: claude/identity-linking-design Owner: tbd Scope: Marten event streams in events.mt_events whose Create event carried a non-LuckBox DataSource. These are duplicate aggregates left behind by the pre-crosswalk per-provider import flow.


Background

Before the ClickHouse crosswalk, each provider's import created its own Marten aggregate for the same real-world entity (e.g. FoxSports' "Reece Walsh" and StatsPerform's "Reece Walsh" were separate ParticipantAggregate streams). The canonical consolidation now lives in ClickHouse (foundry.entity_mapping_crosswalk), and generic importers produce a single DataSource = "LuckBox" aggregate per canonical entity.

The contract builders filter read models to DataSource == "LuckBox", so the duplicates don't appear in the API — but they still take up space in the event store and block cleanup of that filter.

UAT baseline as of 2026-04-17: 16,669 non-LuckBox streams across six aggregate types:

Entity Non-LuckBox streams Create-only With updates
Participant 14,559 12,538 2,021
SportingEvent 1,282 335 947
Venue 460 460 0
Team 346 74 272
Season 15 0 15
Competition 7 4 3

Cross-reference check (UAT): zero LuckBox aggregates reference any non-LuckBox stream. The archive candidates are completely detached from the live LuckBox graph.


Strategy: Marten ArchiveStream

session.Events.ArchiveStream(streamId) is the right primitive:

  • Moves the stream's events from mt_events to mt_events_archived (audit trail preserved)
  • Flags the stream as archived; projections stop firing for it
  • Any projected document for the stream is removed via the archive projection path
  • Reversible via session.Events.UnarchiveStream(streamId) if we ever need to recover

Not used: DeleteStream (irreversible) or manual DELETE (bypasses Marten's stream state).


Preconditions

Each of these must be verified before running with --apply against any environment.

  1. Code is deployed. The branch containing the IEntityMappingService migration, SQL-injection hardening, and GetDetails handler refactor has been merged and deployed to the target environment. Rolling back to a pre-migration version after archiving will leave code referencing AggregateRelations that no longer has the aggregates it expected.

  2. Discovery report is clean.

    set PGPASSWORD=...
    cd tools/archive-old-aggregates
    dotnet run DiscoverOldAggregates.cs
    
    Required state:

  3. Section [3] cross-reference checks all show zero — no LuckBox aggregate references a non-LuckBox stream.
  4. Section [1] breakdowns match expectations for that environment (investigate any surprises before proceeding).

  5. Backup taken. Full logical backup of the Postgres database (or confirm a recent automated backup exists). Azure PostgreSQL Flexible Server has point-in-time restore, but confirm the retention window covers the archive operation.

  6. --apply variant code-reviewed. The archive variant now exists at tools/archive-old-aggregates/ArchiveOldAggregates.cs. It defaults to dry-run, requires --apply to write, and prompts for an interactive ARCHIVE confirmation (skippable with --yes for CI). Re-review it against the target environment before running with --apply.

  7. Maintenance window agreed. Archive is online-safe but will generate write load. Schedule for a low-traffic window.


Rollout stages

Principle: smallest-risk first, confirm invariants between stages. Do not proceed to the next stage if any error count is non-zero after the previous.

Between each stage, re-run DiscoverOldAggregates.cs and confirm: - The archived entity's count dropped to zero - All cross-reference warnings are still zero - No new error logs in the API

Stage 0 — Dry-run verification (no writes)

Run DiscoverOldAggregates.cs -- --dump-csv to produce stream-ids.csv. Spot-check a dozen rows by hand — confirm the data_source column matches the provider naming convention and the stream IDs look plausible.

Stage 1 — Venue (460 streams, lowest risk)

Venue is the cleanest test: - Every non-LuckBox Venue has a LuckBox counterpart (exact 1:1 ratio — 460 of each) - All are create-only — zero updates, so archiving removes exactly one event per stream - No cross-references

# Dry run
dotnet run ArchiveOldAggregates.cs -- --entity=Venue

# Apply
dotnet run ArchiveOldAggregates.cs -- --entity=Venue --apply --limit=50    # first
dotnet run ArchiveOldAggregates.cs -- --entity=Venue --apply                # remainder

Validate: - Venue rows in mt_events dropped by 460 - Venue rows in mt_events_archived increased by 460 - Venue API queries still work (the LuckBox venues remain)

Stage 2 — Competition and Season (22 streams, small blast radius)

Similar pattern, tiny numbers: - Competition: 7 streams - Season: 15 streams

dotnet run ArchiveOldAggregates.cs -- --entity=Competition --apply
dotnet run ArchiveOldAggregates.cs -- --entity=Season --apply

Stage 3 — Team (346 streams)

dotnet run ArchiveOldAggregates.cs -- --entity=Team --apply --limit=100   # batch 1
# verify, then
dotnet run ArchiveOldAggregates.cs -- --entity=Team --apply                # remainder

Stage 4 — SportingEvent (1,282 streams)

Larger volume, more events per stream (947 of 1,282 have updates):

dotnet run ArchiveOldAggregates.cs -- --entity=SportingEvent --apply --limit=250   # per pass
# repeat 5–6 times with validation between

Stage 5 — Participant (14,559 streams — the big one)

Bulk of the work. Expect this to take longest:

dotnet run ArchiveOldAggregates.cs -- --entity=Participant --apply --limit=1000
# repeat ~15 times, monitoring between

Observability during archive

Watch these while --apply runs:

Metric Source Expected behaviour
Event count on the target entity in mt_events SELECT COUNT(*) FROM events.mt_events WHERE type = 'participant_created_event' AND data->>'DataSource' != 'LuckBox' Decreases per batch
Archived events SELECT COUNT(*) FROM events.mt_events_archived Increases matching the decrease above
API error rate Application Insights / APM dashboard No change
Projection lag mt_event_progression Transiently elevated during archive; recovers
Postgres connection count Azure metrics Within normal bounds
Tool logs stdout Progress lines every N streams; zero errors

Abort condition: any of these regress. Stop, re-run discovery, investigate before continuing.


Rollback

If something goes wrong mid-stage:

  1. Stop the archive tool (Ctrl+C). Marten archives per-stream in a transaction, so partial state is only ever at stream boundaries — no torn streams.

  2. Inspect the damage. Compare DiscoverOldAggregates.cs output to the baseline. Any LuckBox aggregates now broken?

  3. Reverse individual streams if needed:

    await session.Events.UnarchiveStream(streamId);
    
    The stream-ids.csv from the discovery dump tells you which streams were in scope.

  4. Catastrophic recovery: if the archive table itself is damaged, restore from the pre-archive backup. Azure Postgres point-in-time restore covers this.

Un-archiving only restores the event stream — any downstream Postgres read-model documents removed by the archive projection will repopulate as the projection replays.


Post-archive cleanup

Once all stages complete successfully and the discovery tool reports zero non-LuckBox streams:

  1. Remove the DataSource == "LuckBox" filter from contract builders. This was item #1 from the earlier cleanup review — blocked only by the presence of non-LuckBox aggregates in the event store. Files:
  2. src/Domain/LBS.Domain.Sport/Participant/ContractBuilder/ParticipantContractBuilder.cs
  3. src/Domain/LBS.Domain.Sport/Team/ContractBuilder/TeamContractBuilder.cs
  4. src/Domain/LBS.Domain.Sport/Competition/ContractBuilder/CompetitionContractBuilder.cs
  5. src/Domain/LBS.Domain.Sport/Season/ContractBuilder/SeasonContractBuilder.cs
  6. src/Domain/LBS.Domain.Sport/SportingEvent/ContractBuilder/SportingEventContractBuilder.cs
  7. src/Domain/LBS.Domain.Sport/Venue/ContractBuilder/VenueContractBuilder.cs

  8. Decide the fate of readmodels."AggregateRelations". Confirmed earlier: no C# code reads or writes this table. The table itself can be dropped as a follow-up DDL migration when you're comfortable. Until then, it sits dormant.

  9. Optional: compact mt_events_archived if size becomes a concern. Archived events are kept for audit; if that's no longer needed they can be truncated, but that's irreversible.


Open questions for the runbook review

  • Execution order within a batch — current plan archives by arbitrary order. Do we prefer oldest-first, newest-first, or per-provider?
  • Concurrency — is it safe to run two --entity=X passes in parallel against different entity types? Marten is per-stream transactional, so likely yes, but worth confirming.
  • Projection rebuild requirement — after archiving, do we want to force a projection rebuild to guarantee read-model consistency, or rely on the per-stream archive path?
  • Should we log to a table instead of just stdout? Would help auditability.
  • Who runs it — is this operator-run or a scheduled one-off job?

Appendix: the --apply tool

The archive variant is implemented at tools/archive-old-aggregates/ArchiveOldAggregates.cs as a single-file .NET tool. It:

  • Takes --entity=<Participant|Team|Competition|Season|SportingEvent|Venue> (required; one at a time)
  • Takes --limit=<N> (default 100; safety cap per invocation)
  • Takes --batch-size=<N> (default 50; streams per Marten transaction)
  • Defaults to dry-run; requires the --apply flag to actually archive
  • Prompts for an interactive ARCHIVE confirmation before the first write; pass --yes to skip it (CI mode)
  • For each stream: opens a Marten session → session.Events.ArchiveStream(id)SaveChangesAsync → logs progress
  • Batches in transactions of --batch-size streams to balance throughput vs. rollback granularity
  • Prints a summary including: attempted, archived, failed (with error), elapsed time
  • Writes a per-run log line to runs.log (next to the tool) for audit
  • Re-queries the remaining non-LuckBox stream count as a post-hoc sanity check

Connection details are read from env vars (PG_HOST, PG_DATABASE, PG_USER, PGPASSWORD, EVENTS_SCHEMA).