Runbook: Archive Non-LuckBox Aggregates¶
Status: Tool implemented (ArchiveOldAggregates.cs exists); execution status unconfirmed
Branch: claude/identity-linking-design
Owner: tbd
Scope: Marten event streams in events.mt_events whose Create event carried a non-LuckBox DataSource. These are duplicate aggregates left behind by the pre-crosswalk per-provider import flow.
Background¶
Before the ClickHouse crosswalk, each provider's import created its own Marten aggregate for the same real-world entity (e.g. FoxSports' "Reece Walsh" and StatsPerform's "Reece Walsh" were separate ParticipantAggregate streams). The canonical consolidation now lives in ClickHouse (foundry.entity_mapping_crosswalk), and generic importers produce a single DataSource = "LuckBox" aggregate per canonical entity.
The contract builders filter read models to DataSource == "LuckBox", so the duplicates don't appear in the API — but they still take up space in the event store and block cleanup of that filter.
UAT baseline as of 2026-04-17: 16,669 non-LuckBox streams across six aggregate types:
| Entity | Non-LuckBox streams | Create-only | With updates |
|---|---|---|---|
| Participant | 14,559 | 12,538 | 2,021 |
| SportingEvent | 1,282 | 335 | 947 |
| Venue | 460 | 460 | 0 |
| Team | 346 | 74 | 272 |
| Season | 15 | 0 | 15 |
| Competition | 7 | 4 | 3 |
Cross-reference check (UAT): zero LuckBox aggregates reference any non-LuckBox stream. The archive candidates are completely detached from the live LuckBox graph.
Strategy: Marten ArchiveStream¶
session.Events.ArchiveStream(streamId) is the right primitive:
- Moves the stream's events from
mt_eventstomt_events_archived(audit trail preserved) - Flags the stream as archived; projections stop firing for it
- Any projected document for the stream is removed via the archive projection path
- Reversible via
session.Events.UnarchiveStream(streamId)if we ever need to recover
Not used: DeleteStream (irreversible) or manual DELETE (bypasses Marten's stream state).
Preconditions¶
Each of these must be verified before running with --apply against any environment.
-
Code is deployed. The branch containing the
IEntityMappingServicemigration, SQL-injection hardening, andGetDetailshandler refactor has been merged and deployed to the target environment. Rolling back to a pre-migration version after archiving will leave code referencingAggregateRelationsthat no longer has the aggregates it expected. -
Discovery report is clean.
Required state: - Section [3] cross-reference checks all show zero — no LuckBox aggregate references a non-LuckBox stream.
-
Section [1] breakdowns match expectations for that environment (investigate any surprises before proceeding).
-
Backup taken. Full logical backup of the Postgres database (or confirm a recent automated backup exists). Azure PostgreSQL Flexible Server has point-in-time restore, but confirm the retention window covers the archive operation.
-
--applyvariant code-reviewed. The archive variant now exists attools/archive-old-aggregates/ArchiveOldAggregates.cs. It defaults to dry-run, requires--applyto write, and prompts for an interactiveARCHIVEconfirmation (skippable with--yesfor CI). Re-review it against the target environment before running with--apply. -
Maintenance window agreed. Archive is online-safe but will generate write load. Schedule for a low-traffic window.
Rollout stages¶
Principle: smallest-risk first, confirm invariants between stages. Do not proceed to the next stage if any error count is non-zero after the previous.
Between each stage, re-run DiscoverOldAggregates.cs and confirm:
- The archived entity's count dropped to zero
- All cross-reference warnings are still zero
- No new error logs in the API
Stage 0 — Dry-run verification (no writes)¶
Run DiscoverOldAggregates.cs -- --dump-csv to produce stream-ids.csv. Spot-check a dozen rows by hand — confirm the data_source column matches the provider naming convention and the stream IDs look plausible.
Stage 1 — Venue (460 streams, lowest risk)¶
Venue is the cleanest test: - Every non-LuckBox Venue has a LuckBox counterpart (exact 1:1 ratio — 460 of each) - All are create-only — zero updates, so archiving removes exactly one event per stream - No cross-references
# Dry run
dotnet run ArchiveOldAggregates.cs -- --entity=Venue
# Apply
dotnet run ArchiveOldAggregates.cs -- --entity=Venue --apply --limit=50 # first
dotnet run ArchiveOldAggregates.cs -- --entity=Venue --apply # remainder
Validate:
- Venue rows in mt_events dropped by 460
- Venue rows in mt_events_archived increased by 460
- Venue API queries still work (the LuckBox venues remain)
Stage 2 — Competition and Season (22 streams, small blast radius)¶
Similar pattern, tiny numbers: - Competition: 7 streams - Season: 15 streams
dotnet run ArchiveOldAggregates.cs -- --entity=Competition --apply
dotnet run ArchiveOldAggregates.cs -- --entity=Season --apply
Stage 3 — Team (346 streams)¶
dotnet run ArchiveOldAggregates.cs -- --entity=Team --apply --limit=100 # batch 1
# verify, then
dotnet run ArchiveOldAggregates.cs -- --entity=Team --apply # remainder
Stage 4 — SportingEvent (1,282 streams)¶
Larger volume, more events per stream (947 of 1,282 have updates):
dotnet run ArchiveOldAggregates.cs -- --entity=SportingEvent --apply --limit=250 # per pass
# repeat 5–6 times with validation between
Stage 5 — Participant (14,559 streams — the big one)¶
Bulk of the work. Expect this to take longest:
dotnet run ArchiveOldAggregates.cs -- --entity=Participant --apply --limit=1000
# repeat ~15 times, monitoring between
Observability during archive¶
Watch these while --apply runs:
| Metric | Source | Expected behaviour |
|---|---|---|
Event count on the target entity in mt_events |
SELECT COUNT(*) FROM events.mt_events WHERE type = 'participant_created_event' AND data->>'DataSource' != 'LuckBox' |
Decreases per batch |
| Archived events | SELECT COUNT(*) FROM events.mt_events_archived |
Increases matching the decrease above |
| API error rate | Application Insights / APM dashboard | No change |
| Projection lag | mt_event_progression |
Transiently elevated during archive; recovers |
| Postgres connection count | Azure metrics | Within normal bounds |
| Tool logs | stdout | Progress lines every N streams; zero errors |
Abort condition: any of these regress. Stop, re-run discovery, investigate before continuing.
Rollback¶
If something goes wrong mid-stage:
-
Stop the archive tool (Ctrl+C). Marten archives per-stream in a transaction, so partial state is only ever at stream boundaries — no torn streams.
-
Inspect the damage. Compare
DiscoverOldAggregates.csoutput to the baseline. Any LuckBox aggregates now broken? -
Reverse individual streams if needed:
Thestream-ids.csvfrom the discovery dump tells you which streams were in scope. -
Catastrophic recovery: if the archive table itself is damaged, restore from the pre-archive backup. Azure Postgres point-in-time restore covers this.
Un-archiving only restores the event stream — any downstream Postgres read-model documents removed by the archive projection will repopulate as the projection replays.
Post-archive cleanup¶
Once all stages complete successfully and the discovery tool reports zero non-LuckBox streams:
- Remove the
DataSource == "LuckBox"filter from contract builders. This was item #1 from the earlier cleanup review — blocked only by the presence of non-LuckBox aggregates in the event store. Files: src/Domain/LBS.Domain.Sport/Participant/ContractBuilder/ParticipantContractBuilder.cssrc/Domain/LBS.Domain.Sport/Team/ContractBuilder/TeamContractBuilder.cssrc/Domain/LBS.Domain.Sport/Competition/ContractBuilder/CompetitionContractBuilder.cssrc/Domain/LBS.Domain.Sport/Season/ContractBuilder/SeasonContractBuilder.cssrc/Domain/LBS.Domain.Sport/SportingEvent/ContractBuilder/SportingEventContractBuilder.cs-
src/Domain/LBS.Domain.Sport/Venue/ContractBuilder/VenueContractBuilder.cs -
Decide the fate of
readmodels."AggregateRelations". Confirmed earlier: no C# code reads or writes this table. The table itself can be dropped as a follow-up DDL migration when you're comfortable. Until then, it sits dormant. -
Optional: compact
mt_events_archivedif size becomes a concern. Archived events are kept for audit; if that's no longer needed they can be truncated, but that's irreversible.
Open questions for the runbook review¶
- Execution order within a batch — current plan archives by arbitrary order. Do we prefer oldest-first, newest-first, or per-provider?
- Concurrency — is it safe to run two
--entity=Xpasses in parallel against different entity types? Marten is per-stream transactional, so likely yes, but worth confirming. - Projection rebuild requirement — after archiving, do we want to force a projection rebuild to guarantee read-model consistency, or rely on the per-stream archive path?
- Should we log to a table instead of just stdout? Would help auditability.
- Who runs it — is this operator-run or a scheduled one-off job?
Appendix: the --apply tool¶
The archive variant is implemented at tools/archive-old-aggregates/ArchiveOldAggregates.cs as a single-file .NET tool. It:
- Takes
--entity=<Participant|Team|Competition|Season|SportingEvent|Venue>(required; one at a time) - Takes
--limit=<N>(default 100; safety cap per invocation) - Takes
--batch-size=<N>(default 50; streams per Marten transaction) - Defaults to dry-run; requires the
--applyflag to actually archive - Prompts for an interactive
ARCHIVEconfirmation before the first write; pass--yesto skip it (CI mode) - For each stream: opens a Marten session →
session.Events.ArchiveStream(id)→SaveChangesAsync→ logs progress - Batches in transactions of
--batch-sizestreams to balance throughput vs. rollback granularity - Prints a summary including: attempted, archived, failed (with error), elapsed time
- Writes a per-run log line to
runs.log(next to the tool) for audit - Re-queries the remaining non-LuckBox stream count as a post-hoc sanity check
Connection details are read from env vars (PG_HOST, PG_DATABASE, PG_USER, PGPASSWORD, EVENTS_SCHEMA).