Multi-Host Concurrency Design¶

Implementation note (current)¶

Distributed locking shipped via an IDistributedLock / IDistributedLockHandle abstraction (src/Core/LBS.Augment/Timers/), with two implementations: RedisDistributedLock for multi-host coordination and InMemoryDistributedLock for single-host and test scenarios. Locks are acquired explicitly through IDistributedLock.AcquireAsync(...) / TryAcquireAsync(...).

There is no useDistributedLock flag on [Timer(...)]. The real TimerAttribute (src/Core/LBS.Augment/Timers/TimerAttribute.cs) takes only maxConcurrency, intervalInSeconds, and the optional maxRunTimeInSeconds, retryOnError, and spreadStart parameters. The design body below predates the shipped implementation; treat this note as the source of truth for what exists today.

Problem Statement¶

The current TimerService implementation only controls concurrency within a single host process:

[Timer(maxConcurrency: 3, intervalInSeconds: 60)]
public async Task ProcessOrdersAsync(CancellationToken ct) { }

Current Behavior: - Host A: Max 3 concurrent executions - Host B: Max 3 concurrent executions - Total across all hosts: 6 concurrent executions (not controlled)

Required Behavior for Multi-Host: - Total across ALL hosts: Max 3 concurrent executions globally

Current Architecture Limitations¶

1. In-Process Concurrency Tracking¶

// TimerService.cs line 84
CurrentConcurrency = 0  // Only tracks local host

This counter is in-memory and not shared across hosts.

2. No Distributed Coordination¶

// TimerService.cs line 297-298
if (timer.CurrentConcurrency < timer.MaxConcurrency)
{
    // Start timer - no awareness of other hosts
}

Each host independently decides whether to run a timer.

3. Spread Start is Random per Host¶

// TimerService.cs line 64
var spreadDelay = TimeSpan.FromTicks((long)(this.random.NextDouble() * timerAttribute.Interval.Ticks));

Each host generates its own random delay, potentially causing all hosts to start at similar times.

Multi-Host Concurrency Strategies¶

Strategy 1: Distributed Locking (Recommended)¶

Concept: Acquire a distributed lock before executing each timer instance.

Pros: - True global concurrency control - Works across any number of hosts - Prevents duplicate work - Handles host failures (locks auto-expire)

Cons: - Requires external dependency (Redis, SQL, etc.) - Additional latency for lock acquisition - Potential single point of failure

Implementation:

[Timer(maxConcurrency: 3, intervalInSeconds: 60, useDistributedLock: true)]
public async Task ProcessOrdersAsync(CancellationToken ct)
{
    // Only 3 instances across ALL hosts will run concurrently
}

Strategy 2: Database-Based Coordination¶

Concept: Use database rows to track running instances globally.

Pros: - Leverages existing database infrastructure - Atomic operations via transactions - Natural audit trail

Cons: - Higher database load - Slower than in-memory locks - Requires cleanup of stale entries

Implementation:

CREATE TABLE TimerExecutions (
    TimerName NVARCHAR(255),
    InstanceId UNIQUEIDENTIFIER PRIMARY KEY,
    HostName NVARCHAR(255),
    StartedAt DATETIME2,
    ExpiresAt DATETIME2
);

Strategy 3: Leader Election¶

Concept: One host becomes the leader and manages all timer scheduling.

Pros: - Simple coordination model - No per-execution overhead - Single source of truth

Cons: - Single point of failure - Underutilizes available hosts - Complex leader election logic

Not Recommended: Defeats the purpose of multi-host deployment.

Strategy 4: Consistent Hashing¶

Concept: Hash timer name to determine which host should execute it.

Pros: - Deterministic distribution - No external dependencies - Automatic rebalancing

Cons: - Doesn't control concurrency across hosts - Requires service discovery - Uneven load distribution

Use Case: Good for partitioning work, not for concurrency control.

Strategy 5: Message Queue Coordination¶

Concept: Use a message queue to distribute timer work items.

Pros: - Natural work distribution - Built-in retry mechanisms - Good for high-volume scenarios

Cons: - Architectural complexity - Requires message broker - Changes timer execution model

Recommended Approach: Distributed Locking¶

Design Overview¶

┌─────────────────────────────────────────────────────────────────┐
│                        Timer Service                             │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ 1. Check if timer should run (interval, blackout, etc.) │   │
│  └──────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ 2. Check LOCAL concurrency (existing behavior)          │   │
│  └──────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ 3. Try acquire DISTRIBUTED lock (new)                   │   │
│  │    - Lock Name: "timer:{TimerName}:slot:{0-MaxConc}"    │   │
│  │    - Timeout: MaxRunTime                                │   │
│  │    - Retry: Don't wait, try next slot                   │   │
│  └──────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                    ┌─────────┴─────────┐                         │
│                    │                   │                         │
│               Lock Acquired       Lock Failed                    │
│                    │                   │                         │
│                    ▼                   ▼                         │
│         ┌──────────────────┐  ┌──────────────────┐              │
│         │ Execute Timer    │  │ Skip execution   │              │
│         │ Release lock     │  │ (max concurrency │              │
│         │ when done        │  │  reached)        │              │
│         └──────────────────┘  └──────────────────┘              │
└─────────────────────────────────────────────────────────────────┘

Distributed Lock Store (Redis/SQL)
┌─────────────────────────────────────────────────────────────────┐
│ timer:ProcessOrders:slot:0 → Host-A (expires in 30s)            │
│ timer:ProcessOrders:slot:1 → Host-B (expires in 30s)            │
│ timer:ProcessOrders:slot:2 → Host-C (expires in 30s)            │
│ timer:ProcessOrders:slot:3 → [available]                        │
└─────────────────────────────────────────────────────────────────┘

Lock Slot Algorithm¶

For a timer with MaxConcurrency = 5, we create 5 lock slots: - timer:ProcessOrders:slot:0 - timer:ProcessOrders:slot:1 - timer:ProcessOrders:slot:2 - timer:ProcessOrders:slot:3 - timer:ProcessOrders:slot:4

Acquisition Process: 1. Try to acquire slot 0 2. If fail, try slot 1 3. If fail, try slot 2 4. Continue until all slots exhausted 5. If no slot available → max concurrency reached globally

Benefits: - Distributed hosts automatically coordinate - No central counter needed - Locks auto-expire on host failure - Simple to implement and reason about

Implementation Details¶

1. Lock Naming Convention¶

private string GetLockName(TimerInfo timer, int slot)
{
    var timerName = timer.IsManualTimer
        ? timer.ManualTimerName
        : $"{timer.Type.Name}.{timer.Method.Name}";

    return $"timer:{timerName}:slot:{slot}";
}

2. Lock Acquisition in Execute Loop¶

// Before starting timer execution
IDistributedLockHandle lockHandle = null;
for (int slot = 0; slot < timer.MaxConcurrency; slot++)
{
    var lockName = GetLockName(timer, slot);
    lockHandle = await distributedLock.TryAcquireAsync(
        lockName,
        timer.MaxRunTime,
        cancellationToken);

    if (lockHandle != null)
        break; // Got a slot!
}

if (lockHandle == null)
{
    // All slots taken globally - skip execution
    continue;
}

try
{
    // Execute timer
    await ExecuteTimerAsync(timer, cancellationToken);
}
finally
{
    // Release lock
    await lockHandle.ReleaseAsync();
}

3. Configuration¶

public class TimerService
{
    private readonly IDistributedLock distributedLock;

    public TimerService(
        IServiceProvider serviceProvider,
        ILogger<TimerService> logger,
        IDistributedLock distributedLock = null) // Optional
    {
        this.distributedLock = distributedLock; // Null = single host mode
    }
}

4. Attribute Enhancement¶

[Timer(
    maxConcurrency: 3,
    intervalInSeconds: 60,
    useDistributedLock: true)] // Enable multi-host coordination
public async Task ProcessOrdersAsync(CancellationToken ct) { }

Distributed Lock Implementations¶

Redis Implementation (Recommended)¶

Advantages: - Ultra-fast (in-memory) - Native expiration support - Lua scripting for atomic operations - Battle-tested in production

Setup:

services.AddSingleton<IConnectionMultiplexer>(sp =>
    ConnectionMultiplexer.Connect("localhost:6379"));

services.AddSingleton<IDistributedLock>(sp =>
    new RedisDistributedLock(
        sp.GetRequiredService<IConnectionMultiplexer>(),
        keyPrefix: "myapp:timer-lock:"));

Lock Algorithm:

SET timer:ProcessOrders:slot:0 {unique-guid} NX EX 30
- NX: Only set if not exists
- EX: Expire in 30 seconds
- Returns: 1 if acquired, 0 if already held

SQL Server Implementation¶

Advantages: - No additional infrastructure - Uses existing database - Transactional consistency

Disadvantages: - Slower than Redis - More database load - Requires cleanup job for stale locks

Setup:

services.AddSingleton<IDistributedLock>(sp =>
    new SqlDistributedLock(
        connectionString,
        tableName: "DistributedLocks"));

Schema:

CREATE TABLE DistributedLocks (
    ResourceName NVARCHAR(255) PRIMARY KEY,
    LockId UNIQUEIDENTIFIER NOT NULL,
    AcquiredAt DATETIME2 NOT NULL,
    ExpiresAt DATETIME2 NOT NULL,
    HostName NVARCHAR(255) NULL
);

CREATE INDEX IX_DistributedLocks_ExpiresAt
    ON DistributedLocks(ExpiresAt);

PostgreSQL Implementation¶

Similar to SQL Server but with PostgreSQL-specific features:

CREATE TABLE distributed_locks (
    resource_name VARCHAR(255) PRIMARY KEY,
    lock_id UUID NOT NULL,
    acquired_at TIMESTAMP NOT NULL,
    expires_at TIMESTAMP NOT NULL,
    host_name VARCHAR(255)
);

-- Use advisory locks for better performance
SELECT pg_try_advisory_lock(hashtext('timer:ProcessOrders:slot:0'));

Azure Table Storage / Cosmos DB¶

Advantages: - Cloud-native - Highly available - Automatic scaling

Disadvantages: - Higher latency - More complex error handling - Cost considerations

Performance Considerations¶

Lock Acquisition Latency¶

Backend	Average Latency	P99 Latency
Redis (local)	1-2ms	5ms
Redis (remote)	5-10ms	20ms
SQL Server (local)	5-15ms	50ms
SQL Server (remote)	15-30ms	100ms
PostgreSQL	Similar to SQL Server

Throughput Impact¶

For a timer with 1-minute interval: - Without distributed locks: Instant scheduling decision - With distributed locks: +1-30ms per scheduling decision

Impact: Negligible for most use cases since timers typically run every minute or more.

Optimization: Lock Caching¶

For very high-frequency timers (< 10 seconds):

// Cache lock acquisition results briefly
private readonly MemoryCache lockCache = new MemoryCache(new MemoryCacheOptions());

private async Task<bool> CanAcquireLockCached(string timerName)
{
    var cacheKey = $"lock-check:{timerName}";

    if (lockCache.TryGetValue(cacheKey, out bool canAcquire))
        return canAcquire;

    // Try actual lock acquisition
    var lockHandle = await TryAcquireLock(timerName);
    canAcquire = lockHandle != null;

    // Cache result for 1 second
    lockCache.Set(cacheKey, canAcquire, TimeSpan.FromSeconds(1));

    return canAcquire;
}

Failure Scenarios¶

1. Lock Service Unavailable¶

Scenario: Redis/SQL is down

Current Behavior: Timers cannot acquire locks → no execution

Recommendation: Add fallback mode

if (distributedLock != null)
{
    try
    {
        lockHandle = await distributedLock.TryAcquireAsync(...);
    }
    catch (Exception ex)
    {
        logger.LogError(ex, "Distributed lock failed, falling back to local concurrency");
        // Fall back to local concurrency only
        distributedLock = null;
    }
}

2. Lock Not Released (Host Crash)¶

Scenario: Host crashes while holding lock

Solution: Lock auto-expires after MaxRunTime

// Lock expires automatically after max runtime
await distributedLock.AcquireAsync(
    resourceName,
    lockTimeout: timer.MaxRunTime, // Auto-expire
    acquireTimeout: TimeSpan.FromSeconds(5));

3. Lock Extension for Long-Running Timers¶

Scenario: Timer runs longer than expected but within limits

Solution: Heartbeat lock extension

var lockHandle = await AcquireLockAsync();

// Background task to extend lock
_ = Task.Run(async () =>
{
    while (!cancellationToken.IsCancellationRequested)
    {
        await Task.Delay(timer.MaxRunTime / 2, cancellationToken);
        await lockHandle.ExtendAsync(timer.MaxRunTime);
    }
});

try
{
    await ExecuteTimerAsync(timer, cancellationToken);
}
finally
{
    await lockHandle.ReleaseAsync();
}

Migration Path¶

Phase 1: Single-Host (Current)¶

services.AddSingleton<TimerService>();
// No distributed lock configured

Phase 2: Multi-Host (Opt-In per Timer)¶

services.AddSingleton<IDistributedLock, RedisDistributedLock>();
services.AddSingleton<TimerService>();

// Opt-in per timer
[Timer(maxConcurrency: 3, useDistributedLock: true)]
public async Task ProcessOrdersAsync(CancellationToken ct) { }

Phase 3: Multi-Host (Default for All)¶

// Configure distributed lock globally
services.AddSingleton<IDistributedLock, RedisDistributedLock>();
services.AddSingleton<TimerService>();

// All timers use distributed locks by default
[Timer(maxConcurrency: 3)]
public async Task ProcessOrdersAsync(CancellationToken ct) { }

Monitoring and Observability¶

Metrics to Track¶

Lock Acquisition Rate
Successful acquisitions per minute
Failed acquisitions (max concurrency reached)
Lock Contention
Number of hosts competing for locks
Average wait time for lock acquisition
Lock Duration
How long locks are held
Locks that expire (host crashed)
Lock Service Health
Redis/SQL availability
Lock operation latency

Logging¶

logger.LogInformation(
    "Distributed lock acquired: {Timer}, Slot: {Slot}, Host: {Host}",
    timerName,
    slot,
    Environment.MachineName);

logger.LogWarning(
    "Max global concurrency reached: {Timer}, MaxConcurrency: {Max}",
    timerName,
    timer.MaxConcurrency);

Health Checks¶

public class DistributedLockHealthCheck : IHealthCheck
{
    private readonly IDistributedLock distributedLock;

    public async Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context)
    {
        try
        {
            // Try to acquire and release a test lock
            var handle = await distributedLock.TryAcquireAsync(
                "health-check",
                TimeSpan.FromSeconds(5));

            if (handle == null)
                return HealthCheckResult.Degraded("Could not acquire test lock");

            await handle.ReleaseAsync();
            return HealthCheckResult.Healthy("Distributed lock working");
        }
        catch (Exception ex)
        {
            return HealthCheckResult.Unhealthy("Distributed lock failed", ex);
        }
    }
}

Best Practices¶

1. Set Appropriate Lock Timeouts¶

// Lock timeout should match max runtime
[Timer(
    maxConcurrency: 3,
    intervalInSeconds: 300,
    maxRunTimeInSeconds: 60)] // Lock expires after 60s

2. Handle Lock Acquisition Failures Gracefully¶

if (lockHandle == null)
{
    // Log but don't fail - timer will retry next interval
    logger.LogDebug("Skipped execution (max concurrency reached)");
    return;
}

3. Use Separate Lock Stores per Environment¶

// Production
redis: "prod-redis:6379"

// Staging
redis: "staging-redis:6379"

// Prevents staging from interfering with production locks

4. Monitor Lock Age¶

// Alert if locks are held too long (potential deadlock)
if (lockAge > timer.MaxRunTime * 1.5)
{
    logger.LogWarning("Lock held longer than expected: {Timer}", timerName);
}

Summary¶

Current State: - Single-host concurrency control - No multi-host coordination

Recommended Solution: - Use distributed locking with lock slots - Implement with Redis (preferred) or SQL - Opt-in per timer via attribute flag - Fail gracefully when lock service unavailable - Monitor lock acquisition and contention

Next Steps: 1. Choose lock backend (Redis recommended) 2. Implement IDistributedLock interface 3. Update TimerService to acquire locks before execution 4. Add monitoring and health checks 5. Gradually migrate timers to use distributed locks