Skip to content

Multi-Host Concurrency Design

Implementation note (current)

Distributed locking shipped via an IDistributedLock / IDistributedLockHandle abstraction (src/Core/LBS.Augment/Timers/), with two implementations: RedisDistributedLock for multi-host coordination and InMemoryDistributedLock for single-host and test scenarios. Locks are acquired explicitly through IDistributedLock.AcquireAsync(...) / TryAcquireAsync(...).

There is no useDistributedLock flag on [Timer(...)]. The real TimerAttribute (src/Core/LBS.Augment/Timers/TimerAttribute.cs) takes only maxConcurrency, intervalInSeconds, and the optional maxRunTimeInSeconds, retryOnError, and spreadStart parameters. The design body below predates the shipped implementation; treat this note as the source of truth for what exists today.

Problem Statement

The current TimerService implementation only controls concurrency within a single host process:

[Timer(maxConcurrency: 3, intervalInSeconds: 60)]
public async Task ProcessOrdersAsync(CancellationToken ct) { }

Current Behavior: - Host A: Max 3 concurrent executions - Host B: Max 3 concurrent executions - Total across all hosts: 6 concurrent executions (not controlled)

Required Behavior for Multi-Host: - Total across ALL hosts: Max 3 concurrent executions globally

Current Architecture Limitations

1. In-Process Concurrency Tracking

// TimerService.cs line 84
CurrentConcurrency = 0  // Only tracks local host

This counter is in-memory and not shared across hosts.

2. No Distributed Coordination

// TimerService.cs line 297-298
if (timer.CurrentConcurrency < timer.MaxConcurrency)
{
    // Start timer - no awareness of other hosts
}

Each host independently decides whether to run a timer.

3. Spread Start is Random per Host

// TimerService.cs line 64
var spreadDelay = TimeSpan.FromTicks((long)(this.random.NextDouble() * timerAttribute.Interval.Ticks));

Each host generates its own random delay, potentially causing all hosts to start at similar times.

Multi-Host Concurrency Strategies

Concept: Acquire a distributed lock before executing each timer instance.

Pros: - True global concurrency control - Works across any number of hosts - Prevents duplicate work - Handles host failures (locks auto-expire)

Cons: - Requires external dependency (Redis, SQL, etc.) - Additional latency for lock acquisition - Potential single point of failure

Implementation:

[Timer(maxConcurrency: 3, intervalInSeconds: 60, useDistributedLock: true)]
public async Task ProcessOrdersAsync(CancellationToken ct)
{
    // Only 3 instances across ALL hosts will run concurrently
}

Strategy 2: Database-Based Coordination

Concept: Use database rows to track running instances globally.

Pros: - Leverages existing database infrastructure - Atomic operations via transactions - Natural audit trail

Cons: - Higher database load - Slower than in-memory locks - Requires cleanup of stale entries

Implementation:

CREATE TABLE TimerExecutions (
    TimerName NVARCHAR(255),
    InstanceId UNIQUEIDENTIFIER PRIMARY KEY,
    HostName NVARCHAR(255),
    StartedAt DATETIME2,
    ExpiresAt DATETIME2
);

Strategy 3: Leader Election

Concept: One host becomes the leader and manages all timer scheduling.

Pros: - Simple coordination model - No per-execution overhead - Single source of truth

Cons: - Single point of failure - Underutilizes available hosts - Complex leader election logic

Not Recommended: Defeats the purpose of multi-host deployment.

Strategy 4: Consistent Hashing

Concept: Hash timer name to determine which host should execute it.

Pros: - Deterministic distribution - No external dependencies - Automatic rebalancing

Cons: - Doesn't control concurrency across hosts - Requires service discovery - Uneven load distribution

Use Case: Good for partitioning work, not for concurrency control.

Strategy 5: Message Queue Coordination

Concept: Use a message queue to distribute timer work items.

Pros: - Natural work distribution - Built-in retry mechanisms - Good for high-volume scenarios

Cons: - Architectural complexity - Requires message broker - Changes timer execution model

Design Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Timer Service                             │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ 1. Check if timer should run (interval, blackout, etc.) │   │
│  └──────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ 2. Check LOCAL concurrency (existing behavior)          │   │
│  └──────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ 3. Try acquire DISTRIBUTED lock (new)                   │   │
│  │    - Lock Name: "timer:{TimerName}:slot:{0-MaxConc}"    │   │
│  │    - Timeout: MaxRunTime                                │   │
│  │    - Retry: Don't wait, try next slot                   │   │
│  └──────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                    ┌─────────┴─────────┐                         │
│                    │                   │                         │
│               Lock Acquired       Lock Failed                    │
│                    │                   │                         │
│                    ▼                   ▼                         │
│         ┌──────────────────┐  ┌──────────────────┐              │
│         │ Execute Timer    │  │ Skip execution   │              │
│         │ Release lock     │  │ (max concurrency │              │
│         │ when done        │  │  reached)        │              │
│         └──────────────────┘  └──────────────────┘              │
└─────────────────────────────────────────────────────────────────┘

Distributed Lock Store (Redis/SQL)
┌─────────────────────────────────────────────────────────────────┐
│ timer:ProcessOrders:slot:0 → Host-A (expires in 30s)            │
│ timer:ProcessOrders:slot:1 → Host-B (expires in 30s)            │
│ timer:ProcessOrders:slot:2 → Host-C (expires in 30s)            │
│ timer:ProcessOrders:slot:3 → [available]                        │
└─────────────────────────────────────────────────────────────────┘

Lock Slot Algorithm

For a timer with MaxConcurrency = 5, we create 5 lock slots: - timer:ProcessOrders:slot:0 - timer:ProcessOrders:slot:1 - timer:ProcessOrders:slot:2 - timer:ProcessOrders:slot:3 - timer:ProcessOrders:slot:4

Acquisition Process: 1. Try to acquire slot 0 2. If fail, try slot 1 3. If fail, try slot 2 4. Continue until all slots exhausted 5. If no slot available → max concurrency reached globally

Benefits: - Distributed hosts automatically coordinate - No central counter needed - Locks auto-expire on host failure - Simple to implement and reason about

Implementation Details

1. Lock Naming Convention

private string GetLockName(TimerInfo timer, int slot)
{
    var timerName = timer.IsManualTimer
        ? timer.ManualTimerName
        : $"{timer.Type.Name}.{timer.Method.Name}";

    return $"timer:{timerName}:slot:{slot}";
}

2. Lock Acquisition in Execute Loop

// Before starting timer execution
IDistributedLockHandle lockHandle = null;
for (int slot = 0; slot < timer.MaxConcurrency; slot++)
{
    var lockName = GetLockName(timer, slot);
    lockHandle = await distributedLock.TryAcquireAsync(
        lockName,
        timer.MaxRunTime,
        cancellationToken);

    if (lockHandle != null)
        break; // Got a slot!
}

if (lockHandle == null)
{
    // All slots taken globally - skip execution
    continue;
}

try
{
    // Execute timer
    await ExecuteTimerAsync(timer, cancellationToken);
}
finally
{
    // Release lock
    await lockHandle.ReleaseAsync();
}

3. Configuration

public class TimerService
{
    private readonly IDistributedLock distributedLock;

    public TimerService(
        IServiceProvider serviceProvider,
        ILogger<TimerService> logger,
        IDistributedLock distributedLock = null) // Optional
    {
        this.distributedLock = distributedLock; // Null = single host mode
    }
}

4. Attribute Enhancement

[Timer(
    maxConcurrency: 3,
    intervalInSeconds: 60,
    useDistributedLock: true)] // Enable multi-host coordination
public async Task ProcessOrdersAsync(CancellationToken ct) { }

Distributed Lock Implementations

Advantages: - Ultra-fast (in-memory) - Native expiration support - Lua scripting for atomic operations - Battle-tested in production

Setup:

services.AddSingleton<IConnectionMultiplexer>(sp =>
    ConnectionMultiplexer.Connect("localhost:6379"));

services.AddSingleton<IDistributedLock>(sp =>
    new RedisDistributedLock(
        sp.GetRequiredService<IConnectionMultiplexer>(),
        keyPrefix: "myapp:timer-lock:"));

Lock Algorithm:

SET timer:ProcessOrders:slot:0 {unique-guid} NX EX 30
- NX: Only set if not exists
- EX: Expire in 30 seconds
- Returns: 1 if acquired, 0 if already held

SQL Server Implementation

Advantages: - No additional infrastructure - Uses existing database - Transactional consistency

Disadvantages: - Slower than Redis - More database load - Requires cleanup job for stale locks

Setup:

services.AddSingleton<IDistributedLock>(sp =>
    new SqlDistributedLock(
        connectionString,
        tableName: "DistributedLocks"));

Schema:

CREATE TABLE DistributedLocks (
    ResourceName NVARCHAR(255) PRIMARY KEY,
    LockId UNIQUEIDENTIFIER NOT NULL,
    AcquiredAt DATETIME2 NOT NULL,
    ExpiresAt DATETIME2 NOT NULL,
    HostName NVARCHAR(255) NULL
);

CREATE INDEX IX_DistributedLocks_ExpiresAt
    ON DistributedLocks(ExpiresAt);

PostgreSQL Implementation

Similar to SQL Server but with PostgreSQL-specific features:

CREATE TABLE distributed_locks (
    resource_name VARCHAR(255) PRIMARY KEY,
    lock_id UUID NOT NULL,
    acquired_at TIMESTAMP NOT NULL,
    expires_at TIMESTAMP NOT NULL,
    host_name VARCHAR(255)
);

-- Use advisory locks for better performance
SELECT pg_try_advisory_lock(hashtext('timer:ProcessOrders:slot:0'));

Azure Table Storage / Cosmos DB

Advantages: - Cloud-native - Highly available - Automatic scaling

Disadvantages: - Higher latency - More complex error handling - Cost considerations

Performance Considerations

Lock Acquisition Latency

Backend Average Latency P99 Latency
Redis (local) 1-2ms 5ms
Redis (remote) 5-10ms 20ms
SQL Server (local) 5-15ms 50ms
SQL Server (remote) 15-30ms 100ms
PostgreSQL Similar to SQL Server

Throughput Impact

For a timer with 1-minute interval: - Without distributed locks: Instant scheduling decision - With distributed locks: +1-30ms per scheduling decision

Impact: Negligible for most use cases since timers typically run every minute or more.

Optimization: Lock Caching

For very high-frequency timers (< 10 seconds):

// Cache lock acquisition results briefly
private readonly MemoryCache lockCache = new MemoryCache(new MemoryCacheOptions());

private async Task<bool> CanAcquireLockCached(string timerName)
{
    var cacheKey = $"lock-check:{timerName}";

    if (lockCache.TryGetValue(cacheKey, out bool canAcquire))
        return canAcquire;

    // Try actual lock acquisition
    var lockHandle = await TryAcquireLock(timerName);
    canAcquire = lockHandle != null;

    // Cache result for 1 second
    lockCache.Set(cacheKey, canAcquire, TimeSpan.FromSeconds(1));

    return canAcquire;
}

Failure Scenarios

1. Lock Service Unavailable

Scenario: Redis/SQL is down

Current Behavior: Timers cannot acquire locks → no execution

Recommendation: Add fallback mode

if (distributedLock != null)
{
    try
    {
        lockHandle = await distributedLock.TryAcquireAsync(...);
    }
    catch (Exception ex)
    {
        logger.LogError(ex, "Distributed lock failed, falling back to local concurrency");
        // Fall back to local concurrency only
        distributedLock = null;
    }
}

2. Lock Not Released (Host Crash)

Scenario: Host crashes while holding lock

Solution: Lock auto-expires after MaxRunTime

// Lock expires automatically after max runtime
await distributedLock.AcquireAsync(
    resourceName,
    lockTimeout: timer.MaxRunTime, // Auto-expire
    acquireTimeout: TimeSpan.FromSeconds(5));

3. Lock Extension for Long-Running Timers

Scenario: Timer runs longer than expected but within limits

Solution: Heartbeat lock extension

var lockHandle = await AcquireLockAsync();

// Background task to extend lock
_ = Task.Run(async () =>
{
    while (!cancellationToken.IsCancellationRequested)
    {
        await Task.Delay(timer.MaxRunTime / 2, cancellationToken);
        await lockHandle.ExtendAsync(timer.MaxRunTime);
    }
});

try
{
    await ExecuteTimerAsync(timer, cancellationToken);
}
finally
{
    await lockHandle.ReleaseAsync();
}

Migration Path

Phase 1: Single-Host (Current)

services.AddSingleton<TimerService>();
// No distributed lock configured

Phase 2: Multi-Host (Opt-In per Timer)

services.AddSingleton<IDistributedLock, RedisDistributedLock>();
services.AddSingleton<TimerService>();

// Opt-in per timer
[Timer(maxConcurrency: 3, useDistributedLock: true)]
public async Task ProcessOrdersAsync(CancellationToken ct) { }

Phase 3: Multi-Host (Default for All)

// Configure distributed lock globally
services.AddSingleton<IDistributedLock, RedisDistributedLock>();
services.AddSingleton<TimerService>();

// All timers use distributed locks by default
[Timer(maxConcurrency: 3)]
public async Task ProcessOrdersAsync(CancellationToken ct) { }

Monitoring and Observability

Metrics to Track

  1. Lock Acquisition Rate
  2. Successful acquisitions per minute
  3. Failed acquisitions (max concurrency reached)

  4. Lock Contention

  5. Number of hosts competing for locks
  6. Average wait time for lock acquisition

  7. Lock Duration

  8. How long locks are held
  9. Locks that expire (host crashed)

  10. Lock Service Health

  11. Redis/SQL availability
  12. Lock operation latency

Logging

logger.LogInformation(
    "Distributed lock acquired: {Timer}, Slot: {Slot}, Host: {Host}",
    timerName,
    slot,
    Environment.MachineName);

logger.LogWarning(
    "Max global concurrency reached: {Timer}, MaxConcurrency: {Max}",
    timerName,
    timer.MaxConcurrency);

Health Checks

public class DistributedLockHealthCheck : IHealthCheck
{
    private readonly IDistributedLock distributedLock;

    public async Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context)
    {
        try
        {
            // Try to acquire and release a test lock
            var handle = await distributedLock.TryAcquireAsync(
                "health-check",
                TimeSpan.FromSeconds(5));

            if (handle == null)
                return HealthCheckResult.Degraded("Could not acquire test lock");

            await handle.ReleaseAsync();
            return HealthCheckResult.Healthy("Distributed lock working");
        }
        catch (Exception ex)
        {
            return HealthCheckResult.Unhealthy("Distributed lock failed", ex);
        }
    }
}

Best Practices

1. Set Appropriate Lock Timeouts

// Lock timeout should match max runtime
[Timer(
    maxConcurrency: 3,
    intervalInSeconds: 300,
    maxRunTimeInSeconds: 60)] // Lock expires after 60s

2. Handle Lock Acquisition Failures Gracefully

if (lockHandle == null)
{
    // Log but don't fail - timer will retry next interval
    logger.LogDebug("Skipped execution (max concurrency reached)");
    return;
}

3. Use Separate Lock Stores per Environment

// Production
redis: "prod-redis:6379"

// Staging
redis: "staging-redis:6379"

// Prevents staging from interfering with production locks

4. Monitor Lock Age

// Alert if locks are held too long (potential deadlock)
if (lockAge > timer.MaxRunTime * 1.5)
{
    logger.LogWarning("Lock held longer than expected: {Timer}", timerName);
}

Summary

Current State: - Single-host concurrency control - No multi-host coordination

Recommended Solution: - Use distributed locking with lock slots - Implement with Redis (preferred) or SQL - Opt-in per timer via attribute flag - Fail gracefully when lock service unavailable - Monitor lock acquisition and contention

Next Steps: 1. Choose lock backend (Redis recommended) 2. Implement IDistributedLock interface 3. Update TimerService to acquire locks before execution 4. Add monitoring and health checks 5. Gradually migrate timers to use distributed locks