Multi-Host Concurrency Design¶
Implementation note (current)¶
Distributed locking shipped via an IDistributedLock / IDistributedLockHandle
abstraction (src/Core/LBS.Augment/Timers/), with two implementations:
RedisDistributedLock for multi-host coordination and InMemoryDistributedLock for
single-host and test scenarios. Locks are acquired explicitly through
IDistributedLock.AcquireAsync(...) / TryAcquireAsync(...).
There is no useDistributedLock flag on [Timer(...)]. The real TimerAttribute
(src/Core/LBS.Augment/Timers/TimerAttribute.cs) takes only
maxConcurrency, intervalInSeconds, and the optional maxRunTimeInSeconds,
retryOnError, and spreadStart parameters. The design body below predates the shipped
implementation; treat this note as the source of truth for what exists today.
Problem Statement¶
The current TimerService implementation only controls concurrency within a single host process:
[Timer(maxConcurrency: 3, intervalInSeconds: 60)]
public async Task ProcessOrdersAsync(CancellationToken ct) { }
Current Behavior: - Host A: Max 3 concurrent executions - Host B: Max 3 concurrent executions - Total across all hosts: 6 concurrent executions (not controlled)
Required Behavior for Multi-Host: - Total across ALL hosts: Max 3 concurrent executions globally
Current Architecture Limitations¶
1. In-Process Concurrency Tracking¶
This counter is in-memory and not shared across hosts.
2. No Distributed Coordination¶
// TimerService.cs line 297-298
if (timer.CurrentConcurrency < timer.MaxConcurrency)
{
// Start timer - no awareness of other hosts
}
Each host independently decides whether to run a timer.
3. Spread Start is Random per Host¶
// TimerService.cs line 64
var spreadDelay = TimeSpan.FromTicks((long)(this.random.NextDouble() * timerAttribute.Interval.Ticks));
Each host generates its own random delay, potentially causing all hosts to start at similar times.
Multi-Host Concurrency Strategies¶
Strategy 1: Distributed Locking (Recommended)¶
Concept: Acquire a distributed lock before executing each timer instance.
Pros: - True global concurrency control - Works across any number of hosts - Prevents duplicate work - Handles host failures (locks auto-expire)
Cons: - Requires external dependency (Redis, SQL, etc.) - Additional latency for lock acquisition - Potential single point of failure
Implementation:
[Timer(maxConcurrency: 3, intervalInSeconds: 60, useDistributedLock: true)]
public async Task ProcessOrdersAsync(CancellationToken ct)
{
// Only 3 instances across ALL hosts will run concurrently
}
Strategy 2: Database-Based Coordination¶
Concept: Use database rows to track running instances globally.
Pros: - Leverages existing database infrastructure - Atomic operations via transactions - Natural audit trail
Cons: - Higher database load - Slower than in-memory locks - Requires cleanup of stale entries
Implementation:
CREATE TABLE TimerExecutions (
TimerName NVARCHAR(255),
InstanceId UNIQUEIDENTIFIER PRIMARY KEY,
HostName NVARCHAR(255),
StartedAt DATETIME2,
ExpiresAt DATETIME2
);
Strategy 3: Leader Election¶
Concept: One host becomes the leader and manages all timer scheduling.
Pros: - Simple coordination model - No per-execution overhead - Single source of truth
Cons: - Single point of failure - Underutilizes available hosts - Complex leader election logic
Not Recommended: Defeats the purpose of multi-host deployment.
Strategy 4: Consistent Hashing¶
Concept: Hash timer name to determine which host should execute it.
Pros: - Deterministic distribution - No external dependencies - Automatic rebalancing
Cons: - Doesn't control concurrency across hosts - Requires service discovery - Uneven load distribution
Use Case: Good for partitioning work, not for concurrency control.
Strategy 5: Message Queue Coordination¶
Concept: Use a message queue to distribute timer work items.
Pros: - Natural work distribution - Built-in retry mechanisms - Good for high-volume scenarios
Cons: - Architectural complexity - Requires message broker - Changes timer execution model
Recommended Approach: Distributed Locking¶
Design Overview¶
┌─────────────────────────────────────────────────────────────────┐
│ Timer Service │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 1. Check if timer should run (interval, blackout, etc.) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 2. Check LOCAL concurrency (existing behavior) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 3. Try acquire DISTRIBUTED lock (new) │ │
│ │ - Lock Name: "timer:{TimerName}:slot:{0-MaxConc}" │ │
│ │ - Timeout: MaxRunTime │ │
│ │ - Retry: Don't wait, try next slot │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────┴─────────┐ │
│ │ │ │
│ Lock Acquired Lock Failed │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Execute Timer │ │ Skip execution │ │
│ │ Release lock │ │ (max concurrency │ │
│ │ when done │ │ reached) │ │
│ └──────────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Distributed Lock Store (Redis/SQL)
┌─────────────────────────────────────────────────────────────────┐
│ timer:ProcessOrders:slot:0 → Host-A (expires in 30s) │
│ timer:ProcessOrders:slot:1 → Host-B (expires in 30s) │
│ timer:ProcessOrders:slot:2 → Host-C (expires in 30s) │
│ timer:ProcessOrders:slot:3 → [available] │
└─────────────────────────────────────────────────────────────────┘
Lock Slot Algorithm¶
For a timer with MaxConcurrency = 5, we create 5 lock slots:
- timer:ProcessOrders:slot:0
- timer:ProcessOrders:slot:1
- timer:ProcessOrders:slot:2
- timer:ProcessOrders:slot:3
- timer:ProcessOrders:slot:4
Acquisition Process: 1. Try to acquire slot 0 2. If fail, try slot 1 3. If fail, try slot 2 4. Continue until all slots exhausted 5. If no slot available → max concurrency reached globally
Benefits: - Distributed hosts automatically coordinate - No central counter needed - Locks auto-expire on host failure - Simple to implement and reason about
Implementation Details¶
1. Lock Naming Convention¶
private string GetLockName(TimerInfo timer, int slot)
{
var timerName = timer.IsManualTimer
? timer.ManualTimerName
: $"{timer.Type.Name}.{timer.Method.Name}";
return $"timer:{timerName}:slot:{slot}";
}
2. Lock Acquisition in Execute Loop¶
// Before starting timer execution
IDistributedLockHandle lockHandle = null;
for (int slot = 0; slot < timer.MaxConcurrency; slot++)
{
var lockName = GetLockName(timer, slot);
lockHandle = await distributedLock.TryAcquireAsync(
lockName,
timer.MaxRunTime,
cancellationToken);
if (lockHandle != null)
break; // Got a slot!
}
if (lockHandle == null)
{
// All slots taken globally - skip execution
continue;
}
try
{
// Execute timer
await ExecuteTimerAsync(timer, cancellationToken);
}
finally
{
// Release lock
await lockHandle.ReleaseAsync();
}
3. Configuration¶
public class TimerService
{
private readonly IDistributedLock distributedLock;
public TimerService(
IServiceProvider serviceProvider,
ILogger<TimerService> logger,
IDistributedLock distributedLock = null) // Optional
{
this.distributedLock = distributedLock; // Null = single host mode
}
}
4. Attribute Enhancement¶
[Timer(
maxConcurrency: 3,
intervalInSeconds: 60,
useDistributedLock: true)] // Enable multi-host coordination
public async Task ProcessOrdersAsync(CancellationToken ct) { }
Distributed Lock Implementations¶
Redis Implementation (Recommended)¶
Advantages: - Ultra-fast (in-memory) - Native expiration support - Lua scripting for atomic operations - Battle-tested in production
Setup:
services.AddSingleton<IConnectionMultiplexer>(sp =>
ConnectionMultiplexer.Connect("localhost:6379"));
services.AddSingleton<IDistributedLock>(sp =>
new RedisDistributedLock(
sp.GetRequiredService<IConnectionMultiplexer>(),
keyPrefix: "myapp:timer-lock:"));
Lock Algorithm:
SET timer:ProcessOrders:slot:0 {unique-guid} NX EX 30
- NX: Only set if not exists
- EX: Expire in 30 seconds
- Returns: 1 if acquired, 0 if already held
SQL Server Implementation¶
Advantages: - No additional infrastructure - Uses existing database - Transactional consistency
Disadvantages: - Slower than Redis - More database load - Requires cleanup job for stale locks
Setup:
services.AddSingleton<IDistributedLock>(sp =>
new SqlDistributedLock(
connectionString,
tableName: "DistributedLocks"));
Schema:
CREATE TABLE DistributedLocks (
ResourceName NVARCHAR(255) PRIMARY KEY,
LockId UNIQUEIDENTIFIER NOT NULL,
AcquiredAt DATETIME2 NOT NULL,
ExpiresAt DATETIME2 NOT NULL,
HostName NVARCHAR(255) NULL
);
CREATE INDEX IX_DistributedLocks_ExpiresAt
ON DistributedLocks(ExpiresAt);
PostgreSQL Implementation¶
Similar to SQL Server but with PostgreSQL-specific features:
CREATE TABLE distributed_locks (
resource_name VARCHAR(255) PRIMARY KEY,
lock_id UUID NOT NULL,
acquired_at TIMESTAMP NOT NULL,
expires_at TIMESTAMP NOT NULL,
host_name VARCHAR(255)
);
-- Use advisory locks for better performance
SELECT pg_try_advisory_lock(hashtext('timer:ProcessOrders:slot:0'));
Azure Table Storage / Cosmos DB¶
Advantages: - Cloud-native - Highly available - Automatic scaling
Disadvantages: - Higher latency - More complex error handling - Cost considerations
Performance Considerations¶
Lock Acquisition Latency¶
| Backend | Average Latency | P99 Latency |
|---|---|---|
| Redis (local) | 1-2ms | 5ms |
| Redis (remote) | 5-10ms | 20ms |
| SQL Server (local) | 5-15ms | 50ms |
| SQL Server (remote) | 15-30ms | 100ms |
| PostgreSQL | Similar to SQL Server |
Throughput Impact¶
For a timer with 1-minute interval: - Without distributed locks: Instant scheduling decision - With distributed locks: +1-30ms per scheduling decision
Impact: Negligible for most use cases since timers typically run every minute or more.
Optimization: Lock Caching¶
For very high-frequency timers (< 10 seconds):
// Cache lock acquisition results briefly
private readonly MemoryCache lockCache = new MemoryCache(new MemoryCacheOptions());
private async Task<bool> CanAcquireLockCached(string timerName)
{
var cacheKey = $"lock-check:{timerName}";
if (lockCache.TryGetValue(cacheKey, out bool canAcquire))
return canAcquire;
// Try actual lock acquisition
var lockHandle = await TryAcquireLock(timerName);
canAcquire = lockHandle != null;
// Cache result for 1 second
lockCache.Set(cacheKey, canAcquire, TimeSpan.FromSeconds(1));
return canAcquire;
}
Failure Scenarios¶
1. Lock Service Unavailable¶
Scenario: Redis/SQL is down
Current Behavior: Timers cannot acquire locks → no execution
Recommendation: Add fallback mode
if (distributedLock != null)
{
try
{
lockHandle = await distributedLock.TryAcquireAsync(...);
}
catch (Exception ex)
{
logger.LogError(ex, "Distributed lock failed, falling back to local concurrency");
// Fall back to local concurrency only
distributedLock = null;
}
}
2. Lock Not Released (Host Crash)¶
Scenario: Host crashes while holding lock
Solution: Lock auto-expires after MaxRunTime
// Lock expires automatically after max runtime
await distributedLock.AcquireAsync(
resourceName,
lockTimeout: timer.MaxRunTime, // Auto-expire
acquireTimeout: TimeSpan.FromSeconds(5));
3. Lock Extension for Long-Running Timers¶
Scenario: Timer runs longer than expected but within limits
Solution: Heartbeat lock extension
var lockHandle = await AcquireLockAsync();
// Background task to extend lock
_ = Task.Run(async () =>
{
while (!cancellationToken.IsCancellationRequested)
{
await Task.Delay(timer.MaxRunTime / 2, cancellationToken);
await lockHandle.ExtendAsync(timer.MaxRunTime);
}
});
try
{
await ExecuteTimerAsync(timer, cancellationToken);
}
finally
{
await lockHandle.ReleaseAsync();
}
Migration Path¶
Phase 1: Single-Host (Current)¶
Phase 2: Multi-Host (Opt-In per Timer)¶
services.AddSingleton<IDistributedLock, RedisDistributedLock>();
services.AddSingleton<TimerService>();
// Opt-in per timer
[Timer(maxConcurrency: 3, useDistributedLock: true)]
public async Task ProcessOrdersAsync(CancellationToken ct) { }
Phase 3: Multi-Host (Default for All)¶
// Configure distributed lock globally
services.AddSingleton<IDistributedLock, RedisDistributedLock>();
services.AddSingleton<TimerService>();
// All timers use distributed locks by default
[Timer(maxConcurrency: 3)]
public async Task ProcessOrdersAsync(CancellationToken ct) { }
Monitoring and Observability¶
Metrics to Track¶
- Lock Acquisition Rate
- Successful acquisitions per minute
-
Failed acquisitions (max concurrency reached)
-
Lock Contention
- Number of hosts competing for locks
-
Average wait time for lock acquisition
-
Lock Duration
- How long locks are held
-
Locks that expire (host crashed)
-
Lock Service Health
- Redis/SQL availability
- Lock operation latency
Logging¶
logger.LogInformation(
"Distributed lock acquired: {Timer}, Slot: {Slot}, Host: {Host}",
timerName,
slot,
Environment.MachineName);
logger.LogWarning(
"Max global concurrency reached: {Timer}, MaxConcurrency: {Max}",
timerName,
timer.MaxConcurrency);
Health Checks¶
public class DistributedLockHealthCheck : IHealthCheck
{
private readonly IDistributedLock distributedLock;
public async Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context)
{
try
{
// Try to acquire and release a test lock
var handle = await distributedLock.TryAcquireAsync(
"health-check",
TimeSpan.FromSeconds(5));
if (handle == null)
return HealthCheckResult.Degraded("Could not acquire test lock");
await handle.ReleaseAsync();
return HealthCheckResult.Healthy("Distributed lock working");
}
catch (Exception ex)
{
return HealthCheckResult.Unhealthy("Distributed lock failed", ex);
}
}
}
Best Practices¶
1. Set Appropriate Lock Timeouts¶
// Lock timeout should match max runtime
[Timer(
maxConcurrency: 3,
intervalInSeconds: 300,
maxRunTimeInSeconds: 60)] // Lock expires after 60s
2. Handle Lock Acquisition Failures Gracefully¶
if (lockHandle == null)
{
// Log but don't fail - timer will retry next interval
logger.LogDebug("Skipped execution (max concurrency reached)");
return;
}
3. Use Separate Lock Stores per Environment¶
// Production
redis: "prod-redis:6379"
// Staging
redis: "staging-redis:6379"
// Prevents staging from interfering with production locks
4. Monitor Lock Age¶
// Alert if locks are held too long (potential deadlock)
if (lockAge > timer.MaxRunTime * 1.5)
{
logger.LogWarning("Lock held longer than expected: {Timer}", timerName);
}
Summary¶
Current State: - Single-host concurrency control - No multi-host coordination
Recommended Solution: - Use distributed locking with lock slots - Implement with Redis (preferred) or SQL - Opt-in per timer via attribute flag - Fail gracefully when lock service unavailable - Monitor lock acquisition and contention
Next Steps:
1. Choose lock backend (Redis recommended)
2. Implement IDistributedLock interface
3. Update TimerService to acquire locks before execution
4. Add monitoring and health checks
5. Gradually migrate timers to use distributed locks