Design: Error Handling and Resilience for External Service Calls
Context
Spotter's background sync loops call six external APIs (Navidrome, Spotify, Last.fm, MusicBrainz, Fanart.tv, OpenAI) on fixed timer intervals. Before this design, errors were logged and skipped until the next tick -- transient failures waited the full sync interval (5 minutes to 1 hour) before retry, and permanent failures (revoked credentials) generated noisy log entries indefinitely with no user notification. The user had zero visibility into provider health without reading logs.
This design introduces a two-tier error classification system with exponential backoff for transient errors and event bus notifications for permanent (fatal) errors, giving proportionate responses to both failure modes.
Governing ADRs: [📝 ADR-0020](../../adrs/📝 ADR-0020-error-handling-resilience), [📝 ADR-0007](../../adrs/📝 ADR-0007-in-memory-event-bus), [📝 ADR-0016](../../adrs/📝 ADR-0016-pluggable-provider-factory-pattern), [📝 ADR-0013](../../adrs/📝 ADR-0013-goroutine-ticker-background-scheduling).
Goals / Non-Goals
Goals
- Classify every external service error as exactly retriable or fatal
- Apply exponential backoff with jitter (30s base, 30m cap, +/-25%) for retriable errors
- Maintain per-provider per-user backoff state in memory with mutex protection
- Publish browser-visible toast notifications for fatal errors via the event bus
- Automatically reset backoff state on the next successful provider call
- Prevent duplicate fatal notifications across sync ticks
Non-Goals
- Provider-specific API client implementation or HTTP transport-level retries
- Persistent error state across process restarts (in-memory only by design)
- Circuit breaker state machine (too heavy for a personal server; simple backoff suffices)
- Automatic remediation of fatal errors (user must reconnect the provider)
Decisions
Two-Tier Classification over Circuit Breaker
Choice: Classify errors into exactly two categories -- retriable and fatal -- using a shared
ClassifyError() utility function.
Rationale: A full circuit breaker (open/half-open/closed) is designed for high-throughput microservice architectures. Spotter makes 3-5 provider calls per sync cycle. A simple counter with a timestamp achieves the same goal in ~50 lines of Go with no external dependency.
Alternatives considered:
sony/gobreakerlibrary: well-tested but adds a dependency for a pattern trivially implemented in stdlib. The open/half-open model does not naturally distinguish retriable from fatal errors.- Always retry immediately: hammers failing services, risks rate limiting, wastes resources on permanent failures.
- No retry (prior behavior): transient errors wait the full sync interval; fatal errors generate infinite log noise.
In-Memory Backoff State over Database Persistence
Choice: Store backoff state in a sync.RWMutex-protected map keyed by (userID, providerType).
Rationale: Backoff state is inherently ephemeral. If the process restarts, retrying immediately is the correct behavior (the external service may have recovered). Persisting backoff adds database writes on every sync tick for marginal benefit.
Alternatives considered:
- Database-backed state: survives restarts but adds write load and complexity for a single-user personal server.
- File-based state: simpler than database but introduces filesystem dependency and race conditions.
Heuristic Error Message Matching as Fallback
Choice: After checking typed errors (HTTPStatusError, net.Error, net.DNSError), fall back
to string matching against known error message patterns.
Rationale: Not all providers wrap errors in structured types. Last.fm returns XML errors,
some providers include status codes in plain error messages ("spotify API returned status 403").
The heuristic catches these without requiring every provider to implement a custom error type.
Architecture
Error Classification Flow
Backoff State Machine
Integration with Sync Loop
Key Implementation Details
- Error classifier:
internal/services/resilience.go--ClassifyError(err error) ErrorClassinspects the error chain forHTTPStatusError,net.Error,net.DNSError,net.OpError, then falls back to heuristic message matching. - Backoff state:
BackoffManagerstruct withsync.RWMutex-protectedmap[BackoffKey]*BackoffState. Key is(UserID int, ProviderType providers.Type). - Backoff formula:
delay = min(30s * 2^consecutiveFailures, 30m) * jitter[0.75, 1.25]usingmath/rand/v2for jitter. - Constants:
backoffBaseDelay = 30s,backoffMaxDelay = 30m. - Notification dedup:
BackoffState.NotifiedFatalflag prevents re-publishing the same fatal notification across ticks (REQ-NOTIFY-002). Cleared onClearFatal()orRecordSuccess(). - Fatal recovery:
ClearFatal(key)resets all state fields -- called when a user reconnects a provider via OAuth.
Files:
internal/services/resilience.go-- all error classification, backoff math, and state managementinternal/services/resilience_test.go-- unit tests for classification and backoff calculationsinternal/events/bus.go--PublishNotification()convenience method used for fatal alerts
Risks / Trade-offs
- In-memory state lost on restart: A provider backing off for 30 minutes will retry immediately after a process restart. This is acceptable -- the external service may have recovered, and a single retry is harmless. If the error persists, backoff re-engages from the first failure.
- Error classification maintenance: Each new provider may introduce unique error codes or message formats requiring updates to
ClassifyError()or its heuristic patterns. Mitigated by defaulting unknown errors to retriable (prefer retry over silent failure). - Heuristic false positives: String matching on error messages is fragile. A provider returning a message containing "timeout" in a non-transient context would be misclassified. Mitigated by checking typed errors first; heuristics are the last resort.
- No notification for long-running retriable backoff: REQ-NOTIFY-003 permits (but does not require) a warning notification when backoff reaches the 30-minute cap. This is not currently implemented -- the user sees nothing for persistent transient failures.
Migration Plan
This feature was implemented incrementally:
- Added
ErrorClasstype,ClassifyError(), andclassifyHTTPStatus()ininternal/services/resilience.go - Added
BackoffState,BackoffKey,BackoffManagerwith mutex-protected map - Added
CalculateBackoff()with exponential formula and jitter - Integrated
ShouldSkip()check into sync loop before each provider call - Integrated
RecordFailure()andRecordSuccess()after each provider call - Added
PublishNotification()call for fatal errors with dedup viaMarkNotified() - Added
ClearFatal()call in OAuth reconnection handlers
No database migration required (all state is in-memory).
Open Questions
- Should retriable backoff reaching the 30-minute cap trigger a user warning notification? The spec permits it (REQ-NOTIFY-003 MAY), but it is not implemented. Adding it risks notification fatigue for transient issues that resolve on their own.
- Should the backoff formula be configurable via environment variables (base delay, max delay, jitter range)? Currently hardcoded, which is simpler but less flexible.
- Should
ClassifyErrorsupport provider-specific overrides (e.g., Subsonic error codes, Last.fm XML error codes)? The spec mentions wrapper functions (REQ-ERR-004 MAY) but none are implemented yet.