Retry policies (exponential backoff with filters & deadline)

When to use

  • Calls to flaky I/O (HTTP, DB, object storage) that usually succeed on retry.
  • You need limits: max retries, deadline, exception filters, and jitter.
  • You want a drop-in decorator around functions/methods.

Avoid when the operation isn’t idempotent/safe to re-run, or freshness must be exact.

Diagram (text)

call() ── fail ──> sleep(base * 2^n + jitter) ── retry ──> ... ──> success or give up (retries/deadline)
            ↑ only for selected exceptions (e.g., TimeoutError)

Python example (≤40 lines, type-hinted)

A compact decorator: exponential backoff, optional jitter, exception filtering, deadline.

from __future__ import annotations
import time
from functools import wraps
from typing import Callable, TypeVar, Tuple

T = TypeVar("T")

def retry(*, retries: int = 3, base: float = 0.2, max_delay: float = 2.0,
          exceptions: Tuple[type[BaseException], ...] = (Exception,),
          jitter: Callable[[float], float] = lambda d: d,
          sleep: Callable[[float], None] = time.sleep,
          now: Callable[[], float] = time.time,
          deadline: float | None = None) -> Callable[[Callable[..., T]], Callable[..., T]]:
    def deco(fn: Callable[..., T]) -> Callable[..., T]:
        @wraps(fn)
        def wrapper(*a, **kw) -> T:
            delay, start = base, now()
            for attempt in range(retries + 1):
                try:
                    return fn(*a, **kw)
                except exceptions:
                    if attempt == retries or (deadline is not None and now() - start + delay > deadline):
                        raise
                    d = min(delay, max_delay)
                    sleep(jitter(d))
                    delay = min(delay * 2, max_delay)
        return wrapper
    return deco

Usage & tiny pytest-style checks

def test_succeeds_with_backoff():
    calls, sleeps = {"n": 0}, []
    @retry(retries=5, base=0, sleep=lambda d: sleeps.append(d), exceptions=(TimeoutError,))
    def flaky():
        calls["n"] += 1
        if calls["n"] < 3: raise TimeoutError()
        return 42
    assert flaky() == 42 and len(sleeps) == 2

def test_filters_exceptions():
    import pytest
    @retry(retries=2, base=0, exceptions=(TimeoutError,))
    def bad():
        raise ValueError("no retry")  # not in exceptions
    with pytest.raises(ValueError): bad()

def test_deadline_stops_early():
    import pytest
    t = {"v": 0.0}
    def now(): return t["v"]
    def advance(d): t["v"] += d
    @retry(retries=10, base=1.0, max_delay=8.0, deadline=2.5, now=now, sleep=advance, exceptions=(TimeoutError,))
    def always_timeout(): raise TimeoutError()
    with pytest.raises(TimeoutError): always_timeout()

Trade-offs & pitfalls

  • Pros: Robust against transient failures; tunable; easy to apply at call sites.
  • Cons: Can hide real errors; increases latency; adds complexity.
  • Pitfalls / anti-patterns:
    • Retrying non-idempotent ops (double charges, duplicate inserts).
    • Catching broad Exception—filter specific transient ones.
    • Infinite or too-long retries—set retries/deadline and log.
    • Sleeping on the main thread in services—prefer async/backoff primitives or background tasks.
    • No jitter → thundering herds. Use full jitter (random(0, delay)) in scaled systems.

Pythonic alternatives

  • Libraries: tenacity, backoff (rich policies: jitter, stop/try/exception filters, async support).
  • Decorator stacks: combine with your metrics decorator to time success/fail attempts.
  • Context managers for scoped retries on blocks (with retry_ctx: ...) if setup/teardown matters.

Mini exercise

Add full jitter:

import random
full_jitter = lambda d: random.uniform(0, d)
  • Use jitter=full_jitter.
  • Add on_retry(attempt, exc, delay) callback to the decorator to emit logs/metrics.
  • Write tests that: (1) callback is called with expected attempt counts, (2) non-retryable exceptions bypass sleeps.

Checks (quick checklist)

  • Retries limited by count and optionally a deadline.
  • Exception filter lists only transient errors.
  • Backoff increases up to max_delay with jitter.
  • Idempotency considered; non-idempotent ops guarded.
  • Tests cover success-after-retry, filter, and deadline paths.