Retry policies (exponential backoff with filters & deadline)
When to use
- Calls to flaky I/O (HTTP, DB, object storage) that usually succeed on retry.
- You need limits: max retries, deadline, exception filters, and jitter.
- You want a drop-in decorator around functions/methods.
Avoid when the operation isn’t idempotent/safe to re-run, or freshness must be exact.
Diagram (text)
call() ── fail ──> sleep(base * 2^n + jitter) ── retry ──> ... ──> success or give up (retries/deadline)
↑ only for selected exceptions (e.g., TimeoutError)
Python example (≤40 lines, type-hinted)
A compact decorator: exponential backoff, optional jitter, exception filtering, deadline.
from __future__ import annotations
import time
from functools import wraps
from typing import Callable, TypeVar, Tuple
T = TypeVar("T")
def retry(*, retries: int = 3, base: float = 0.2, max_delay: float = 2.0,
exceptions: Tuple[type[BaseException], ...] = (Exception,),
jitter: Callable[[float], float] = lambda d: d,
sleep: Callable[[float], None] = time.sleep,
now: Callable[[], float] = time.time,
deadline: float | None = None) -> Callable[[Callable[..., T]], Callable[..., T]]:
def deco(fn: Callable[..., T]) -> Callable[..., T]:
@wraps(fn)
def wrapper(*a, **kw) -> T:
delay, start = base, now()
for attempt in range(retries + 1):
try:
return fn(*a, **kw)
except exceptions:
if attempt == retries or (deadline is not None and now() - start + delay > deadline):
raise
d = min(delay, max_delay)
sleep(jitter(d))
delay = min(delay * 2, max_delay)
return wrapper
return deco
Usage & tiny pytest-style checks
def test_succeeds_with_backoff():
calls, sleeps = {"n": 0}, []
@retry(retries=5, base=0, sleep=lambda d: sleeps.append(d), exceptions=(TimeoutError,))
def flaky():
calls["n"] += 1
if calls["n"] < 3: raise TimeoutError()
return 42
assert flaky() == 42 and len(sleeps) == 2
def test_filters_exceptions():
import pytest
@retry(retries=2, base=0, exceptions=(TimeoutError,))
def bad():
raise ValueError("no retry") # not in exceptions
with pytest.raises(ValueError): bad()
def test_deadline_stops_early():
import pytest
t = {"v": 0.0}
def now(): return t["v"]
def advance(d): t["v"] += d
@retry(retries=10, base=1.0, max_delay=8.0, deadline=2.5, now=now, sleep=advance, exceptions=(TimeoutError,))
def always_timeout(): raise TimeoutError()
with pytest.raises(TimeoutError): always_timeout()
Trade-offs & pitfalls
- Pros: Robust against transient failures; tunable; easy to apply at call sites.
- Cons: Can hide real errors; increases latency; adds complexity.
- Pitfalls / anti-patterns:
- Retrying non-idempotent ops (double charges, duplicate inserts).
- Catching broad Exception—filter specific transient ones.
- Infinite or too-long retries—set retries/deadline and log.
- Sleeping on the main thread in services—prefer async/backoff primitives or background tasks.
- No jitter → thundering herds. Use full jitter (random(0, delay)) in scaled systems.
Pythonic alternatives
- Libraries:
tenacity,backoff(rich policies: jitter, stop/try/exception filters, async support). - Decorator stacks: combine with your metrics decorator to time success/fail attempts.
- Context managers for scoped retries on blocks (
with retry_ctx: ...) if setup/teardown matters.
Mini exercise
Add full jitter:
import random
full_jitter = lambda d: random.uniform(0, d)
- Use
jitter=full_jitter. - Add
on_retry(attempt, exc, delay)callback to the decorator to emit logs/metrics. - Write tests that: (1) callback is called with expected
attemptcounts, (2) non-retryable exceptions bypass sleeps.
Checks (quick checklist)
- Retries limited by count and optionally a deadline.
- Exception filter lists only transient errors.
- Backoff increases up to max_delay with jitter.
- Idempotency considered; non-idempotent ops guarded.
- Tests cover success-after-retry, filter, and deadline paths.




